This hands-on lab assumes you are logged into a virtual machine that completed the Appendix A - Environment Setup as the user hduser/hduser.
This lab runs the pi example provided with Hadoop. This example calculates pi by using the Quasi-Monte Carlos method which does not give you an exact value of pi but instead uses a sample set of randomly generated numbers and an equation to approximate pi. Therefore when you run this job, you will specify how many sample sets to run and how many random numbers you want the job to generate in each sample set.
The goals of this lab are to:
$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 4 1000
$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi
Hadoop's strength is it can run in a distributed manor across on commondity hardware. This lab will simulate running Hadoop in a distributed manor.
The goals of this lab are to:
$ sudo vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
$ start-yarn.sh
$ jps
$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 4 1000
This lab uses Hadoop's distributed file system, HDFS, to store and distributed files.
The goals of this lab are to:
NOTE: If you receive the warning "WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable" it is just a warning and can be ignored. It is because Hadoop was compiled for 32-bit but we are running on a 64-bit system.
$ sudo vim $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
$ start-dfs.sh
$ jps
$ hdfs dfs -ls /
$ hdfs dfs -mkdir /books
$ hdfs dfs -ls /
$ hdfs dfs -ls /books
$ hdfs dfs -copyFromLocal /opt/data/moby_dick.txt /books
$ hdfs dfs -cat /books/moby_dick.txt
This lab will combine the distributed processing of Hadoop with the distributed filesystem of HDFS to create a distributed solution.
The goals of this lab are to:
$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /books out
$ hdfs dfs -ls out
$ hdfs dfs -cat out/_SUCCESS
$ hdfs dfs -cat out/part-r-00000
$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /books out
This lab includes writing your first mapper, reducer and job file and then runs them.
The goals of this lab are to:
$ cd ~/workspaces/hadoop-wordcount
$ mkdir -p src/main/java/com/manifestcorp/hadoop/wc
$ sudo vim src/main/java/com/manifestcorp/hadoop/wc/WordCountMapper.java
package com.manifestcorp.hadoop.wc;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
private static final String SPACE = " ";
private static final IntWritable ONE = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split(SPACE);
for (String str: words) {
word.set(str);
context.write(word, ONE);
}
}
}
$ sudo vim src/main/java/com/manifestcorp/hadoop/wc/WordCountReducer.java
package com.manifestcorp.hadoop.wc;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int total = 0;
for (IntWritable value : values) {
total++;
}
context.write(key, new IntWritable(total));
}
}
$ sudo vim src/main/java/com/manifestcorp/hadoop/wc/MyWordCount.java
package com.manifestcorp.hadoop.wc;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MyWordCount {
public static void main(String[] args) throws Exception {
Job job = new Job();
job.setJobName("my word count");
job.setJarByClass(MyWordCount.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
$ mvn clean package
$ hadoop jar target/hadoop-mywordcount-0.0.1-SNAPSHOT.jar com.manifestcorp.hadoop.wc.MyWordCount /books out-2
$ hdfs dfs -ls out-2
$ hdfs dfs -cat out-2/_SUCCESS
$ hdfs dfs -cat out-2/part-r-00000
This hands-on lab assumes you have virtual machine running Ubuntu 12.04 LTS and you have logged in with a user that has sudo privilages.
$ sudo mkdir /opt/data
$ cd /opt/data
$ sudo wget http://www.trieuvan.com/apache/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz
$ sudo wget http://www.trieuvan.com/apache/hadoop/common/hadoop-2.2.0/hadoop-2.2.0-src.tar.gz
$ sudo wget http://www.gutenberg.org/ebooks/2489.txt.utf-8
$ mv 2489.txt.utf-8 moby_dick.txt
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
$ sudo adduser hduser sudo
$ su hduser
$ cd ~
$ sudo mkdir -p /opt/hadoop
$ sudo tar vxzf /opt/data/hadoop-2.2.0.tar.gz -C /opt/hadoop
$ sudo chown -R hduser:hadoop /opt/hadoop/hadoop-2.2.0
# other stuff
# java variables
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
# hadoop variables
export HADOOP_HOME=/opt/hadoop/hadoop-2.2.0
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
$ source .bashrc
$ hadoop version
$ sudo vim etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
$ sudo mv etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
$ sudo vim etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
$ ssh-keygen -t rsa -P ''
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ ssh localhost
$ exit
$ vim etc/hadoop/hadoop-env.sh
# other stuff
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
# more stuff
$ sudo mkdir -p /opt/hdfs/namenode
$ sudo mkdir -p /opt/hdfs/datanode
$ sudo chmod -R 777 /opt/hdfs
$ sudo chown -R hduser:hadoop /opt/hdfs
$ cd /opt/hadoop/hadoop-2.2.0
$ sudo vim etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<!-- number of hdfs instances to replicate to. -->
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<!-- location of Name Node data -->
<name>dfs.namenode.name.dir</name>
<value>file:/opt/hdfs/namenode</value>
</property>
<property>
<!-- location of Data Node data -->
<name>dfs.datanode.data.dir</name>
<value>file:/opt/hdfs/datanode</value>
</property>
</configuration>
$ hdfs namenode -format
$ sudo vim etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
$ cd ~/workspaces
$ git clone https://github.com/cjudd/hadoop-logs.git
$ git clone https://github.com/cjudd/hadoop-wordcount.git