I'm assuming you already have a working Raspberry Pi 3 running Ubuntu standard server and that you know how to login via SSH. We will install the latest version of Hadoop, that is 2.7.3. Great, let's get started and login via SSH.
Change hostname and config static IP address
We change hostname to master and IP address to 192.168.1.188, you can change to whatever IP address you prefer here.
Sample setting for network interface
sudo nano /etc/hostname
sudo nano /etc/hosts
sudo nano /etc/network/interfaces
# interfaces(5) file used by ifup(8) and ifdown(8) # Include files from /etc/network/interfaces.d: source-directory /etc/network/interfaces.d # The loopback network interface auto lo iface lo inet loopback auto eth0 iface eth0 inet static address 192.168.1.188 netmask 255.255.255.0 gateway 192.168.1.1 dns-nameservers 22.214.171.124 126.96.36.199
Creating a new group and user
sudo addgroup hadoop sudo adduser --ingroup hadoop hduser sudo adduser hduser sudo
Everything Hadoop will be happening via the hduser. Let's change to this user.
Generating SSH keys
Although we are using a single node setup in this part, I decided to already create SSH keys. These will be the keys that the nodes use to talk to each other.
cd ~ mkdir .ssh ssh-keygen -t rsa -P "" cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
To verify that everyting is working, you can easily create a SSH tunnel to localhost.
Installing the elephant in the room called Hadoop
wget ftp://apache.belnet.be/mirrors/ftp.apache.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz sudo mkdir /opt cd ~ sudo tar -xvzf hadoop-2.7.3.tar.gz -C /opt/ cd /opt sudo chown -R hduser:hadoop hadoop-2.7.3/
Depending on what you already did with your Pi, the /opt directory may already exist.
Hadoop is now installed, but we still need quite a bit of tinkering to get it configured right.
Setting a few environment variables
We using oracle java 1.8 JDK, let's install this first.
sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer sudo apt-get install oracle-java8-set-default
Now we need to set a few environment variables. There are a few ways to do this, but I always do it by editing the .bashrc file.
Add the following lines at the end of the file:
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:jre/bin/java::") export HADOOP_HOME=/opt/hadoop-2.7.3 export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Changes in the .bashrc are not applied when you save this file. You can either logout and login again to use those new environment variables or you can:
If everything is configured right, you should be able to print the installed version of Hadoop.
$hadoop version Hadoop 2.7.3 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a Compiled by jenkins on 2015-06-29T06:04Z Compiled with protoc 2.5.0 From source with checksum fc0a1a23fc1868e4d5ee7fa2b28a58a This command was run using /opt/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar
Configuring Hadoop 2.7.3
Let's go to the directory that contains all the configuration files of Hadoop. We want to edit the hadoop-env.sh file. For some reason we need to configure JAVA_HOME manually in this file, Hadoop seems to ignore our $JAVA_HOME.
cd $HADOOP_CONF_DIR nano hadoop-env.sh
Yes, I use nano. Look for the line saying JAVA_HOME and change it to your Java install directory. This was how the line looked after I changed it:
There are quite a few files that need to be edited now. These are XML files, you just have to paste the code bits below between the configuration tags.
<property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/hdfs/tmp</value> </property>
<property> <name>dfs.replication</name> <value>1</value> </property>
cp mapred-site.xml.template mapred-site.xml nano mapred-site.xml
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>256</value> </property> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx210m</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>256</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx210m</value> </property> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>256</value> </property>
The first property tells us that we want to use Yarn as the MapReduce framework. The other properties are some specific settings for our Raspberry Pi. For example we tell that the Yarn Mapreduce Application Manager gets 256 megabytes of RAM and so does the Map and Reduce containers. These values allow us to actually run stuff, the default size is 1,5GB which our Pi can't deliver with its 1GB RAM.
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>768</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>768</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>4</value> </property>
This file tells Hadoop some information about this node, like the maximum number of memory and cores that can be used. We limit the usable RAM to 768 megabytes, that leaves a bit of memory for the OS and Hadoop. A container will always receive a memory amount that is a multitude of the minimum allocation, 128 megabytes. For example a container that needs 450 megabytes, will get 512 megabytes assigned.
sudo mkdir -p /hdfs/tmp sudo chown hduser:hadoop /hdfs/tmp chmod 750 /hdfs/tmp hdfs namenode -format
cd $HADOOP_HOME/sbin start-dfs.sh start-yarn.sh
If you want to verify that everything is working you can use the jps command. In the output of this command you can see that Hadoop components like the NameNode are running. The numbers can be ignored, they are process numbers.
7696 ResourceManager 7331 DataNode 7464 SecondaryNameNode 8107 Jps 7244 NameNode
Running a first MapReduce job
For our first job we need some data. I selected a bit of books from Project Gutenberg and concatenated them into one large file. There's a bit to like for everyone: Shakespeare, Homer, Edgar Allan Poe... The resulting file is about 16MB large.
wget http://www.gutenberg.org/cache/epub/11/pg11.txt wget http://www.gutenberg.org/cache/epub/74/pg74.txt wget http://www.gutenberg.org/cache/epub/1661/pg1661.txt wget http://www.gutenberg.org/cache/epub/2701/pg2701.txt wget http://www.gutenberg.org/cache/epub/5200/pg5200.txt wget http://www.gutenberg.org/cache/epub/2591/pg2591.txt wget http://www.gutenberg.org/cache/epub/6130/pg6130.txt wget http://www.gutenberg.org/cache/epub/4300/pg4300.txt wget http://www.gutenberg.org/cache/epub/8800/pg8800.txt wget http://www.gutenberg.org/cache/epub/345/pg345.txt wget http://www.gutenberg.org/cache/epub/1497/pg1497.txt wgethttp://www.gutenberg.org/cache/epub/135/pg135.txdsfsdf wget http://www.gutenberg.org/cache/epub/135/pg135.txt wget http://www.gutenberg.org/cache/epub/41/pg41.txt wget http://www.gutenberg.org/cache/epub/120/pg120.txt wget http://www.gutenberg.org/cache/epub/22381/pg22381.txt wget http://www.gutenberg.org/cache/epub/2600/pg2600.txt wget http://www.gutenberg.org/cache/epub/236/pg236.txt cat pg*.txt > books.txt
Just create dummy books.txt that contains few words
The books.txt file cannot be read by Hadoop from our traditional Linux file system, it needs to be stored on HDFS. We can easily copy it to HDFS.
hdfs dfs -copyFromLocal books.txt /books.txt
You can make sure that the copy operation went properly by listing the contents of the HDFS root directory.
hdfs dfs -ls /
Now let's count the occurence of words in this giant book. We are in luck, the kind developers of Hadoop provide an example that does exactly that.
hadoop jar /opt/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /books.txt /books-result
You can view the progress by surfing to http://
hdfs dfs -cat /books-result/part-r* | head -n 20
Since we are talking about multiple books printing the entire list might take a while. If you look at the output you see that the Wordcount example has room for improvement. Uppercase and lowercase words are counted separately and also symbols and characters around a word make things messy. But time for a first benchmark: how long did our single node Raspberry Pi 2 work on this wordcount? The average execution time of 5 jobs was measured to be 3 minutes and 25 seconds.