Getting started

I'm assuming you already have a working Raspberry Pi 3 running Ubuntu standard server and that you know how to login via SSH. We will install the latest version of Hadoop, that is 2.7.3. Great, let's get started and login via SSH.

Change hostname and config static IP address

We change hostname to master and IP address to 192.168.1.188, you can change to whatever IP address you prefer here.

sudo nano /etc/hostname
sudo nano /etc/hosts
sudo nano /etc/network/interfaces
Sample setting for network interface
# interfaces(5) file used by ifup(8) and ifdown(8)
# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

# The loopback network interface
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
address	192.168.1.188
netmask	255.255.255.0
gateway	192.168.1.1
dns-nameservers	8.8.8.8	8.8.4.4

Creating a new group and user

sudo addgroup hadoop  
sudo adduser --ingroup hadoop hduser  
sudo adduser hduser sudo  

Everything Hadoop will be happening via the hduser. Let's change to this user.

su hduser  

Generating SSH keys

Although we are using a single node setup in this part, I decided to already create SSH keys. These will be the keys that the nodes use to talk to each other.

cd ~  
mkdir .ssh  
ssh-keygen -t rsa -P ""  
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys  

To verify that everyting is working, you can easily create a SSH tunnel to localhost.

ssh localhost  

Installing the elephant in the room called Hadoop

wget ftp://apache.belnet.be/mirrors/ftp.apache.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
sudo mkdir /opt  
cd ~  
sudo tar -xvzf hadoop-2.7.3.tar.gz -C /opt/  
cd /opt  
sudo chown -R hduser:hadoop hadoop-2.7.3/  

Depending on what you already did with your Pi, the /opt directory may already exist.

Hadoop is now installed, but we still need quite a bit of tinkering to get it configured right.

Setting a few environment variables

We using oracle java 1.8 JDK, let's install this first.

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default

Now we need to set a few environment variables. There are a few ways to do this, but I always do it by editing the .bashrc file.

nano ~/.bashrc  

Add the following lines at the end of the file:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:jre/bin/java::")  
export HADOOP_HOME=/opt/hadoop-2.7.3  
export HADOOP_MAPRED_HOME=$HADOOP_HOME  
export HADOOP_COMMON_HOME=$HADOOP_HOME  
export HADOOP_HDFS_HOME=$HADOOP_HOME  
export YARN_HOME=$HADOOP_HOME  
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop  
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop  
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin  

Changes in the .bashrc are not applied when you save this file. You can either logout and login again to use those new environment variables or you can:

source ~/.bashrc  

If everything is configured right, you should be able to print the installed version of Hadoop.

$hadoop version
Hadoop 2.7.3  
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a  
Compiled by jenkins on 2015-06-29T06:04Z  
Compiled with protoc 2.5.0  
From source with checksum fc0a1a23fc1868e4d5ee7fa2b28a58a  
This command was run using /opt/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar  

Configuring Hadoop 2.7.3

Let's go to the directory that contains all the configuration files of Hadoop. We want to edit the hadoop-env.sh file. For some reason we need to configure JAVA_HOME manually in this file, Hadoop seems to ignore our $JAVA_HOME.

cd $HADOOP_CONF_DIR  
nano hadoop-env.sh  

Yes, I use nano. Look for the line saying JAVA_HOME and change it to your Java install directory. This was how the line looked after I changed it:

export JAVA_HOME=/usr/lib/jvm/java-8-oracle/jre/  

There are quite a few files that need to be edited now. These are XML files, you just have to paste the code bits below between the configuration tags.

nano core-site.xml  
<property>  
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
</property>  
<property>  
  <name>hadoop.tmp.dir</name>
  <value>/hdfs/tmp</value>
</property>  
nano hdfs-site.xml  
<property>  
<name>dfs.replication</name>  
<value>1</value>  
</property>  
cp mapred-site.xml.template mapred-site.xml  
nano mapred-site.xml  
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>mapreduce.map.memory.mb</name>
    <value>256</value>
  </property>
  <property>
    <name>mapreduce.map.java.opts</name>
    <value>-Xmx210m</value>
  </property>
  <property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>256</value>
  </property>
  <property>
    <name>mapreduce.reduce.java.opts</name>
    <value>-Xmx210m</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.resource.mb</name>
    <value>256</value>
  </property>

The first property tells us that we want to use Yarn as the MapReduce framework. The other properties are some specific settings for our Raspberry Pi. For example we tell that the Yarn Mapreduce Application Manager gets 256 megabytes of RAM and so does the Map and Reduce containers. These values allow us to actually run stuff, the default size is 1,5GB which our Pi can't deliver with its 1GB RAM.

nano yarn-site.xml  
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>4</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>768</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>128</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>768</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>4</value>
  </property>

This file tells Hadoop some information about this node, like the maximum number of memory and cores that can be used. We limit the usable RAM to 768 megabytes, that leaves a bit of memory for the OS and Hadoop. A container will always receive a memory amount that is a multitude of the minimum allocation, 128 megabytes. For example a container that needs 450 megabytes, will get 512 megabytes assigned.

Preparing HDFS

sudo mkdir -p /hdfs/tmp  
sudo chown hduser:hadoop /hdfs/tmp  
chmod 750 /hdfs/tmp  
hdfs namenode -format  

Booting Hadoop

cd $HADOOP_HOME/sbin  
start-dfs.sh  
start-yarn.sh  

If you want to verify that everything is working you can use the jps command. In the output of this command you can see that Hadoop components like the NameNode are running. The numbers can be ignored, they are process numbers.

7696 ResourceManager  
7331 DataNode  
7464 SecondaryNameNode  
8107 Jps  
7244 NameNode  

Running a first MapReduce job

For our first job we need some data. I selected a bit of books from Project Gutenberg and concatenated them into one large file. There's a bit to like for everyone: Shakespeare, Homer, Edgar Allan Poe... The resulting file is about 16MB large.

wget http://www.gutenberg.org/cache/epub/11/pg11.txt  
wget http://www.gutenberg.org/cache/epub/74/pg74.txt  
wget http://www.gutenberg.org/cache/epub/1661/pg1661.txt  
wget http://www.gutenberg.org/cache/epub/2701/pg2701.txt  
wget http://www.gutenberg.org/cache/epub/5200/pg5200.txt  
wget http://www.gutenberg.org/cache/epub/2591/pg2591.txt  
wget http://www.gutenberg.org/cache/epub/6130/pg6130.txt  
wget http://www.gutenberg.org/cache/epub/4300/pg4300.txt  
wget http://www.gutenberg.org/cache/epub/8800/pg8800.txt  
wget http://www.gutenberg.org/cache/epub/345/pg345.txt  
wget http://www.gutenberg.org/cache/epub/1497/pg1497.txt  
wgethttp://www.gutenberg.org/cache/epub/135/pg135.txdsfsdf  
wget http://www.gutenberg.org/cache/epub/135/pg135.txt  
wget http://www.gutenberg.org/cache/epub/41/pg41.txt  
wget http://www.gutenberg.org/cache/epub/120/pg120.txt  
wget http://www.gutenberg.org/cache/epub/22381/pg22381.txt  
wget http://www.gutenberg.org/cache/epub/2600/pg2600.txt  
wget http://www.gutenberg.org/cache/epub/236/pg236.txt  
cat pg*.txt > books.txt  

or

Just create dummy books.txt that contains few words

The books.txt file cannot be read by Hadoop from our traditional Linux file system, it needs to be stored on HDFS. We can easily copy it to HDFS.

hdfs dfs -copyFromLocal books.txt /books.txt  

You can make sure that the copy operation went properly by listing the contents of the HDFS root directory.

hdfs dfs -ls /  

Now let's count the occurence of words in this giant book. We are in luck, the kind developers of Hadoop provide an example that does exactly that.

hadoop jar /opt/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /books.txt /books-result  

You can view the progress by surfing to http://:8088/cluster. After the job is done you can find the output in the book-result directory. We can view the results of this MapReduce (V2) job using the hdfs command:

hdfs dfs -cat /books-result/part-r* | head -n 20  

Since we are talking about multiple books printing the entire list might take a while. If you look at the output you see that the Wordcount example has room for improvement. Uppercase and lowercase words are counted separately and also symbols and characters around a word make things messy. But time for a first benchmark: how long did our single node Raspberry Pi 2 work on this wordcount? The average execution time of 5 jobs was measured to be 3 minutes and 25 seconds.