Diving into Hadoop

Created on: 2016-04-19 13:00:00 -0400

Categories: Linux ContainersSoftware DevelopmentHadoop

Currently, I have been tinkering around with Hadoop. The need came about when I was exploring for an architecture to handle big data processing. After using technologies like OpenMP/OpenMPI for previous projects it seems there has been some attention in the computational sector towards Hadoop. This blog posting will go over a setup I just completed.

We will be using Cloudera to help build our Hadoop cluster. Personally, I chose to go with Cloudera because it is well documented and there is a lot of information available on the web in case you get stuck. The prerequisite is as follows:

CentOS 7
Oracle JDK jdk-8u73

If you are following along with my Linux Container series you can start up a CentOS 7 container. From there launch the container and type ‘ip a’ to get the container’s IP address and enter that IP address in /etc/hosts for the masternode IP address. Then skip down to the Installing Packages section. The instructions for starting up the container is as follows:

$ sudo systemctl start lxc.service
$ sudo systemctl start libvirtd
$ sudo lxc-clone c7 c7masternode
$ sudo lxc-start -n c7masternode

General Server Configuration

To start, I used the minimal install CentOS 7 image on a server that contains 2 hard drives. The first hard drive (/dev/sda) will be used to host the OS and all the components needed to run Hadoop. The second hard drive (/dev/sdb) will be used to build the Hadoop Filesystem (HDFS).

Before we can start installing the Hadoop components we will need to configure our server. The first step is to install NTP. The following process was used:

$ sudo yum install ntp
$ sudo systemctl enable ntpd.service
$ sudo systemctl start ntpd.service
$ sudo ntpdate -u 0.centos.pool.ntp.org
$ sudo hwclock --systohc

Second, we needed to configure the hostname for the master node. An entry was made in /etc/hosts that reflects the following:

192.168.1.10 masternode

The following command was also issued to help assign the hostname to the server.

$ hostnamectl set-hostname masternode

Additionally, /etc/sysconfig/network was also edited to include the following:

HOSTNAME=masternode

Installing packages

First, we need to install Java. You will need to go to Oracle’s website and download the 64-bit RPM file. Once you have downloaded it, run the following command:

$ sudo yum localinstall jdk-8u73-linux-x84.rpm

Then download the Cloudera Hadoop package and install it using the following commands:

$ curl -O https://archive.cloudera.com/cdh5/one-click-install/redhat/7/x86_64/cloudera-cdh-5-0.x86_64.rpm
$ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
$ sudo rpm --import https://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
$ sudo yum install hadoop-conf-pseudo

Hadoop Configuration

Configuring a Hadoop installation using Cloudera is relatively easy. The ‘alternatives’ application is used to help point to the current configuration needed for Hadoop. Once, you have configured the configuration path, you basically copy that Hadoop configuration directory to all the DataNodes. This will be done in a follow up posting. Lets begin:

$ sudo cp -r /etc/hadoop/conf.empty /etc/hadoop/conf.my_cluster
$ sudo alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
$ sudo alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster

Now we need to tell Hadoop about the paths to the HDFS. On the masternode add the following entries:

In /etc/hadoop/conf/core-site.xml

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://masternode:8020</value>
</property>
<property>    
  <name>hadoop.proxyuser.mapred.groups</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.mapred.hosts</name>
  <value>*</value>
</property>

and in /etc/hadoop/conf/hdfs-site.xml

<!-- /etc/hadoop/conf/hdfs-site.xml -->
<property>
  <name>dfs.permissions.superusergroup</name>
  <value>hadoop</value>
</property>
<property>    
  <name>dfs.namenode.name.dir</name>
  <value>file:///data/1/dfs/nn,file:///data/2/dfs/nn,file:///data/3/dfs/nn,file:///data/4/dfs/nn</value>
</property>

Now create the HDFS directories

$ sudo mkdir -p /data/1/dfs/nn /data/2/dfs/nn /data/3/dfs/nn /data/4/dfs/nn
$ sudo chown -R hdfs.hdfs /data/1/dfs/nn /data/2/dfs/nn /data/3/dfs/nn /data/4/dfs/nn
$ sudo chmod 700 /data/1/dfs/nn /data/2/dfs/nn /data/3/dfs/nn /data/4/dfs/nn

Create and format the HDFS:

$ sudo -u hdfs hdfs namenode -format
$ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
$ sudo /usr/lib/hadoop/libexec/init-hdfs.sh

Configuring YARN services The following file needs to be configured to enable YARN services being used:

<!-- /etc/hadoop/conf/mapred-site.xml -->
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>
<property>
  <name>mapreduce.jobhistory.address</name>
  <value>masternode:10020</value>
</property>
<property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>masternode:19888</value>
</property>
<property>
  <name>yarn.app.mapreduce.am.staging-dir</name>
  <value>/user</value>
</property>

<!-- /etc/hadoop/conf/yarn-site.xml -->
<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>masternode</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
</property>
<property>
  <description>List of directories to store localized files in.</description>
  <name>yarn.nodemanager.local-dirs</name>
  <value>file:///data/1/yarn/local,file:///data/2/yarn/local,file:///data/3/yarn/local,file:///data/4/yarn/local</value>
</property>
<property>
  <description>Where to store container logs.</description>
  <name>yarn.nodemanager.log-dirs</name>
  <value>file:///data/1/yarn/logs,file:///data/2/yarn/local,file:///data/3/yarn/local,file:///data/4/yarn/local</value>
</property>
<property>
  <description>Where to aggregate logs to.</description>
  <name>yarn.nodemanager.remote-app-log-dir</name>
  <value>hdfs://masternode:8020/var/log/hadoop-yarn/apps</value>
</property>
<property>
  <description>Classpath for typical applications.</description>
  <name>yarn.application.classpath</name>
  <value>
        $HADOOP_CONF_DIR,
        $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
        $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
        $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
        $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
  </value>
</property>

Create the necessary directories for YARN services

$ sudo mkdir -p /data/1/yarn/local /data/2/yarn/local /data/3/yarn/local /data/4/yarn/local 
$ sudo mkdir -p /data/1/yarn/logs /data/2/yarn/logs /data/3/yarn/logs /data/4/yarn/logs
$ sudo chown -R yarn.yarn /data/1/yarn/local /data/2/yarn/local /data/3/yarn/local /data/4/yarn/local 
$ sudo chown -R yarn.yarn /data/1/yarn/logs /data/2/yarn/logs /data/3/yarn/logs /data/4/yarn/logs
$ sudo -u hdfs hadoop fs -mkdir -p /user/history
$ sudo -u hdfs hadoop fs -chmod -R 1777 /user/history
$ sudo -u hdfs hadoop fs -chown mapred:hadoop /user/history

Start the Hadoop Yarn services:

$ sudo service hadoop-yarn-resourcemanager start
$ sudo service hadoop-yarn-nodemanager start 
$ sudo service hadoop-mapreduce-historyserver start

Testing the installation

Once we have finished installing the packages and configuring the setup we can now proceed to testing to see if the installation is working.

Logged in as the hadoop user, create the HDFS file directory for the user:

$ sudo -u hdfs hadoop fs -mkdir -p /user/$USER
$ sudo -u hdfs hadoop fs -chown $USER /user/$USER

Create the data directory using the user and copy some data:

$ hadoop fs -mkdir input
$ hadoop fs -put /etc/hadoop/conf/*.xml input #copies

Run a simple example. This one executes a grep command to see the occurrences of words starting with dfs.

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output 'dfs[a-z.]+'

After the job has ran, you should be able to see the results in the output folder. Run the following command to see the contents of the output folder:

$ hadoop fs -ls output

-rw-r--r--   1 hadoop supergroup          0 2016-02-11 15:03 output/_SUCCESS
-rw-r--r--   1 hadoop supergroup        150 2016-02-11 15:03 output/part-r-00000

$ hadoop fs -cat output/part-r-00000
1       dfs.safemode.min.datanodes
1       dfs.safemode.extension
1       dfs.replication
1       dfs.namenode.name.dir
1       dfs.namenode.checkpoint.dir
1       dfs.domain.socket.path
1       dfs.datanode.hdfs
1       dfs.datanode.data.dir
1       dfs.client.read.shortcircuit
1       dfs.client.file

This wraps up this post. You can now start exploring applications that utilize the MapReduce framework Hadoop provides. In a follow up to this posting, I will show how to setup a DataNode and provide another MapReduce application you can try at home. Stay tuned!