Categories: Linux ContainersSoftware DevelopmentHadoop
Currently, I have been tinkering around with Hadoop. The need came about when I was exploring for an architecture to handle big data processing. After using technologies like OpenMP/OpenMPI for previous projects it seems there has been some attention in the computational sector towards Hadoop. This blog posting will go over a setup I just completed.
We will be using Cloudera to help build our Hadoop cluster. Personally, I chose to go with Cloudera because it is well documented and there is a lot of information available on the web in case you get stuck. The prerequisite is as follows:
CentOS 7
Oracle JDK jdk-8u73
If you are following along with my Linux Container series you can start up a CentOS 7 container. From there launch the container and type ‘ip a’ to get the container’s IP address and enter that IP address in /etc/hosts for the masternode IP address. Then skip down to the Installing Packages section. The instructions for starting up the container is as follows:
General Server Configuration
To start, I used the minimal install CentOS 7 image on a server that contains 2 hard drives. The first hard drive (/dev/sda) will be used to host the OS and all the components needed to run Hadoop. The second hard drive (/dev/sdb) will be used to build the Hadoop Filesystem (HDFS).
Before we can start installing the Hadoop components we will need to configure our server. The first step is to install NTP. The following process was used:
Second, we needed to configure the hostname for the master node. An entry was made in /etc/hosts that reflects the following:
The following command was also issued to help assign the hostname to the server.
Additionally, /etc/sysconfig/network was also edited to include the following:
Installing packages
First, we need to install Java. You will need to go to Oracle’s website and download the 64-bit RPM file. Once you have downloaded it, run the following command:
Then download the Cloudera Hadoop package and install it using the following commands:
Hadoop Configuration
Configuring a Hadoop installation using Cloudera is relatively easy. The ‘alternatives’ application is used to help point to the current configuration needed for Hadoop. Once, you have configured the configuration path, you basically copy that Hadoop configuration directory to all the DataNodes. This will be done in a follow up posting. Lets begin:
Now we need to tell Hadoop about the paths to the HDFS. On the masternode add the following entries:
In /etc/hadoop/conf/core-site.xml
and in /etc/hadoop/conf/hdfs-site.xml
Now create the HDFS directories
Create and format the HDFS:
Configuring YARN services
The following file needs to be configured to enable YARN services being used:
Create the necessary directories for YARN services
Start the Hadoop Yarn services:
Testing the installation
Once we have finished installing the packages and configuring the setup we can now proceed to testing to see if the installation is working.
Logged in as the hadoop user, create the HDFS file directory for the user:
Create the data directory using the user and copy some data:
Run a simple example. This one executes a grep command to see the occurrences of words starting with dfs.
After the job has ran, you should be able to see the results in the output folder. Run the following command to see the contents of the output folder:
This wraps up this post. You can now start exploring applications that utilize the MapReduce framework Hadoop provides. In a follow up to this posting, I will show how to setup a DataNode and provide another MapReduce application you can try at home. Stay tuned!