Why another post on Hadoop Installation?
It took a while for me to get a Hadoop cluster up and running. Especially after looking at all the available documentation and tutorials available on the internet. Moreover, for a starter into the Hadoop ecosystem, it can be quite frustrating in to decide to choose between a distribution like Cloudera or MapR for the same or just a direct installation from the apache site. I have chosen the later and it works fine for me. Yes, there are a number of good tutorials available on the internet, but well, I am sure this would help a few out there like me. Before I start, I do assume that you have a basic understanding of how Hadoop works or a general overview. If not, I suggest you do so.
Now for those of you came here by accident, I would like to quote from the Apache Hadoop website. http://hadoop.apache.org/
“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.”
For an introduction, I would recommend,
Although, its a dated tutorial, it does give a good idea of the overall system and yes, the map reduce frame work. Or, if you’d rather read a book, this is a must:
This book written by Tom White is considered ‘the’ book on Hadoop, with a hands on approach.
I will be using hadoop-0.20.2 version, a stable release. You can download it or a newer version of it from here.
Or if you want to try out other good tutorials out there, I would suggest the following:
This is a good read and gives you a great insight on the framework. This post will be for a standalone system on Ubuntu.
Once you are comfortable with this, move on to his next tutorial on multiple machines.
I would also recommend his mapreduce tutorial in python. Although Java is the native API, like he says, Python can do the needful thanks to the streaming API for hadoop.
Again, its upto you to decide whether to go in for a distribution. I would suggest cloudera. I wont be writting about it here though.
MapR is also worth mentioning. Especially when it comes with support for analytics.
So what now?
I will be be going through a general case of Hadoop installation on a RHEL5 machine. I would give a tutorialized how to format.
1. Some prerequists:
- Java installation and the installation path. Make sure you have atleast the 1.6 build for Java. Else do install it.
java version "1.6.0_10-rc" Java(TM) SE Runtime Environment (build 1.6.0_10-rc-b28) Java HotSpot(TM) Server VM (build 11.0-b15, mixed mode)
- Making sure you have a valid hostname. Check hostname using the hostname commad.
- While not necessary, it would be a good idea to partition your available disks in a given format( more on this later). Login as root and use the fdisk -l commnad to see the partitions available.
2. Now the installation
Copy the tar file you downloaded say. hadoop-0.20.2.tar.gz to /home/hadoop/installations
cp - hadoop-0.20.2.tar.gz /home/hadoop/installations
3.Now untar the file to /usr/local/, why? you’ll know.
sudo tar -xzvf /home/hadoop/installations/hadoop-0.20.2.tar.gz -C /usr/local/
Also give the required permissions. This is very important.
sudo chown -R hadoop:hadoop /usr/local/hadoop-0.20.2/
Create a soft link to /usr/local/hadoop-0.20.2
ln -s /usr/local/hadoop-0.20.2 /home/hadoop/hadoop
Create or copy existing configuration files
Now, as you may know, the three main configuration files are core-site.xml,hadoop-env.sh,hdfs-site.xml,mapred-site.xml. Refer to one of the above tutorials on how to set them, or better, the book, “Hadoop Definitive Guide”. Now if you have them ready, copy core-site.xml,hadoop-env.sh,hdfs-site.xml,mapred-site.xml to each of your datanode, or the main namenode if this is your first install and configure them appropriately. This itself would take a long time to explain and so I’ll write another post. If you have them ready on another hadoop server, do this. Yes, it is important that they all share the same attributes.
scp -r firstname.lastname@example.org:/home/hadoop/hadoop/conf/* /home/hadoop/hadoop/conf
Set the environment variables /etc/profile using a editor like vim.
### Hadoop Environment Variables ### export HADOOP_HOME=/home/hadoop/hadoop export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/conf export PATH=$PATH:$HADOOP_HOME/bin
Now create the Hadoop System Directories with our hadoop user.
mkdir -p /home/hadoop/mapred/local mkdir -p /home/hadoop/mapred/system
Now create the other directories as root, but give them ownership permissions for the hadoop user.
mkdir -p /var/log/hadoop chown hadoop:hadoop /var/log/hadoop mkdir -p /var/run/hadoop chown hadoop:hadoop /var/run/hadoop
The important data directories and mapred directories.
mkdir -p /disk1/hadoop/hdfs/data mkdir -p /disk1/hadoop/mapred/local chown -R hadoop:hadoop /disk1/hadoop
If this is to add a datanode to an existing hadoop system, you should add an entry to /etc/hosts for every new datanode.
Passphrase less ssh login from the namenode to datanodes. The idea is to copy the namenodes public key id_dsa,pub to the datanode created, that is to its, /home/hadoop/.ssh/authorized_keys. If you dont know how to create the keys, follow this link. It is explained very lucidly.
create .ssh directory , if not exists mkdir -p /home/hadoop/.ssh scp email@example.com:/home/hadoop/.ssh/authorized_keys /home/hadoop/.ssh
If you are doing it for the name node, you need to format the hdfs before you start the daemons.Do not format a running Hadoop filesystem, this will cause all your data to be erased. Before formatting, ensure that the dfs.name.dir directory exists. More on this here. http://wiki.apache.org/hadoop/GettingStartedWithHadoop
hadoop namenode -format
Stop the running cluster if it exists.
Do add the IP of the new data node in conf/slaves and conf/includes and restart/start the cluster.
This should get you up and running. Although by all means, this is not a complete listing, I have tried to keep it short and clean. I’d write more on the configuration files and other administrative stuff in later blogs. Comments and suggestions appreciated! 🙂