Hadoop series : Pseudo Distributed Mode Hadoop Instalation

In this tutorial, we will learn required steps of setting up Hadoop on single node also called pseudo distributed mode. How ever as part of this series we will also setup hadoop on multiple machines. We will also learn Map Reduce, hive , Pig etc.So stay tuned, Here it comes .... :)

We need Following this to start with.
Ubuntu: I always prefer ubuntu as my linux flavour. However if you are using a very low end machine with very small ram, you can install Lubuntu also. Lubuntu is very light weight and any low end machine should work. If you have a windows machine. you can install hadoop using cygwin or VirtualBox or Vmware player. 

Here we go... Steps for installation of Hadoop.
If you are using Windows follow steps from 1 to 4

1. Download  and install Virtual box from https://www.virtualbox.org/wiki/Downloads 

2. Download ubuntu from http://www.ubuntu.com/download/desktop. If you have low 
configuration machine then you can use Lubuntu. download Lubuntu from 
you will get a .iso file.

3. Start virtual box and create a new vm using the downloaded .iso file.

4. Using .iso file, install Ubuntu/Lubuntu.

5. After linux is installed start Linux's command prompt(Terminal ) window. In Lubuntu you 
can go in accessories and click on terminal. in Ubuntu just type terminal in searchbox of 
dashboard on top left.

6. Add a new group called "hadoop" using following command.

              sudo addgroup hadoop

it will prompt for password. give password of user with whom you are logged in. sudo is used if we want to run any command as super user(you can say admin of system).

7. Add a new user for hadoop in group "hadoop"

             sudo adduser --ingroup hadoop hduser.

hduser is the name of new user. Once you hit enter for this command, it will ask for different things like, password that you want to set, display name of user etc. Give appropriate answer, You can give display name also hduser for easy maintenance  .

8. Now add hduser in the list of sudoers. so that you can run any command as sudo when you are logged in as hduser.

             sudo adduser hduser sudo.

9. Now Logout  and login as hduser.

10. Once you are logged in to hduser, open a terminal window as explained earlier.

11. Hadoop is developed in Java. java should be installed in your machine before we start using hadoop. We need Java 1.5+, means any version of java greater than 1.5 will work.
I am going to install java 6.

            sudo apt-get install openjdk-6-jdk

    apt-get is a package manager for lubuntu/ubuntu. it helps us install different software very easily. you can just mention the name of software that you want to install. it will check in its repository list for corresponding link from where he can download that software and then install it. It does even more than what i have just mentioned. but i just wanted to put it in simple words for people who are new to linux env. 
   openjdk-6-jdk is one of the implementations of java 6. you can use oracle's java implementation also.

12. Now we need to install ssh server. ssh stands for secure shell. In simple words, if i want to login remotely from one linux machine to other linux machine, i should use ssh. Ssh gives me access of shell of the remote machine. 
          
           sudo apt-get install openssh-server

      if i want to login to machine X remotely using ssh, then openssh-server should be installed on that machine. this is command for installing openssh-server. In our case we are creating only single machine pseudo- distributed mode cluster. so we will install openssh-server on local machine.

13. Once the ssh is installed we can login to remote machine using following command

        ssh <ip-address>
     if you try ssh localhost, you will notice that it will prompt you for password. Now we want to make this login password-less. One way of doing it is to use keys. we can generate keys using following command.

       ssh-keygen -t rsa -P ""

it will prompt you to give path to store keys, dont type anything, just press enter.
This command will generate two keys at "/home/hduser/.ssh/" path. id_rsa and id_rsa.pub. 
id_rsa is private key. 
id_rsa.pub is publc key
    If i want to login into remote machine X. i will share share my public key with machine X. In our case it is local machine, so following command is used.

      ssh-copy-id -i /home/hduser/.ssh/id_rsa.pub hduser@localhost

this will prompt for password. Give password for hduser.

14. you can verify everything using ssh localhost. it should be password-less. if it is not, then you ssh configuration may be wrong.

16. Downlaod hadoop 1.0.3 from hadoop 1.0.3

17. Extract hadoop and put it in folder "/home/hduser/hadoop"

18. Now we need to make configurations in hadoop configuration file. You will find these files in "/home/hduser/hadoop/conf" folder.

19. There are 4 important files in this folder

     a) hadoop-env.sh
     b) hdfs-site.xml
     c) mapred-site.xml
     d) core-site.xml

hadoop-env.sh is a file which contains hadoop enviorment related properties. Here we can set properties like where is java home, what is heap memory size, what is classpath of hadoop, which version of IP to use etc. we will set java home in this file. for me java home is "/usr/lib/jvm/java-6-openjdk-i386". so put following line in file and save.

     export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386

hdfs-site.xml is file which contains properties related to hdfs(hadoop distributed file system.). We need to set here the replication factor here. By default replication factor is 3. since we are installing hadoop in single machine. we will set it to 1. Copy following in-between the configuration tag in file.

<property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property>

mapred-site.xml is a file that contains properties related to map reduce. we will set here ip address and port of machine on which job tracker is running. copy following in between configuration tag.

<property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>

core-site.xml is property file which contains property which are common or used by both map reduce and hdfs. here we will set ip address and port number of machine on which namenode will be running. Other property tells where should hadoop store files like fsimage and blocks etc. Copy following in between configuration tag.

<property> <name>hadoop.tmp.dir</name> <value>/home/hduser/hadoop_tmp_files</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>

Now open terminal and format namenode with the following command. Namenode should be formatted only once, before you start using your hadoop cluster. if you format namnode later, you will lose all the data stored on hdfs. Notice that "/home/hduser/hadoop/bin/" folder contains all the important scripts to start hadoop, stop hadoop, access hdfs, format hdfs etc.

          /home/hduser/hadoop/bin/hadoop namenode -format

Now you can start hadoop using following command.

         /home/hduser/hadoop/bin/start-all.sh

you can check if hadoop has started using following command

         jps

it shows all java processes running. it should show following processes.

1. namenode
2. secondary namenode
3. datanode
4. jobtracker
5. tasktracker

if you can thsese processes. your hadoop single node instance is up and running.

Hope this will be useful for you. Please share your feedback, so that i can improve on these. I will also share a video on this topic. Please feel free to share your questions also. stay tuned for few more blogs on hadoop and machine learning and lots of videos :)




   


Comments

  1. Nice post Harjeet. Do keep on posting such nice tutorials.

    ReplyDelete
  2. Very clear step by step tutorial for beginners..

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. it is the /etc/hosts file.
    run this command: sudo leafpad /etc/hosts

    ReplyDelete
  5. Helped a lot, Thanks.

    Please post a blog for Hadoop Fully Distributed Mode Installation

    ReplyDelete
  6. Really very informative and gives step by step for the starters.recently I bought the hadoop videos on www.hadooponlinetutor.com.loved the videos.

    ReplyDelete
  7. This comment has been removed by a blog administrator.

    ReplyDelete
  8. Videos and Slides presented on this blog are referred to me by my instructor at hadoop online training center. Also we as a group never miss a single post on this website. Thanks for sharing good info..

    ReplyDelete

Post a Comment

Popular posts from this blog

Hive UDF Example

Custom UDF in Apache Spark