Wednesday, February 3, 2016

Hadoop (Linux)

Hadoop 2.7.2  Single Node Installation on Ubuntu  14.04.1 LTS
In this chapter, we'll install a single-node Hadoop cluster backed by the Hadoop Distributed File System on Ubuntu.
Update Ubuntu
sudo apt-get update
C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 03.44 PM.PNG
Installing Java
Java is the primary requirement for running hadoop on any system, So make sure you have Java installed on your system using following command.
Installing Java 8 on Ubuntu
First you need to add webupd8team Java PPA repository in your system and install Oracle Java 8 using following set of commands.
$ sudo add-apt-repository ppa:webupd8team/java
C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 03.44 PM 001.PNG

$ sudo apt-get update

C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 11.35 AM.PNG

$ sudo apt-get install oracle-java8-installer

C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 11.38 AM 003.PNG

C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 11.38 AM 004.PNGC:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 11.39 AM.PNG

C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 11.43 AM.PNGVerify Installed Java Version

After successfully installing oracle Java using above step verify installed version using following command.
adminvm@admin:~$ java –version

C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 11.51 AM.PNG

Configuring Java Environment

In Webupd8 ppa repository also providing a package to set environment variables, Install this package using following command.
$ sudo apt-get install oracle-java8-set-default

C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 11.51 AM 001.PNG

Creating Hadoop User

$ sudo addgroup Hadoop
C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 11.53 AM.PNG

$ sudo adduser --ingroup hadoop hduser

C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 03.21 PM.PNG

$ sudo adduser hduser sudo
C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-02-16 at 03.35 PM 001.PNG

Installing SSH

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost.
So, we need to have SSH up and running on our machine and configured it to allow SSH public key authentication.
Hadoop uses SSH (to access its nodes) which would normally require the user to enter a password. However, this requirement can be eliminated by creating and setting up SSH certificates using the following commands. If asked for a filename just leave it blank and press the enter key to continue.

$ sudo apt-get install ssh

C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 11.54 AM 002.PNG

Create and setup SSH Certificate

$ su hduser
$ ssh-keygen -t rsa -P ""
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 11.58 AM.PNG


Downloading Hadoop 2.7.2

Now download hadoop 2.7.1 source archive file using below command. You can also select alternate download mirror for increasing download speed.

C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-02-16 at 04.23 PM.PNG

$ tar xvzf hadoop-2.7.2.tar.gz

Create folder /usr/local/hadoop

$sudo mkdir /usr/local/Hadoop

C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-02-16 at 04.21 PM.PNG

$sudo mv * /usr/local/hadoop/

C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-02-16 at 04.25 PM.PNG

$sudo chown -R hduser:hadoop /usr/local/Hadoop

C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-02-16 at 04.25 PM 001.PNG

Step 5: Setup Configuration Files

The following files will have to be modified to complete the Hadoop setup:
  1. ~/.bashrc
  2. /usr/local/hadoop/hadoop-2.7.2/etc/hadoop/hadoop-env.sh
  3. /usr/local/hadoop/hadoop-2.7.2/etc/hadoop/core-site.xml
  4. /usr/local/hadoop/hadoop-2.7.2/etc/hadoop/mapred-site.xml.template
  5. /usr/local/hadoop/hadoop-2.7.2/etc/hadoop/hdfs-site.xml

1. ~/.bashrc:
Before editing the .bashrc file in our home directory, we need to find the path where Java has been installed to set the JAVA_HOME environment variable using the following command:
update-alternatives --config java
Now we can append the following to the end of ~/.bashrc:
hduser@adminvm:~$ gedit ~/.bashrc

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_INSTALL=/usr/local/hadoop/hadoop-2.7.2
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END

2. Refresh the Basrc by using following command
hduser@adminvm:~$ source ~/.bashrc

3. We need to set JAVA_HOME by modifying hadoop-env.sh file.

$ cd  /usr/local/hadoop/hadoop-2.7.2/etc/hadoop

$ sudo gedit  hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

Adding the above statement in the hadoop-env.sh file ensures that the value of JAVA_HOME variable will be available to Hadoop whenever it is started up.
4. Edit the core-site.xml
The core-site.xml file contains configuration properties that Hadoop uses when starting up.
This file can be used to override the default settings that Hadoop starts with.
$sudo mkdir -p /app/hadoop/tmp
$sudo chown hduser:hadoop /app/hadoop/tmp

C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 09.22 AM.PNG
Open the file and enter the following in between the <configuration></configuration> tag:
 sudo gedit /usr/local/hadoop-2.7.2/etc/hadoop/core-site.xml

<configuration>
<property>
 <name>hadoop.tmp.dir</name>
 <value>/app/hadoop/tmp</value>
 <description>A base for other temporary directories.</description>
</property>

<property>
 <name>fs.default.name</name>
 <value>hdfs://localhost:54310</value>
 <description>The name of the default file system.  A URI whose
 scheme and authority determine the FileSystem implementation.  The
 uri's scheme determines the config property (fs.SCHEME.impl) naming
 the FileSystem implementation class.  The uri's authority is used to
 determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
5. Edit the mapred-site.xml
By default, the  hadoop-2.7.2  folder contains
mapred-site.xml.template
file which has to be renamed/copied with the name mapred-site.xml:
cp /usr/local/hadoop-2.7.2/etc/hadoop/mapred-site.xml.template /usr/local/hadoop-2.7.2/etc/hadoop/mapred-site.xml

sudo gedit /usr/local/hadoop-2.7.2/etc/hadoop/ mapred-site.xml

The mapred-site.xml file is used to specify which framework is being used for MapReduce.
We need to enter the following content in between the <configuration></configuration> tag:


<configuration>
   <property>
       <name>mapreduce.framework.name</name>
       <value>yarn</value>
   </property>
</configuration>


6. Edit the yarn-site.xml
sudo gedit /usr/local/hadoop-2.7.2/etc/hadoop/yarn-site.xml

We need to enter the following content in between the <configuration></configuration> tag:

<configuration>
   <property>
       <name>yarn.nodemanager.aux-services</name>
       <value>mapreduce_shuffle</value>
   </property>
</configuration>

7.Edit hdfs-site.xml
The  hdfs-site.xml file needs to be configured for each host in the cluster that is being used.
It is used to specify the directories which will be used as the namenode and the datanode on that host.
Before editing this file, we need to create two directories which will contain the namenode and the datanode for this Hadoop installation.
This can be done using the following commands:

 sudo mkdir -p /usr/local/hadoop_data_store/hdfs/namenode


 sudo mkdir -p /usr/local/hadoop_data_store/hdfs/datanode


 sudo chown -R hduser:hadoop /usr/local/hadoop_store


Open the file and enter the following content in between the <configuration></configuration> tag:
sudo gedit /usr/local/hadoop-2.7.2/etc/hadoop/hdfs-site.xml

<configuration>
<property>
 <name>dfs.replication</name>
 <value>1</value>
 <description>Default block replication.
 The actual number of replications can be specified when the file is created.
 The default is used if replication is not specified in create time.
 </description>
</property>
<property>
  <name>dfs.namenode.name.dir</name>
  <value>file:/usr/local/hadoop_data_store/hdfs/namenode</value>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>file:/usr/local/hadoop_data_store/hdfs/datanode</value>
</property>
</configuration>
Format the New Hadoop Filesystem
hadoop namenode –format


Note that hadoop namenode -format command should be executed once before we start using Hadoop.
If this command is executed again after Hadoop has been used, it'll destroy all the data on the Hadoop file system.
Starting Hadoop
Now it's time to start the newly installed single node cluster.
We can use start-all.sh or (start-dfs.sh and start-yarn.sh)
cd /usr/local/hadoop/hadoop-2.7.2/sbin
sudo su hduser
start-all.sh
C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 03.30 PM.PNG

We can check if it's really up and running:
hduser@admin:/usr/local/hadoop/hadoop/sbin$ jps

C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 03.31 PM.PNG

Access Hadoop Services in Browser
Hadoop NameNode started on port 50070 default. Access your server on port 50070 in your favorite web browser.
http://localhost:50070/ - web UI of the NameNode daemon

C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 03.33 PM.PNG


Create Directory in Hadoop using following commands

hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/hduser



HDFS directories are can view in Browse as follows
C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 03.36 PM.PNG
C:\Users\SHRUTHI\Pictures\My Screen Shots\Screen Shot 02-03-16 at 03.37 PM.PNG

3 comments:

  1. Really Good info.. Thankyou srinivas

    ReplyDelete
  2. Thanks for sharing this article.. You may also refer http://www.s4techno.com/blog/2016/07/11/hadoop-administrator-interview-questions/..

    ReplyDelete
  3. Thanks for providing this informative information you may also refer.
    http://www.s4techno.com/blog/2016/08/13/storm-components/

    ReplyDelete