Hadoop-Cluster

MapReduce in Cluster.

Prerequisites

You need at least two Linux based hosts to set up a hadoop cluster. One would act as a Master and other hosts act as slaves. You can add any number of slaves to your cluster. Here, I have used two hosts for the sake of simplicity and demonstration where one act as master and other as slave.

Steps-by-Step Process

You might need to update your system

apt-get update

Hadoop wokrs on java. If you don't have java already installed, install java by using :

apt-get install default-jdk -y

See Java Version:

java -version

There are many versions of Hadoop available online. Select one of the stable release and download using the wget command.

http://www.apache.org/dyn/closer.cgi/hadoop/common/

wget http://apache.mirrors.lucidnetworks.net/hadoop/common/stable/hadoop-3.2.0.tar.gz

Once the download is complete, extract the package:

tar -xvzf hadoop-3.2.0.tar.gz

Move the extracted files into local directory /usr/local

mv hadoop-3.2.0 /usr/local/hadoop

Open the hadoop-env.sh file using nano command and look for the export JAVA_HOME= line. Uncomment it and add the static value and dynamic value for JAVA_HOME

nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
export JAVA_HOME=$(readlink -f /usr/bin/java | sed " s:bin/java::" )

Check if the Hadoop is correctly installed by running the below command:

/usr/local/hadoop/bin/hadoop

Ceate a directory for your input classes.

mkdir wordcount_classes

Execute the java file with the following command (If java file is not in present working directory, use full directory address where java file is located):

javac -classpath ${usr/local/hadoop/bin/hadoop classpath} -d wordcount_classes/ '/home/paras/Downloads/WordCount.java'

you can find out the classpath by issuing: $ echo $(usr/local/hadoop/bin/hadoop classpath)

Consolidate the files in the wordcount_classes/ directory into a single jar file:

jar -cvf wc.jar -C wordcount_classes/ .

Run the jar file in hadoop:

/usr/local/hadoop/bin/hadoop jar wc.jar WordCount /usr/input /output

Hadoop in Pseudo-Distributed Mode

Edit core-site.xml:

sudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml

Insert into configuration tags:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

Edit hdfs-site.xml:

sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Insert into configuration tags:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Create public private key pairs

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ chmod 0600 ~/.ssh/authorized_keys

Now ssh localhost

Format the filesystem:

/usr/local/hadoop/bin/hdfs namenode -format

Start the namenodes, secondary namenodes and datanodes

/usr/local/hadoop/sbin/start-dfs.sh

Create the Directory in HDFS and insert the input file in it

$ /usr/local/hadoop/bin/hdfs dfs -mkdir /user 
$ /usr/local/hadoop/bin/hdfs dfs -mkdir /user/paras
$ /usr/local/hadoop/bin/hdfs dfs -put '/home/paras/Downloads/WordCountText.txt' /user/paras

Run Hadoop to execute the jar file:

$ /usr/local/hadoop/bin/hadoop jar wc.jar WordCount /user/paras /output

You can see the contents of output directory. Go to http://localhost:50070/ for Hadoop 2.x versions or older and http://localhost:9870/ for Hadoop 3.x versions:

Opening the file part-r-00000:

Run WordCount on the Hadoop cluster with 2 VMs

Configure two VMs
Rename the master hostname as HadoopMaster Rename the slave hostname as HadoopSlave

3)install Java on both : 4)install hadoop on both: Make changes in configuration of below mentioned files:

core-site.xml:

sudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml

  <name>fs.default.name</name>
  <value>hdfs://HadoopMaster:9000</value>
</property>

hdfc-site.xml:

HadoopMaster:

sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

<configuration>
<property>
	<name>dfs.replication</name>
	<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_tmp/hdfs/namenode</value>
</property>
</configuration>

HadoopSlave:

sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

<configuration>
<property>
      <name>dfs.replication</name>
      <value>1</value>
 </property>

 <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:/usr/local/hadoop_tmp/hdfs/datanode</value>
 </property>
</configuration>

yarn-site.xml

--HadoopMaster and HadoopSlave--:

sudo gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml

<property>
	<name>yarn.resourcemanager.resource-tracker.address</name>
	<value>HadoopMaster:8025</value>
</property>
<property>
	<name>yarn.resourcemanager.scheduler.address</name>
	<value>HadoopMaster:8035</value>
</property>
<property>
	<name>yarn.resourcemanager.address</name>
	<value>HadoopMaster:8050</value>
</property>

mapred-site.xml sudo gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml

<configuration>
<property>
	<name>mapreduce.job.tracker</name>
	<value>HadoopMaster:5431</value>
</property>
<property>
	<name>mapred.framework.name</name>
	<value>yarn</value>
</property>
</configuration>

masters Only on master node i.e. HadoopMaster* sudo gedit /usr/local/hadoop/etc/hadoop/masters

HadoopMaster

workers Only on master node i.e. HadoopMaster* sudo gedit /usr/local/hadoop/etc/hadoop/workers

HadoopSlave

hosts sudo gedit /etc/hosts

HadoopMaster and HadoopSlave

127.0.0.1	localhost
<master node's IPv4 Address> HadoopMaster
<slave node's IPv4 Address> HadoopSlave

Note: If the Hadoop version is 2.X or older, then you might need to change slaves instead of workers

~/.bashrc file Open the file using sudo gedit ~/.bashrc and add the below lines at the end of the file:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

If you have not configured your hadoop-env.sh file, edit the file as mentioned in above section (Single Node Standalone Mode)

Setting Passwordless connection between master and Slave Refer to the below link to set up a passwordless ssh to localhost and the remote machine. This has to be done on master node as well as slave node.

https://www.tecmint.com/ssh-passwordless-login-using-ssh-keygen-in-5-easy-steps/

Format the filesystem:

/usr/local/hadoop/bin/hdfs namenode -format

Start the namenodes, secondary namenodes and datanodes

/usr/local/hadoop/sbin/start-dfs.sh

Create the Directory in HDFS and insert the input file in it

$ /usr/local/hadoop/bin/hdfs dfs -mkdir /user 
$ /usr/local/hadoop/bin/hdfs dfs -mkdir /user/paras
$ /usr/local/hadoop/bin/hdfs dfs -put '/home/paras/Downloads/WordCountText.txt' /user/paras

Run Hadoop to execute the jar file:

$ /usr/local/hadoop/bin/hadoop jar wc.jar WordCount /user/paras /output

References

http://pingax.com/install-apache-hadoop-ubuntu-cluster-setup/

http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

https://www.linuxhelp.com/how-to-install-hadoop-in-ubuntu

https://www.tecmint.com/ssh-passwordless-login-using-ssh-keygen-in-5-easy-steps/

parasgulati8 / hadoop-cluster Goto Github PK

hadoop-cluster's Introduction

Hadoop-Cluster

Prerequisites

Steps-by-Step Process

Hadoop in Pseudo-Distributed Mode

Run WordCount on the Hadoop cluster with 2 VMs

References

hadoop-cluster's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent