D1: Getting Started
Here is a quick reference for installing and running a sample M/R program in Hadoop on a single node.
Download the latest Hadoop Distribution hadoop-X.tar.gz from here. When
writing this article the latest is hadoop.1.2.0.
1. Untar hoop-1.2.0.tar.gz.
$ tar -xvf hadoop-1.2.0.tar.gz
$ cd hadoop-1.2.0
2. JAVA_HOME:
Make sure that java is available on your sandbox. Now open the script
conf/hadoop-env.sh, uncomment and set JAVA_HOME with java installed path.
Eg: export JAVA_HOME=/usr/local/jdk1.6.0_38/
3. Configuration:
For setting up a pseudo-cluster on single box. You need to add the following
in the below three config files for getting started.
a. conf/core-site.xml :
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://127.0.0.1:9000</value>
</property>
</configuration>
b. conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
C. conf/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>127.0.0.1:9001</value>
</property>
</configuration>
Note: We have set the hdfs and map-red machines as localhost. It
we were using a cluster we would have given the target machine ip. One
interesting pointe to rember here is that we can use the same set of
conf files on all the machines in the cluster.
4. Set up SSH :
We should set up a password less ssh to login to any machine in cluster
from any other machine. As this is a Single node psedo cluster set up we
should set up a passless ssh login to localhost. Try
$ ssh localhost
If a prompt appears for password, skip this and jump to step 5. Else if you
see a message as "Connection refused on port 22", its because sshd is
not up.
$ /etc/init.d/sshd start
Now try 'ssh localhost' if it prompts for password go to step 5.
If you dont see sshd binary get it from linux distriution repo,
for ubuntu users
$sudo apt-get install openssh-server
or
download from here and install using command