Sunday 26 May 2013

Hadoop Lab

D1: Getting Started

 

Here is a quick reference for installing and running a sample M/R program in Hadoop on a single node.

Download the latest Hadoop Distribution hadoop-X.tar.gz from here. When
writing this article the latest is hadoop.1.2.0.


1. Untar hoop-1.2.0.tar.gz.

                $ tar -xvf hadoop-1.2.0.tar.gz
                $ cd hadoop-1.2.0

2. JAVA_HOME:  

Make sure that java is available on your sandbox. Now open the script 
conf/hadoop-env.sh, uncomment and set JAVA_HOME with java installed path.

        Eg: export JAVA_HOME=/usr/local/jdk1.6.0_38/

3. Configuration

For setting up a pseudo-cluster on single box. You need to add the following
in the below three config files for getting started.

        a.  conf/core-site.xml :

            <configuration>
                      <property>
                         <name>fs.default.name</name>
                         <value>hdfs://127.0.0.1:9000</value>
                      </property>
           </configuration>

      b. conf/hdfs-site.xml

          <configuration>
                <property>
                  <name>dfs.replication</name>
                  <value>1</value>
               </property>
         </configuration>

   
     C. conf/mapred-site.xml

        <configuration>
               <property>
                     <name>mapred.job.tracker</name>
                     <value>127.0.0.1:9001</value>
              </property>
        </configuration>       

Note: We have set the hdfs and map-red machines as localhost. It we were using a cluster we would have given the target machine ip. One interesting pointe to rember here is that we can use the same set of conf files on all the machines in the cluster.

4. Set up SSH : 

We should set up a password less ssh to login to any machine in cluster
from any other machine. As this is a Single node psedo cluster set up we
should set up a passless ssh login to localhost. Try
  
                         $ ssh localhost

If a prompt appears for password, skip this and jump to step 5. Else if you
see  a message as "Connection refused on port 22", its because sshd is
not up.
                         $ /etc/init.d/sshd start

Now try 'ssh localhost' if it prompts for password go to step 5.
If you dont see sshd binary get it from linux distriution repo,
for ubuntu users                 
           $sudo apt-get install openssh-server
                                       or
download from here  and install using command
   

AVRO

Friday 24 May 2013

Data Pipeline Ecosystem


Data I/O
  • Avro
  • Java Serialization

Data Streaming
  • Kafka
  • ActiveMQ
  • RabbitMQ

Data Stores  (No-SQL) 
  • HBase
  • Cassandra
  • MongoDB
  • CouchDB
 
Data Storage 
  • HDFS
Data Transfer
  • Sqoop
  • Flume

Data Exploring/Retrieval

  • HIVE
  • PIG
Data Analysis/Computations
  • Map/Reduce YARN
  • Aggragation
  • CQL


Job Schedulers/Co-ordinators
  • Oozie
  • Zoo keeper
  • Apache Mesos

Data Management [New]
  • Apache Falcon