HBase administration using the Java API, using code examples

December 3, 2011December 4, 2011 NPKadministration, code, HBase, Java API, Tutorial11 Comments

I have not given a formal introduction on HBase, but this post will help those who have already set up and have an active HBase installation. I will be dealing with the administrative work that can be done on HBase using the Java API. The API is vast and easy to use. I have explained the code wherever I find it necessary, but this
post is by all means incomplete. I have as usual, provided the full code at the end. Cheers. 🙂

If you want to follow along your better import all this, or if you are using an IDE like Eclipse, you’ll follow along just fine as it automatically fixes up your imports. The only thing you need to do is to set the class path to include all the jar files from the hadoop installation and/or hbase installation, especially the hadoop-0.*.*-core.jar and the jar files inside the lib folder. I’ll put in another post on that later.


import java.io.IOException;
import java.util.Collection;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.ClusterStatus;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HServerInfo;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.MasterNotRunningException;
import org.apache.hadoop.hbase.ZooKeeperConnectionException;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.util.Bytes;

1. Creating a table in HBase

    public void createTable (String tablename, String familyname) throws IOException {

        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);

        HTableDescriptor tabledescriptor = new HTableDescriptor(Bytes.toBytes(tablename));

        tabledescriptor.addFamily(new HColumnDescriptor (familyname));

        admin.createTable(tabledescriptor);

    }

2. Adding a column to an existing table

    public void addColumn (String tablename, String  colunmnname) throws IOException{

        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.addColumn(tablename, new HColumnDescriptor (colunmnname));
        System.out.println("Added column : " + colunmnname + "to table " + tablename);
    }

3. Deleting a column to an existing table

    public void delColumn (String tablename, String  colunmnname) throws IOException{

        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.deleteColumn(tablename, colunmnname);
        System.out.println("Deleted column : " + colunmnname + "from table " + tablename);
    }

4. Check if your hbase cluster is running properly

    public static void checkIfRunning() throws MasterNotRunningException, ZooKeeperConnectionException{
        //Create the required configuration.
        Configuration conf = HBaseConfiguration.create();
        //Check if Hbase is running
        try{
        HBaseAdmin.checkHBaseAvailable(conf);
        }catch(Exception e){
            System.err.println("Exception at " + e);
            System.exit(1);
        }
    }

5. Major compaction

    public void majorCompact (String mytable) throws IOException{

        //Create the required configuration.
        Configuration conf = HBaseConfiguration.create();
        //Instantiate a new client.
        HTable table = new HTable(conf,mytable);

        HBaseAdmin admin = new HBaseAdmin(conf);

        String tablename = table.toString();
        try{
        admin.majorCompact(tablename);
        System.out.println("Compaction done!");
        }catch(Exception e){
            System.out.println(e);
        }
    }

6. Minor compaction

    public void minorcompact(String trname) throws IOException, InterruptedException{
        //Create the required configuration.
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.compact(trname);
    }

7. Print out the cluster status.

    public ClusterStatus getclusterstatus () throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        return admin.getClusterStatus();
    }

8. Get all the cluster details.

    public void printClusterDetails() throws IOException{
        ClusterStatus status = getclusterstatus();

        status.getServerInfo();
        Collection serverinfo =  status.getServerInfo();
        for (HServerInfo s : serverinfo){
            System.out.println("Servername " + s.getServerName());
            System.out.println("Hostname " + s.getHostname());
            System.out.println("Hostname:Port " + s.getHostnamePort());
            System.out.println("Info port" + s.getInfoPort());
            System.out.println("Server load " + s.getLoad().toString());
            System.out.println();
        }

        String version = status.getHBaseVersion();
        System.out.println("Version " + version);

        int regioncounts = status.getRegionsCount();
        System.out.println("Regioncounts :" + regioncounts);

        int servers = status.getServers();
        System.out.println("Servers :" + servers);

        double averageload = status.getAverageLoad();
        System.out.println("Average load: " + averageload);

        int deadservers = status.getDeadServers();
        System.out.println("Deadservers : " + deadservers);

        Collection  Servernames = status.getDeadServerNames();
        for (String s : Servernames  ){
            System.out.println("Dead Servernames " + s);
        }
}

9. Disable a table.

    public void disabletable(String tablename) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.disableTable(tablename);
    }

10. Enable a table

    public void enabletable(String tablename) throws IOException{
        //Create the required configuration.
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.enableTable(tablename);
    }

11. Delete a table.

    public void deletetable(String tablename) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.deleteTable(tablename);
    }

12. Check if table is available

   public void isTableAvailable(String tablename) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        boolean result = admin.isTableAvailable(tablename);
        System.out.println("Table " + tablename + " available ?" + result);
    }

13. Check if table is enabled

    public void isTableEnabled(String tablename) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        boolean result = admin.isTableEnabled(tablename);
        System.out.println("Table " + tablename + " enabled ?" + result);
    }

14. Check if table is disabled

    public void isTableDisabled(String tablename) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        boolean result = admin.isTableDisabled(tablename);
        System.out.println("Table " + tablename + " disabled ?" + result);
    }

15. Check if table exists.

    public void tableExists(String tablename) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        boolean result = admin.tableExists(tablename);
        System.out.println("Table " + tablename + " exists ?" + result);
    }

16. List all tables

    public void listTables () throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.listTables();
    }

17. Flush tables.

   public void flush(String trname) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.disableTable(trname);
    }

18. Shutdown hbase.

    public void shutdown() throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        System.out.println("Shutting down..");
        admin.shutdown();
    }

19. Modify column for a table.

    @SuppressWarnings("deprecation")
    public void modifyColumn(String tablename, String columnname, String descriptor) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.modifyColumn(tablename, columnname, new HColumnDescriptor(descriptor));

    }

20. Modify the avilable table.

    public void modifyTable(String tablename, String newtablename) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.modifyTable(Bytes.toBytes(tablename), new HTableDescriptor(newtablename));

    }

21. Split based on tablename.

    public void split(String tablename) throws IOException, InterruptedException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.split(tablename);
    }

22. Check if master is running.

    public void isMasterRunning() throws MasterNotRunningException, ZooKeeperConnectionException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin administer = new HBaseAdmin(conf);
        System.out.println( "Master running ? "+ administer.isMasterRunning());
    }

There are lots more, you can check the Java API for HBase and prepare more. I found all this necessary. And some well..

The full listing of the code:

/*
 * Hbase administration basic tools.
 *
 * */

import java.io.IOException;
import java.util.Collection;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.ClusterStatus;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HServerInfo;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.MasterNotRunningException;
import org.apache.hadoop.hbase.ZooKeeperConnectionException;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.util.Bytes;

public class HbaseAdmin {

    public HbaseAdmin(){

    }

    public void addColumn (String tablename, String  colunmnname) throws IOException{

        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.addColumn(tablename, new HColumnDescriptor (colunmnname));
        System.out.println("Added column : " + colunmnname + "to table " + tablename);
    }

    public void delColumn (String tablename, String  colunmnname) throws IOException{

        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.deleteColumn(tablename, colunmnname);
        System.out.println("Deleted column : " + colunmnname + "from table " + tablename);
    }

    public void createTable (String tablename, String familyname) throws IOException {

        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);

        HTableDescriptor tabledescriptor = new HTableDescriptor(Bytes.toBytes(tablename));

        tabledescriptor.addFamily(new HColumnDescriptor (familyname));

        admin.createTable(tabledescriptor);

    }

    public void majorCompact (String mytable) throws IOException{

        //Create the required configuration.
        Configuration conf = HBaseConfiguration.create();
        //Instantiate a new client.
        HTable table = new HTable(conf,mytable);

        HBaseAdmin admin = new HBaseAdmin(conf);

        String tablename = table.toString();
        try{
        admin.majorCompact(tablename);
        System.out.println("Compaction done!");
        }catch(Exception e){
            System.out.println(e);
        }
    }

    public static void checkIfRunning() throws MasterNotRunningException, ZooKeeperConnectionException{
        //Create the required configuration.
        Configuration conf = HBaseConfiguration.create();
        //Check if Hbase is running
        try{
        HBaseAdmin.checkHBaseAvailable(conf);
        }catch(Exception e){
            System.err.println("Exception at " + e);
            System.exit(1);
        }
    }

    public void minorcompact(String trname) throws IOException, InterruptedException{
        //Create the required configuration.
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.compact(trname);
    }

    public void deletetable(String tablename) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.deleteTable(tablename);
    }

    public void disabletable(String tablename) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.disableTable(tablename);
    }

    public void enabletable(String tablename) throws IOException{
        //Create the required configuration.
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.enableTable(tablename);
    }

    public void flush(String trname) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.disableTable(trname);
    }

    public ClusterStatus getclusterstatus () throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        return admin.getClusterStatus();
    }

    public void printClusterDetails() throws IOException{
        ClusterStatus status = getclusterstatus();

        status.getServerInfo();
        Collection serverinfo =  status.getServerInfo();
        for (HServerInfo s : serverinfo){
            System.out.println("Servername " + s.getServerName());
            System.out.println("Hostname " + s.getHostname());
            System.out.println("Hostname:Port " + s.getHostnamePort());
            System.out.println("Info port" + s.getInfoPort());
            System.out.println("Server load " + s.getLoad().toString());
            System.out.println();
        }

        String version = status.getHBaseVersion();
        System.out.println("Version " + version);

        int regioncounts = status.getRegionsCount();
        System.out.println("Regioncounts :" + regioncounts);

        int servers = status.getServers();
        System.out.println("Servers :" + servers);

        double averageload = status.getAverageLoad();
        System.out.println("Average load: " + averageload);

        int deadservers = status.getDeadServers();
        System.out.println("Deadservers : " + deadservers);

        Collection  Servernames = status.getDeadServerNames();
        for (String s : Servernames  ){
            System.out.println("Dead Servernames " + s);
        }

    }

    public void isTableAvailable(String tablename) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        boolean result = admin.isTableAvailable(tablename);
        System.out.println("Table " + tablename + " available ?" + result);
    }

    public void isTableEnabled(String tablename) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        boolean result = admin.isTableEnabled(tablename);
        System.out.println("Table " + tablename + " enabled ?" + result);
    }

    public void isTableDisabled(String tablename) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        boolean result = admin.isTableDisabled(tablename);
        System.out.println("Table " + tablename + " disabled ?" + result);
    }

    public void tableExists(String tablename) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        boolean result = admin.tableExists(tablename);
        System.out.println("Table " + tablename + " exists ?" + result);
    }

    public void shutdown() throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        System.out.println("Shutting down..");
        admin.shutdown();
    }

    public void listTables () throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.listTables();
    }

    @SuppressWarnings("deprecation")
    public void modifyColumn(String tablename, String columnname, String descriptor) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.modifyColumn(tablename, columnname, new HColumnDescriptor(descriptor));

    }

    public void modifyTable(String tablename, String newtablename) throws IOException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.modifyTable(Bytes.toBytes(tablename), new HTableDescriptor(newtablename));

    }

    public void split(String tablename) throws IOException, InterruptedException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.split(tablename);
    }

    public void isMasterRunning() throws MasterNotRunningException, ZooKeeperConnectionException{
        Configuration conf = HBaseConfiguration.create();
        HBaseAdmin administer = new HBaseAdmin(conf);
        System.out.println( "Master running ? "+ administer.isMasterRunning());
    }

    public static void main (String[] args) throws IOException{

        
        HbaseAdmin admin = new HbaseAdmin();

        //Check if Hbase is running properly
        HbaseAdmin.checkIfRunning();
        admin.printClusterDetails();

        //other functions based on arguments.

    }
}

And hey, I’ll update this post soon with more details especially about compaction and stuff! Till then Cheers! 🙂

Shell scripting: Arithmetic using expr, bc and dc tools

November 24, 2011 NPKbc, dc, expr, TutorialLeave a comment

You will definitely need to do some math sometime or the other on the shell. As always ‘expr’ was the most popular thing out there to do complicated mathematical expressions. I was looking at some other options as well when I came across the bc and dc tools. I will explain each one of them in this post.

expr

This is by far the most famous for doing some math on the bash shell. There are two kinds mainly. One on string expressions and then the usual numericals. I would be writing about the later.

expr 40 % 5
0
expr 40 / 5
8
expr 40 / 5 / 8
1
expr 40 / 5 / 8 + 1
2
expr 40 / 5 / 8 + 1 * 10
expr: syntax error

Of course, while doing multiplication you need to use the escaped character ‘\’ backslash. And thus,

expr 40 / 5 / 8 + 1 \* 10
11

The brackets, division, multiplication, addition and subtraction rules also govern here. Now lets look at the others.

This is a language bc that supports arbitrary precision numbers with interactive execution of statements. It starts by processing code from all the files listed on the command line in the order listed. Now a neat way to calculate stuff is:

echo 2*30/3 | bc
20
echo "20 + 5 * 3" | bc
35

Again this follows the basic BODMAS rules.

Stands for desk calculator. Its an interactive calculator on the shell. It supports the basic arithmetic and uses the standard + – / * symbols but entered after the digits. Once you enter the symbol, get the calculated output by passing ‘p’ similar to our ‘=’ symbol on the calculator. And you can keep going.

dc
98
9
*
p
882
10
/
p
88

If I find more useful tools, I’ll update this post. If you have better ideas to implement this, feel free to suggest!

Shell scripting: awk tutorial

November 24, 2011 NPKawk, Tutorial3 Comments

This is one of the best linux line processing utility I have come across. I can do all sort of stuff. Add, replace, find and index stuff. And basically much more. I’ll just get to the examples.

The Basics:

Suppose we have a file like:

echo $somefile
this_is_something_interesting

echo $somefile | awk -F '_' '{print toupper($3)}'
SOMETHING

Now, -F is the field delimiter. And we split the contents of the variable or a file based on the delimiter, which in our case is ‘_’. Now print is the standard function to print out stuff. Now $3 contains the third split value, which in this case is ‘something’. toupper() converts this to uppercase. Duh!.

This example shows how awk converts a string into an array.

echo $time
10:20:30

hms=`echo $time | awk '{split($0,a,":"); print a[1], a[2], a[3]}'`
echo $hms
10 20 30

How the delimiter is ‘:’. Store the splits into variable a which acts like an array. a[2] contains the second split 20 and so on.

Okay, so how do the get the second last or last split of a string?

c=`echo $i | awk 'BEGIN{FS="_"}{for (i=1; i<=NF; i++) if (i==NF-1) print $i}'` # NF contains total number of splits and variable c contains the second last word-split.

This is the basic syntax. You start with BEGIN, where you mention the field seperator FS. Now, NF is a special variable that contains the number of splits. So we loop through the condition, till we reach NF-1, the second last split, which we check using the ‘if’ condition. Then just print it out of course!

Substitution using awk:
This is done using ‘sub’. The first and only the FIRST occurrence of ‘shower’ is replaced by ‘steam’. ‘$0’ means the entire string.

text=`echo $text | awk '{sub("shower","stream"); print $0}'` # substitution only first

Global substitution using awk:
Using ‘gsub’, the entire string or file is replaced. Now, a[a-z] means, all words starting with a and followed by any alphabet, is to be replaced by x.

text1=`echo $text | awk '{gsub("a[a-z]","x"); print $0}'` # global, a followed by an alpha replaced by x

Similarly, the condition here is to replace anything between ‘a’ and ‘d’ with ‘tt’.

text2=`echo $text | awk '{sub("a*d","tt"); print $0}'` # a followed by anything till d, replace d with tt

Similarly for numbers:

cat=`echo $name | awk '{sub("[0-9]+",""); print $0}'` # first occurance removes numbers

Another example:

short=`echo $name | awk '{gsub("[b-z]",""); print $0}'` # global removes all from b to z and replace with ''

Substring using awk:
Suppose we want just the part of the string. (12,8) means go to the 12th character, and get me the next 8 characters.

echo $caption
thisislinuxjunkies

object=`echo $caption | awk '{print substr($0,12,8)}'` # substring
echo $object
inuxjunk

Reading a particular set of lines:
‘NR’ is a special variable with awk which tells us about the number of lines read.

echo $myfile
a
b
c
d
awk 'NR < 3' $myfile # number of lines read in a file
a
b

This is not a complete set of things you get to do with awk. I’ll update this post if I find more neat tricks! Questions appreciated! 🙂

A HDFSClient for Hadoop using the native JAVA API, a tutorial

November 21, 2011August 6, 2012 NPKaddFile, Adminstration, API, block locations, Client, copyFromLocal, copyToLocal, delete, Full code, getHostnames, Hadoop, hadoop fs, HDFS, Java, mkdir, modification time, Programming, read, Tutorial30 Comments

I’d like to talk about doing some day to day administrative task on the Hadoop system. Although the hadoop fs <commands> can get you to do most of the things, its still worthwhile to explore the rich API in Java for Hadoop. This post is by no means complete, but can get you started well.

The most basic step is to create an object of this class.

HDFSClient client = new HDFSClient();

Of course, you need to import a bunch of stuff. But if you are using an IDE like Eclipse, you’ll follow along just fine just by importing these. This should word fine for the entire code.

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.BlockLocation;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hdfs.DistributedFileSystem;
import org.apache.hadoop.hdfs.protocol.DatanodeInfo;

1. Copying from Local file system to HDFS.
Copies a local file onto HDFS. You do have the hadoop file system command to do the same.

hadoop fs -copyFromLocal <local fs> <hadoop fs>

I am not explaining much here as the comments are quite helpful. Of course, while importing the configuration files, make sure to point them to your hadoop systems location. For mine, it looks like this:

Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);

This is how the Java API looks like:

public void copyFromLocal (String source, String dest) throws IOException {

Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);
Path srcPath = new Path(source);

Path dstPath = new Path(dest);
// Check if the file already exists
if (!(fileSystem.exists(dstPath))) {
System.out.println("No such destination " + dstPath);
return;
}

// Get the filename out of the file path
String filename = source.substring(source.lastIndexOf('/') + 1, source.length());

try{
fileSystem.copyFromLocalFile(srcPath, dstPath);
System.out.println("File " + filename + "copied to " + dest);
}catch(Exception e){
System.err.println("Exception caught! :" + e);
System.exit(1);
}finally{
fileSystem.close();
}
}

2.Copying files from HDFS to the local file system.

The hadoop fs command is the following.

hadoop fs -copyToLocal <hadoop fs> <local fs>

Alternatively,

hadoop fs -copyToLocal

public void copyFromHdfs (String source, String dest) throws IOException {

Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);
Path srcPath = new Path(source);

Path dstPath = new Path(dest);
// Check if the file already exists
if (!(fileSystem.exists(dstPath))) {
System.out.println("No such destination " + dstPath);
return;
}

// Get the filename out of the file path
String filename = source.substring(source.lastIndexOf('/') + 1, source.length());

try{
fileSystem.copyToLocalFile(srcPath, dstPath)
System.out.println("File " + filename + "copied to " + dest);
}catch(Exception e){
System.err.println("Exception caught! :" + e);
System.exit(1);
}finally{
fileSystem.close();
}
}

3.Renaming a file in HDFS.

You can use the mv command in this context.

hadoop fs -mv <this name> <new name>

public void renameFile (String fromthis, String tothis) throws IOException{
Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);
Path fromPath = new Path(fromthis);
Path toPath = new Path(tothis);

if (!(fileSystem.exists(fromPath))) {
System.out.println("No such destination " + fromPath);
return;
}

if (fileSystem.exists(toPath)) {
System.out.println("Already exists! " + toPath);
return;
}

try{
boolean isRenamed = fileSystem.rename(fromPath, toPath);
if(isRenamed){
System.out.println("Renamed from " + fromthis + "to " + tothis);
}
}catch(Exception e){
System.out.println("Exception :" + e);
System.exit(1);
}finally{
fileSystem.close();
}

}

4.Upload or add a file to HDFS

public void addFile(String source, String dest) throws IOException {

// Conf object will read the HDFS configuration parameters
Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);

// Get the filename out of the file path
String filename = source.substring(source.lastIndexOf('/') + 1, source.length());

// Create the destination path including the filename.
if (dest.charAt(dest.length() - 1) != '/') {
dest = dest + "/" + filename;
} else {
dest = dest + filename;
}

// Check if the file already exists
Path path = new Path(dest);
if (fileSystem.exists(path)) {
System.out.println("File " + dest + " already exists");
return;
}

// Create a new file and write data to it.
FSDataOutputStream out = fileSystem.create(path);
InputStream in = new BufferedInputStream(new FileInputStream(
new File(source)));

byte[] b = new byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) {
out.write(b, 0, numBytes);
}

// Close all the file descripters
in.close();
out.close();
fileSystem.close();
}

5.Delete a file from HDFS.

You can use the following:

For removing a directory or a file:

hadoop fs -rmr <hdfs path>

If you want to skip the trash also, use:

hadoop fs -rmr -skipTrash <hdfs path>

public void deleteFile(String file) throws IOException {
Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);

Path path = new Path(file);
if (!fileSystem.exists(path)) {
System.out.println("File " + file + " does not exists");
return;
}

fileSystem.delete(new Path(file), true);

fileSystem.close();
}

6.Get modification time of a file in HDFS.

If you have any idea on this let me know. 🙂

public void getModificationTime(String source) throws IOException{

Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);
Path srcPath = new Path(source);

// Check if the file already exists
if (!(fileSystem.exists(srcPath))) {
System.out.println("No such destination " + srcPath);
return;
}
// Get the filename out of the file path
String filename = source.substring(source.lastIndexOf('/') + 1, source.length());

FileStatus fileStatus = fileSystem.getFileStatus(srcPath);
long modificationTime = fileStatus.getModificationTime();

System.out.format("File %s; Modification time : %0.2f %n",filename,modificationTime);

}

7.Get the block locations of a file in HDFS.

public void getBlockLocations(String source) throws IOException{

Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);
Path srcPath = new Path(source);

// Check if the file already exists
if (!(ifExists(srcPath))) {
System.out.println("No such destination " + srcPath);
return;
}
// Get the filename out of the file path
String filename = source.substring(source.lastIndexOf('/') + 1, source.length());

FileStatus fileStatus = fileSystem.getFileStatus(srcPath);

BlockLocation[] blkLocations = fileSystem.getFileBlockLocations(fileStatus, 0, fileStatus.getLen());
int blkCount = blkLocations.length;

System.out.println("File :" + filename + "stored at:");
for (int i=0; i < blkCount; i++) {
String[] hosts = blkLocations[i].getHosts();
System.out.format("Host %d: %s %n", i, hosts);
}

}

8.List all the datanodes in terms of hostnames.
This is a neat way rather than looking up the /etc/hosts file in the namenode.

public void getHostnames () throws IOException{
Configuration config = new Configuration();
config.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
config.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
config.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fs = FileSystem.get(config);
DistributedFileSystem hdfs = (DistributedFileSystem) fs;
DatanodeInfo[] dataNodeStats = hdfs.getDataNodeStats();

String[] names = new String[dataNodeStats.length];
for (int i = 0; i < dataNodeStats.length; i++) {
names[i] = dataNodeStats[i].getHostName();
System.out.println((dataNodeStats[i].getHostName()));
}
}

9.Create a new directory in HDFS.
Creating a directory will be done as:

hadoop fs -mkdir <hadoop fs path>

public void mkdir(String dir) throws IOException {
Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);

Path path = new Path(dir);
if (fileSystem.exists(path)) {
System.out.println("Dir " + dir + " already exists!");
return;
}

fileSystem.mkdirs(path);

fileSystem.close();
}

10. Read a file from HDFS

public void readFile(String file) throws IOException {
Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);

Path path = new Path(file);
if (!fileSystem.exists(path)) {
System.out.println("File " + file + " does not exists");
return;
}

FSDataInputStream in = fileSystem.open(path);

String filename = file.substring(file.lastIndexOf('/') + 1,
file.length());

OutputStream out = new BufferedOutputStream(new FileOutputStream(
new File(filename)));

byte[] b = new byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) {
out.write(b, 0, numBytes);
}

in.close();
out.close();
fileSystem.close();
}

11.Checking if a file exists in HDFS

public boolean ifExists (Path source) throws IOException{

Configuration config = new Configuration();
config.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
config.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
config.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem hdfs = FileSystem.get(config);
boolean isExists = hdfs.exists(source);
return isExists;
}

I know this is no way complete. But this is a rather long post. I hope it is useful. Responses appreciated!
And here is the complete code for HDFSClient.java. Happy Hadooping! 🙂

/*
Feel free to use, copy and distribute this program in any form.
HDFSClient.java
https://linuxjunkies.wordpress.com/
2011
*/

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.BlockLocation;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hdfs.DistributedFileSystem;
import org.apache.hadoop.hdfs.protocol.DatanodeInfo;

public class HDFSClient {

public HDFSClient() {

}

public static void printUsage(){
System.out.println("Usage: hdfsclient add" + "<local_path> <hdfs_path>");
System.out.println("Usage: hdfsclient read" + "<hdfs_path>");
System.out.println("Usage: hdfsclient delete" + "<hdfs_path>");
System.out.println("Usage: hdfsclient mkdir" + "<hdfs_path>");
System.out.println("Usage: hdfsclient copyfromlocal" + "<local_path> <hdfs_path>");
System.out.println("Usage: hdfsclient copytolocal" + " <hdfs_path> <local_path> ");
System.out.println("Usage: hdfsclient modificationtime" + "<hdfs_path>");
System.out.println("Usage: hdfsclient getblocklocations" + "<hdfs_path>");
System.out.println("Usage: hdfsclient gethostnames");
}

public boolean ifExists (Path source) throws IOException{

Configuration config = new Configuration();
config.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
config.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
config.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem hdfs = FileSystem.get(config);
boolean isExists = hdfs.exists(source);
return isExists;
}

public void getHostnames () throws IOException{
Configuration config = new Configuration();
config.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
config.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
config.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fs = FileSystem.get(config);
DistributedFileSystem hdfs = (DistributedFileSystem) fs;
DatanodeInfo[] dataNodeStats = hdfs.getDataNodeStats();

String[] names = new String[dataNodeStats.length];
for (int i = 0; i < dataNodeStats.length; i++) {
names[i] = dataNodeStats[i].getHostName();
System.out.println((dataNodeStats[i].getHostName()));
}
}

public void getBlockLocations(String source) throws IOException{

Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);
Path srcPath = new Path(source);

// Check if the file already exists
if (!(ifExists(srcPath))) {
System.out.println("No such destination " + srcPath);
return;
}
// Get the filename out of the file path
String filename = source.substring(source.lastIndexOf('/') + 1, source.length());

FileStatus fileStatus = fileSystem.getFileStatus(srcPath);

BlockLocation[] blkLocations = fileSystem.getFileBlockLocations(fileStatus, 0, fileStatus.getLen());
int blkCount = blkLocations.length;

System.out.println("File :" + filename + "stored at:");
for (int i=0; i < blkCount; i++) {
String[] hosts = blkLocations[i].getHosts();
System.out.format("Host %d: %s %n", i, hosts);
}

}

public void getModificationTime(String source) throws IOException{

Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);
Path srcPath = new Path(source);

// Check if the file already exists
if (!(fileSystem.exists(srcPath))) {
System.out.println("No such destination " + srcPath);
return;
}
// Get the filename out of the file path
String filename = source.substring(source.lastIndexOf('/') + 1, source.length());

FileStatus fileStatus = fileSystem.getFileStatus(srcPath);
long modificationTime = fileStatus.getModificationTime();

System.out.format("File %s; Modification time : %0.2f %n",filename,modificationTime);

}

public void copyFromLocal (String source, String dest) throws IOException {

Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);
Path srcPath = new Path(source);

Path dstPath = new Path(dest);
// Check if the file already exists
if (!(fileSystem.exists(dstPath))) {
System.out.println("No such destination " + dstPath);
return;
}

// Get the filename out of the file path
String filename = source.substring(source.lastIndexOf('/') + 1, source.length());

try{
fileSystem.copyFromLocalFile(srcPath, dstPath);
System.out.println("File " + filename + "copied to " + dest);
}catch(Exception e){
System.err.println("Exception caught! :" + e);
System.exit(1);
}finally{
fileSystem.close();
}
}

public void copyToLocal (String source, String dest) throws IOException {

Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);
Path srcPath = new Path(source);

Path dstPath = new Path(dest);
// Check if the file already exists
if (!(fileSystem.exists(srcPath))) {
System.out.println("No such destination " + srcPath);
return;
}

// Get the filename out of the file path
String filename = source.substring(source.lastIndexOf('/') + 1, source.length());

try{
fileSystem.copyToLocalFile(srcPath, dstPath);
System.out.println("File " + filename + "copied to " + dest);
}catch(Exception e){
System.err.println("Exception caught! :" + e);
System.exit(1);
}finally{
fileSystem.close();
}
}

public void renameFile (String fromthis, String tothis) throws IOException{
Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);
Path fromPath = new Path(fromthis);
Path toPath = new Path(tothis);

if (!(fileSystem.exists(fromPath))) {
System.out.println("No such destination " + fromPath);
return;
}

if (fileSystem.exists(toPath)) {
System.out.println("Already exists! " + toPath);
return;
}

try{
boolean isRenamed = fileSystem.rename(fromPath, toPath);
if(isRenamed){
System.out.println("Renamed from " + fromthis + "to " + tothis);
}
}catch(Exception e){
System.out.println("Exception :" + e);
System.exit(1);
}finally{
fileSystem.close();
}

}

public void addFile(String source, String dest) throws IOException {

// Conf object will read the HDFS configuration parameters
Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);

// Get the filename out of the file path
String filename = source.substring(source.lastIndexOf('/') + 1, source.length());

// Create the destination path including the filename.
if (dest.charAt(dest.length() - 1) != '/') {
dest = dest + "/" + filename;
} else {
dest = dest + filename;
}

// Check if the file already exists
Path path = new Path(dest);
if (fileSystem.exists(path)) {
System.out.println("File " + dest + " already exists");
return;
}

// Create a new file and write data to it.
FSDataOutputStream out = fileSystem.create(path);
InputStream in = new BufferedInputStream(new FileInputStream(
new File(source)));

byte[] b = new byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) {
out.write(b, 0, numBytes);
}

// Close all the file descripters
in.close();
out.close();
fileSystem.close();
}

public void readFile(String file) throws IOException {
Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);

Path path = new Path(file);
if (!fileSystem.exists(path)) {
System.out.println("File " + file + " does not exists");
return;
}

FSDataInputStream in = fileSystem.open(path);

String filename = file.substring(file.lastIndexOf('/') + 1,
file.length());

OutputStream out = new BufferedOutputStream(new FileOutputStream(
new File(filename)));

byte[] b = new byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) {
out.write(b, 0, numBytes);
}

in.close();
out.close();
fileSystem.close();
}

public void deleteFile(String file) throws IOException {
Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);

Path path = new Path(file);
if (!fileSystem.exists(path)) {
System.out.println("File " + file + " does not exists");
return;
}

fileSystem.delete(new Path(file), true);

fileSystem.close();
}

public void mkdir(String dir) throws IOException {
Configuration conf = new Configuration();
conf.addResource(new Path("/home/hadoop/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/home/hadoop/hadoop/conf/mapred-site.xml"));

FileSystem fileSystem = FileSystem.get(conf);

Path path = new Path(dir);
if (fileSystem.exists(path)) {
System.out.println("Dir " + dir + " already exists!");
return;
}

fileSystem.mkdirs(path);

fileSystem.close();
}

public static void main(String[] args) throws IOException {

if (args.length < 1) {
printUsage();
System.exit(1);
}

HDFSClient client = new HDFSClient();
if (args[0].equals("add")) {
if (args.length < 3) {
System.out.println("Usage: hdfsclient add <local_path> " + "<hdfs_path>");
System.exit(1);
}
client.addFile(args[1], args[2]);

} else if (args[0].equals("read")) {
if (args.length < 2) {
System.out.println("Usage: hdfsclient read <hdfs_path>");
System.exit(1);
}
client.readFile(args[1]);

} else if (args[0].equals("delete")) {
if (args.length < 2) {
System.out.println("Usage: hdfsclient delete <hdfs_path>");
System.exit(1);
}

client.deleteFile(args[1]);
} else if (args[0].equals("mkdir")) {
if (args.length < 2) {
System.out.println("Usage: hdfsclient mkdir <hdfs_path>");
System.exit(1);
}

client.mkdir(args[1]);
}else if (args[0].equals("copyfromlocal")) {
if (args.length < 3) {
System.out.println("Usage: hdfsclient copyfromlocal <from_local_path> <to_hdfs_path>");
System.exit(1);
}

client.copyFromLocal(args[1], args[2]);
} else if (args[0].equals("rename")) {
if (args.length < 3) {
System.out.println("Usage: hdfsclient rename <old_hdfs_path> <new_hdfs_path>");
System.exit(1);
}

client.renameFile(args[1], args[2]);
}else if (args[0].equals("copytolocal")) {
if (args.length < 3) {
System.out.println("Usage: hdfsclient copytolocal <from_hdfs_path> <to_local_path>");
System.exit(1);
}

client.copyToLocal(args[1], args[2]);
}else if (args[0].equals("modificationtime")) {
if (args.length < 2) {
System.out.println("Usage: hdfsclient modificationtime <hdfs_path>");
System.exit(1);
}

client.getModificationTime(args[1]);
}else if (args[0].equals("getblocklocations")) {
if (args.length < 2) {
System.out.println("Usage: hdfsclient getblocklocations <hdfs_path>");
System.exit(1);
}

client.getBlockLocations(args[1]);
} else if (args[0].equals("gethostnames")) {

client.getHostnames();
}else {

printUsage();
System.exit(1);
}

System.out.println("Done!");
}
}

Hadoop or Hadoop datanode Installation Tutorial (On a cluster)

November 20, 2011 NPKCloudera, Datanode, distributed computing, Hadoop, HDFS, Installation, MapR, Namenode, passwordless login, ssh-keygen, Tutorial, Yahoo hadoop1 Comment

Why another post on Hadoop Installation?

It took a while for me to get a Hadoop cluster up and running. Especially after looking at all the available documentation and tutorials available on the internet. Moreover, for a starter into the Hadoop ecosystem, it can be quite frustrating in to decide to choose between a distribution like Cloudera or MapR for the same or just a direct installation from the apache site. I have chosen the later and it works fine for me. Yes, there are a number of good tutorials available on the internet, but well, I am sure this would help a few out there like me. Before I start, I do assume that you have a basic understanding of how Hadoop works or a general overview. If not, I suggest you do so.

Now for those of you came here by accident, I would like to quote from the Apache Hadoop website. http://hadoop.apache.org/

“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.”

For an introduction, I would recommend,

http://developer.yahoo.com/hadoop/tutorial/

Although, its a dated tutorial, it does give a good idea of the overall system and yes, the map reduce frame work. Or, if you’d rather read a book, this is a must:

http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979

This book written by Tom White is considered ‘the’ book on Hadoop, with a hands on approach.

I will be using hadoop-0.20.2 version, a stable release. You can download it or a newer version of it from here.

http://hadoop.apache.org/common/releases.html.

Or if you want to try out other good tutorials out there, I would suggest the following:

This is a good read and gives you a great insight on the framework. This post will be for a standalone system on Ubuntu.

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

Once you are comfortable with this, move on to his next tutorial on multiple machines.

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

I would also recommend his mapreduce tutorial in python. Although Java is the native API, like he says, Python can do the needful thanks to the streaming API for hadoop.

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

Again, its upto you to decide whether to go in for a distribution. I would suggest cloudera. I wont be writting about it here though.

http://www.cloudera.com/hadoop/

MapR is also worth mentioning. Especially when it comes with support for analytics.

http://www.mapr.com/

So what now?

I will be be going through a general case of Hadoop installation on a RHEL5 machine. I would give a tutorialized how to format.

1. Some prerequists:

Java installation and the installation path. Make sure you have atleast the 1.6 build for Java. Else do install it.

java -version

java version "1.6.0_10-rc"
Java(TM) SE Runtime Environment (build 1.6.0_10-rc-b28)
Java HotSpot(TM) Server VM (build 11.0-b15, mixed mode)

Making sure you have a valid hostname. Check hostname using the hostname commad.

While not necessary, it would be a good idea to partition your available disks in a given format( more on this later). Login as root and use the fdisk -l commnad to see the partitions available.

2. Now the installation

Copy the tar file you downloaded say. hadoop-0.20.2.tar.gz to /home/hadoop/installations

cp - hadoop-0.20.2.tar.gz /home/hadoop/installations

3.Now untar the file to /usr/local/, why? you’ll know.

sudo tar -xzvf /home/hadoop/installations/hadoop-0.20.2.tar.gz -C /usr/local/

Also give the required permissions. This is very important.

sudo chown -R hadoop:hadoop /usr/local/hadoop-0.20.2/

Create a soft link to /usr/local/hadoop-0.20.2

ln -s /usr/local/hadoop-0.20.2 /home/hadoop/hadoop

Create or copy existing configuration files

Now, as you may know, the three main configuration files are core-site.xml,hadoop-env.sh,hdfs-site.xml,mapred-site.xml. Refer to one of the above tutorials on how to set them, or better, the book, “Hadoop Definitive Guide”. Now if you have them ready, copy core-site.xml,hadoop-env.sh,hdfs-site.xml,mapred-site.xml to each of your datanode, or the main namenode if this is your first install and configure them appropriately. This itself would take a long time to explain and so I’ll write another post. If you have them ready on another hadoop server, do this. Yes, it is important that they all share the same attributes.

scp -r  hadoop@10.0.9.91:/home/hadoop/hadoop/conf/* /home/hadoop/hadoop/conf

Set the environment variables /etc/profile using a editor like vim.

### Hadoop Environment Variables ###
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
export PATH=$PATH:$HADOOP_HOME/bin

Now create the Hadoop System Directories with our hadoop user.

mkdir -p /home/hadoop/mapred/local
mkdir -p /home/hadoop/mapred/system

Now create the other directories as root, but give them ownership permissions for the hadoop user.

mkdir -p /var/log/hadoop
chown hadoop:hadoop /var/log/hadoop
mkdir -p /var/run/hadoop
chown hadoop:hadoop /var/run/hadoop

The important data directories and mapred directories.

mkdir -p /disk1/hadoop/hdfs/data
mkdir -p /disk1/hadoop/mapred/local
chown -R  hadoop:hadoop   /disk1/hadoop

If this is to add a datanode to an existing hadoop system, you should add an entry to /etc/hosts for every new datanode.
Passphrase less ssh login from the namenode to datanodes. The idea is to copy the namenodes public key id_dsa,pub to the datanode created, that is to its, /home/hadoop/.ssh/authorized_keys. If you dont know how to create the keys, follow this link. It is explained very lucidly.

http://rcsg-gsir.imsb-dsgi.nrc-cnrc.gc.ca/documents/internet/node31.html

create .ssh directory , if not exists
mkdir -p /home/hadoop/.ssh
scp hadoop@10.0.9.100:/home/hadoop/.ssh/authorized_keys /home/hadoop/.ssh

If you are doing it for the name node, you need to format the hdfs before you start the daemons.Do not format a running Hadoop filesystem, this will cause all your data to be erased. Before formatting, ensure that the dfs.name.dir directory exists. More on this here. http://wiki.apache.org/hadoop/GettingStartedWithHadoop

hadoop namenode -format

Stop the running cluster if it exists.

stop-mapred.sh
stop-dfs.sh

Do add the IP of the new data node in conf/slaves and conf/includes and restart/start the cluster.

start-dfs.sh
start-mapred.sh

This should get you up and running. Although by all means, this is not a complete listing, I have tried to keep it short and clean. I’d write more on the configuration files and other administrative stuff in later blogs. Comments and suggestions appreciated! 🙂

Data Structures using Java Part 9: equals(), Degree of equality and Testing

December 1, 2010December 1, 2010 NPKDegree of equality, equals, Java, Testing, TutorialLeave a comment

equals
——-
Every class has one equals() method.

r1.equals(r2);
r1 == r2;

Now, if the content of the object (same or not same) pointed by r1 and r2 are the same then the first statement is true. The second statement is true iff both point to the same object. Many classes define equals() to compare the contents of the two objects.

Four degrees of equality
————————–
1. Reference equality, ==
2. Shallow Strutural equality, ie fields ==
3. Deep structural equality, ie fields equals()
4. Logical equality
a. Fractions 1/3 and 2/6 are “equals” from a mathematical point of view
b. “Set” objects are “equals” if they contain the same elements, irrelevent of the order.

public class SList{
public boolean equals(SList other){
if (size != other.size)
return false; // two SLists with different size cannot be same

SListNode n1 = head;     // initial offsets of both the SLists
SListNode n2 = other.head;

while(n1 != null){
if(!n1.item.equals(n2.item)){
return false;
}
n1 = n1.next;
n2 = n2.next;
} // end of while

return true;
}
}

Data Structures using Java Part 7: Linked Lists Part 2 (Doubly Linked)

November 26, 2010December 1, 2010 NPKAbstract Data Type, ADT, Circularly linked doubly linked, Data Structures, Doubly Linked List, Interfaces, Invarient, Java, Linked List, private, sentinel, TutorialLeave a comment

private field
————–
private field is invisible and inacessable to other classes.
Reasons:
1. To prevent data to be corrupted by other classes.
2. You can improve the implementation without causing the other classes to fail.

public class Date{

	private int day;
	private int month;

	public void setMonth(int m){
		month  = m;
	}

	public Date(int month , int date){
	// implementation and error checking code
	}
}

public class Evil{

	public void tamper(){

	Date d = new Date(10,10,2010); // this step is possible

	// these are not going to work!
	d.day =2000;
	d.setMonth(56);
	}
}

Interfaces
————
The interface of a class is made up of two things:
1. prototype for public methods and
2. the descriptions of the method behaviour.

Abstract Data Types(ADT)
——————————
ADT is a class(s) that has a well defined interface, but whose implementation details are firmly hidden from other classes.

Invarient
———-
Invarient is a fact about a data structure that is always true. (assuming there are no bugs). For example the “Date” object always represents a valid date.
Not all classes are ADTs. Some are just store data(no invarients).

The SList ADT
—————–

Another advantage of having a seperate SList class is that it enforces two invarients:
1. “size” is always correct.
2. list is never circularly linked.

Both of these goals are accomplised through simple Java mechanisms. Only the SList methods can change the lists.

SList ensures this:
1. The fields of the SList class are private. The “head” and “size” are private.
2. No method of SList returns an SListNode.

Doubly Linked Lists
———————

Doubly linked list

Inserting or deleting an element at the front of list is easy:

public void deleteFront(){
	if (head != null){
	head = head.next;
	size--;
	}
}

Inserting or deleting the item from the end of the list takes a lot of time in traversals.

public class DListNode{

	int item;
	DListNode next; // reference to the next node in list
	DListNode prev;// referencet to the previous node in the list

}

public class DList{

	private DListNode head; // points to start of the list
	private DListNode tail; // points to the end of the list
}

Insert and delete items at both ends in constant running time.
Removes the tail node (at least 2 items in DList):

tail.prev.next = null;
tail = tail.prev;

Sentinel
———
A special node that does not represent an item. It is to be noted that the sentinel node does not store an item.

Circularly linked Doubly linked list
————————————

 // no changes here
 public class DListNode{

	int item;
	DListNode next; // reference to the next node in list
	DListNode prev;// referencet to the previous node in the list

}

// changes here
public class DList{

	private DListNode head; // head will point to the sentinel
	private int size; // keeps track of the size of the list excluding the sentine.
}

DList invarients with sentinel:
——————————–
1. For any DList data structure d, d.head will never be null.
2. For any DListNode x, x.next is never null.
3. For any DListNode x, x.perv is never null.
4. For any DListNode x, if x.next is == y, then y.perv ==x.
5. For any DListNode x, x.prev ==y, then y.next ==x.
6. A DLists “size” field is number of DListNodes, NOT COUNTING sentinel. The sentinel is always hidden.
7. In empty DLists, the sentinels next and prev point to itselfs.

Linux Tutorial Part 2: Basic Linux Shell Skills 2

November 13, 2010November 26, 2010 NPKCommands, Linux, Shell, TutorialLeave a comment

16. cat – catenates files

cat abc.txt – dumps the contents of ‘abc.txt’ to STDOUT
cat abc.txt def.txt dumps both files to STDOUT
cat abc.txt def.txt > abcdef.txt – creates new concatenated file

17. mkdir – creates a new directory

18. cp – copies files

cp abc.txt testdir/
By default, ‘cp’ does NOT preserve the original modification time
cp -v def.txt testdir/

19. mv – moves files

mv abc.txt testdir/ – moves the file, preserving timestamp

20. rm – removes files/directories

rm abc.txt
rm -rf testdir – removes recursively and enforces

21. touch – creates blank file/updates timestamp

touch test.txt – will create a zero-byte file, if it doesn’t exist
touch def.txt – will update the timestamp
touch -t 201001011529 abcdef.txt – changes timestamp

22. stat – reveals statistics of files

stat def.txt – reveals full attributes of the file

23. find – finds files using search patterns

find / -name ‘fstab’
Note: ‘find’ can search for fields returned by the ‘stat’ command

24. alias – returns/sets aliases for commands

alias – dumps current aliases
alias copy=’cp -v’

25. date – shows the current date and time.

Linux Tutorial Part 1: Basic Shell Skills 1

November 10, 2010November 27, 2010 NPKCommands, Linux, Shell, TutorialLeave a comment

1. uname

‘uname -a’ returns :

OS – Linux
Fully Qualified Domain Name (FQDN)
Kernel version – 2.6.31…
2.6 = major version
31= minor version
14 = Anything else after the minor version indicates that the kernel was patched by the distributor
Date and time that the kernel was compiled

2. ifconfig

lists out all the network interfaces
useful to check or set the IP of an interface.

3. tty – reveals the current terminal
4. whoami – reveals the currently logged-in user
5. which – reveals where in the search path a program is located

6. echo – prints to the screen

echo $PATH – dumps the current path to STDOUT
echo $PWD – dumps the contents of the $PWD variable
echo $OLDPWD – dumps the most recently visited directory

7. set – prints and optionally sets shell variables
8. clear – clears the screen or terminal
9. reset – resets the screen buffer
10. history – reveals your command history

!362 – executes the 362th command in our history
command history is maintained on a per-user basis via:
~/.bash_history
~ = users’s $HOME directory in the BASH shell

11. pwd – prints the working directory
12. cd – changes directory to desired directory

‘cd ‘ with no options changes to the $HOME directory
‘cd ~’ changes to the $HOME directory
‘cd /’ changes to the root of the file system
‘cd Desktop/’ changes us to the relative directory ‘Desktop’
‘cd ..’ changes us one-level up in the directory tree
‘cd ../..’ changes us two-levels up in the directory tree

13. Arrow keys (up and down) navigates through your command history
14. ls – lists files and directories

ls / – lists the contents of the ‘/’ mount point
ls -l – lists the contents of a directory in long format:
Includes: permissions, links, ownership, size, date, name
ls -ld /etc – lists properties of the directory ‘/etc’, NOT the contents of ‘/etc’
ls -ltr – sorts chronologically from older to newer (bottom)
ls –help – returns possible usage information
ls -a – reveals hidden files. e.g. ‘.bash_history’
files or directories prefixed with ‘.’ are hidden. e.g. ‘.bash_history’

15. cat : to retrieve information from a file as in the os version in our case.

LinuxJunkies

Musings, Rants, Gotchas and HowTos.

Tutorial