Using Oracle NoSQL Database with Cloudera Distribution for Hadoop

By Deepak Vohra

Get a test project up and running to explore the basic principles involved.

Published June 2012

Introduced in 2011, Oracle NoSQL Database is a highly available, highly scalable, key/value storage based (nonrelational) database that provides support for CRUD operations via a Java API. A related technology, the Hadoop MapReduce framework, provides a distributed environment for developing applications that process large quantities of data in parallel on large clusters.

In this article we discuss integrating Oracle NoSQL Database with Cloudera Distribution for Hadoop (CDH) on Windows OS via an Oracle JDeveloper project (download). We will also demonstrate processing the NoSQL Database data in Hadoop using a MapReduce job.

Setup

The following software is required for this project. Download and install anything on the list you don’t already have according to the respective instructions.

Install Java 1.7 in a directory (without spaces in its name) in the directory path. Set the JAVA_HOME environment variable.

Configuring Oracle NoSQL Database in Oracle JDeveloper

First, we’ll need to configure the NoSQL database server as an external tool in JDeveloper. Select Tools>External Tools. In the External Tools window select New. In the Create External Tool wizard select Tool Type: External Program and click Next. In Program Options specify the following program options.

Field

Value

Program Executable

C:\JDK7\Java\jdk1.7.0_05\bin\java.exe

Arguments

-jar ./lib/kvstore-1.2.123.jar kvlite

Run Directory

C:\OracleNoSQL\kv-1.2.123


Click Finish in Create External Tools:

nosql-hadoop-f1 

Oracle NoSQL Database is now configured as an external tool; the external tool name may vary based on whether other tools requiring the same program executable are also configured.  Click on OK in External Tools.

Next, select Tools>Java 1. The Oracle NoSQL Database server starts up and a key-value (KV) store is created. 

nosql-hadoop-f2

The NoSQL Database store has the following args by default:

Arg

Value

-root

kvroot

-store

kvstore

-host

localhost

-port

5000

-admin

5001

 

On subsequent runs of the external tool for the NoSQL Database server the existing KV store is opened with the same configuration with which it was created: 

nosql-hadoop-f3


Running the HelloBigDataWorld Example

The NoSQL Database package includes some examples in the C:\OracleNoSQL\kv-1.2.123\examples directory. We will run the following examples in this article:

  • hello.HelloBigDataWorld
  • hadoop.CountMinorKeys 

The HelloBigDataWorld example can be run using an external tool configuration or as a Java application.

Using as an External Tool

To run HelloBigDataWorld as an external tool select Tools>External Tools and create a new external tool configuration with the same procedure as with the NoSQL Database server. We need to create two configurations, one for compiling the HelloBigDataWorld file and another for running the compiled application. Specify the following program options for compiling HelloBigDataWorld.

Program Option

Value

Program Executable

C:\JDK7\Java\jdk1.7.0_05\bin\javac.exe

Arguments

-cp ./examples;./lib/kvclient-1.2.123.jar examples/hello/HelloBigDataWorld.java

Run Directory

C:/OracleNoSQL/kv-1.2.123


The program options for compiling the hello/HelloBigDataWorld.java file are shown below. Click Finish.

 nosql-hadoop-f4

An external tool Javac gets created. Select Tools>Javac to compile the hello/ HelloBigDataWorld.java class. Next, create an external tool for running the hello.HelloBigDataWorld class file using the following configuration.

Program Option

Value

Program Executable

C:\JDK7\Java\jdk1.7.0_05\bin\java.exe

Arguments

-cp ./examples;./lib/kvclient-1.2.123.jar hello.HelloBigDataWorld

Run Directory

C:/OracleNoSQL/kv-1.2.123


The classpath should include the kvclient-1.2.123.jar file. Click Finish

nosql-hadoop-f5

To run the hello.HelloBigDataWorld class select Tools>Java. The hello.HelloBigDataWorld application runs and a short message is written.

 nosql-hadoop-f6

Running in a Java Application

Next, we will run the hello.HelloBigDataWorld application as a Java application in an Oracle JDeveloper project. To create a new application:

  • Select Java Desktop Application in New Gallery.
  • Specify an Application Name (e.g., NoSQLDB) and select the default directory. Click Next.
  • Specify a Project Name (e.g., NoSQLDB) and click Finish

Next, create a Java class in the project.

  • Select Java Class in New Gallery and click OK.
  • In Create Java Class specify class name as “HelloBigDataWorld” and package as “hello”. Click OK. The hello.HelloBigDataWorld class is added to the application.
  • Copy the hello/HelloBigDataWorld.java file from the C:\OracleNoSQL\kv-1.2.123\examples directory to the class file in Oracle JDeveloper.

In the example application, a new oracle.kv.KVStore is created using the KVStoreFactory class:

store = KVStoreFactory.getStore(new KVStoreConfig(storeName, hostName + ":" + hostPort));

Key/value pairs are created and stored in the KV store:

        final String keyString = "Hello";
        final String valueString = "Big Data World!";
store.put(Key.createKey(keyString), Value.createValue(valueString.getBytes()));

The key/value are retrieved from the store and output. Subsequently the KV store is closed.

final ValueVersion valueVersion = store.get(Key.createKey(keyString));
System.out.println(keyString + " " + new String(valueVersion.getValue().getValue())+ "\n ");
store.close();

The hello.HelloBigDataWorld class is shown below.

 nosql-hadoop-f7

To run the HelloBigDataWorld class add the C:\OracleNoSQL\kv-1.2.123\lib\kvclient-1.2.123.jar file to the Libraries and Classpath.

 nosql-hadoop-f8

To run the application right-click on the class and select Run. The hello.HelloBigDataWorld class runs and one line of output is generated.  The example application creates only one key/value pair.

In the next section we will run the hadoop.CountMinorKeys.java example. To prepare for that, rerun the HelloBigDataWorld example to create additional key/value pairs in the KV store:

 nosql-hadoop-f9 

Processing NoSQL Database Data in Hadoop

Next, we will run the Hadoop example in C:\OracleNoSQL\kv-1.2.123\examples\hadoop\CountMinorKeys.java. Create a Java class called hadoop/CountMinorKeys.java and copy the \examples\hadoop\CountMinorKeys.java file to that class.

 nosql-hadoop-f10

Add the CDH jar file to the project..

 nosql-hadoop-f11

Configuring the Hadoop Cluster

Next, we will configure the Hadoop cluster. In CDH2 there are three configuration files: core-site.xml, mapred-site.xml, and hdfs-site.xml.  In the conf/core-site.xml specify the fs.default.name parameter, which is the URI of NameNode.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9100</value>
    </property>
</configuration>

The core-site.xml is shown below.

 nosql-hadoop-f12

In conf/mapred-site.xml specify the mapred.job.tracker parameter for the Host or IP and port of JobTracker. Specify host as localhost and port as 9101.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
        <name>mapred.job.tracker</name>
        <value>localhost:9101</value>
    </property>
</configuration>

The conf/mapred-site.xml is shown below.

 nosql-hadoop-f13

Specify the dfs.replication parameter in conf/hdfs-site.xml configuration file. The dfs.replication parameter specifies how many machines a single file should be replicated to before becoming available. The value should not exceed the number of DataNodes. (We use one DataNode in this example.)

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

The conf/hdfs-site.xml is shown below.

nosql-hadoop-f14


Having configured a Hadoop cluster, we now start the cluster. But, first, we need to create a Hadoop Distributed File System (HDFS) for the files used in processing the Hadoop data. Run the following command in Cygwin.

>cd hadoop-0.20.1+169.127

>bin/hadoop namenode -format

A storage directory, \tmp\hadoop-dvohra\dfs, is created.

nosql-hadoop-f15
 

  • We also need to create a deployment profile for the hadoop.CountMinorKeys application. Select the project node in Application Navigator and select File>New.
  • In New Gallery select Deployment Profiles JAR File and click OK.
  • In Create Deployment Profile, specify Deployment Profile Name (hadoop) and click OK.
  • In Edit JAR Deployment Profile Properties, select the default settings and click OK.
  • A new deployment profile is created. Click OK.

To deploy the deployment profile right-click on the NoSQL project and select Deploy>hadoop.

 nosql-hadoop-f16


In Deployment Action, select Deploy to JAR file and click Next. Click Finish in Summary. The hadoop.jar gets deployed to the deploy directory in the JDeveloper project. Copy the hadoop.jar to the C:\cygwin\home\dvohra\hadoop-0.20.1+169.127 directory as the application shall be run from the hadoop-0.20.1+169.127 directory in Cygwin.

Starting the Hadoop Cluster

Typically a multi-node Hadoop cluster consists of  the following nodes.

Node Name

Function

Type

NameNode

For the HDFS storage layer management. We formatted the NameNode to create a storage layer in the previous section.

master

JobTracker

MapReduce data processing management; assigns tasks

master

DataNode

Stores filesystem data, HDFS storage layer processing  

slave

TaskTracker

MapReduce processing

slave

Secondary NameNode

Stores modifications to the filesystem and periodically merges the changes with the current HDFS state.

 


Next, we shall start the nodes in the cluster. To start the NameNode run the following commands in Cygwin.

> cd hadoop-0.20.1+169.127
> bin/hadoop namenode

 nosql-hadoop-f17

Start the Secondary NameNode with the following commands:

> cd hadoop-0.20.1+169.127
> bin/hadoop secondarynamenode

nosql-hadoop-f18


Start the DataNode:

> cd hadoop-0.20.1+169.127
> bin/hadoop datanode

nosql-hadoop-f19


Start the JobTracker :

> cd hadoop-0.20.1+169.127
> bin/hadoop jobtracker

nosql-hadoop-f20


Start the TaskTracker:

> cd hadoop-0.20.1+169.127
> bin/hadoop tasktracker

 nosql-hadoop-f21


Running a MapReduce Job

Next, we shall run the hadoop.CountMinorKeys application for which created the hadoop.jar file. The hadoop.CountMinorKeys  application runs a MapReduce job on the Oracle NoSQL Database data in the KV store and generates an output in the Hadoop HDFS. The NoSQL Database server Java API is in the kvclient-1.2.123.jar directory. Copy the kvclient-1.2.123.jar from the C:\NoSQLDB\kv-1.2.123\lib directory to the C:\cygwin\home\dvohra\hadoop-0.22.0\lib directory, which is in the classpath of Hadoop. Run the hadoop.jar with the following commands in Cygwin.

> cd hadoop-0.20.1+169.127
> bin/hadoop jar hadoop.jar hadoop.CountMinorKeys   kvstore dvohra-PC:5000 hdfs://localhost:9100/tmp/hadoop/output/  

The MapReduce job runs and the output is generated in the hdfs://localhost:9100/tmp/hadoop/output/  directory.

 nosql-hadoop-f22

List the files in the temp/hadoop/output directory with the following command.

> bin/hadoop dfs -ls hdfs://localhost:9100/tmp/hadoop/output

The MapReduce job output is generated in the part-r-00000 file, which gets listed with the previous command.

nosql-hadoop-f23


Get the part-r-00000 file to the local filesystem with the command:

bin/hadoop dfs -get hdfs://localhost:9100/tmp/hadoop/output/part-r-00000  part-r-00000

The MapReduce job ouput is shown in Oracle JDeveloper; the output lists the number of records for each major key in the KV store, which was created with the first example application, hello.HelloBigDataWorld.

 nosql-hadoop-f24

Congratulations, your project is complete! 



Deepak Vohra
is a NuBean consultant, Web developer, and Oracle Certified Associate, Oracle Database 10g.