What You See Is What You Get Element

Oracle 12c & Hadoop: Optimal Store and Process of Big Data

 


This article illustrates how to use the Hadoop Ecosystem tools to extract data from an Oracle 12c database, use the Hadoop Framework to process and transform data and then load the data processed within Hadoop into an Oracle 12c database.

This article covers basic concepts:

  • What is Big Data? Big Data is the amount of data that one single machine cannot store and process. Data comes with different formats (structured, non - structured) from different sources and with great velocity of grow. 
  • What is Apache Hadoop? It is a framework allowing distributed processing of large data sets across many (can be thousands) of machines. Hadoop concept was first introduced by Google. Hadoop framework consists of HDFS and MapReduce. 
  • What is HDFS? HDFS (Hadoop Distributed File System): the Hadoop File System that enables storing large data sets across multiple machines. 
  • What is Map Reduce? The data processing component of the Hadoop Framework that consists of Map phase and Reduce phase. 
  •  
  • What is Apache Sqoop? Apache Sqoop(TM) is a tool to transfer bulk data between Apache Hadoop and structured data stores such as relational databases. It is part or the Hadoop ecosystem. 
  • What is Apache Hive? Hive is a tool to query and manage large datasets stored in Hadoop HDFS. It is also part of the Hadoop ecosystem. 
Where Does Hadoop Fit In? In the following article, we will use the Apache Hadoop Ecosystem (Apache Sqoop) to extract data from an Oracle 12c database and store it into the Hadoop Distributed File System (HDFS). We will then use the Apache Hadoop Ecosystem (Apache Hive) to transform data and process it using the Map Reduce (We can also use Java programs to do the same). Apache Sqoop will be used to load the data already processed within Hadoop into an Oracle 12c database. The following image describes where Hadoop fits in the process. This scenario represents a practical solution to processing big data coming from Oracle database as a source; the only condition is that data source must be structured. Note that Hadoop can also process non – structured data like videos, log files etc.


global-overview-jpg

System Preparation

Extract Data from Oracle Database into HDFS:

  • Connect to the Pluggable Database (PDB) in Oracle 12cR1:
    Note: This step is NOT required when you have an Oracle database prior to Oracle 12c release 1. In this article we are going to connect to a pluggable database. If you want to connect to a version of Oracle database prior to Oracle 12c version, all you need to have is a running database service. This step works also for a Real Application Cluster and Exadata environment.
 In Oracle 12cR1, you can use one of 4 ways to connect to Pluggable Databases:
  • Using the default service name that is already created by the database software automatically.
  • Created your own service for the PDB using the SRVCTL utility.
  • Issue an alter session command for example; ALTER SESSION SET CONTAINER = MYPDB; which set the container to MYPDB pluggable database.
  • Use the Enterprise Manager 12c or the Enterprise Manager 12c Express.

 


Create a new service for the Pluggable database called “orawiss12c”

sandbox1(orawiss):/home/oracle/wissem>sqlplus / as sysdba

SQL*Plus: Release 12.1.0.1.0 Production on Mon Nov 25 10:57:42 2013

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options 

SQL> select con_id,dbid,NAME,OPEN_MODE from v$pdbs;

CON_ID            DBID               NAME                OPEN_MODE

    2             4061740350         PDB$SEED            READ ONLY
    3             1574282659         ORAWISS12C          READ WRITE

SQL>


To create a service for a PDB using the SRVCTL utility, use the add service command and specify the PDB in the -pdb parameter.
Then start the new service called orawissPDB.

sandbox1(+ASM):/home/oracle>srvctl add service -db orawiss -service orawissPDB -pdb orawiss12c
sandbox1(+ASM):/home/oracle>srvctl start service -db orawiss -service orawissPDB
sandbox1(+ASM):/home/oracle>



Create an entry in the tnsnames.ora file for the service orawissPDB created.
Then test the connection.


sandbox1(+ASM):/home/oracle>cat /opt/app/oracle/product/12.1/db_1/network/admin/tnsnames.ora
orawissPDB =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS =
(PROTOCOL = TCP)(HOST = sandbox1.XXX.com)(PORT = 1522) 
    )
(CONNECT_DATA =
(SERVICE_NAME = orawissPDB)
      )
    )
sandbox1(+ASM):/home/oracle>
sandbox1(orawiss):/home/oracle/wissem >sqlplus wissem/wissem@orawissPDB

SQL*Plus: Release 12.1.0.1.0 Production on Mon Nov 25 10:51:24 2013
Copyright (c) 1982, 2013, Oracle.
All rights reserved.
Last Successful login time: Sun Jul 14 2013 05:20:12 -05:00
Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options

SQL> exit

Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options

sandbox1(orawiss):/home/oracle/wissem/ORAWISS/archivelog>



Download the JDBC Driver for Oracle:  http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html.
You must accept the license agreement before you can download the driver.
Download the ojdbc6.jar file.


oracle-jdbc-driver-jpg

Copy the ojdbc6.jar file to /usr/lib/sqoop/lib/ directory. Run the following command:


oracle-jdbc-driver-2-jpg


Test the connection to Oracle database: We can test the connection to the Oracle database using sqoop.
Note: We are using sandbox1 as oracle database server and localhost where Hadoop, Sqoop and Hive are installed. Typically, you can find Hadoop installed in multiple machines, Sqoop installed in a separate and client machine. Hive can be installed also in a separate and client machine. For simplicity, all Hadoop ecosystem and Hadoop Framework is installed in once machine.



sqoop-eval-60-jpg


We selected 1 from the dual table and Sqoop returns the value 1 which means the connection is established to the Oracle pluggable database.

 

Internet Active Users Dataset:

In this example, Marketing Online Company offering to internet user’s banners, WebPages to download software, games etc ... have millions of active users per day around the globe. The Inventory department wants to know how many times one user is active per day: active means user clicked on a banner or clicked on an offer or just signed up. For simplicity lets imagine we need an Oracle table with just two fields: user IP and active date. Table can have many other columns like a Banner identifier, traffic source, country, etc … The Table let’s call it “User_Activity” which may have billions of records per day.

We will write a program that mines Active user’s data. Data are collected every time a user does an “action” at many locations around the globe. We may gather a large volume of log data. Data are structured so are good candidate to be managed by an Oracle database. Also user activity table can be a good candidate for analysis with Map Reduce. The goal is to know how many times one user is active per day. 

Let’s create the table in Oracle 12c database and populate it via some sample data.

sandbox1(orawiss):/home/oracle/wissem>sqlplus  wissem/wissem@orawissPDB
SQL*Plus: Release 12.1.0.1.0 Production on Wed Nov 27 04:43:09 2013
Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Last Successful login time: Tue Nov 26 2013 11:26:04 -05:00

Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options

SQL> create table user_activity(IP_ADRR Varchar2(30), ACTIVE_DATE DATE, Traffi_source Varchar2(50),
ISO_COUNTRY Varchar2(2));
Table created.
SQL> ALTER TABLE user_activity ADD PRIMARY KEY (IP_ADRR, ACTIVE_DATE);
Table altered.

SQL> select * from user_activity;

IP_ADRR               ACTIVE_DATE                     TRAFFI_SOURCE                 IS
------------------ -------------- -------------------------------------------------- --
98.245.153.111        11/27/2013                     04:49:14 Google                 TN
65.245.153.105        11/27/2013                     04:49:59 Yahoo                  BE
98.245.153.111        11/27/2013                     04:50:14 MSN                    TN
98.245.153.111        11/27/2013                     04:50:25 MSN                    TN
98.245.153.111        11/27/2013                     04:50:28 MSN                    TN
98.245.153.111        11/27/2013                     04:50:30 MSN                    TN
65.245.153.105        11/27/2013                     04:50:40 Yahoo                  BE
65.245.153.105        11/27/2013                     04:50:41 Yahoo                  BE
65.245.153.105        11/27/2013                     04:50:42 Yahoo                  BE
……
……
SQL>


Extract data from Oracle Database into HDFS:

We use Sqoop to import the data from Oracle to HDFS. Sqoop has the import option to import data from RDBMS. We need to specify is a connection string, table name and fields separator. More details about sqoop import can be found in the Sqoop user’s guide. Sqoop can import data in parallel. We can specify the number of map tasks (parallel processes) to use to perform the import from Oracle by using the -m or --num-mappers argument. By default, four tasks are used. In our example we will use only one mapper. (rac1)

[wissem@localhost ~]$ sqoop import --connect jdbc:oracle:thin:@//sandbox1.xxx.com:1522/orawissPDB --table
USER_ACTIVITY --fields-terminated-by '\t' --num-mappers 1 --username WISSEM -P
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Enter password:
13/11/27 14:39:23 INFO manager.SqlManager: Using default fetchSize of 1000
13/11/27 14:39:23 INFO tool.CodeGenTool: Beginning code generation
13/11/27 14:39:26 INFO manager.OracleManager: Time zone has been set to GMT
13/11/27 14:39:27 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM USER_ACTIVITY t WHERE 1=0
13/11/27 14:39:27 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
13/11/27 14:39:27 INFO orm.CompilationManager: Found hadoop core jar at: /usr/lib/hadoop-0.20- mapreduce/hadoop-core.jar Note: /tmp/sqoop-wissem/compile/6a6e927fb008e151d1eeb04bff6f24ef/USER_ACTIVITY.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details.
13/11/27 14:39:30 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop- wissem/compile/6a6e927fb008e151d1eeb04bff6f24ef/USER_ACTIVITY.jar
13/11/27 14:39:30 INFO manager.OracleManager: Time zone has been set to GMT
13/11/27 14:39:34 WARN manager.OracleManager: The table USER_ACTIVITY contains a multi-column primary key. Sqoop will default to the column ACTIVE_DATE only for this job.
13/11/27 14:39:34 INFO manager.OracleManager: Time zone has been set to GMT
13/11/27 14:39:35 WARN manager.OracleManager: The table USER_ACTIVITY contains a multi-column primary key. Sqoop will default to the column IP_ADRR only for this job.
13/11/27 14:39:35 INFO mapreduce.ImportJobBase: Beginning import of USER_ACTIVITY
13/11/27 14:39:36 INFO manager.OracleManager: Time zone has been set to GMT
13/11/27 14:39:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/11/27 14:39:42 INFO mapred.JobClient: Running job: job_201311271428_0001
13/11/27 14:39:43 INFO mapred.JobClient: map 0% reduce 0% 13/11/27 14:40:04 INFO mapred.JobClient: map 100% reduce 0% 13/11/27 14:40:07 INFO mapred.JobClient: Job complete: job_201311271428_0001
13/11/27 14:40:07 INFO mapred.JobClient: Counters: 23 13/11/27 14:40:07 INFO mapred.JobClient: File System Counters 13/11/27 14:40:07 INFO mapred.JobClient: FILE: Number of bytes read=0 13/11/27 14:40:07 INFO mapred.JobClient: FILE: Number of bytes written=216292
13/11/27 14:40:07 INFO mapred.JobClient: FILE: Number of read operations=0
13/11/27 14:40:07 INFO mapred.JobClient: FILE: Number of large read operations=0
13/11/27 14:40:07 INFO mapred.JobClient: FILE: Number of write operations=0
13/11/27 14:40:07 INFO mapred.JobClient: HDFS: Number of bytes read=87
13/11/27 14:40:07 INFO mapred.JobClient: HDFS: Number of bytes written=407
13/11/27 14:40:07 INFO mapred.JobClient: HDFS: Number of read operations=1
13/11/27 14:40:07 INFO mapred.JobClient: HDFS: Number of large read operations=0
13/11/27 14:40:07 INFO mapred.JobClient: HDFS: Number of write operations=1
13/11/27 14:40:07 INFO mapred.JobClient: Job Counters
13/11/27 14:40:07 INFO mapred.JobClient: Launched map tasks=1
13/11/27 14:40:07 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=20435
13/11/27 14:40:07 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
13/11/27 14:40:07 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/11/27 14:40:07 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/11/27 14:40:07 INFO mapred.JobClient: Map-Reduce Framework
13/11/27 14:40:07 INFO mapred.JobClient: Map input records=9
13/11/27 14:40:07 INFO mapred.JobClient: Map output records=9
13/11/27 14:40:07 INFO mapred.JobClient: Input split bytes=87
13/11/27 14:40:07 INFO mapred.JobClient: Spilled Records=0
13/11/27 14:40:07 INFO mapred.JobClient: CPU time spent (ms)=1000
13/11/27 14:40:07 INFO mapred.JobClient: Physical memory (bytes) snapshot=60805120
13/11/27 14:40:07 INFO mapred.JobClient: Virtual memory (bytes) snapshot=389214208
13/11/27 14:40:07 INFO mapred.JobClient: Total committed heap usage (bytes)=15990784
13/11/27 14:40:07 INFO mapreduce.ImportJobBase: Transferred 407 bytes in 31.171 seconds (13.057 bytes/sec)
13/11/27 14:40:07 INFO mapreduce.ImportJobBase: Retrieved 9 records.
[wissem@localhost ~]$


 

Sqoop used the Map Reduce Framework (1 Mapper and = Reducer); Sqoop created automatically a Jar file named “USER_ACTIVITY.jar” and extracted using Sql the contents of User Activity table in Oracle database. The file in HDFS is named part-m-00000. This is Hadoop File naming convention defined in FileOutputFormat class and there is property mapreduce.output.basename. The “m” part means map and “r” if there is reduce phase means simply output of the reducer. As we have defined one map we only have one output file named part-m-00000. Let’s see part-m-00000 content and rename it later to user_Activity.txt file.

 

[wissem@localhost ~]$ hadoop fs -cat /user/wissem/USER_ACTIVITY/part-m-00000

98.245.153.111 2013-11-27 04:49:14.0 Google TN
65.245.153.105 2013-11-27 04:49:59.0 Yahoo BE
98.245.153.111 2013-11-27 04:50:14.0 MSN TN
98.245.153.111 2013-11-27 04:50:25.0 MSN TN
98.245.153.111 2013-11-27 04:50:28.0 MSN TN
98.245.153.111 2013-11-27 04:50:30.0 MSN TN
65.245.153.105 2013-11-27 04:50:40.0 Yahoo BE
65.245.153.105 2013-11-27 04:50:41.0 Yahoo BE
65.245.153.105 2013-11-27 04:50:42.0 Yahoo BE
[wissem@localhost ~]$
[wissem@localhost ~]$ hadoop fs -mv USER_ACTIVITY/part-m-00000 /user/wissem/USER_ACTIVITY/user_activity.txt [wissem@localhost ~]$ hadoop fs -cat /user/wissem/USER_ACTIVITY/user_activity.txt
98.245.153.111 2013-11-27 04:49:14.0 Google TN
65.245.153.105 2013-11-27 04:49:59.0 Yahoo BE
98.245.153.111 2013-11-27 04:50:14.0 MSN TN
98.245.153.111 2013-11-27 04:50:25.0 MSN TN
98.245.153.111 2013-11-27 04:50:28.0 MSN TN
98.245.153.111 2013-11-27 04:50:30.0 MSN TN
65.245.153.105 2013-11-27 04:50:40.0 Yahoo BE
65.245.153.105 2013-11-27 04:50:41.0 Yahoo BE
65.245.153.105 2013-11-27 04:50:42.0 Yahoo BE

[wissem@localhost ~]$

Now we have the data inside the Hadoop Distributed File System. Let’s do the second part of the ETL process which is the Transformation part.

Data Transformation in Hadoop: The goal is to know how many times per date users are active.


This query could take several minutes to run on an Oracle database. But, how if we have thousands of billions of records the query starts taking hours to run? As a Distributed framework Hadoop will store the output file in different blocks of 64MB (this is the default) within different servers. Hadoop Map program will be run in each node storing the data and a reducer will take the sorted results of Map phase and generates an output file in HDFS. This process will guarantee each node of the Hadoop framework does the processing of a smaller file. All nodes of the Hadoop cluster working in parallel will deliver the output file. Now that we explained how Hadoop system will work. Let‘s run a Map Reduce program against the user_activity.txt file. We can write Java classes to do the Map Reduce phase but HiveQL (Hive SQL-like language) can do also the work without the need to write lines of Java code. First let’s create a table in HIVE with Tab '\t' delimiter.


[wissem@localhost ~]$ hive
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
Hive history file=/tmp/wissem/hive_job_log_024ccdda-b362-4ef2-9222-2c374dd2ac83_188192255.txt
hive> create table user_Activity (IP_ADRR STRING, ACTIVE_DATE STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
OK
Time taken: 13.84 seconds
hive> show tables user_Activity;
OK
Time taken: 0.235 seconds
hive>


Now we have created a table in Hive, let’s populate the table with user activity data extracted from Oracle database in previous section of this article.


hive> LOAD DATA INPATH "/user/wissem/USER_ACTIVITY/user_activity.txt" INTO TABLE user_Activity;
Loading data to table default.user_activity
Table default.user_activity stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 407, raw_data_size: 0]
OK
Time taken: 1.318 seconds
hive> select * from user_activity;
OK
98.245.153.111 2013-11-27 04:49:14.0
65.245.153.105 2013-11-27 04:49:59.0
98.245.153.111 2013-11-27 04:50:14.0
98.245.153.111 2013-11-27 04:50:25.0
98.245.153.111 2013-11-27 04:50:28.0
98.245.153.111 2013-11-27 04:50:30.0
65.245.153.105 2013-11-27 04:50:40.0
65.245.153.105 2013-11-27 04:50:41.0
65.245.153.105 2013-11-27 04:50:42.0
Time taken: 0.541 seconds
hive>


Now let’s process the user activity data and output the results in HDFS. We use the insert OVERWRITE syntax for HiveQL. We called 'USER_ACTIVITY_OUT' the output directory in HDFS, overwritten if exists. As of Hive 0.11.0 the separator used can be specified, in earlier versions it was always the ^A character (\001).


hive> INSERT OVERWRITE DIRECTORY 'USER_ACTIVITY_OUT' SELECT COUNT(*), IP_ADRR , SUBSTR(ACTIVE_DATE, 1,
10) FROM USER_ACTIVITY group by IP_ADRR , SUBSTR(ACTIVE_DATE, 1, 10);
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer= <number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max= <number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks= <number>
Starting Job = job_201311271428_0003, Tracking URL = http://localhost:50030/jobdetails.jsp? jobid=job_201311271428_0003 Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201311271428_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-11-27 15:32:20,392 Stage-1 map = 0%, reduce = 0%
2013-11-27 15:32:27,475 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.54 sec
2013-11-27 15:32:28,488 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.54 sec
2013-11-27 15:32:29,497 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.54 sec
2013-11-27 15:32:30,526 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.54 sec
2013-11-27 15:32:31,543 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.36 sec
2013-11-27 15:32:32,582 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.36 sec
MapReduce Total cumulative CPU time: 4 seconds 360 msec
Ended Job = job_201311271428_0003
Moving data to: USER_ACTIVITY_OUT
2 Rows loaded to USER_ACTIVITY_OUT
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 4.36 sec HDFS Read: 634 HDFS Write: 56 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 360 msec
OK
Time taken: 16.466 seconds
hive> quit;


Note from the output above, we can monitor the execution of the job via the
URL: http://localhost:50030/jobdetails.jsp?jobid=job_201311271428_0003



hadoop-oracle-job-results-jpg


Now, we can check the contents of the output file.

[wissem@localhost ~]$ hadoop fs -cat /user/wissem/USER_ACTIVITY_OUT/000000_0

4 65.245.153.105 2013-11-27
5 98.245.153.111 2013-11-27

[wissem@localhost ~]$




Load data from Hadoop into Oracle database: At this point let's create a table called USER_STATS in Oracle 12c. This table will contain the final processed data.



SQL> CREATE TABLE USER_STATS (ACTIVES NUMBER,IP_ADRR VARCHAR2(30), ACTIVE_DATE VARCHAR2(10));
Table created.



At this point, we load the processed data into Oracle database using Sqoop. We use export option of Sqoop.


[wissem@localhost ~]$
sqoop export --connect jdbc:oracle:thin:@//sandbox1.xxx.com:1522/orawissPDB --export- dir '/user/wissem/USER_ACTIVITY_OUT/000000_0' --table WISSEM.USER_STATS --fields-terminated-by '\001'
--input-null-string '\\N' --input-null-non-string '\\N' -m 1 --username WISSEM -P

Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Enter password:
13/11/27 15:36:06 INFO manager.SqlManager: Using default fetchSize of 1000
13/11/27 15:36:06 INFO tool.CodeGenTool: Beginning code generation
13/11/27 15:36:09 INFO manager.OracleManager: Time zone has been set to GMT
13/11/27 15:36:10 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM WISSEM.USER_STATS t WHERE 1=0
13/11/27 15:36:11 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
13/11/27 15:36:11 INFO orm.CompilationManager: Found hadoop core jar at: /usr/lib/hadoop-0.20-mapreduce/ hadoop-core.jar Note: /tmp/sqoop-wissem/compile/e49e98500225d2ef01254258389240fd/WISSEM_USER_STATS.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
13/11/27 15:36:12 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-wissem/ compile/e49e98500225d2ef01254258389240fd/WISSEM.USER_STATS.jar
13/11/27 15:36:12 INFO mapreduce.ExportJobBase: Beginning export of WISSEM.USER_STATS
13/11/27 15:36:13 INFO manager.OracleManager: Time zone has been set to GMT
13/11/27 15:36:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/11/27 15:36:15 INFO input.FileInputFormat: Total input paths to process : 1
13/11/27 15:36:15 INFO input.FileInputFormat: Total input paths to process : 1
13/11/27 15:36:15 INFO mapred.JobClient: Running job: job_201311271428_0004
13/11/27 15:36:16 INFO mapred.JobClient: map 0% reduce 0%
13/11/27 15:36:23 INFO mapred.JobClient: map 100% reduce 0%
13/11/27 15:36:24 INFO mapred.JobClient: Job complete: job_201311271428_0004
13/11/27 15:36:24 INFO mapred.JobClient: Counters: 24
13/11/27 15:36:24 INFO mapred.JobClient: File System Counters
13/11/27 15:36:24 INFO mapred.JobClient: FILE: Number of bytes read=0
13/11/27 15:36:24 INFO mapred.JobClient: FILE: Number of bytes written=214755
13/11/27 15:36:24 INFO mapred.JobClient: FILE: Number of read operations=0
13/11/27 15:36:24 INFO mapred.JobClient: FILE: Number of large read operations=0
13/11/27 15:36:24 INFO mapred.JobClient: FILE: Number of write operations=0
13/11/27 15:36:24 INFO mapred.JobClient: HDFS: Number of bytes read=202
13/11/27 15:36:24 INFO mapred.JobClient: HDFS: Number of bytes written=0
13/11/27 15:36:24 INFO mapred.JobClient: HDFS: Number of read operations=4
13/11/27 15:36:24 INFO mapred.JobClient: HDFS: Number of large read operations=0
13/11/27 15:36:24 INFO mapred.JobClient: HDFS: Number of write operations=0
13/11/27 15:36:24 INFO mapred.JobClient: Job Counters 13/11/27 15:36:24 INFO mapred.JobClient: Launched map tasks=1
13/11/27 15:36:24 INFO mapred.JobClient: Data-local map tasks=1
13/11/27 15:36:24 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=7608
13/11/27 15:36:24 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
13/11/27 15:36:24 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots
(ms)=0
13/11/27 15:36:24 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/11/27 15:36:24 INFO mapred.JobClient: Map-Reduce Framework
13/11/27 15:36:24 INFO mapred.JobClient: Map input records=2
13/11/27 15:36:24 INFO mapred.JobClient: Map output records=2
13/11/27 15:36:24 INFO mapred.JobClient: Input split bytes=143
13/11/27 15:36:24 INFO mapred.JobClient: Spilled Records=0
13/11/27 15:36:24 INFO mapred.JobClient: CPU time spent (ms)=820
13/11/27 15:36:24 INFO mapred.JobClient: Physical memory (bytes) snapshot=58933248
13/11/27 15:36:24 INFO mapred.JobClient: Virtual memory (bytes) snapshot=388550656
13/11/27 15:36:24 INFO mapred.JobClient: Total committed heap usage (bytes)=15990784
13/11/27 15:36:24 INFO mapreduce.ExportJobBase: Transferred 202 bytes in 10.349 seconds (19.5188 bytes/sec)
13/11/27 15:36:24 INFO mapreduce.ExportJobBase: Exported 2 records.

[wissem@localhost ~]$



Now that the results are exported successfully, let’s check the USER_STATS table.



SQL> select * from USER_STATS;

ACTIVES     IP_ADRR              ACTIVE_DAT
---------- --------------------- ----------
4            65.245.153.105      2013-11-27
5            98.245.153.111      2013-11-27



Conclusion:

In this article, we have seen how Oracle database and Hadoop (Framework & Ecosystem) can work together to deliver an efficient / optimal Enterprise architecture to store and process large data set.

About the author:

Wissem is a Senior Oracle DBA / Architect with over 12 years of experience specialized in Oracle RAC & Data Guard. He currently works for “Schneider Electric / APC Global operations”. Wissem is the first Oracle ACE in Spain and He is OCP / OCE DBA. You can follow Wissem on his blog www.oracle-class.com or in twitter @orawiss