This article illustrates how to use the Hadoop Ecosystem tools to extract data from an Oracle 12c database, use the Hadoop Framework to process and transform data and then load the data processed within Hadoop into an Oracle 12c database.
This article covers basic concepts:
Extract Data from Oracle Database into HDFS:
sandbox1(orawiss):/home/oracle/wissem>sqlplus / as sysdba
SQL*Plus: Release 12.1.0.1.0 Production on Mon Nov 25 10:57:42 2013
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing optionsSQL> select con_id,dbid,NAME,OPEN_MODE from v$pdbs;
CON_ID DBID NAME OPEN_MODE
2 4061740350 PDB$SEED READ ONLY
3 1574282659 ORAWISS12C READ WRITE
SQL>
To create a service for a PDB using the SRVCTL utility, use the add service command and specify the PDB in the -pdb parameter.
Then start the new service called orawissPDB.
sandbox1(+ASM):/home/oracle>srvctl add service -db orawiss -service orawissPDB -pdb orawiss12c
sandbox1(+ASM):/home/oracle>srvctl start service -db orawiss -service orawissPDB
sandbox1(+ASM):/home/oracle>
Create an entry in the tnsnames.ora file for the service orawissPDB created.
Then test the connection.
sandbox1(+ASM):/home/oracle>cat /opt/app/oracle/product/12.1/db_1/network/admin/tnsnames.ora
orawissPDB =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS =
(PROTOCOL = TCP)(HOST = sandbox1.XXX.com)(PORT = 1522)
)
(CONNECT_DATA =
(SERVICE_NAME = orawissPDB)
)
)
sandbox1(+ASM):/home/oracle>
sandbox1(orawiss):/home/oracle/wissem >sqlplus wissem/wissem@orawissPDB
SQL*Plus: Release 12.1.0.1.0 Production on Mon Nov 25 10:51:24 2013
Copyright (c) 1982, 2013, Oracle.
All rights reserved.
Last Successful login time: Sun Jul 14 2013 05:20:12 -05:00
Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options
SQL> exit
Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options
sandbox1(orawiss):/home/oracle/wissem/ORAWISS/archivelog>
Download the JDBC Driver for Oracle: http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html.
You must accept the license agreement before you can download the driver.
Download the ojdbc6.jar file.
Copy the ojdbc6.jar file to /usr/lib/sqoop/lib/ directory. Run the following command:
Test the connection to Oracle database: We can test the connection to the Oracle database using sqoop.
Note: We are using sandbox1 as oracle database server and localhost where Hadoop, Sqoop and Hive are installed. Typically, you can find Hadoop installed in multiple machines, Sqoop installed in a separate and client machine. Hive can be installed also in a separate and client machine. For simplicity, all Hadoop ecosystem and Hadoop Framework is installed in once machine.
We selected 1 from the dual table and Sqoop returns the value 1 which means the connection is established to the Oracle pluggable database.
Internet Active Users Dataset:
In this example, Marketing Online Company offering to internet user’s banners, WebPages to download software, games etc ... have millions of active users per day around the globe. The Inventory department wants to know how many times one user is active per day: active means user clicked on a banner or clicked on an offer or just signed up. For simplicity lets imagine we need an Oracle table with just two fields: user IP and active date. Table can have many other columns like a Banner identifier, traffic source, country, etc … The Table let’s call it “User_Activity” which may have billions of records per day.
We will write a program that mines Active user’s data. Data are collected every time a user does an “action” at many locations around the globe. We may gather a large volume of log data. Data are structured so are good candidate to be managed by an Oracle database. Also user activity table can be a good candidate for analysis with Map Reduce. The goal is to know how many times one user is active per day.
Let’s create the table in Oracle 12c database and populate it via some sample data.
sandbox1(orawiss):/home/oracle/wissem>sqlplus wissem/wissem@orawissPDB
SQL*Plus: Release 12.1.0.1.0 Production on Wed Nov 27 04:43:09 2013
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Last Successful login time: Tue Nov 26 2013 11:26:04 -05:00
Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options
SQL> create table user_activity(IP_ADRR Varchar2(30), ACTIVE_DATE DATE, Traffi_source Varchar2(50),
ISO_COUNTRY Varchar2(2));
Table created.
SQL> ALTER TABLE user_activity ADD PRIMARY KEY (IP_ADRR, ACTIVE_DATE);
Table altered.
SQL> select * from user_activity;
IP_ADRR ACTIVE_DATE TRAFFI_SOURCE IS
------------------ -------------- -------------------------------------------------- --
98.245.153.111 11/27/2013 04:49:14 Google TN
65.245.153.105 11/27/2013 04:49:59 Yahoo BE
98.245.153.111 11/27/2013 04:50:14 MSN TN
98.245.153.111 11/27/2013 04:50:25 MSN TN
98.245.153.111 11/27/2013 04:50:28 MSN TN
98.245.153.111 11/27/2013 04:50:30 MSN TN
65.245.153.105 11/27/2013 04:50:40 Yahoo BE
65.245.153.105 11/27/2013 04:50:41 Yahoo BE
65.245.153.105 11/27/2013 04:50:42 Yahoo BE
……
……
SQL>
Extract data from Oracle Database into HDFS:
We use Sqoop to import the data from Oracle to HDFS. Sqoop has the import option to import data from RDBMS. We need to specify is a connection string, table name and fields separator. More details about sqoop import can be found in the Sqoop user’s guide. Sqoop can import data in parallel. We can specify the number of map tasks (parallel processes) to use to perform the import from Oracle by using the -m or --num-mappers argument. By default, four tasks are used. In our example we will use only one mapper. (rac1)
[wissem@localhost ~]$ sqoop import --connect jdbc:oracle:thin:@//sandbox1.xxx.com:1522/orawissPDB --table
USER_ACTIVITY --fields-terminated-by '\t' --num-mappers 1 --username WISSEM -P
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Enter password:
13/11/27 14:39:23 INFO manager.SqlManager: Using default fetchSize of 1000
13/11/27 14:39:23 INFO tool.CodeGenTool: Beginning code generation
13/11/27 14:39:26 INFO manager.OracleManager: Time zone has been set to GMT
13/11/27 14:39:27 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM USER_ACTIVITY t WHERE 1=0
13/11/27 14:39:27 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
13/11/27 14:39:27 INFO orm.CompilationManager: Found hadoop core jar at: /usr/lib/hadoop-0.20- mapreduce/hadoop-core.jar Note: /tmp/sqoop-wissem/compile/6a6e927fb008e151d1eeb04bff6f24ef/USER_ACTIVITY.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details.
13/11/27 14:39:30 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop- wissem/compile/6a6e927fb008e151d1eeb04bff6f24ef/USER_ACTIVITY.jar
13/11/27 14:39:30 INFO manager.OracleManager: Time zone has been set to GMT
13/11/27 14:39:34 WARN manager.OracleManager: The table USER_ACTIVITY contains a multi-column primary key. Sqoop will default to the column ACTIVE_DATE only for this job.
13/11/27 14:39:34 INFO manager.OracleManager: Time zone has been set to GMT
13/11/27 14:39:35 WARN manager.OracleManager: The table USER_ACTIVITY contains a multi-column primary key. Sqoop will default to the column IP_ADRR only for this job.
13/11/27 14:39:35 INFO mapreduce.ImportJobBase: Beginning import of USER_ACTIVITY
13/11/27 14:39:36 INFO manager.OracleManager: Time zone has been set to GMT
13/11/27 14:39:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/11/27 14:39:42 INFO mapred.JobClient: Running job: job_201311271428_0001
13/11/27 14:39:43 INFO mapred.JobClient: map 0% reduce 0% 13/11/27 14:40:04 INFO mapred.JobClient: map 100% reduce 0% 13/11/27 14:40:07 INFO mapred.JobClient: Job complete: job_201311271428_0001
13/11/27 14:40:07 INFO mapred.JobClient: Counters: 23 13/11/27 14:40:07 INFO mapred.JobClient: File System Counters 13/11/27 14:40:07 INFO mapred.JobClient: FILE: Number of bytes read=0 13/11/27 14:40:07 INFO mapred.JobClient: FILE: Number of bytes written=216292
13/11/27 14:40:07 INFO mapred.JobClient: FILE: Number of read operations=0
13/11/27 14:40:07 INFO mapred.JobClient: FILE: Number of large read operations=0
13/11/27 14:40:07 INFO mapred.JobClient: FILE: Number of write operations=0
13/11/27 14:40:07 INFO mapred.JobClient: HDFS: Number of bytes read=87
13/11/27 14:40:07 INFO mapred.JobClient: HDFS: Number of bytes written=407
13/11/27 14:40:07 INFO mapred.JobClient: HDFS: Number of read operations=1
13/11/27 14:40:07 INFO mapred.JobClient: HDFS: Number of large read operations=0
13/11/27 14:40:07 INFO mapred.JobClient: HDFS: Number of write operations=1
13/11/27 14:40:07 INFO mapred.JobClient: Job Counters
13/11/27 14:40:07 INFO mapred.JobClient: Launched map tasks=1
13/11/27 14:40:07 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=20435
13/11/27 14:40:07 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
13/11/27 14:40:07 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/11/27 14:40:07 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/11/27 14:40:07 INFO mapred.JobClient: Map-Reduce Framework
13/11/27 14:40:07 INFO mapred.JobClient: Map input records=9
13/11/27 14:40:07 INFO mapred.JobClient: Map output records=9
13/11/27 14:40:07 INFO mapred.JobClient: Input split bytes=87
13/11/27 14:40:07 INFO mapred.JobClient: Spilled Records=0
13/11/27 14:40:07 INFO mapred.JobClient: CPU time spent (ms)=1000
13/11/27 14:40:07 INFO mapred.JobClient: Physical memory (bytes) snapshot=60805120
13/11/27 14:40:07 INFO mapred.JobClient: Virtual memory (bytes) snapshot=389214208
13/11/27 14:40:07 INFO mapred.JobClient: Total committed heap usage (bytes)=15990784
13/11/27 14:40:07 INFO mapreduce.ImportJobBase: Transferred 407 bytes in 31.171 seconds (13.057 bytes/sec)
13/11/27 14:40:07 INFO mapreduce.ImportJobBase: Retrieved 9 records.
[wissem@localhost ~]$
Sqoop used the Map Reduce Framework (1 Mapper and = Reducer); Sqoop created automatically a Jar file named “USER_ACTIVITY.jar” and extracted using Sql the contents of User Activity table in Oracle database. The file in HDFS is named part-m-00000. This is Hadoop File naming convention defined in FileOutputFormat class and there is property mapreduce.output.basename. The “m” part means map and “r” if there is reduce phase means simply output of the reducer. As we have defined one map we only have one output file named part-m-00000. Let’s see part-m-00000 content and rename it later to user_Activity.txt file.
[wissem@localhost ~]$ hadoop fs -cat /user/wissem/USER_ACTIVITY/part-m-00000
98.245.153.111 2013-11-27 04:49:14.0 Google TN
65.245.153.105 2013-11-27 04:49:59.0 Yahoo BE
98.245.153.111 2013-11-27 04:50:14.0 MSN TN
98.245.153.111 2013-11-27 04:50:25.0 MSN TN
98.245.153.111 2013-11-27 04:50:28.0 MSN TN
98.245.153.111 2013-11-27 04:50:30.0 MSN TN
65.245.153.105 2013-11-27 04:50:40.0 Yahoo BE
65.245.153.105 2013-11-27 04:50:41.0 Yahoo BE
65.245.153.105 2013-11-27 04:50:42.0 Yahoo BE
[wissem@localhost ~]$
[wissem@localhost ~]$ hadoop fs -mv USER_ACTIVITY/part-m-00000 /user/wissem/USER_ACTIVITY/user_activity.txt [wissem@localhost ~]$ hadoop fs -cat /user/wissem/USER_ACTIVITY/user_activity.txt
98.245.153.111 2013-11-27 04:49:14.0 Google TN
65.245.153.105 2013-11-27 04:49:59.0 Yahoo BE
98.245.153.111 2013-11-27 04:50:14.0 MSN TN
98.245.153.111 2013-11-27 04:50:25.0 MSN TN
98.245.153.111 2013-11-27 04:50:28.0 MSN TN
98.245.153.111 2013-11-27 04:50:30.0 MSN TN
65.245.153.105 2013-11-27 04:50:40.0 Yahoo BE
65.245.153.105 2013-11-27 04:50:41.0 Yahoo BE
65.245.153.105 2013-11-27 04:50:42.0 Yahoo BE
[wissem@localhost ~]$
Now we have the data inside the Hadoop Distributed File System. Let’s do the second part of the ETL process which is the Transformation part.
Data Transformation in Hadoop: The goal is to know how many times per date users are active.
This query could take several minutes to run on an Oracle database. But, how if we have thousands of billions of records the query starts taking hours to run? As a Distributed framework Hadoop will store the output file in different blocks of 64MB (this is the default) within different servers. Hadoop Map program will be run in each node storing the data and a reducer will take the sorted results of Map phase and generates an output file in HDFS. This process will guarantee each node of the Hadoop framework does the processing of a smaller file. All nodes of the Hadoop cluster working in parallel will deliver the output file. Now that we explained how Hadoop system will work. Let‘s run a Map Reduce program against the user_activity.txt file. We can write Java classes to do the Map Reduce phase but HiveQL (Hive SQL-like language) can do also the work without the need to write lines of Java code. First let’s create a table in HIVE with Tab '\t' delimiter.
[wissem@localhost ~]$ hive
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
Hive history file=/tmp/wissem/hive_job_log_024ccdda-b362-4ef2-9222-2c374dd2ac83_188192255.txt
hive> create table user_Activity (IP_ADRR STRING, ACTIVE_DATE STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
OK
Time taken: 13.84 seconds
hive> show tables user_Activity;
OK
Time taken: 0.235 seconds
hive>
Now we have created a table in Hive, let’s populate the table with user activity data extracted from Oracle database in previous section of this article.
hive> LOAD DATA INPATH "/user/wissem/USER_ACTIVITY/user_activity.txt" INTO TABLE user_Activity;
Loading data to table default.user_activity
Table default.user_activity stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 407, raw_data_size: 0]
OK
Time taken: 1.318 seconds
hive> select * from user_activity;
OK
98.245.153.111 2013-11-27 04:49:14.0
65.245.153.105 2013-11-27 04:49:59.0
98.245.153.111 2013-11-27 04:50:14.0
98.245.153.111 2013-11-27 04:50:25.0
98.245.153.111 2013-11-27 04:50:28.0
98.245.153.111 2013-11-27 04:50:30.0
65.245.153.105 2013-11-27 04:50:40.0
65.245.153.105 2013-11-27 04:50:41.0
65.245.153.105 2013-11-27 04:50:42.0
Time taken: 0.541 seconds
hive>
Now let’s process the user activity data and output the results in HDFS. We use the insert OVERWRITE syntax for HiveQL. We called 'USER_ACTIVITY_OUT' the output directory in HDFS, overwritten if exists. As of Hive 0.11.0 the separator used can be specified, in earlier versions it was always the ^A character (\001).
hive> INSERT OVERWRITE DIRECTORY 'USER_ACTIVITY_OUT' SELECT COUNT(*), IP_ADRR , SUBSTR(ACTIVE_DATE, 1,
10) FROM USER_ACTIVITY group by IP_ADRR , SUBSTR(ACTIVE_DATE, 1, 10);
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer= <number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201311271428_0003, Tracking URL = http://localhost:50030/jobdetails.jsp? jobid=job_201311271428_0003 Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201311271428_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-11-27 15:32:20,392 Stage-1 map = 0%, reduce = 0%
2013-11-27 15:32:27,475 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.54 sec
2013-11-27 15:32:28,488 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.54 sec
2013-11-27 15:32:29,497 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.54 sec
2013-11-27 15:32:30,526 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.54 sec
2013-11-27 15:32:31,543 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.36 sec
2013-11-27 15:32:32,582 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.36 sec
MapReduce Total cumulative CPU time: 4 seconds 360 msec
Ended Job = job_201311271428_0003
Moving data to: USER_ACTIVITY_OUT
2 Rows loaded to USER_ACTIVITY_OUT
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 4.36 sec HDFS Read: 634 HDFS Write: 56 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 360 msec
OK
Time taken: 16.466 seconds
hive> quit;
Note from the output above, we can monitor the execution of the job via the
URL: http://localhost:50030/jobdetails.jsp?jobid=job_201311271428_0003
Now, we can check the contents of the output file.
[wissem@localhost ~]$ hadoop fs -cat /user/wissem/USER_ACTIVITY_OUT/000000_0
4 65.245.153.105 2013-11-27
5 98.245.153.111 2013-11-27
[wissem@localhost ~]$
Load data from Hadoop into Oracle database: At this point let's create a table called USER_STATS in Oracle 12c. This table will contain the final processed data.
SQL> CREATE TABLE USER_STATS (ACTIVES NUMBER,IP_ADRR VARCHAR2(30), ACTIVE_DATE VARCHAR2(10));
Table created.
At this point, we load the processed data into Oracle database using Sqoop. We use export option of Sqoop.
[wissem@localhost ~]$
sqoop export --connect jdbc:oracle:thin:@//sandbox1.xxx.com:1522/orawissPDB --export- dir '/user/wissem/USER_ACTIVITY_OUT/000000_0' --table WISSEM.USER_STATS --fields-terminated-by '\001'
--input-null-string '\\N' --input-null-non-string '\\N' -m 1 --username WISSEM -P
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Enter password:
13/11/27 15:36:06 INFO manager.SqlManager: Using default fetchSize of 1000
13/11/27 15:36:06 INFO tool.CodeGenTool: Beginning code generation
13/11/27 15:36:09 INFO manager.OracleManager: Time zone has been set to GMT
13/11/27 15:36:10 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM WISSEM.USER_STATS t WHERE 1=0
13/11/27 15:36:11 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
13/11/27 15:36:11 INFO orm.CompilationManager: Found hadoop core jar at: /usr/lib/hadoop-0.20-mapreduce/ hadoop-core.jar Note: /tmp/sqoop-wissem/compile/e49e98500225d2ef01254258389240fd/WISSEM_USER_STATS.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
13/11/27 15:36:12 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-wissem/ compile/e49e98500225d2ef01254258389240fd/WISSEM.USER_STATS.jar
13/11/27 15:36:12 INFO mapreduce.ExportJobBase: Beginning export of WISSEM.USER_STATS
13/11/27 15:36:13 INFO manager.OracleManager: Time zone has been set to GMT
13/11/27 15:36:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/11/27 15:36:15 INFO input.FileInputFormat: Total input paths to process : 1
13/11/27 15:36:15 INFO input.FileInputFormat: Total input paths to process : 1
13/11/27 15:36:15 INFO mapred.JobClient: Running job: job_201311271428_0004
13/11/27 15:36:16 INFO mapred.JobClient: map 0% reduce 0%
13/11/27 15:36:23 INFO mapred.JobClient: map 100% reduce 0%
13/11/27 15:36:24 INFO mapred.JobClient: Job complete: job_201311271428_0004
13/11/27 15:36:24 INFO mapred.JobClient: Counters: 24
13/11/27 15:36:24 INFO mapred.JobClient: File System Counters
13/11/27 15:36:24 INFO mapred.JobClient: FILE: Number of bytes read=0
13/11/27 15:36:24 INFO mapred.JobClient: FILE: Number of bytes written=214755
13/11/27 15:36:24 INFO mapred.JobClient: FILE: Number of read operations=0
13/11/27 15:36:24 INFO mapred.JobClient: FILE: Number of large read operations=0
13/11/27 15:36:24 INFO mapred.JobClient: FILE: Number of write operations=0
13/11/27 15:36:24 INFO mapred.JobClient: HDFS: Number of bytes read=202
13/11/27 15:36:24 INFO mapred.JobClient: HDFS: Number of bytes written=0
13/11/27 15:36:24 INFO mapred.JobClient: HDFS: Number of read operations=4
13/11/27 15:36:24 INFO mapred.JobClient: HDFS: Number of large read operations=0
13/11/27 15:36:24 INFO mapred.JobClient: HDFS: Number of write operations=0
13/11/27 15:36:24 INFO mapred.JobClient: Job Counters 13/11/27 15:36:24 INFO mapred.JobClient: Launched map tasks=1
13/11/27 15:36:24 INFO mapred.JobClient: Data-local map tasks=1
13/11/27 15:36:24 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=7608
13/11/27 15:36:24 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
13/11/27 15:36:24 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots
(ms)=0
13/11/27 15:36:24 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/11/27 15:36:24 INFO mapred.JobClient: Map-Reduce Framework
13/11/27 15:36:24 INFO mapred.JobClient: Map input records=2
13/11/27 15:36:24 INFO mapred.JobClient: Map output records=2
13/11/27 15:36:24 INFO mapred.JobClient: Input split bytes=143
13/11/27 15:36:24 INFO mapred.JobClient: Spilled Records=0
13/11/27 15:36:24 INFO mapred.JobClient: CPU time spent (ms)=820
13/11/27 15:36:24 INFO mapred.JobClient: Physical memory (bytes) snapshot=58933248
13/11/27 15:36:24 INFO mapred.JobClient: Virtual memory (bytes) snapshot=388550656
13/11/27 15:36:24 INFO mapred.JobClient: Total committed heap usage (bytes)=15990784
13/11/27 15:36:24 INFO mapreduce.ExportJobBase: Transferred 202 bytes in 10.349 seconds (19.5188 bytes/sec)
13/11/27 15:36:24 INFO mapreduce.ExportJobBase: Exported 2 records.
[wissem@localhost ~]$
Now that the results are exported successfully, let’s check the USER_STATS table.
SQL> select * from USER_STATS;
ACTIVES IP_ADRR ACTIVE_DAT
---------- --------------------- ----------
4 65.245.153.105 2013-11-27
5 98.245.153.111 2013-11-27
In this article, we have seen how Oracle database and Hadoop (Framework & Ecosystem) can work together to deliver an efficient / optimal Enterprise architecture to store and process large data set.
Wissem is a Senior Oracle DBA / Architect with over 12 years of experience specialized in Oracle RAC & Data Guard. He currently works for “Schneider Electric / APC Global operations”. Wissem is the first Oracle ACE in Spain and He is OCP / OCE DBA. You can follow Wissem on his blog www.oracle-class.com or in twitter @orawiss