Overview: Integrate Hadoop using Oracle Data Integration
- Review the Scenario
- Ingesting Data Using Sqoop and Oracle GoldenGate
- Transforming Data Using Hive, Spark, or Pig
- Loading Data to Oracle DB using Oracle Loader for Hadoop
- Accessing Hadoop Data from Oracle using Big Data SQL
- Data Sessionization using Pig
- Execute all Steps Using an ODI Package
This demo illustrates how you can move and transform all your data using Oracle Data Integration - whether that data resides in Oracle Database, Hadoop, third-party databases, applications, files, or a combination of these sources. The "Design once, run anywhere" paradigm allows you to focus on the logical rules of data transformation and movement while the choice of implementation is separate and can evolve as use cases change and new technologies become available.
This demo is included in the virtual machine Big Data Lite 4.2, which is downloadable from http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html.
Time to Complete
Approximately one hour
This tutorial is divided into the following sections:
Oracle MoviePlex is an online movie streaming company. Its web site collects every customer interaction in massive JSON formatted log files. It also maintains movie data in an Oracle source database. By unlocking the information contained in these sources and combining it with enterprise data in its data warehouse, the company will be able to enrich its understanding of customer behavior, the effectiveness of product offers, the organization of web site content, and more.
The company is using Oracle's Big Data Management System to unify their data platform and facilitate these analyses, and it is achieved by implementing a Data Reservoir pattern where both structured and unstructured data is collected and staged in a Big Data instance for further analysis and load into target DBs.
Oracle Data Integration provides a unified tool-driven approach to declaratively define integration of all data. The concept of "Design once, run anywhere" means that users can define integration processes regardless of the implementation language and run them in different environments. A transformation that is executed today on an RDBMS can be reused to run the Hadoop cluster and utilize future Hadoop languages and optimizations in the as they become available.
For Oracle MoviePlex, a variety of mechanisms is showcased. Data is loaded from a source database to Hive tables, both in bulk using Sqoop through Oracle Data Integrator(ODI), as well as through change data capture using Oracle GoldenGate(OGG). Data is transformed through joining, filtering and aggregating through Hive,Spark, or Pig in ODI, and the resulting data can be unloaded into a target Oracle DB using the optimized Oracle Loader for Hadoop(OLH) or Oracle SQL Connector for Hadoop(OSCH). Hadoop data can also be used in Oracle DB using Big Data SQL, where ODI transparently generates the necessary external tables to expose Hive tables in the Oracle DB to be used in SQL queries.
Let's begin the tutorial by reviewing how data is moved and transformed using Oracle Data Integration for Big Data.
Resetting the demoYou can reset the demo environment from a previously run demo or hands-on lab by executing the script:
Downloading and Installing Big Data Lite Virtual Machine
Please follow the instructions at the Big Data Lite Deployment Guide for details on downloading,installing, and starting Big Data Lite 4.2.
Starting required services
Prepare Oracle Database for Oracle GoldenGate
Part 1 - Ingesting Data Using Sqoop and Oracle GoldenGate
In this section, you will learn how to ingest data from external sources into Hadoop, using Sqoop for bulk load and Oracle GoldenGate for change data capture.
Ingest Bulk Data Using Sqoop in Oracle Data Integrator
Oracle Data Integrator allows the creation of mappings to move and transform data. In this step we will review and execute an existing mapping to load data from an Oracle DB source to a Hive target table.
Open and review the Sqoop mapping
Execute the Sqoop mapping
Ingest Change Data using Oracle GoldenGate for Big Data
Oracle GoldenGate allows the capture of completed transactions from a source database, and the replication of these changes to a target system. Oracle GoldenGate is non-invasive, highly performant, and reliable in capturing and applying these changes. Oracle GoldenGate for Big Data provides a component to replicate changes captured by GoldenGate into different Hadoop target technologies. In this tutorial we will replicate inserts into the MOVIE table in Oracle to the respective movie_updates table in Hive. Oracle GoldenGate provides this capability through GoldenGate for Big Data and provides adapters for Hive, HDFS, HBase, Flume, and Kafka. For this example we will be first using the Hive example and later also demonstrate the delivery mechanism to Kafka.The GoldenGate processes for the Hive example are as following:
Start Oracle GoldenGate and Set Up Replication from Oracle to Hive
Replicate Data from Oracle to Kafka
Oracle GoldenGate for Big Data provides a built in component to replicate changes captured by GoldenGate into Kafka. It also provides capabilities to the data in various formats such as JSON, XML, Avro in addition to the CSV format. The GoldenGate processes for the Kafka handler example are as following:
Set up Kafka
Working with Kafka handler
Merge Updates using Hive in Oracle Data Integrator
In our demo we have used Oracle Data Integrator to load initial data using Sqoop and Oracle GoldenGate to replicate changes in real-time. We have also used a view to reconciliate the updates on the fly. We can also merge the data in bulk using an ODI mapping to provide the same data that the original table in Oracle holds.
Open and review the Hive merge mapping
Execute the Hive merge mapping
Part 2 - Transforming Data using Hive, Spark, and Pig
In this section, you will use Oracle Data Integrator to transform the movie data previously loaded with Sqoop and GoldenGate. The use case is to join the table movie with customer activity event data that has been previously loaded into an HDFS file using Flume and is now exposed as a Hive table movieapp_log_odistage. The activity data contains rating actions, we will calculate an average rating for every movie and store the result in a Hive table movie_rating. You can create one logical mapping and choose whether to use Hive, Spark, or Pig as execution engine of the staging location. ODI will generate either Hive SQL, Spark-Python, or Pig Lating and execute it in the appropriate server engine.
Also as part of this chapter we will show a mapping that takes
a nested JSON HDFS file as input and flattens it to calculate
movie ratings on the contents. The implementation engine used is
This demo shows that with ODI you can create logical mappings declaratively without considering any implementation details; those can be added later in the physical design.
Transform Movie Data using Hive
Open and review the Hive mapping
Execute the Hive mapping
Transform Movie Data using Spark
Open and review the mapping
Execute the Spark mapping
Transform Movie Data using Pig
Open and review the mapping
Execute the Pig mapping
Transform Nested JSON Data Using
Prepare and review source HDFS JSON data
Open and review the mapping
Execute the mapping
Part 3 - Loading Data to Oracle DB using Oracle Loader for Hadoop
In this task we load the results of the prior Hive transformation from the resulting Hive table into the Oracle DB data warehouse. We are using the Oracle Loader for Hadoop (OLH) build data loader which uses mechanisms specifically optimized for Oracle DB.
Load Movie data to Oracle DB
using ODI Mapping and Oracle Loader for Hadoop
Execute the OLH mapping
Part 4 - Accessing Hadoop Data from Oracle using Big Data SQL
In the next section ODI will transform data from both Oracle DB as well as Hive in a single mapping using Oracle Big Data SQL. Big Data SQL enables Oracle Exadata to seamlessly query data on the Oracle Big Data Appliance using Oracle's rich SQL dialect. Data stored in Hadoop is queried in exactly the same way as all other data in Oracle Database.
In this use case we are combining the previously used activity data with Customer data that is stored in Oracle DB. We are summarizing all purchase activities for each customer and join it with the core customer data.
Calculate Sales from Hive and
Oracle Tables using Big Data SQL
Open and review the Big Data SQL mapping
Execute the Big Data SQL mapping
Part 5 - Data Sessionization using Pig
This demo shows how to execute a complex mapping complex Pig using ODI with the ability to use user-defined functions and table functions. The mapping is using a Pig function sessionize from the Apache DataFu library which is included in this Hadoop distribution.
The objective of this mapping is to order the activities in the Hive table movieapp_log_odistage into separate sessions for different users and calculate statistics for session minimum, maximum, and average duration based on geography.
Load Sessionize and Analyze User
Activities using Pig in ODI
Open and review the mapping
Execute the Pig mapping
Part 6. Execute all Steps Using an ODI Package
In this section we use an ODI Package to execute all other ODI steps together sequentially. Packages can be used to orchestrate various jobs such as mappings, procedures, tools, or other packages, and allow the inclusion of conditional logic and variables as part of the execution. This package also makes use of a File Copy tool that copies a JSON file from the local filesystem into HDFS as well as procedures to issue Hive SQL calls to truncate tables.
In this tutorial, you have learned how to:
- Ingest Data Using Sqoop and Oracle GoldenGate
- Transform Data Using Hive, Pig, and Spark
- Load Data to Oracle DB using Oracle Loader for Hadoop
- Access Hadoop Data from Oracle using Big Data SQL
- Sessionize Data using Pig
- Use ODI Packages to orchestrate multiple jobs
- Authors: Alex Kotopoulis
To help navigate this Oracle by Example, note the following:
- Hiding Header Buttons:
- Click the Title to hide the buttons in the header. To show the buttons again, simply click the Title again.
- Topic List Button:
- A list of all the topics. Click one of the topics to navigate to that section.
- Expand/Collapse All Topics:
- To show/hide all the detail for all the sections. By default, all topics are collapsed
- Show/Hide All Images:
- To show/hide all the screenshots. By default, all images are displayed.
- To print the content. The content currently displayed or hidden will be printed.