ODI 12.2.1: Big Data


Options



Before You Begin

Purpose

This tutorial covers how to make ODI mappings with Big Data using ApacheHadoop®, Hive™, Pig, and Spark™.

(Apache Hadoop, Hadoop, Hive, Spark, Apache, and the Apache feather logo are trademarks of The Apache Software Foundation. Apache Hadoop, Hive, and Spark project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and other countries.)

Time to Complete

Approximately 1 hour.

Background

Given a pre-installed environment with data servers and schemas for Hadoop, Hive, Pig, and Spark, create a single mapping that can be used with three different big data engines.

Scenario

Oracle MoviePlex is a fictitious on-line movie streaming company. Customers log into Oracle MoviePlex where they are presented with a targeted list of movies based on their past viewing behavior. Because of this personalized experience and reliable and fast performance, customers spend a lot of money with the company and it has become extremely profitable.

How do you harness big data to create a personalized experience for customers?

What Do You Need?

  • Software
    • Oracle Enterprise Linux (OEL) 6.4 or higher
    • Oracle Database 12c Release 1 Enterprise Edition (12.1.0.2) - including Oracle Big Data SQL-enabled external tables, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more
    • Oracle Data Integrator (ODI) 12.2.1 or higher
    • Oracle Big Data Connectors 4.0 or higher
    • Oracle NoSQL Database Enterprise Edition 12cR1 or higher
    • Cloudera 5.1 or higher
    • Alternatively, the Oracle Big Data Lite Virtual Machine (the VM, see Prerequisites) contains all of the above pre-installed.
  • Hardware
    • At least 2 GB RAM for each target (database or ODI)
    • At least 6 GB RAM for the VM (if you are using that)

Prerequisites

Before starting this tutorial, you should:

  • Have  at least 6 GB RAM available if you're using the Oracle Big Data Lite VM, at least version 4.0.1 or higher
  • Have access to or have installed the sample lab artifacts
  • Have attended ODI 12c training, [D82167GC10] Oracle Data Integrator 12c: Integration and Administration, or equivalent experience.

Typographic Conventions

These are the conventions used throughout this practice to indicate who types what.

  • Filename or /path/subdirectory/. Often you can navigate to a destination without having to type it.
  • Command or value that you type, such as gedit .bashrc. Be mindful of periods at the end of sentences that are not part of the command.
  • Click this such as or .
  • Some variable or <variable> as something you replace. Often prefixed with "my" such as <myName> or <myHost>.
  • A Menu Item or dialog Field Prompt or Tab name. For example, "click the Designer tab and select Models > New Model".
  • Generally a red rectangle in the graphics is not part of the original image, but added to draw your attention to an item.

1. Hive

Hive logo From Hive: The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

1.1 Hive Logical Mapping

  1. On the main Linux desktop menu bar, click the ODI launcher shortcut ().

    menu bar

    Only click it once, do not double-click. If you do double-click, you will get two copies starting and it will run verrrrry slowly. You can close the second copy.

  2. Click Connect to Repository and select Login Name: ODI Movie Demo.

    Wallet signin

    The User: SUPERVISOR and Password will be pre-populated for you. Click . (If you are ever prompted for the Wallet password, it is welcome1.)

  3. In the navigation panel, on the Topology tab, expand [+] Physical Architecture > Technologies > Hive and review the Hive dataservers.

    alt desc

    These were added for you before the lab.

  4. On the Designer tab, expand [+] Projects > Big Data Demo > Demo > Mappings, right-click Mappings and select New Mapping. Set Name: CalcRatings.

    New Mapping Mapping Name

    Deselect Create Empy Dataset' and then click .

  5. From Designer > Models > HiveMovie, drag and drop the following datastores as sources onto the mapping canvas:

    • movie
    • movieapp_log_avro
    Drag datastores

    Do not worry too much about alignment and pretty layout.

  6. Create a Filter (funnel) behind movieapp_log with Properties > Condition > Filter Condition:
    movieapp_log_avro.activity < 2.

    Add filter

    Dragging the activity attribute to the canvas will create the filter. There are other ways to do it as well.

  7. Create an AGGREGATE (Σ) dragged from the Component Palette behind FILTER, and drop into it from movieapp_log_avro:

    • movieid
    • rating
    Create Aggregate

    This automatically makes the connection from the output of Filter to the input of Aggregate.

  8. Change the Target Expression: for AGGREGATE.rating to AVG(movieapp_log_avro.rating).
    The AVG must be in UPPER CASE for Pig to work.

    AVG rating

    NOTE: The AVG must be in UPPER CASE for Pig to work. The yellow math function keywords such as AVG are hard to see...

  9. Drop a Join dragged from the Component Palette behind movie and AGGREGATE components.

    Add JOIN
  10. Drop onto the JOIN component:

    • movie.movie_id
    • AGGREGATE.movieid
    Cols to Join

    The Join Condition: should now say, movie.movie_id = AGGREGATE.movieid (note one attribute has an underscore and the other does not).

  11. Drag and drop the following target from Models > HiveMovie to behind JOIN: movie_rating.

    movie_rating

    You may want to change the scale/zoom to something smaller, like 75%.

  12. Connect JOIN to movie_rating and do Attribute Matching Auto Map.

    alt desc

    Click .

  13. Drag AGGREGATE.rating to movie_rating.avg_rating.

    Avg rating
  14. Change the Target Expression: for movie_rating.avg_rating to XROUND(AGGREGATE.rating).

    alt desc

    NOTE: XROUND is a user-defined function that maps to the correct rounding operations for different technologies. Optionally review XROUND by opening Designer > Projects > Beta > User Functions > Numeric > XROUND.

    XROUND
  15. For the components FILTER, AGGREGATE, and JOIN, change the Execute on Hint: to Staging.

    alt desc

    Each of the components has the Hint in a slightly different place in Properties.

You have created a Logical mapping for Hive.

1.2 Hive Physical Mapping

  1. Switch to the Physical tab. Rename the Physical design to Name: Hive.

    alt desc

    As soon as you click away from the Name field, the upper tab will change from "Physical" to "Hive."

  2. Select target MOVIE_RATING and switch to the Integration Knowledge Module tab. IKM should be IKM Hive Append.GLOBAL.

    Change TRUNCATE option to True.

    alt desc

    Save (Diskette) and close the CalcRatings mapping tab.

You have created a Physical mapping for Hive.

1.3 Running Hive Mapping

  1. Run (green arrow) the CalcRatings mapping with the following parameters:

    • Physical Mapping Design: Hive
    • Logical Agent: Local
    Run

    Click . On the Session started Information dialog, click .

  2. Wait a few seconds, then go to Operator tab and review the execution results.

    alt desc

    They should all be successful (green check circle), none should be errors (red x circle).

  3. Go to task "40 - Load MOVIE_RATING - IKM Hive Append" and open the Task Editor. Go to the Code tab and review the generated Hive SQL code.

    alt desc

    Scroll down to see more of the code you configured.

  4. In Models > HiveMovie, right-click target movie_rating and review inserted data by selecting View Data.

    alt desc

    Close the Data: movie_rating tab.

You have successfully run the Hive mapping and populated the movie_rating table.

You have integrated Big Data Hive with ODI via a mapping.

2. Pig

Pig logo From Pig: "Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets."

2.1 Pig Logical Mapping

  1. On the Designer tab, execute the procedure Projects > Big Data Demo > Demo > Procedures > Truncate movie_rating to remove all fields in the target table movie_rating.

    alt desc

    Click . On the Session started Information dialog, click .

  2. Review Pig Data Servers in Topology > Physical Architecture: Data Server Pig executes pig scripts in Map Reduce mode.

    alt desc

    These have been pre-configured for you.

  3. Go to Designer > Projects > Beta > Big Data > Mappings and open the mapping editor for CalcRatings.

    Select the background of the logical mapping diagram and go to Properties. Set Staging Location Hint: Pig.

    alt desc

It is pretty significant that Hive and Pig are almost identical in the mapping. Very little changes to the configuration when you switch engines.

2.2 Pig Physical Mapping

  1. Switch to the Physical tab.

    Click the Create a New Physical Design tab (tab w/plus), rename it in Properties > Name: to Pig.

    alt desc

    As soon as you click away from the Name field, the upper tab will change from "Physical1" to "Pig."

  2. Go to MOVIE_RATING target and change IKM to <default>.

    alt desc

    An IKM is not needed for this mode and will lead to warnings.

  3. Review Extract Options of physical components such as JOIN. Note the PARALLEL_NUMBER option for performance.

    alt desc

    Do not change anything, you are just looking.

  4. Save (Diskette) your work, but leave the CalcRatings mapping tab open for the next step.

    Save mapping

You have created a Pig physical diagram.

2.3 Running Pig Mapping

  1. Run (green arrow) the CalcRatings mapping with the following parameters:

    • Physical Design: Pig
    • Logical Agent: Local
    Run Mapping Physical/Logical

    Click . On the Session started Information dialog, click .

  2. Go to Operator tab and review the execution results.

    Oper tab

    They should all be successful (green check circle), none should be errors (red x circle).

  3. Go to task "30 - Load JOIN_AP - LKM Pig to Hive" and open the Task Editor. Go to the Code tab and review the generated Pig Latin code under Source Code.

    Pig Latin code
  4. Click on the Details tab and review the Pig details.

    Pig details
  5. Back to Mapping, review inserted data by right-clicking on movie_rating and selecting Data from target in Logical View.

    View Data

    Close all open tabs.

You have run a mapping with ODI and Pig and checked the output status on Operator.

You have integrated Big Data Pig with ODI via a mapping.

3. Spark

Spark logo From Spark FAQ: "Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning."

3.1 Spark Logical Mapping

  1. On the Topology tab, review Spark Data Server physical architecture technologies.

    Topo: Phys: Spark
  2. On the Designer tab, expand [+] Projects > Big Data Demo > Demo > Mapping. Open CalcRatings, go to the Mapping Editor.

    Open CalcRatings
  3. Select the background of the logical mapping diagram and go to Properties. Set Staging Location Hint: Spark.

    alt desc

    Not SparkLocal, but just plain Spark.

You have configured the logical mapping view for using Spark.

3.2 Spark Physical Mapping

  1. Switch to the Physical tab.

    Click the Create a New Physical Design tab (tab w/plus), rename it in Properties > Name: to Spark.

    alt desc

    As soon as you click away from the Name field, the upper tab will change from "Physical2" to "Spark."

  2. Select JOIN_AP and change the Properties > Loading Knowledge Module > Options: OVERWRITE_TARGET_TABLE to True.

    JOIN_AP LKM
  3. Select MOVIE_RATING in the TARGET_GROUP and change Integration Knowledge Module: to <Default>.

    MOVIE_RATING IKM default
  4. Review Extract Options properties of physical components such as JOIN.

    CACHE_DATA

    Note the CACHE_DATA option.

  5. Save (Diskette) the mapping, but leave the CalcRatings mapping tab open for the next step.

    Save mapping

You have completed the physical mapping of an ODI to Spark transformation.

3.3 Running Spark Mapping

  1. Run (green arrow) the CalcRatings mapping with the following parameters:

    • Physical Design: Spark
    • Logical Agent: Local
      Physical/Logical

    Click . On the Session started Information dialog, click .

  2. Wait a minute, then go to the Operator tab and review the execution results.

    Oper tab

    They should all be successful (green check circle), none should be errors (red x circle). The Warnings (yellow triangle !) are about a deprecated Java option SPARK_JAVA_OPTS, and can be safely ignored.

  3. Go to task "30 - Load JOIN_AP - LKM Spark to Hive" and open the Task Editor. Go to Code tab and review the generated Target PySpark code which is written to file and executed in the subsequent task.

    PySpark Code

    Close the Session Task Load tab.

  4. Back to Mapping, review inserted data by right-clicking on movie_rating and selecting Data from target in Logical View.

    View Data

    Close all open tabs.

You have successfully run the Spark ODI integration and validated the inserted data in the table.

You have integrated Big Data Spark with ODI via a mapping.

Want to Learn More?

The Big Data engines (Hive, Pig, and Spark) are remarkably similar in use when it comes to ODI mappings. They can be wildly different internally, but by making small changes on the physical mapping, you can use all three interchangeably at run-time. The Knowledge Module architecture of ODI insulates you from the complexities of the different technologies.

In this Oracle By Example (OBE) tutorial, you learned how to create an ODI mapping for big data using:

  • Hive
  • Pig
  • Spark

Resources

Conceptual or procedural Help topics relevant to the topic of this tutorial:

Credits

As always, a team effort. This is a partial list of the team:

  • Lead Curriculum Developer: Alex Kotopoulis
  • Other Contributors: Steve Friedberg, Julien Testut, Ayush Ganeriwal