- After starting Big Data Lite, ensure the following services are started by using the Start/Stop Services application found on the Linux desktop:
This tutorial illustrates how you can securely analyze data across the big data platform - whether that data resides in Oracle Database 12c, Hadoop, Oracle NoSQL Database or a combination of these sources. Importantly, you will be able to leverage your existing Oracle skill sets and applications to gain these insights. Oracle Big Data SQL allows you to apply Oracle's rich SQL dialect and security policies across the data platform - greatly simplifying the ability to gain insights from all your data.
Note, there are two parts to Big Data SQL: 1) enhanced Oracle Database 12c external tables and 2) Oracle Big Data SQL Servers on the Oracle Big Data Appliance or DIY Hadoop Clusters (see Big Data SQL Datasheet for supported deployments). On the Hadoop cluster, Big Data SQL Cell Servers apply SmartScan over data stored in HDFS in order to achieve fast performance (see this blog post for details). The Oracle Big Data Lite Virtual Machine used for this lab does not have Big Data SQL Cell Server installed.
Time to Complete
Approximately 90 mins
This tutorial requires Oracle BIg Data Lite Virtual Machine (VM). You can download the VM from the Big Data Lite page on OTN.
Before starting this lesson, perform the following:
Note the started services have an * next to their name:
Note: Ignore any errors generated by these statements.
This tutorial is divided into the following sections:
- Review the Scenario
- Configuring Oracle Big Data SQL
- Create Oracle Tables Over an HDFS Sourced Application Log
- Leverage the Hive Metastore to Access Data in Hadoop and
Oracle NoSQL Database
- Applying Oracle Database Security Policies Across the Big Data Platform
- Using Oracle Analytic SQL Across All Your Data
- Using SQL Pattern Matching on your web log data
Oracle MoviePlex is an online movie streaming company. Every user that accesses the site is presented with his/her own movie recommendations based on past viewing activity. This list of recommended movies is updated frequently and is part of the user's profile. Oracle NoSQL Database stores these profiles - delivering application query requests with very low latency for large user communities. Additionally, the web site collects every customer interaction in massive JSON formatted log files. By unlocking the information contained in that activity data and combining it with recommendation data and enterprise data in its data warehouse, the company will be able to enrich its understanding of customer behavior, the effectiveness of product offers, the organization of web site content, and more.
The company is using Oracle's Big Data Management System to unify their data platform and facilitate these analyses.
Oracle Big Data Management System unifies the data platform by providing a common query language, management platform and security framework across Hadoop, NoSQL and Oracle Database. Oracle Big Data SQL is a key component of the platform. It enables Oracle Database 12c to seamlessly query data in Hadoop and NoSQL using Oracle's rich SQL dialect. Data stored in Hadoop or Oracle NoSQL Database is queried in exactly the same way as all other data in Oracle Database. This means that users can begin to gain insights from these new sources using their existing skill sets and applications.
For Oracle MoviePlex, every click on its web site is streamed into HDFS. After the data lands in HDFS, it is immediately accessible to Oracle Database users through Oracle Big Data SQL. In addition, the recommendation data in Oracle NoSQL Database is also accessible thru Oracle Big Data SQL. In this hands-on, you will learn how to combine the "click data" stored in HDFS with recommendation data in NoSQL Database and revenue data in Oracle Database to better understand the shopping and purchasing patterns of customers visiting the site.
Let's begin the tutorial by reviewing how access to the BDA is configured in Oracle Database.
Part 1 - Configuring Oracle Big Data SQL
- Install Oracle Big Data SQL on the Hadoop cluster. This step sets up a Big Data SQL Server Cells on each node A - enabling SmartScan on local data.
- For Oracle Database 12c, run the Big Data SQL installation script on each Oracle database node. This step sets up connectivity from Oracle Database to the Big Data SQL Server Cells on the Hadoop Cluster. It also includes installing a Hadoop client, configuration directories and files, Big Data SQL Agent, Oracle directory objects and more.
- The Common directory contains a few subdirectories and
an important file, named
bigdata.properties. This file stores configuration information that is common to all BDA clusters. Specifically, it contains property-value pairs used to configure the JVM and identify a default cluster.
bigdata.propertiesfile must be accessible to the operating system user under which the Oracle Database runs.
- The Cluster directory contains configuration files required to connect to a specific BDA cluster.
- In addition, the Cluster directory must be a subdirectory of the Common directory - and the name of the directory is important: It is the name that you will use to identify the cluster. This will be described in more detail later.
- ORACLE_BIGDATA_CONFIG : the Oracle directory object that references the Common Directory
- ORACLE_BIGDATA_CL_bigdatalite : the Oracle directory object that references the Cluster Directory. The naming convention for this directory is as follows:
- Cluster Directory name begins with ORACLE_BIGDATA_CL_
- Followed by the cluster name (i.e. "bigdatalite"). This name is case sensitive (so don't forget quotes for lowercase names!) and is limited to 15 characters.
- Must match the physical directory name in the file system (repeat: it's case sensitive!).
In this section, you learn how to configure Oracle Big Data SQL. This configuration process enables Oracle Database 12c to query data in Hadoop or Oracle NoSQL Database.
As mentioned in the overview, this VM uses enhanced external tables in Oracle Database 12c to access data in HDFS and Oracle NoSQL Database. It does not have the Big Data SQL Server Cells installed. In a true Oracle Big Data environment, there are two installation tasks:
Let's review some of the important elements that are produced in the second configuration task.
Review the Common and Cluster Directories
Two file system directories -- the Common and Cluster directories -- are set up in the Oracle Database home. These directories store configuration files that enable the Exadata Server to connect to the BDA. A short description of each follows.
First, let's review the Common Directory's
Oracle directory objects that correspond to these file system directories are created by the install process.
Review Oracle Directory Objects
As previously shown, the configuration files have been saved to the file system. The installation process creates corresponding Oracle directory objects that point to these folders.
The Oracle directory objects have a specific naming convention:
Review these Oracle directory objects:
Now that we have reviewed the configuration, lets create Oracle tables that access data in HDFS and Oracle NoSQL Database!
Part 2 - Create Oracle Table Over Application Log
In this section, you will create an Oracle table over data
stored in HDFS and then query that data. This example
will use the
ORACLE_HDFS driver; it will not
leverage metadata stored in the Hive Catalog.
Review Application Log Stored in
The movie application streamed data into HDFS - specifically
into the following directory:
/user/oracle/moviework/applog_json. Let's review that log data:
Create Oracle Table Over Application Log
Now that you have reviewed the source data, create an Oracle table over the file. This table will be very simple: a single column where each record contains a JSON document. You will then user Oracle SQL to easily parse the JSON fields found in each document:
In a matter of minutes, you were able to create and query Oracle Database tables over data sourced in HDFS - and then join that data with other Oracle Database tables.
Next, we will leverage metadata already available in the Hive Metastore to make it even easier to query complex data in Hadoop.
Part 3 - Leverage the Hive Metastore to Access Data in Hadoop & Oracle NoSQL Database
Hive enables SQL access to data stored in Hadoop and NoSQL stores. There are two parts to Hive: the Hive execution engine and the Hive Metastore.
The Hive execution engine launches MapReduce job(s) based on the SQL that has been issued. MapReduce is a batch processing framework and is not intended for interactive query and analysis - but it is extremely useful for querying massive data sets using the well understood SQL language. Importantly, no coding is required (Java, Pig, etc.). The SQL supported by Hive is still limited (SQL92), but improvements are being made over time.
The Hive Metastore has become the standard metadata repository for data stored in Hadoop. It contains the definitions of tables (table name, columns and data types), the location of data files (e.g. directory in HDFS), and the routines required parse that data (e.g. StorageHandlers, InputFormats and SerDes). The data accessed thru Hive does not have to be stored in Hadoop. For example, Oracle NoSQL Database offers a StorageHandler that makes its data accessible thru Hive. This capability will be leveraged by Oracle Big Data SQL.
There are many query execution engines that use the Hive Metastore while bypassing the Hive execution engine. Oracle Big Data SQL is an example of such an engine. This means that the same metadata can be shared across multiple products (e.g. Hive, Oracle Big Data SQL, Impala, Pig, Spark SQL, etc.); you will see an example of this in action in the following exercises.
Let's begin by reviewing the tables that have been defined in Hive. After reviewing these hive definitions, we'll create tables in the Oracle Database that will query the underlying Hive data stored in HDFS and Oracle NoSQL Database:
Review Tables Stored in Hive
Tables in Hive are organized into databases. In our example, several tables have been created in the default database. Connect to Hive and investigate these tables.
Leverage Hive Metadata When
Creating Oracle Tables
Oracle Big Data SQL is able to leverage the Hive metadata when creating and querying tables.
In this section, you will create Oracle tables over
three Hive tables:
movieapp_log_avro and recommendation.
Oracle Big Data SQL will utilize the existing
StorageHandlers and SerDes required to process this data.
Part 4 - Big Data SQL Performance Features
Big Data SQL provides numerous features that enhance query performance. These include:
- SmartScan: data local scans on the Hadoop cluster that will filter data based on SQL query predicates
- Storage Indexes: automatically generated, in-memory indexes that enable SmartScan to skip reading blocks that do not contain data based on the query predicate
- Bloom Filters: pushes a predicate that was applied to a joined look-up table to the data stored on the hadoop cluster
- Partition Pruning: avoid reading hive partitions based on query predicates
- Predicate Pushdown: intelligent data sources - like Oracle NoSQL Database, HBase, Parquet and ORC files - are able to process predicates and leverage optimized storage performance capabilities.
Using a simple, single-node VM is not an environment for evaluating performance. However, Big Data Lite will allow you to utilize Partition Pruning and Predicate Pushdown - enabling you to better understand how these performance features work. Because Big Data Lite does not include Big Data SQL Cells, you will not be able see the value from SmartScan, Storage Indexes and Bloom Filter features.
- Launch beeline and review table definition and its partitions
- In SQL Developer, create a Big Data SQL-enabled table over the hive partitioned table
- Compare query performance between the non-partitioned and partitioned sources
- Launch beeline and review the table definition and its partitions
- In SQL Developer, create a Big Data SQL-enabled table over the hive partitioned table
- Compare query performance between the parquet source and the previous example
Querying Partitioned Hive Tables
In this exercise, we will examine the performance impact
of Hive partition pruning. The hive table
is a non-partitioned table defined over Avro
data; we queried this table in a previous exercise.
A second table has been created -
- that is partitioned by month.
Predicate Pushdown to Intelligent
A delimited text file is not an intelligent data source. The data contained in the source may be interesting - but there it doesn't provide capabilities to optimize retrieval of data. Oracle NoSQL Database, Parquet and ORC files are examples are intelligent sources. They provide numerous features that optimize data retrieval. You can review this blog post for details.
This exercise will examine the performance benefit of
predicate pushdown into Parquet files - which provides a
compressed, efficient columnar store. This example
uses the same data as found in the previous example;
movieapp_log_month_parquet is a partitioned
hive table where data for each month is stored in Parquet
Part 5 - Applying Oracle Database Security Policies Across the Big Data Platform
In most deployments, the Oracle Database contains critical and sensitive data that must be protected. A rich set of Oracle Database security features, including strong authentication, row level access, data redaction, data masking, auditing and more - have been utilized to ensure that data remains safe. These same security policies can be leveraged when using Oracle Big Data SQL. This means that a single set of security policies can be utilized to protect all of your data.
In our example, we need to protect personally identifiable
information, including the customer last name and customer
id. To accomplish this task, an Oracle Data Redaction
policy has already been set up on the
table that obscures these two fields. This was accomplished by
DBMS_REDACT PL/SQL package, shown here:
object_schema => 'MOVIEDEMO',
object_name => 'CUSTOMER',
column_name => 'CUST_ID',
policy_name => 'customer_redaction',
function_type => DBMS_REDACT.PARTIAL,
function_parameters => '9,1,7',
expression => '1=1'
object_schema => 'MOVIEDEMO',
object_name => 'CUSTOMER',
action => DBMS_REDACT.ADD_COLUMN,
column_name => 'LAST_NAME',
policy_name => 'customer_redaction',
function_type => DBMS_REDACT.PARTIAL,
function_parameters => 'VVVVVVVVVVVVVVVVVVVVVVVVV,VVVVVVVVVVVVVVVVVVVVVVVVV,*,3,25',
expression => '1=1'
The first PL/SQL call creates a policy called
- It is applied to the
- It performs a partial redaction - i.e. it is not nec. applied to all characters in the field
- It replaces the first 7 characters with the number "9"
- The redaction policy will always apply - since the expression describing when it will apply is specified as "1=1"
The second API call updates the
policy, redacting a second column in that same table. It
will replace the characters 3 to 25 in the
column with an '*'. Note: the application of redaction
policies does not change underlying data. Oracle Database
performs the redaction at execution time, just before the data
is displayed to the application user.
Querying these two columns in the customer table produces the following result:
SELECT cust_id, last_name FROM customer;
Importantly, SQL executed against redacted data remains unchanged. For example, queries can use the cust_id and last_name columns in join conditions, apply filters to them, etc. The fact that the data is redacted is transparent to application code.
In the next section, you apply redaction policies to our tables
that have data sourced in Hadoop.
- The first procedure redacts data sourced from JSON in HDFS
- The second procedure redacts Avro data sourced from Hive
- The third procedure redacts data sourced from Oracle NoSQL Database
- Both policies redact the
Apply Redaction Policies to Data
Stored in Hadoop and Oracle NoSQL Database
Here, you apply an equivalent redaction policy to two of our Oracle Big Data SQL tables, with the following effects:
Part 6 - Using Oracle Analytic SQL Across All Your Data
- Recency : when was the last time the customer accessed the site?
- Frequency : what is the level of activity for that customer on the site?
- Monetary : how much money has the customer spent?
Oracle Big Data SQL allows you to utilize Oracle's rich SQL dialect to query all your data, regardless of where that data may reside. We will take a look at a couple of analytic queries that deliver unique insights across our three data sources.
Gaining Insights From All Your
This next example will enrich Oracle MoviePlex's understanding of customers by utilizing an RFM analysis. This query will identify:
To answer these questions, SQL Analytic Functions will be applied to data residing in both the application logs on Hadoop and sales data in Oracle Database tables. Customers will be categorized into 5 buckets measured in increasing importance. For example, an RFM combined score of 551 indicates that the customer is in the highest tier of customers in terms of recent visits (R=5) and activity on the site (F=5), however the customer is in the lowest tier in terms of spend (M=1). Perhaps this is a customer that performs research on the site, but then decides to buy movies elsewhere!
We want to target customers who we may be losing to competition. Therefore, execute the following query -- which finds important customers (high monetary score) that have not visited the site recently (low recency score):
Part 7 - Introduction to SQL Pattern Matching
- Logically group and order the data that is used in the MATCH_RECOGNIZE clause using the PARTITION BY and ORDER BY clauses.
- Define business rules/patterns using the PATTERN clause. These patterns use regular expressions syntax, a powerful and expressive feature and applied to the pattern variables.
- Specify the logical conditions required to map a row to a row pattern variable using the DEFINE clause.
- Define output measures, which are expressions within the MEASURES clause.
- Control the output (summary vs. detailed) from the pattern matching process
This section covers the new SQL pattern matching and analytical SQL functionality that is part of Oracle Database 12c. Row pattern matching in native SQL improves application development, developer productivity and query efficiency for row-sequence analysis. This new feature is an important addition to your SQL toolbox.
Recognizing patterns in a sequence of rows has been a capability that was widely desired, but not really possible with SQL until now. There were many workarounds, but these were difficult to write, hard to understand, and inefficient to execute. With Oracle Database 12c you can use the MATCH_RECOGNIZE clause to perform pattern matching in SQL to do the following:
moviedemoschema contains a view called MOVIEAPP_LOG_JSON_V which returns a formatted version of JSON click data stream from our web application log file. The view returns the following columns:
Using this click data we will create a sessionization data set which tracks each session, the duration of the session and the number of clicks/events.
Tasks and Keywords in Pattern Matching
Let us go over some of the tasks and keywords used in pattern matching. Building a pattern matching statement can be broken down into four simple steps:
|1. Organize the data||PARTITION
|Logically divide/partition the rows into groups
Logically order the rows within a partition
|2. Define the business rules||PATTERN
|Defines the pattern variables that must be
matched, the sequence in which they must be matched,
and the number of rows which must be matched
Specifies the conditions that define a pattern variable
Determines where to restart the matching process after a match is found
|3. Define the output measures||MEASURES
|Defines row pattern measure columns
Finds which pattern variable applies to which rows
Identifies which component of a pattern applies to a specific row
|4. Control the output||ONE
ROW PER MATCH
ALL ROWS PER MATCH
|Returns one summary row of output for each match
Returns one detail row of output for each row of each match
Pattern Match Example: Web Log Sessionization Analysis
Part 8 - Checking the Pattern Matching Process
In this section, we will use the CLASSIFIER() measure to show which pattern variable is being assigned to each row. This will help us debug our pattern matching process and ensure it is working correctly.
Part 9 - Creating a More Useful Data Set
Using the CLASSIFIER() function we have established that our pattern matching process is working correctly. What we need to do now is condense the data set so that we have one row for each session. We can do that by changing the way we output the data. It would be also useful to include some additional business metrics as part of the of the data set. The following sections will explain how to do this.
Creating a Summary Report
Adding Business Value
Part 10 - Other Useful 12c Analytical SQL Features
To make the following code easier to read we have wrapped the MATCH_RECOGNIZE clause that we created above within a view called movieapp_analytics_v. Of course you could incorporate the following additional features into the MATCH_RECOGNIZE statements above.
How Many Distinct Customers?
A useful metric within click data is to count the number of distinct customers using our site each month. Normally we would just use the COUNT(DISTINCT expr) function. However, this can require a lot of resources to search through a large dataset and return the exact number of distinct values.
With the release of Database 12c Oracle provides a much faster way to do this type of analysis. Using APPROX_COUNT_DISTINCT it is possible to get a reasonably accurate estimate of the number of distinct values within a column. This new function can process large amounts of data significantly faster than COUNT(DISTINCT expr) function, with negligible deviation from the exact result. For more information about this new feature please refer to the SQL Reference Guide
Let's calculate the number of unique sessions per month using both functions:
Finding the Top 1% of Customers
Finding the Bottom 1% of Customers
In this tutorial, you have learned how to:
- Configure Oracle Big Data SQL on Oracle Exadata
- Create Oracle external tables that access data in HDFS, Oracle NoSQL Database and Hive
- Apply Oracle security policies across data in both Oracle Database and Hadoop
- Use Oracle's rich SQL dialect to analyze data across the data platform
- Use Oracle's new rich SQL pattern matching features to analyze web log data
- Use the MATCH_RECOGNIZE clause to perform pattern matching in SQL
- Use the main keywords that are used in pattern matching
- Create sessionization analysis from web log file data
- Quickly change the business rules for an existing query
- Add calculated measures to increase business value
- Speed up processing by using the new APPROX_COUNT_DISTINCT function to quickly calculate number of distinct values
- Quickly find the top/bottom values using the new FETCH feature
For detailed information on pattern matching, see the Pattern Matching chapter in the Oracle Database 12c Data Warehousing Guide reference guide. Among other things, this chapter contains the following detailed examples:
- Stock Market Examples: based on common tasks involving share prices and patterns.
- Security Log Analysis Examples: deals with a computer system that issues error messages and authentication checks, and stores the events in a system file.
- Sessionization Examples: analysis of user activity, typically involving multiple events in a single session. Pattern matching makes it easy to express queries for sessionization.
To help navigate this Oracle by Example, note the following:
- Hiding Header Buttons:
- Click the Title to hide the buttons in the header. To show the buttons again, simply click the Title again.
- Topic List Button:
- A list of all the topics. Click one of the topics to navigate to that section.
- Expand/Collapse All Topics:
- To show/hide all the detail for all the sections. By default, all topics are collapsed
- Show/Hide All Images:
- To show/hide all the screenshots. By default, all images are displayed.
- To print the content. The content currently displayed or hidden will be printed.