Working with Oracle Big Data Manager Analytic Pipelines

Before You Begin

In this 20-minute tutorial, you learn how to create and run a new Oracle Big Data Manager pipeline that contains Data Copy and Data Extract jobs. You also import a note into Oracle Big Data Manager that displays the copied and extracted data.

Background

You can use pipelines in Oracle Big Data Manager to easily chain Data Copy and Data Extract jobs. One job can automatically trigger another job without the need for manual intervention. The Analytic Pipelines feature is built on top of the Oozie workflow scheduler.

What Do You Need?

Access to an instance of Oracle Big Data Cloud Service and the required login credentials.
Access to Oracle Big Data Manager on an Oracle Big Data Cloud Service instance. A port must be opened to permit access to Oracle Big Data Manager, as described in Enabling Oracle Big Data Manager.
The required sign in credentials for Oracle Big Data Manager.
Read/Write privileges to your HDFS home directory that is associated with your Oracle Big Data Manager Username. For example, if you logged in to Oracle Big Data Manager with username john, and your HDFS directory is /user/john, then you must have Read/Write privileges to your /user/john HDFS directory. In this tutorial, we use the demo user which has Read/Write privileges to the /user/demo HDFS home directory associated with this user.
Read/Write privileges to the /tmp HDFS directory.
Basic familiarity with HDFS, Spark, database concepts and SQL, and optionally, Apache Zeppelin.

Access the Oracle Big Data Manager Console

Sign in to Oracle Cloud and open your Oracle Big Data Cloud Service console.

Description of the illustration bdcs-console.png
In the row for the cluster, click Manage this service , and then click Oracle Big Data Manager console from the context menu to display the Oracle Big Data Manager Home page.

Upload Local Files to HDFS

Right-click the taxidropoff_11.csv file, select Save link as from the context menu, and then save it to your local machine.
Right-click the taxidropoff_files_1_10.zip file, select Save link as from the context menu, and then save it to your local machine.
On the Big Data Manager page, click the Data tab.
In the Data explorer section, select HDFS storage (hdfs) from the Storage drop-down list. Navigate to the /tmp HDFS directory, and then click File upload on the toolbar.

Description of the illustration file-upload.png
In the Files upload dialog box, click Choose files to upload. In the Open dialog box, navigate to your local directory that contains the taxidropoff_11.csv and the taxidropoff_files_1_10.zip files. Hold down the Ctrl key, and then select the two files. The two files are displayed in the Name column. Click Upload.

Description of the illustration upload.png

When the two files are uploaded successfully to the /tmp HDFS directory, the Upload has finished message is displayed in the Details section of the dialog box. Click Close to close the dialog box.

Create and Run a Data Copy Job

On the Big Data Manager page, click the Jobs tab to display the Jobs page. Click Create new job, and then select Data Copy from the context menu. The Create new job dialog box is displayed.
Accept the default and unique Data Copy job name in the Job name field.
In the Source(s) section of the Source and destination tab, click Select file or directory. The Select file or directory dialog box is displayed. Select HDFS storage (hdfs) from the Location drop-down list. Navigate to the /tmp directory, click the taxidropoff_11.csv file, and then click Select.
In the Destination section of the Source and destination tab, click Select file or directory. The Select file or directory dialog box is displayed. Select HDFS storage (hdfs) from the Location drop-down list. Click Open home directory on the toolbar to navigate to your HDFS home directory automatically. For example, if you logged in to Oracle Big Data Manager with username john, and your HDFS directory is /user/john, then your destination directory should be /user/john. Again, in this tutorial, we used the demo user with the /user/demo HDFS home directory associated with this user. Click Select.

Description of the illustration create-new-job.png

Important: Your job name and destination HDFS directory will be different than what it is shown in the preceding screen capture. Substitute the /user/demo path shown in the preceding screen capture with your actual HDFS home directory path associated with your username.
Click Create. The Jobs page is refreshed and the new Data Copy job is displayed in the list of available jobs.

Description of the illustration copy-job-template.png
In the row for the new Data Copy job, click Manage this job , and then click Run now from the context menu. The Execute Data copy job # your-job-number dialog box is displayed. Click Create. If the Data Copy job executes successfully, the Last execution field shows Succeeded. This indicates that the source file is copied successfully to your designated HDFS directory.

Description of the illustration job-details.png

Note: You can click the job name link to display the job details while it is executing or after the execution is completed.

Create and Run a New Pipeline

On the Oracle Big Data Manager page, click the Pipelines tab to display the Pipelines page.
In the Create new empty pipeline section, click Create. The New pipeline page is displayed. Click Hide help to hide the tooltips, or click anywhere on the page. Accept the default and unique pipeline name.
Click the Existing job tab. Click and drag your Data Copy job that you created in the previous section from the list of available jobs onto the Add new job node at the end of the pipeline in the Pipeline Editor. The Data Copy job is added to the pipeline.

Description of the illustration add-copy-job.png
Create a new Data Extract job to copy and extract the contents of the the taxidropoff_files_1_10.zip file into your designated HDFS directory. Click the Add new job node at the end of the pipeline to add a new Data Extract job. Click Data Extract from the Select item context menu. The Edit Data Extract properties dialog box is displayed. Accept the default and unique Data Extract job name in the Job name field.
In the Source(s) section of the Source and destination tab, click Select file or directory. The Select file or directory dialog box is displayed. Select HDFS storage (hdfs) from the Location drop-down list, if not already selected. Navigate to the /tmp directory, click the taxidropoff_1_10.zip file, and then click Select.
In the Destination section of the Source and destination tab, click Select file or directory. The Select file or directory dialog box is displayed. Select HDFS storage (hdfs) from the Location drop-down list. Click Open home directory on the toolbar to navigate to your HDFS home directory automatically, and then click Select. For example, if you logged in to Oracle Big Data Manager with username john, and your HDFS directory is /user/john, then your destination directory should be /user/john. Again, in this tutorial, we used the demo user with the /user/demo HDFS home directory associated with this user. Click Update.

Description of the illustration new-extract-job.png

Important: Your Data Extract job name and destination HDFS directory will be different than what it is shown in the preceding screen capture. Substitute the /user/demo path shown in the preceding screen capture with your actual HDFS home directory path associated with your username.
Click Pipeline properties. The Pipeline properties inspector is displayed. In the View options section, change the graph orientation to Horizontal, and then click Save to hide the Pipeline properties inspector.

Description of the illustration completed-pipeline.png
Click Save and run pipeline on the toolbar. The Enable pipeline? dialog box is displayed. Click Enable, save, and run pipeline. If the pipeline executes successfully, Succeeded is displayed to the left of the Execution # 1 link. This indicates that the Data Extract job in the pipeline copied the taxidropoff_files_1_10.zip file and extracted its content into your designated destination HDFS directory. This file contains ten files, taxidropoff_1.csv through taxidropoff_10.csv. The taxidropoff_files_1_10.zip file is not saved in this HDFS directory.
You can view the pipeline execution page either while the job is executing or after the execution is completed. On the Oracle Big Data Manager page, click the Pipelines tab to display the Pipelines page. In the list of available pipelines, click your new pipeline. The Pipeline details page is displayed. It contains the Pipeline overview and Execution history sections. You can drill-down on the Pipeline overview section to display the pipeline flow.

Description of the illustration pipeline-details-page.png
In the Execution history section, click Execution #1. The Pipeline execution page is displayed. This page contains the Summary, Status, and Executed jobs sections.

Description of the illustration pipeline-execution.png

Note: You can click on any executed job name to view the job's detail and output.
In the Executed jobs section, click Graph in the View as field to display the pipeline in a graph format instead of a list. If the pipeline is executed successfully, a green check mark badge is displayed at the bottom right of the Data Copy and Data Extract jobs.

Description of the illustration pipeline-execut.png

View the Copied and Extracted Data in a Zeppelin Note

In this section, you restart the Spark interpreter and import a note into Oracle Big Data Manager Notebook. This note verifies that the data was copied and extracted to your HDFS home directory successfully.

On the Oracle Big Data Manager page, click the Notebook tab.
View and re-start the Spark interpreter. Click the Menu drop-down list, and then select Interpreter. The Interpreters page is displayed. Scroll-down to the spark interpreter section. Make sure that the ZEPPELIN_IMPERSONATE_SPARK_PROXY_USER property is set to true. If it's not, click edit, enter true in the value field, and then click Save at the bottom of the Properties section.

Description of the illustration
spark-interpreter.png — Description of the illustration spark-interpreter.png

Click restart. A Do you want to restart this interpreter message box is displayed. Click OK.
Right-click the copy_extract_data_from_pipeline_to_hdfs.json file, select Save link as from the context menu, and then save it to your local machine.
On the Notebook tab banner, click Home . In the Notebook section, click Import note. The Import new note dialog box is is displayed.
In the Import AS field enter Data Copy and Extract Jobs to HDFS Pipeline Template. Click the Choose a JSON here icon. In the Open dialog box, navigate to your local directory that contains the copy_extract_data_from_pipeline_to_hdfs.json file, and then select the file. The note is imported and displayed in the list of available notes in the Notebook.
Click the Data Copy and Extract Jobs to HDFS Pipeline Template note to view it. The initial status of each paragraph in the note is READY which indicates that the paragraph has not been executed yet.
In the Load data from HDFS and count number of lines paragraph, replace the /user/demo/ path in the load command with your actual HDFS home directory path. For example, if you logged in to Oracle Big Data Manager with username john, edit the path in the load command to reflect the actual path of your HDFS home directory as follows:
```
.load("hdfs:/user/john/*.csv")
```
Click Run all paragraphs on the Note's toolbar to run all paragraphs in this note. A Run all paragraphs confirmation message is displayed. Click OK. When a paragraph executes successfully, its status changes to FINSIHED. The note loads the data from your HDFS home directory, counts the numbers of lines, analyzes the data, and then displays the data in tabular and Bar Chart formats.

Want to Learn More?

Oracle Big Data Cloud Service

Oracle Big Data Manager Tutorials