Running a Batch Spark Job in a Big Data Cloud Cluster

Before You Begin

This 15-minute tutorial shows you how to run a simple batch spark job in a Big Data Cloud cluster.

Background

The New York City (NYC) Taxi & Limousine Commission - Trip Record Data is analyzed in this tutorial. Two jobs are created for this task:

TripParserJob: This job reads the NYC Taxi logs stored in Oracle Storage Cloud Container and stores it as a Comma-seperated values (CSV) file in Hadoop Distributed File System (HDFS).
TripProcessorJob: This job reads the output generated by the TripParserJob and computes the average fare for each hour of the day. The result is a key-value pair where the key is the hour of the day and the value is the average fare paid by customers in such hour of any day, month, or year during the given time period represented by the initial input file. The output is stored as a text file in the Oracle Storage Cloud Container associated with the BDC cluster.

What Do You Need?

A running BDC cluster.
BDC account credentials or Big Data Cloud Console direct URL (for example: https://xxx.xxx.xxx.xxx:1080/).
BDC cluster login credentials.
Oracle Storage Cloud credentials, tenant name, and container name.
The smallTrip.csv file uploaded in the Oracle Storage Cloud Container that is linked to the BDC cluster.
For instructions on how to upload/create objects in Oracle Storage Cloud Service, see Creating a Single Object.

Navigate to Big Data Cloud Console - Jobs Page

Note:

In the Services page, click the Manage this Service icon of the cluster where you want to create the job and then click Big Data Cloud Console.

Description of this image
A window titled Authentication Required appears. Enter your BDC cluster user name and password and click OK.

Description of this image
In the Big Data Cloud Console, click Jobs.

Description of this image

Create the TripParserJob

In the Big Data Cloud Console Jobs page, click New Job.

Description of this image
Enter a Name and Description for your TripParserJob, and click Next.

Description of this image
Provide your configuration parameters for executing the job and click Next. In this example, the following parameters are used:
- Driver Cores: 2
- Driver Memory: 2 GB
- Executor Cores: 2
- Executor Memory: 3 GB
- No. of Executors: 2
- Queue: api
Description of this image
Provide your driver file information such as File Path, Main Class, Arguments, Additional Jars, Additional Support Files, and click Next. In this example, the following information is entered:
- File Path: hdfs:///spark/examples/perf-jobs-apache-openstack-1.1.0-20160628.173357-1.jar
- Main Class: com.oracle.spoccs.jobs.TripParserJob
- Arguments:
```
inDS=swift://storageContainerName.main/smallTrip.csv
outDS=hdfs:///user/oracle/data/parsedTrip
fs.swift.SERVICE_NAME=main
fs.swift.CONTAINER_NAME=storageContainerName
fs.swift.service.main.auth.url=https://identityDomainName.storage.oraclecloud.com/auth/v2.0/tokens
fs.swift.service.main.tenant=Storage-tenantName
fs.swift.service.main.username=Storageadmin
fs.swift.service.main.password=storagePassword
fs.swift.service.main.public=true
fs.swift.service.http.location-aware=false
```
  Change the storageContainerName, identityDomainName, tenantName, Storageadmin, and storagePassword values in the arguments as per your configuration.
  
  Description of this image
In the confirmation page, confirm your responses and click Create.

Description of this image
After the job completes successfully, you create the TripProcessorJob.

Description of this image

Create the TripProcessorJob

In the Big Data Cloud Console Jobs page, click New Job.

Description of this image
Enter a Name and Description for your TripProcessorJob, and click Next.

Description of this image
Provide your configuration parameters for executing the job and click Next. In this example, the following parameters are used:
- Driver Cores: 2
- Driver Memory: 2 GB
- Executor Cores: 2
- Executor Memory: 3 GB
- No. of Executors: 2
- Queue: api
Description of this image
Provide your driver file information such as File Path, Main Class, Arguments, Additional Jars, Additional Support Files, and click Next. In this example, the following information is entered:
- File Path: hdfs:///spark/examples/perf-jobs-apache-openstack-1.1.0-20160628.173357-1.jar
- Main Class: com.oracle.spoccs.jobs.TripProcessorJob
- Arguments:
```
i nDS=hdfs:///user/oracle/data/parsedTrip
outDS=swift://storageContainerName.main/processedJob
fs.swift.SERVICE_NAME=main
fs.swift.CONTAINER_NAME=storageContainerName
fs.swift.service.main.auth.url=https://identityDomainName.storage.oraclecloud.com/auth/v2.0/tokens
fs.swift.service.main.tenant=Storage-tenantName
fs.swift.service.main.username=Storageadmin
fs.swift.service.main.password=storagePassword
fs.swift.service.main.public=true
fs.swift.service.http.location-aware=false
```
Change the storageContainerName, identityDomainName, tenantName, Storageadmin, and storagePassword values in the arguments as per your configuration.

Description of this image

Note: The output file that was generated in the TripParserJob is used as the input file here.
In the confirmation page, confirm your responses and click Create.

Description of this image
After the job completes successfully, you proceed to viewing the output.

Description of this image

View the Output

In the Big Data Cloud Console, click Data Stores.

Description of this image
Click Cloud Storage as the final output of the TripProcessorJob was stored in Oracle Storage Cloud Container.

Description of this image
Enter the outDS value of the TripProcessorJob argument (in this case processedJob) in the Filter by Prefix field and press Enter.

Description of this image

Notice that the output files are created in the Oracle Storage Cloud Container.

Want to Learn More?

For instructions on how to download the output files/objects from the Oracle Storage Cloud Container, see Downloading an Object.
Working with Jobs
Working with Notebook in Big Data Cloud
Get Started with Oracle Big Data Cloud
Get Started with Oracle Storage Cloud Service