Oracle by Example brandingCopying Data from an HTTP(S) Server with Oracle Big Data Manager

section 0Before You Begin

In this 15-minute tutorial, you learn how to use Oracle Big Data Manager to copy data files hosted on a Hyper Text Transfer Protocol Secure (HTTPS) server to the Hadoop Distributed File System (HDFS) on your cluster.

Background

This is the first tutorial in the Working with Oracle Big Data Manager series. Read them sequentially.

What Do You Need?

  • Access to an HTTP server such as Apache HTTP server, Oracle HTTP Server (OHS), or Oracle Weblogic Server.
  • Access to an instance of Oracle Big Data Cloud Service and the required login credentials.
  • Access to Oracle Big Data Manager on a non-secure Oracle Big Data Cloud Service instance. A port must be opened to permit access to Oracle Big Data Manager, as described in Enabling Oracle Big Data Manager.
  • The required sign in credentials for Oracle Big Data Manager.
  • Read/Write privileges to the /user/demo HDFS directory.
  • Basic familiarity with HDFS, Spark, and optionally, Apache Zeppelin.

section 1 Access the Oracle Big Data Manager Console

  1. Sign in to Oracle Cloud and open your Oracle Big Data Cloud Service console.
    Description of the illustration bdcs-console.png
    Description of the illustration bdcs-console.png
  2. In the row for the cluster, click Manage this service Manage icon, and then click Oracle Big Data Manager console from the context menu to display the Oracle Big Data Manager Home page.
    Description of the illustration 
                                select-bdm-console.png
    Description of the illustration select-bdm-console.png

section 2Configure and Run Your HTTP(S) Server

In this section, you copy two files to a new HTTP(S) web server directory on your cluster. Next, you configure and run your HTTP(S) server which will host the data files.

  1. Copy the taxidropoff_files.zip file to a new directory named taxi_telemetry on your machine where you will run the HTTP(S) server. This file contains (12) .csv data files that were created from several datasets on the NYC Taxi & Limousine Commission website. The taxi_telemetry directory will be accessible via web browsers, HTTP(S), and Oracle Big Data Manager.
  2. Extract the taxidropoff_files.zip file into the taxi_telemetry directory.
  3. Make sure that you have the following files in your taxidropoff_files HTTP(S) web server directory.
    Description of the illustration 
                                files-on-cluster.png
    Description of the illustration files-on-cluster.png
  4. Right-click the list_of_files.txt file, select Save link as from the context menu, and then save it in your taxi_telemetry directory on your machine where you will run the HTTP(S) server. Edit the list_of_files.txt file. Replace the occurrences of your_host_name with your host name. Replace the occurrences of your_port_number with your port number. Save the file.
  5. Description of the illustration 
                                list-of-files.png
    Description of the illustration list-of-files.png

    Note: Your taxi_telemetry HTTP(S) web server directory should contain the list_of_files.txt file and the (12) .csv data files.

  6. In this tutorial, we are using the SimpleHTTPServer module that is available when you install Python. You can use this built-in HTTP(S) server (or your own HTTP(S) server) to turn any directory in your cluster into a web server directory. In your terminal window, cd into the taxi_telemetry directory.
    $ cd taxi_telemetry
  7. You can use any port that is available to you with your HTTP(S) server. Your HTTP(S) server will listen to HTTP(S) requests for the data hosted on your web server directory using this port. In this tutorial, we will use port 17777. For example, to make sure that your port is available, enter the following command:
    $ netstat -tlnp 2>/dev/null | grep [port#]

    Note: In the preceding command, substitute [port#] with your port number.

  8. Start-up the HTTP(S) server. For example, to start the HTTP(S) server on port 17777, enter the following command at the $ prompt:
    $ python -m SimpleHTTPServer 17777
    Description of the illustration 
                                start-http-server.png
    Description of the illustration start-http-server.png

    Note: In the preceding command, substitute port 17777 with your port number.


section 3Copy a Data File from an HTTP(S) Server to HDFS

In this section, you copy the taxidropoff_1.csv data file from the taxi_telemtry directory to HDFS using the Copy here from HTTP(S) feature in Data explorer.

  1. On the Oracle Big Data Manager page, click the Data tab.
    Description of the illustration data-tab.png
    Description of the illustration data-tab.png
  2. In the Data explorer section, select HDFS storage (hdfs) from the Storage drop-down list. Navigate to the /user/demo directory, and then click Copy here from HTTP(S) Copy icon on the toolbar.
    Description of the illustration 
                                copy-from-http.png
    Description of the illustration copy-from-http.png

    The New copy data job dialog box is displayed. It has the Sources and Destination sections and the General and Advanced tabs.

  3. In the Sources section, accept the default Direct link from the Source type drop-down list. Select HTTP(S) from the Source location drop-down list. In the Enter a valid HTTP(S) URI field, enter a URI using the following format. Substitute your_host_name with your actual host name. Substitute your_port_number with your port number that you are using with your HTTP server.
    http://your_host_name:your_port_number/taxidropoff_1.csv
    Description of the illustration 
                                sources-section.png
    Description of the illustration sources-section.png
  4. Note: You can click Add source to add more source locations to this copy data job. In addition to HTTP(S), you can specify other storage source locations such as HDFS storage (hdfs) and Oracle Object Storage Classic (bdcs). In this example, we will only include the HTTP(S) source.

    Description of the illustration add-source-location.png
    Description of the illustration add-source-location.png
  5. In the Destination section, make sure that the /user/demo HDFS destination directory is displayed. To make any changes to the destination directory, click Edit destination Edit.
    Description of the illustration 
                                destination-section.png
    Description of the illustration destination-section.png
  6. In the General tab, accept the defaults for all the fields, and then click Create.
    Description of the illustration general-tab.png
    Description of the illustration general-tab.png
  7. A Data Copy Job Created window is displayed for the requested data transfer (copy). The window displays the job details such as the job number, start time, progress, start date and time, and duration. To display additional details about the data copy job, click View more details. When the file is copied successfully to HDFS, a Succeeded window is displayed.
  8. In the Data explorer section, select HDFS storage (hdfs) from the Storage drop-down list. Navigate to the /user/demo directory, and then click Refresh Refresh icon on the toolbar. The taxidropoff_1.csv file is now displayed in the /user/demo HDFS directory.

    Description of the illustration file-in-hdfs.png
    Description of the illustration file-in-hdfs.png

section 4Copy Multiple Data Files from an HTTP(S) Server to HDFS

In this section, you copy seven .csv data files from the taxi_telemetry directory to HDFS using the Link to list of files option in the New copy data job dialog box. You will use the list_of_files.txt file which contains the urls for the seven .csv data files.

  1. On the Oracle Big Data Manager page, click the Data tab.
  2. In the Data explorer section, select HDFS storage (hdfs) from the Storage drop-down list. Navigate to the /user/demo directory, and then click Copy here from HTTP(S) Copy icon on the toolbar. The New copy data job dialog box is displayed.
  3. In the Sources section, select the Link to list of files from the Source type drop-down list. Select HTTP(S) from the Source location drop-down list. In the Enter a valid HTTP(S) URI field, enter a URI that references the list_of_files.txt file using the following format:
    http://your_host_name:your_port_number/list_of_files.txt

    Note: In the preceding URI, substitute your_host_name with your actual host name. Substitute your_port_number with your port number that you are using with your HTTP server.

    Description of the illustration 
                                sources-section-2.png
    Description of the illustration sources-section-2.png

    To display the contents of the list_of_files.txt file in the taxi_telemetry directory, enter the following command:

    $ cat list_of_files.txt
  4. In the Destination section, make sure that the /user/demo HDFS destination directory is displayed.
  5. In the General tab, accept the defaults for all the fields, and then click Create.
  6. A Data Copy Job Created window is displayed for the requested data transfer. When the files are copied successfully to HDFS, a Succeeded window is displayed. Click Close to close the window.
  7. The copied .csv files are now displayed in the /user/demo HDFS directory. If the files are not displayed, click Refresh Refresh icon on the toolbar.

    Description of the illustration files-in-hdfs.png
    Description of the illustration files-in-hdfs.png

next stepNext Tutorial

Analyzing Data with Oracle Big Data Manager Notebook