Retrieving, Transforming, and Consolidating Web Data with Oracle Database

by Yuli Vasiliev

Learn the Oracle Database features that enable obtaining, transforming, and merging diverse data from heterogeneous sources.


Published November 2012

It has become the norm that enterprise-class databases can store different types of data natively. Thus, for example, you might have CLOBs, BLOBs, and entire XML or HTML documents stored in your database, maybe even within the same table. However, the ability of your database to provide native storage for non-relational data does not automatically mean it offers a toolset to efficiently operate on that content. So, you still might need to employ external tools when it comes to manipulating data stored natively in your database.

OTN is all about helping you become familiar enough with Oracle technologies to make an informed decision. Articles, software downloads, documentation, and more. Join up and get the technical resources you need to do your job.

Though, if you're an Oracle Database user, you most likely won't need to look for external tools when you have to handle the content stored in your database natively. Oracle Database provides an unprecedented wealth of tools for dealing with non-relational data. Even if you need to operate on data that you don't have stored in your database, you still will be able to obtain necessary information from that data using solely Oracle Database tools, without having to retrieve and store the data first.

Suppose you need to query an XML document stored either in an Oracle database or stored externally. You could take advantage of XQuery, a tool integrated with Oracle Database, which provides the ability to work with XML data from within a SQL environment. And making XQuery even more powerful is the fact that it can be used to query existing relational data stored in an Oracle database, allowing you to combine the data from your Oracle database and external sources into one query result. XQuery query results can be then further processed in a parent SQL query or PL/SQL block, allowing greater flexibility for data transformation and consolidation.

It's important to realize that a non-Oracle Database data source you need to access is not always a source of non-relational data. Thus, you might have to access data stored in another relational database system, such as SQL Server, Sybase, or DB2. In such cases, the Oracle Warehouse Builder feature of Oracle Database comes to the rescue with its ability to connect to heterogeneous data sources and then extract, transform, and integrate data from those sources.

This article gives a quick overview of some Oracle Database features that you can employ when you need to query, transform, and consolidate data from various sources, providing several examples that illustrate how to take advantage of those features.

Exploring and Merging Diverse Data with XQuery

The XQuery language provides an efficient means of querying and transforming any data that is expressible in XML format. With XQuery, you can bring together pieces of information from disparate data sources, accessing both database data and non-database data within a single query. Thus, you can easily join data from multiple sources, including relational data sources, XML and HTML files, office documents, or simply plain-text files.

Oracle XML DB, which is part of Oracle Database, supports the XQuery language through SQL/XML functions, such as XMLQuery, XMLTable, XMLExists, and XMLCast. These functions provide a general interface between the SQL and XQuery languages, letting you get the best of both worlds: SQL and XML.

You compose an XQuery expression to access XML data, relational data, or both, and then you pass that expression, as an argument, to one of the SQL/XML functions mentioned above. Depending on the function used, the results might represent a new XML document or fragment or relational rows, or the results can be further used in the parent SQL query, say, in a join expression. Figure 1 is a graphical depiction of this.

XQuery expressions operate on relational and non-relational data, generating either relational rows or XML content

Figure 1: XQuery expressions operate on relational and non-relational data, generating either relational rows or XML content.

As mentioned, an XQuery query can be issued from within a SQL environment. Figure 2 shows a simple XQuery query issued from within a SQL*Plus session:

Issuing an XQuery query from within a SQL*Plus session

Figure 2: Issuing an XQuery query from within a SQL*Plus session.

In this simple example, an XQuery expression derives its value from the data stored in the EMPLOYEES table belonging to the HR sample schema and returns an XML document.

The example shown in Listing 1 queries the OTN -- New Articles RSS feed document, pulling headlines and URLs for the most recent Oracle Technology Network articles. This RSS feed is available from the Oracle Website.

Caution: Before launching this query, make sure that your computer is connected to the internet.

SELECT XMLQuery(
 'for $i in $h//channel
 return 
 <headlines>
 <title>OTN new articles on Oracle Solaris</title>
 <items>
 {for $j in $h//item
 where ora:contains($j, "Oracle Solaris")
 return <item> {($j/title, $j/link)}</item>}
 </items>
 </headlines>'
 PASSING xmlparse (document httpuritype ('http://feeds.delicious.com/v2/rss/OracleTechnologyNetwork/
otntecharticle').getCLOB()) as "h"
 RETURNING CONTENT).getStringVal() as RESULT FROM DUAL;

Listing 1: Example of query with RSS feed.

The query in Listing 1 should generate an XML document that looks similar to Listing 2. Of course, the actual content of the document will depend on the article headlines currently included in the OTN -- New Articles RSS feed.

<headlines>
 <title>OTN new articles on Oracle Solaris</title>
 <items>
 <item>
 <title>
 Tech Article: How to Extend The Oracle Solaris Studio IDE with NetBeans Plug-Ins
 </title>
 <link>
 http://www.oracle.com/technetwork/articles/servers-storage-dev/howto-extend-ide-1727299.html
 </link>
 </item>
 <item>
 <title>
 Tech Article: How to Automate the Installation of Oracle Solaris Cluster 4.0
 </title>
 <link>
 http://www.oracle.com/technetwork/articles/servers-storage-admin/howto-automate-install-cluster4-
1841534.html
 </link>
 </item>
 </items>
</headlines>

Listing 2: Results of example query.

Joining Diverse Data

Being able to query either relational data or Web data is a fine thing, but how can you join data extracted from disparate sources in a single query? In the following example, you query the XML document shown in Listing 3 and a table from an Oracle Database sample schema.

<EMPLOYEES>
 <EMPLOYEE>
 <EMPID>100</EMPID>
 <BONUS>1000</BONUS>
 </EMPLOYEE>
 <EMPLOYEE>
 <EMPID>101</EMPID>
 <BONUS>1000</BONUS>
 </EMPLOYEE>
 <EMPLOYEE>
 <EMPID>102</EMPID>
 <BONUS>1000</BONUS>
 </EMPLOYEE>
</EMPLOYEES>

Listing 3: XML document to query.

For simplicity, you could just put this document within the document directory of your Web server, so that the document is accessible with an URL, like any other document available on the Web. Make sure that your Web server is up and running, after which you can issue the query shown in Listing 4 to join the data derived from the XML document and the hr.employees relational table within a single an XQuery expression.

SELECT XMLQuery(
'for $z in 1
 return (
 <EMPLOYEES> {for $i in ora:view("HR","employees")/ROW,
 $j in $emps/EMPLOYEES/EMPLOYEE
 where $i/EMPLOYEE_ID = $j/EMPID
 return (<EMPLOYEE>
 <EMPID>{xs:string($i/EMPLOYEE_ID)}</EMPID>
 <ENAME>{xs:string(fn:concat($i/FIRST_NAME, " ", $i/LAST_NAME))}</ENAME>
 <BONUS>{xs:integer($j/BONUS)}</BONUS>
 </EMPLOYEE>)} </EMPLOYEES>)'
 PASSING xmlparse (document httpuritype
 ('http://localhost/bonus.xml').getCLOB()) as "emps"
 RETURNING CONTENT).getStringVal() as RESULT FROM DUAL;

Listing 4: Query to join the data.

The query in Listing 4 should generate and output the XML document shown in Listing 5:

<EMPLOYEES>
 <EMPLOYEE>
 <EMPID>100</EMPID>
 <ENAME>Steven King</ENAME>
 <BONUS>1000</BONUS>
 </EMPLOYEE>
 <EMPLOYEE>
 <EMPID>101</EMPID>
 <ENAME>Neena Kochhar</ENAME>
 <BONUS>1000</BONUS>
 </EMPLOYEE>
 <EMPLOYEE>
 <EMPID>102</EMPID>
 <ENAME>Lex De Haan</ENAME>
 <BONUS>1000</BONUS>
 </EMPLOYEE>
</EMPLOYEES>

Listing 5: Generated XML document.

Now suppose you need to return the query results as relational rows. This is where the XMLTable function will come in handy, as shown in Listing 6:

SELECT e.empid,e.ename, e.bonus FROM 
 XMLTable(
 'for $i in ora:view("HR","employees")/ROW,
 $j in $emps/EMPLOYEES/EMPLOYEE
 where $i/EMPLOYEE_ID = $j/EMPID
 return
 (<EMPLOYEE>
 <EMPID>{xs:string($i/EMPLOYEE_ID)}</EMPID>
 <ENAME>{xs:string(fn:concat($i/FIRST_NAME, " ", $i/LAST_NAME))}</ENAME>
 <BONUS>{xs:integer($j/BONUS)}</BONUS>
 </EMPLOYEE>)'
 PASSING xmlparse (document httpuritype
 ('http://localhost/bonus.xml').getCLOB()) as "emps"
 COLUMNS empid NUMBER PATH 'EMPID',
 ename VARCHAR2(50) PATH 'ENAME',
 bonus NUMBER PATH 'BONUS'
) e;

Listing 6: Using the XML Table function.

The query in Listing 6 should result in the following output:

 EMPID ENAME BONUS
---------- -------------------------------------------- --------
 100 Steven King 1000
 101 Neena Kochhar 1000
 102 Lex De Haan 1000

The interesting part is that these relational rows might be further used in the parent SQL query if it were, say, a join, as in the example in Listing 7:

SELECT e.empid,e.ename, e.bonus, d.department_name FROM hr.departments d,
 XMLTable(
 'for $i in ora:view("HR","employees")/ROW,
 $j in $emps/EMPLOYEES/EMPLOYEE
 where $i/EMPLOYEE_ID = $j/EMPID
 return
 (<EMPLOYEE>
 <EMPID>{xs:string($i/EMPLOYEE_ID)}</EMPID>
 <ENAME>{xs:string(fn:concat($i/FIRST_NAME, " ", $i/LAST_NAME))}</ENAME>
 <EDEPT>{xs:string($i/DEPARTMENT_ID)}</EDEPT>
 <BONUS>{xs:integer($j/BONUS)}</BONUS>
 </EMPLOYEE>)'
 PASSING xmlparse (document httpuritype
 ('http://localhost/bonus.xml').getCLOB()) as "emps"
 COLUMNS empid NUMBER PATH 'EMPID',
 ename VARCHAR2(50) PATH 'ENAME',
 edept VARCHAR2(50) PATH 'EDEPT',
 bonus NUMBER PATH 'BONUS'
) e
WHERE d.department_id=e.edept;

Listing 7: Using the relational rows in a join.

This join should produce the following output:

 EMPID ENAME BONUS DEPARTMENT_NAME
---------- ------------------------------ ---------- ---------------
 100 Steven King 1000 Executive
 101 Neena Kochhar 1000 Executive
 102 Lex De Haan 1000 Executive

It's interesting to note that XQuery can be used not only to grab data from static content but also from content generated on the fly. For example, you might pass the results of a search engine query to an XQuery expression for querying. Leading search engines today offer APIs that let you retrieve search results programmatically. All you need to do is write a script, say, in Python or PHP, which will take parameters for searching as arguments and retrieve the search results in XML format. Then, you can call that script from within the PASSING clause of the XMLTable or XMLQuery function.

Exposing Data Stored in an Oracle Database Through Oracle XML DB HTTP Server

As you learned in the preceding sections, XQuery is good when it comes to accessing external sources and bringing data extracted from them into your Oracle database. But what if you need to expose to the outside data that is stored in your Oracle database so that data becomes available through an URL on the Web as HTML or XML. Oracle Database provides this functionality through Oracle XML DB HTTP Server, which, like XQuery, comes preinstalled as part of Oracle Database.

Here's an example that illustrates of how you might use the DBUri servlet preinstalled in the Oracle XML DB HTTP Server to access the content of an Oracle database's table or view from a browser. Just point your browser to the following URL:

http://yourservername:8080/oradb/HR/EMPLOYEES

You'll be asked to enter a database user name and password for authentication. In this particular case, you enter hr as the user name and hr as the password. As a result, an XML document should be generated and displayed. Listing 8 is a snippet of that XML document.

<EMPLOYEES>
 <ROW>
 <EMPLOYEE_ID>100</EMPLOYEE_ID>
 <FIRST_NAME>Steven</FIRST_NAME>
 <LAST_NAME>King</LAST_NAME>
 <EMAIL>SKING</EMAIL>
 <PHONE_NUMBER>515.123.4567</PHONE_NUMBER>
 <HIRE_DATE>17-JUN-87</HIRE_DATE>
 <JOB_ID>AD_PRES</JOB_ID>
 <SALARY>24000</SALARY>
 <DEPARTMENT_ID>90</DEPARTMENT_ID>
 </ROW>

...

</EMPLOYEES>

Listing 8: Snippet from the XML document that is generated.

Of course, viewing relational data in XML format from a browser is not a common task for most users. Rather, they would expect to get an HTML page showing data organized in a table. To address this problem, you could create an XSLT stylesheet that will transform into HTML the XML document that is generated by the servlet. This XSLT stylesheet can be saved in the Oracle XML DB repository, which is an integrated part of Oracle Database, allowing you to store and represent database content in hierarchical structures using a file/folder/URL approach, much like a conventional file system. The PLS/SQL code to create the XSLT stylesheet discussed here as a repository resource might look like Listing 9:

DECLARE
 a BOOLEAN;
BEGIN
 a := DBMS_XDB.createResource('/home/employee.xsl', 
 '<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:template match="/">
 <html>
 <head>
 <title>HR employees table</title>
 </head>
 <body>
 <table border="1" cellspacing="0">
 <tr>
 <th><b>EMP_ID</b></th>
 <th><b>FIRST_NAME</b></th>
 <th><b>LAST_NAME</b></th> 
 <th><b>EMAIL</b></th> 
 <th><b>PHONE_NUMBER</b></th> 
 </tr>
 <xsl:for-each select="EMPLOYEES">
 <xsl:for-each select="ROW">
 <tr>
 <td><xsl:value-of select="EMPLOYEE_ID"/></td>
 <td><xsl:value-of select="FIRST_NAME"/></td>
 <td><xsl:value-of select="LAST_NAME"/></td>
 <td><xsl:value-of select="EMAIL"/></td>
 <td><xsl:value-of select="PHONE_NUMBER"/></td>
 </tr>
 </xsl:for-each>
 </xsl:for-each>
 </table>
 </body>
 </html>
 </xsl:template>
 </xsl:stylesheet>');
 COMMIT;
END;
/

Listing 9: Code for creating XSLT stylesheet.

Now you can use the following URL to apply the above stylesheet to the results being generated by the servlet:

http://yourservername:8080/oradb/HR/EMPLOYEES?contenttype=text/html&transform=/home/employee.xsl

This time, the output should look like Figure 3:

Representing relational data in a browser using the DBUri servlet and XSL transformations

Figure 3: Representing relational data in a browser using the DBUri servlet and XSL transformations.

As you no doubt have realized, your options are not limited to a browser when it comes to the DBUri servlet. Thus, you might invoke the servlet from within programming code using HTTP-aware APIs. For further details, you can refer to the documentation: Oracle XML DB Developer's Guide.

Handling Heterogeneous Data Sources with Oracle Warehouse Builder

Of course, the Web comprises more than HTML, XML, plain text, and other documents. Every data source that you can access on the Web can be considered to be a source of Web content. Thus, for example, you might need to access and manipulate data stored in a remote non-Oracle database system. This functionality is normally beyond the reach of the tools discussed in this article so far. This is where Oracle Warehouse Builder—a powerful data integration feature that is part of Oracle Database—proves useful.

Using the Oracle Warehouse Builder tool, you can perform extract, transform, and load (ETL) operations on data from disparate sources, including non-Oracle databases such as SQL Server and Teradata. Aside from native support for different relational databases, Oracle Warehouse Builder can work with many other data sources, including unstructured data sources such as PDF, cloud-based services such as Twitter, multidimensional databases such as Essbase, and flat files.

It is important to emphasize that Oracle Warehouse Builder is not just a tool for consolidating data from heterogeneous sources. It's a full-featured solution for data modeling, data profiling, data quality, and metadata management, allowing you to turn your Oracle database into a data warehouse—a relational database designed for query and analysis. With Oracle Warehouse Builder, you can profile source data, implement data movement and transformation logic, and design the target schema. This model is represented schematically in Figure 4:

Moving data from the source to the target through the Oracle Warehouse Builder environment

Figure 4: Moving data from the source to the target through the Oracle Warehouse Builder environment.

Perhaps the most interesting thing about Oracle Warehouse Builder is that it doesn't necessarily require you to choose as the target for data movement your Oracle database, as would be the case with XQuery. You can move data into any other remote database, including non-Oracle databases. In fact, the actual list of supported targets depends on the source you're using. For example, if you're using delimited or fixed-format flat files as the source, the list of supported targets includes delimited, fixed-format, and XML-format flat files. The table of sources and targets supported in Oracle Warehouse Builder for different application types can be found in the documentation: Oracle Warehouse Builder Sources and Targets Guide.

To simplify the tasks of extracting, validating, transforming, and integrating diverse data, Oracle Warehouse Builder offers Design Center, a graphical drag-and-drop environment that allows you to accomplish all these tasks without having to write programming code.

To configure, deploy, and then execute a project in Design Center, you typically carry out the following steps. These steps omit details and are meant to show only the general steps you need to take:

  1. Create a new project. This is done with a few clicks in Design Center.
  2. Create a new module for the database (or another data source) that you're going to use as the source. At this stage, you define the source location, specifying all the details required to find the source on the network and connect to it. And then you specify the source details, down to tables.
  3. Create the target module. At this point, you identify the location of the target and specify its details, down to tables.
  4. Create the template mapping module. Here, you drag and drop the source and target tables onto the mapping module canvas, and then connect the source tables with the target tables. At this point, you should see something like Figure 5:

    Logical view of a code template mapping

    Figure 5: Logical view of a code template mapping.

  5. Create the execution view for the code template mapping. In this step, you create the execution unit and associate it with an appropriate code template.
  6. Deploy the code template mapping. The deployment is done with one click in the Project Navigator. The deployment should result in the generation of an EAR file deployed to the Warehouse Builder Control Center Agent.
  7. Execute the code template mapping. You can start execution with a single click in the Project Navigator. It completes the tasks included in the code template associated with the code template mapping, resulting with moving data from the source to the target.

The details to the steps above deserve an article in their own right. If you need one, you might examine the "Oracle Warehouse Builder 11g Release 2 and Heterogeneous Databases" article by Mark Rittman. Also, you might want to look at my article "Using the JDBC Connectivity Layer in Oracle Warehouse Builder."

Oracle SQL Developer Migration Capabilities

Continuing with Oracle Database's tools that you might use to work with external database sources, it's worth mentioning Oracle SQL Developer, a tool that allows you to migrate third-party databases to Oracle Database. Starting with version 11g of Oracle Database, Oracle SQL Developer comes with the default Oracle Database installation. Apart from that, you can download the Oracle SQL Developer kit for free and then install it on your computer, as a separate installation.

Using Oracle SQL Developer's migration capabilities, you can copy metadata and data from a non-Oracle database—including MySQL, SQL Server, Sybase, or Microsoft Access—into your Oracle database. The process of migration can be easily accomplished with the Create/Edit/Select Database Connection wizard. Figure 6 illustrates what a database connection to a third-party database might look like after the migration process has been successfully completed.

Example of a database connection in Oracle SQL Developer

Figure 6: Example of a database connection in Oracle SQL Developer.

For further details on Oracle SQL Developer's migration capabilities, refer to the Oracle Database documentation: Oracle Database SQL Developer User's Guide.

Conclusion

Regardless of the huge volume of data you have stored in your database, you still might need to extract some information from external sources available, say, on the Web. The data obtained from such sources can be then merged into your database for consolidation and reporting.

Of course, you cannot expect that external sources keep their data in the same format as your database does. And, in fact, you don't have to worry about that. If you are an Oracle Database user, you won't even need to look for external tools when it comes to accessing data that is outside of your database. Oracle Database gives you all the necessary tools to query outside data and then seamlessly integrate it into your database.

See Also

About the Author

Yuli Vasiliev is a software developer, freelance author, and consultant currently specializing in open source development, Java technologies, business intelligence (BI), databases, service-oriented architecture (SOA), and, more recently, virtualization. He is the author of a series of books on Oracle technology, the most recent one being Oracle Business Intelligence: An introduction to Business Analysis and Reporting (Packt, 2010).

Revision 1.0, 11/08/2012