Big Data Open Source

It would be hard to write the story of big data without including open source—the two are tied together. The development of open source software was a huge factor in the evolution of big data. And open source technology continues to be an integral part of the big data ecosystem because of its capability for fast innovation. In fact, the most important names in big data software—Hadoop, Spark, Cassandra, and Kafka—are all open source.

How are companies using open source for big data?

Although open source software has a reputation of being a favorite of hobbyists and amateur developers, that business world has been adopting open source in mission-critical environments for quite some time.

Some of the reasons that companies choose open source software include:

  • Competitive features and technical capabilities
  • Quality of the solutions
  • Ability to customize and fix issues
  • Low barrier to entry

Arguably, one of the greatest advantages of open source is its large and devoted developer community. The most popular open source projects have a huge developer base working to patch and improve the technology. Developers are drawn to open source for its competitive features and innovative capabilities, which is especially valuable when compared to what traditional software can create.

Open source is especially beneficial to companies that don’t have the in-house development or IT resources to build their own software. Alternatively, companies that do have those resources turn to open source to give their employees the leading-edge technology that they are more interested working with.

How do companies see open source?

Open source technology holds a great deal of promise. But it is not without challenges. According to the 2016 North Bridge and Black Duck Future of Open Source Study, almost 33 percent of companies have no process for identifying, tracking or remediating known open source vulnerabilities, which could leave them open to security threats.

Open source has been very advantageous to the big data community. With its ready-to-go code, open source software has enabled companies to get products to market faster. But it has always carried a certain amount of risk. The OpenSSL Heartbleed security vulnerability in 2014 is just one example of its vulnerabilities.

Despite the benefits gained from having many contributors, open source software isn’t immune to ordinary programming mistakes and security blunders. Most software engineers don’t track open source use, leaving many companies unaware of the resulting security and compliance risks they could be facing.

For open source to be fully usable and effective, most businesses need it to be integrated and supported to some degree. Which is easier said than done, because in a sense open source is never complete. There’s always something new to work on. In addition, open source products are often not exactly easy to work with. Using open source may require training. Compatibility with existing applications and hardware is another concern. Most companies end up adopting open source through another company.

Companies like Oracle, Databricks, and DataStax have been working with open source in this way. These companies brought open source into the enterprise and made it fully usable. There is huge benefit to this because these companies add value to open source through commits and various other improvements.

At the 2017 Open Source Summit, Linux Founder Linus Torvalds acknowledged the corporate influence and work done on open source projects by corporate developers and welcomed it. “It’s very important to have companies in open source,” he said. “It’s one thing I have been very happy about.”

How is Oracle Big Data using open source?

In 2017 Oracle was named one of the top 35 companies that play a major role in developing and maintaining open source software. Through the purchase of Sun Microsystems in 2010, Oracle inherited some of the world’s most popular open source technologies. Our support for open source big data technologies has been one of the dominant growth drivers for us in the past few years. Oracle continues to support open source development and foundations.

When it comes to big data, Oracle has been especially proactive in working with open source software. The next section describes how Oracle uses open source in various areas of our big data platform. At Oracle, working with big data involves three key steps:

  • integrate big data and bring it into your system
  • manage your big data and have a place to store it
  • analyze to understand, visualize, make sense of, and even build proactive models based on machine learning with your data

Integration and big data

Many of our big data customers are specifically demanding open source offerings. Oracle is committed to developing, supporting, and promoting open source. Oracle data integration products, such as Oracle Data Integration and Oracle GoldenGate, include open source technology, along with many other platforms.

We are also noticing that many customers want to modernize their open source frameworks and the supporting technologies that are constantly changing. On the data integration side, we currently support around twenty-five different open source technologies, data sources, targets, and execution frameworks. Some of the technologies we support include:

  • Apache Kafka
  • Apache Hive
  • Apache HBase
  • Hadoop Cloud System
  • Apache Cassandra

What customers are looking at these days is the maturity level of their big data products. One of the most important factors to consider is whether the vendor has an acceptable support strategy around the big data frameworks. It is critical that the vendor isn’t being casual about their commitment to open source technology.

Along with product maturity, a big data business solution is typically going to be a mix of open source and non-open source. Companies have been solving big data problems with open source solutions, but it requires a great deal of commitment, dedication, and expertise.

You can and should leverage open source technology where it makes sense. But most often, you’ll need to partner with a variety of other vendor technologies as well.

For example, in the early days of establishing data lakes, companies wanted to leverage a product like Kafka, with its ability to take many inputs and distribute to many outputs. But get Kafka more reliable and robust, a technology like Oracle GoldenGate was required. While GoldenGate isn’t open source, GoldenGate and Kafka together make a better ingest option for a data lake than using a product like Sqoop with Kafka because GoldenGate is a much more robust and mature product than Sqoop.

Big data management

From a data management perspective, Oracle’s big data product stack is heavily based on open source.

Oracle chose this approach to take advantage of open source innovation and have better control over the functionality made available to customers. With big data, there are multiple components within the stack that continuously evolve. That’s why we made the decision to have our own open source Hadoop distribution.

We also believe that using open source software enables Oracle to provide better support for our customers. At the same time, we know that other software ecosystems are developing interesting open source projects that are evolving. That’s why Oracle continues to contribute to many different development communities. For example, Oracle’s development efforts are evolving to use object store as a data lake.

Oracle actively contributes to open source communities and offers customers some of our own IP for better performance and capabilities.

R programming language

At Oracle, we haven’t just adopted R; we’ve actually improved it. Oracle’s supported redistribution of open source R (which is a free download) is compatible with running in database and Hadoop, and is now faster because we’ve parallelized it.

R can run on multiple nodes and on a cluster instead of a single machine, so customers can run bigger, more complex algorithms on more data sets without relying on sampling. Oracle’s improvements to R allow users to use the R syntax and provide different implementations underneath it that make it scalable and performant.

In addition, Oracle has made the following improvements to R:

  • Created algorithms to operate in database and R syntax
  • Took R script and made it executable
  • Made it simpler for users to launch R script and leverage SQL

Oracle has expanded into the Hadoop space as well, introducing R interface for Hive.

Oracle’s commitment to R, Hadoop, and open source, isn’t just about the technology. When the R community created the R Consortium in 2015, Oracle was a founding member. The R Consortium was founded to provide benefits and support to the R open source community. Oracle continues to support the growth and development of R and has encouraged the adoption best practices for R package quality.

Spatial and graph database for big data

Oracle Spatial and Oracle Graph analytic services and data models support big data workloads on Apache Hadoop and NoSQL database technologies. Both incorporate open source libraries and components to round out our offerings. Oracle has used several of these components for infrastructure purposes, mostly on Apache-based projects.

Oracle views the relationship as mutually beneficial. For example, our analytics on the spatial/graph side are custom built, but we accelerated that process because we based it on an open source project called Green-marl which is a domain-specific language for graph data analysis that enables us to run around questions for analytics for customers more quickly.

When Oracle contributes to open source, we typically leverage open source, customize it, and enhance it. Here are examples of Oracle’s contributions to open source:

  • Cytoscape: Oracle develops components that we ship (such as an extension to GDAL) so that others can load data into their spatial databases.
  • Property graph side: Oracle finds opportunities to extend the products or projects that we work on, identifying bugs and security issues as well as providing feedback to the appropriate developers. The feature we have contributed the most on is RDF W3C.
  • Oracle has incorporated this for import, export, and format conversion of spatial data. Oracle provides the Oracle Spatial and Oracle Graph driver.