Michael Chen | Content Strategist | May 15, 2024
In-database machine learning refers to the integration of machine learning algorithms and techniques into a database management system. All processes—including data set selection, training algorithms, and evaluating models—stay within the database. With in-database machine learning, organizations can perform complex analytical tasks directly within their databases, eliminating the need to move data between systems. That removes the latency, data integrity, and security concerns involved with data import/export processes.
Consider a company that’s suddenly experiencing a lot of customer churn. Machine learning, or ML, algorithms might predict which customers are likely to decamp to a competitor and suggest personalized marketing campaigns and make other recommendations on how to reengage those buyers. Maybe you have excess inventory of a frequently purchased items. Offering a special promotion can move stock and make customers happy. If machine learning is available directly within the database, these suggestions can be generated much faster, on the most up-to-date data. The company can pivot quickly. And because there’s no need to move data to an external ML engine, worries about exposing customer information are eliminated.
In-database machine learning brings machine learning algorithms directly into the database, eliminating the need to move data back and forth between different systems. Traditionally, machine learning required that data be extracted from the database and processed in a separate ML analytics platform or tool. This can be time-consuming and resource-intensive, especially when dealing with large data sets.
With in-database machine learning, data stays put while machine learning algorithms are executed natively within the database environment. A key benefit of embedding ML algorithms within the database is faster and more efficient analysis.
Simply put, moving data slows everything down.
In-database machine learning is particularly helpful for the large data sets needed to, for example, train AI models. With in-database machine learning, the database environment uses tools for coding, building models, and testing that are native to the platform. That allows all tables in the database to be used for data-intensive projects, with just a few clicks.
In-database machine learning also provides infrastructure consistency, whether in training or deployment, meaning that IT teams are freed from creating new production-ready infrastructures—not to mention the related maintenance and QA work—to support the next stages of model usage.
Key Takeaways
In-database machine learning is a seamless experience because employees work with their familiar database systems and tools. Likewise, analysts can use their existing databases and familiar query languages to perform advanced analytics without the need for additional software or hardware investments. By analyzing data directly within the database, organizations can uncover valuable insights on the very latest data and make more timely, data-driven decisions.
Without in-database machine learning, companies looking to apply ML analytics to their data will need to perform extract/transform/load (ETL) or extract/load/transform (ELT) processes and shift data to external systems. Under this traditional model, data scientists may perform manual import/export operations, or systems may be integrated via APIs; in either case, multiple extra steps are necessary to get data sets ready for machine learning functions—and those extra steps open the door to potential problems, including:
In-database machine learning skips the export/import steps, keeping ML tasks in the same environment as the data itself without requiring rebuilding or reformatting efforts to ensure compatibility. Staying within the database also removes the need to maintain systems capable of handling the go-between.
At scale, a number of hurdles exist when using a diverse set of data sources for machine learning tasks, particularly AI model training. They include the following:
In-database machine learning is important for data teams right now because of the rapid, continued growth of data volume and variety. Simply put, data-intensive tasks are going to get harder, not easier, so it’s more critical than ever to integrate in-database machine learning into workflows.
At its most basic, in-database machine learning works similarly to standard machine learning. The primary difference is that all of the steps required to move data between systems—from extracts to transformation/cleaning—are simply removed. However, this does come with some limitations and requirements due to the nature of working within a database environment.
In broad strokes, here’s how in-database machine learning works.
Everything starts with the initial load into the database, though for the purposes of in-database machine learning, one caveat remains: The database must support the capability—specifically, keeping the code near the data to allow for the full efficiency improvements possible with in-database machine learning.
Whether machine learning algorithms are located in-database or in a third-party platform, they still need to undergo the requisite optimization process. That means training the model, evaluating the results, and fine-tuning as needed. The biggest difference with in-database machine learning is that these steps are performed within the database, rather than in a system that is separate from where the data is housed. This eliminates having to move data between multiple different systems and data stores to perform model optimization tasks.
In traditional machine learning, data must be moved from databases into a repository, such as a data lake, for training the model, evaluating the results, and performing refinements such as tweaking individual algorithms and parameters. These steps use up compute resources, weighing down infrastructure. Database native APIs can handle these tasks, even as the model transfers from development to test to production environments.
Using in-database machine learning, revisions to the ML model can propagate to other databases, whether in development, testing, or production environments, simply by versioning a table. Refinements integrate instantly, allowing for functions to execute without disruption from additional steps or bogged-down compute resources.
When insights are generated using ML models directly within a database, the result is near real-time insights without additional steps or concerns over ETL/ELT latency and data integrity.
In-database machine learning naturally shortens processes and reduces hardware needs for organizations, creating a number of benefits. While this approach comes with its own set of limitations, the common benefits are as follows:
Moving data between systems is, at best, cumbersome. At worst, it can introduce errors, latency, and security risks while slowing operations. By keeping analysis tasks within the database, the extra hurdles involved with ELT/ETL—throughout exporting, data transformation, and loading—are negated, ensuring that the overall analytics process moves as swiftly as possible.
When an organization removes the need to shift large data sets, it gains savings in storage and expert labor, as well as less latency. After all, time is money. In addition, improved efficiency reduces hours spent on troubleshooting hardware and software issues for a secondary level of cost reduction.
Scalability often depends on resources: The more money, manpower, or CPUs needed for a process, the more difficult it is to scale on demand. The removal of data movement processes eliminates the extra compute power needed to complete steps such as exporting or format conversion. Keeping data within the database reduces the need to address compatibility issues and improves compute efficiency, thus offering far greater flexibility and easier scaling to meet demand.
ELT/ETL processes are a primary source of duplicate data within a network. Duplication can stem from many sources, such as a hardware issue interrupting export leaving corrupted data, or problems with data transformation tools leading to accidental editing or deletion. Each step of an ELT/ETL process opens up risks that can damage a data set’s quality and accuracy while also slowing down process efficiency.
Machine learning within the database keeps data in one place. This eliminates the need to move data, reducing export/import and input/output. As a result, processes can happen within the native environment, without relying on other systems. This frees up automation tools and capabilities for various tasks, such as deployment, auditing, and maintenance checks. Users can benefit from these features without worrying about compatibility or integration issues that may arise.
In-database machine learning tools come in a spectrum of services and capabilities. In many cases, these tools are similar to what a database vendor may provide as standalone capabilities, either as a subset of integrated features or as an embedded connection to the vendor’s machine learning platform. For example, Oracle Database provides machine learning capabilities within the environment to eliminate the need to move data from system to system. In this case, Oracle Database provides exploration, preparation, and modeling using Oracle Machine Learning tools, such as SQL, R, Python, REST, automated machine learning (AutoML), and no-code interfaces, along with a variety of available algorithms.
Though it comes with compelling benefits, in-database machine learning is highly dependent on the features and capabilities of the database environment. This may lead to issues with future migration or when the ML model requires something beyond the environment’s native capabilities.
The most common drawbacks and limitations of in-database machine learning include the following:
If everything lines up with a project’s machine learning needs and goals, going from testing to deployment is actually a simple step. However, these models are based on the specific capabilities of an organization’s in-database tools. What happens when the project evolves into something more complex or requires migration? Working with in-database tools can make the immediate ML workload faster and more efficient, but the future can be a question mark, so it’s necessary to consider if longer-term goals align with current capabilities.
In-database machine learning works only on supported database applications and may offer a limited set of APIs. This is changing as the capabilities of in-database machine learning systems grow, but as a general rule, standalone tools offer more power and features, along with a wealth of specialists available to help companies exploit those features.
The greatest strength of in-database machine learning also leads to one of its biggest drawbacks: By keeping data within the database environment, ETL/ELT steps are skipped—but that means opportunities for auditing and data cleansing are bypassed as well.
In many cases, databases won’t share the same compute resources as machine learning tools, particularly for large-scale or extremely complex models that require high performance computing. Because of that, the scope of in-database machine learning models often have a cap. Every organizational setup is different; similarly, every project’s needs are different, and this is a tradeoff to consider during initial planning stages.
HeatWave MySQL is a leading MySQL cloud service that connects data in repositories to features such as analytics and machine learning in a single platform, with no need for cumbersome ETL/ELT processes, even when data is stored outside of a MySQL database. With HeatWave MySQL, teams can build, train, deploy, and explain machine learning models swiftly and efficiently in a scalable, cost-effective, and powerful platform. HeatWave MySQL is available on Oracle Cloud Infrastructure (OCI), Amazon Web Services (AWS), and Microsoft Azure.
To get started, organizations need to ensure their database provides in-database machine learning. They may also need to invest in training to upskill their data scientists and analysts in ML techniques. But once they get rolling, in-database machine learning can be a game-changer for organizations looking to fully leverage the power of ML. By bringing the algorithms to the data, rather than vice versa, decision-makers gain faster and more efficient analysis.
AI models come in many sizes and complexity levels, from LLMs to simpler ML models. What they all have in common? A hunger for data. Here 4 components of an AI-ready data infrastructure.
How can in-database ML be used effectively?
In-database machine learning only works when companies employ a database that supports in-database capabilities. That database’s foundation for compute resources must be considered, along with the size and scope of the database and machine learning model.
What are the benefits of in-database ML?
In-database machine learning removes the need to extract and move data between systems. This creates a natural set of benefits in terms of efficiency, and in some cases, it can shrink process times from weeks to days as it eliminates the need to rely on external tools for ETL/ELT. From a big-picture perspective, this also reduces cost of ownership and increases scalability and operational efficiency via less resource usage.
What are some issues to consider when using in-database ML?
Before deciding to use in-database machine learning for a project, teams should weigh the following factors:
These questions can clarify the pros and cons of in-database machine learning and should be considered for each project.
What are some future trends in in-database ML?
Providers continue to improve and expand their in-database machine learning capabilities, and that means a number of trends are on the horizon. As more and more tools and platforms support in-database machine learning, data scientists will be able to build and deploy more complex models. This also delivers greater transparency because the model exists on a unified platform instead of being limited to whomever is using and driving the machine learning tools. Greater capabilities also mean usability with larger data sets, and thus faster training, testing, and deployment.