Data science combines multiple fields, including statistics, scientific methods, artificial intelligence (AI), and data analysis, to extract value from data. Those who practice data science are called data scientists, and they combine a range of skills to analyze data collected from the web, smartphones, customers, sensors, and other sources to derive actionable insights.
Data science encompasses preparing data for analysis, including cleansing, aggregating, and manipulating the data to perform advanced data analysis. Analytic applications and data scientists can then review the results to uncover patterns and enable business leaders to draw informed insights.
Data science is one of the most exciting fields out there today. But why is it so important?
Because companies are sitting on a treasure trove of data. As modern technology has enabled the creation and storage of increasing amounts of information, data volumes have exploded. It’s estimated that 90 percent of the data in the world was created in the last two years. For example, Facebook users upload 10 million photos every hour.
But this data is often just sitting in databases and data lakes, mostly untouched.
The wealth of data being collected and stored by these technologies can bring transformative benefits to organizations and societies around the world—but only if we can interpret it. That’s where data science comes in.
Data science reveals trends and produces insights that businesses can use to make better decisions and create more innovative products and services. Perhaps most importantly, it enables machine learning (ML) models to learn from the vast amounts of data being fed to them, rather than mainly relying upon business analysts to see what they can discover from the data.
Data is the bedrock of innovation, but its value comes from the information data scientists can glean from it, and then act upon.
To better understand data science—and how you can harness it—it’s equally important to know other terms related to the field, such as artificial intelligence (AI) and machine learning. Often, you’ll find that these terms are used interchangeably, but there are nuances.
Here’s a simple breakdown:
Organizations are using data science to turn data into a competitive advantage by refining products and services. Data science and machine learning use cases include:
Many companies have made data science a priority and are investing in it heavily. In Gartner’s recent survey of more than 3,000 CIOs, respondents ranked analytics and business intelligence as the top differentiating technology for their organizations. The CIOs surveyed see these technologies as the most strategic for their companies, and are investing accordingly.
The process of analyzing and acting upon data is iterative rather than linear, but this is how the data science lifecycle typically flows for a data modeling project:
Planning: Define a project and its potential outputs.
Building a data model: Data scientists often use a variety of open source libraries or in-database tools to build machine learning models. Often, users will want APIs to help with data ingestion, data profiling and visualization, or feature engineering. They will need the right tools as well as access to the right data and other resources, such as compute power.
Evaluating a model: Data scientists must achieve a high percent of accuracy for their models before they can feel confident deploying it. Model evaluation will typically generate a comprehensive suite of evaluation metrics and visualizations to measure model performance against new data, and also rank them over time to enable optimal behavior in production. Model evaluation goes beyond raw performance to take into account expected baseline behavior.
Explaining models: Being able to explain the internal mechanics of the results of machine learning models in human terms has not always been possible—but it is becoming increasingly important. Data scientists want automated explanations of the relative weighting and importance of factors that go into generating a prediction, and model-specific explanatory details on model predictions.
Deploying a model: Taking a trained, machine learning model and getting it into the right systems is often a difficult and laborious process. This can be made easier by operationalizing models as scalable and secure APIs, or by using in-database machine learning models.
Monitoring models: Unfortunately, deploying a model isn’t the end of it. Models must always be monitored after deployment to ensure that they are working properly. The data the model was trained on may no longer be relevant for future predictions after a period of time. For example, in fraud detection, criminals are always coming up with new ways to hack accounts.
Building, evaluating, deploying, and monitoring machine learning models can be a complex process. That’s why there’s been an increase in the number of data science tools. Data scientists use many types of tools, but one of the most common is open source notebooks, which are web applications for writing and running code, visualizing data, and seeing the results—all in the same environment.
Some of the most popular notebooks are Jupyter, RStudio, and Zeppelin. Notebooks are very useful for conducting analysis, but have their limitations when data scientists need to work as a team. Data science platforms were built to solve this problem.
To determine which data science tool is right for you, it’s important to ask the following questions: What kind of languages do your data scientists use? What kind of working methods do they prefer? What kind of data sources are they using?
For example, some users prefer to have a datasource-agnostic service that uses open source libraries. Others prefer the speed of in-database, machine learning algorithms.
At most organizations, data science projects are typically overseen by three types of managers:
Business managers: These managers work with the data science team to define the problem and develop a strategy for analysis. They may be the head of a line of business, such as marketing, finance, or sales, and have a data science team reporting to them. They work closely with the data science and IT managers to ensure that projects are delivered.
IT managers: Senior IT managers are responsible for the infrastructure and architecture that will support data science operations. They are continually monitoring operations and resource usage to ensure that data science teams operate efficiently and securely. They may also be responsible for building and updating IT environments for data science teams.
Data science managers: These managers oversee the data science team and their day-to-day work. They are team builders who can balance team development with project planning and monitoring.
But the most important player in this process is the data scientist.
As a specialty, data science is young. It grew out of the fields of statistical analysis and data mining. The Data Science Journal debuted in 2002, published by the International Council for Science: Committee on Data for Science and Technology. By 2008 the title of data scientist had emerged, and the field quickly took off. There has been a shortage of data scientists ever since, even though more and more colleges and universities have started offering data science degrees.
A data scientist’s duties can include developing strategies for analyzing data, preparing data for analysis, exploring, analyzing, and visualizing data, building models with data using programming languages, such as Python and R, and deploying models into applications.
The data scientist doesn’t work solo. In fact, the most effective data science is done in teams. In addition to a data scientist, this team might include a business analyst who defines the problem, a data engineer who prepares the data and how it is accessed, an IT architect who oversees the underlying processes and infrastructure, and an application developer who deploys the models or outputs of the analysis into applications and products.
Despite the promise of data science and huge investments in data science teams, many companies are not realizing the full value of their data. In their race to hire talent and create data science programs, some companies have experienced inefficient team workflows, with different people using different tools and processes that don’t work well together. Without more disciplined, centralized management, executives might not see a full return on their investments.
This chaotic environment presents many challenges.
Data scientists can’t work efficiently. Because access to data must be granted by an IT administrator, data scientists often have long waits for data and the resources they need to analyze it. Once they have access, the data science team might analyze the data using different—and possibly incompatible—tools. For example, a scientist might develop a model using the R language, but the application it will be used in is written in a different language. Which is why it can take weeks—or even months—to deploy the models into useful applications.
Application developers can’t access usable machine learning. Sometimes the machine learning models that developers receive are not ready to be deployed in applications. And because access points can be inflexible, models can’t be deployed in all scenarios and scalability is left to the application developer.
IT administrators spend too much time on support. Because of the proliferation of open source tools, IT can have an ever-growing list of tools to support. A data scientist in marketing, for example, might be using different tools than a data scientist in finance. Teams might also have different workflows, which means that IT must continually rebuild and update environments.
Business managers are too removed from data science. Data science workflows are not always integrated into business decision-making processes and systems, making it difficult for business managers to collaborate knowledgeably with data scientists. Without better integration, business managers find it difficult to understand why it takes so long to go from prototype to production—and they are less likely to back the investment in projects they perceive as too slow.
Many companies realized that without an integrated platform, data science work was inefficient, unsecure, and difficult to scale. This realization led to the development of data science platforms. These platforms are software hubs around which all data science work takes place. A good platform alleviates many of the challenges of implementing data science, and helps businesses turn their data into insights faster and more efficiently.
With a centralized, machine learning platform, data scientists can work in a collaborative environment using their favorite open source tools, with all their work synced by a version control system.
A data science platform reduces redundancy and drives innovation by enabling teams to share code, results, and reports. It removes bottlenecks in the flow of work by simplifying management and incorporating best practices.
In general, the best data science platforms aim to:
Data science platforms are built for collaboration by a range of users including expert data scientists, citizen data scientists, data engineers, and machine learning engineers or specialists. For example, a data science platform might allow data scientists to deploy models as APIs, making it easy to integrate them into different applications. Data scientists can access tools, data, and infrastructure without having to wait for IT.
The demand for data science platforms has exploded in the market. In fact, the platform market is expected to grow at a compounded annual rate of more than 39 percent over the next few years and is projected to reach US$385 billion by 2025.
If you’re ready to explore the capabilities of data science platforms, there are some key capabilities to consider:
Choose a project-based UI that encourages collaboration. The platform should empower people to work together on a model, from conception to final development. It should give each team member self-service access to data and resources.
Prioritize integration and flexibility. Make sure the platform includes support for the latest open source tools, common version-control providers, such as GitHub, GitLab, and Bitbucket, and tight integration with other resources.
Include enterprise-grade capabilities. Ensure the platform can scale with your business as your team grows. The platform should be highly available, have robust access controls, and support a large number of concurrent users.
Make data science more self-service. Look for a platform that takes the burden off of IT and engineering, and makes it easy for data scientists to spin up environments instantly, track all of their work, and easily deploy models into production.
Ensure easier model deployment. Model deployment and operationalization is one of the most important steps of the machine learning lifecycle, but it’s often disregarded. Make sure that the service you choose makes it easier to operationalize models, whether it’s providing APIs or ensuring that users build models in a way that allows for easy integration.
Your organization could be ready for a data science platform, if you’ve noticed that:
A data science platform can deliver real value to your business. Oracle’s data science platform includes a wide range of services that provide a comprehensive, end-to-end experience designed to accelerate model deployment and improve data science results.