Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract value from data. Data scientists combine a range of skills—including statistics, computer science, and business knowledge—to analyze data collected from the web, smartphones, customers, sensors, and other sources.
Data science reveals trends and produces insights that businesses can use to make better decisions and create more innovative products and services. Data is the bedrock of innovation, but its value comes from the information data scientists can glean from it and then act upon.
Tools for Data Scientists
Data scientists use many types of tools, but one of the most common is open source notebooks, which are web applications for writing and running code, visualizing data, and seeing the results—all in the same environment. Some of the most popular notebooks include Jupyter, RStudio, and Zepplin. Notebooks are very useful for conducting analysis but have their limitations when data scientists need to work as a team. Data science platforms emerged to solve this problem.
As modern technology has enabled the creation and storage of increasing amounts of information, the volume of data has soared. It’s estimated that 90 percent of the data in the world was created in the last two years. For example, Facebook users upload 10 million photos every hour. The number of connected devices in the world—the Internet of Things (IoT)—is projected to grow to more than 75 billion by 2025.
The wealth of data being collected and stored by these technologies can bring transformative benefits to organizations and societies around the world, but only if we can interpret it. That’s where data science comes in.
As a specialty, data science is young. It grew out of the fields of statistical analysis and data mining. The Data Science Journal debuted in 2002, published by the International Council for Science: Committee on Data for Science and Technology. By 2008, the title of data scientist had emerged and the field quickly took off. There has been a shortage of data scientists ever since, even though more and more colleges and universities have started offering data science degrees.
A data scientist’s duties can include developing strategies for analyzing data; preparing data for analysis; exploring, analyzing, and visualizing data; building models with data using programming languages such as Python and R; and deploying models into applications.
The data scientist doesn’t work solo. In fact, the most effective data science is done in teams. In addition to a data scientist, this team might include a business analyst who defines the problem, a data engineer who prepares the data and how it is accessed, an IT architect who oversees the underlying processes and infrastructure, and an application developer who deploys the models or outputs of the analysis into applications and products.
Organizations are using data science teams to turn data into a competitive advantage by refining products and services. For example, companies analyze data collected from call centers to identify customers who are likely to churn, so marketing can take action to retain them. Logistics companies analyze traffic patterns, weather conditions, and other factors to improve delivery speeds and reduce costs. Healthcare companies analyze medical test data and reported symptoms to help doctors diagnose diseases earlier and treat them more effectively.
Most companies have made data science a priority and are investing in it heavily. In Gartner's recent survey of more than 3,000 CIOs, respondents ranked analytics and business intelligence as the top differentiating technology for their organizations. The CIOs surveyed see these technologies as the most strategic for their companies; therefore, they are attracting the most new investment.
The process of analyzing and acting upon data is iterative rather than linear, but this is how the work typically flows for a data modeling project:
The data science process is typically overseen by three types of manager:
Despite the promise of data science and huge investments in data science teams, many companies are not realizing the full value of their data. In their race to hire talent and create data science programs, some companies have experienced inefficient team workflows, with different people using different tools and processes that don’t work well together. Without more disciplined, central management, executives might not see a full return on their investments. This chaotic environment presents many challenges.
Data scientists can’t work efficiently. Because access to data must be granted by an IT administrator, data scientists often have long waits for data and the resources they need to analyze it. Once they have access, the data science team might analyze the data using different and possibly incompatible tools. For example, a scientist might develop a model using the R language, but the application it will be used in is written in a different language. Which is why it can take weeks—or even months—to deploy the models into useful applications.
Application developers can’t access usable machine learning. Sometimes the machine learning models that developers receive must be recoded or are not ready to be deployed in applications. And because access points can be inflexible, models can’t be deployed in all scenarios and scalability is left to the application developer.
IT administrators spend too much time on support. Because of the proliferation of open source tools, IT has an ever-growing list of tools to support. A data scientist in marketing, for example, might be using different tools than a data scientist in finance. Teams might also have different workflows, which means IT must continually rebuild and update environments.
Business managers are too removed from data science. Data science workflows are not always integrated into business decision-making processes and systems, making it difficult for business managers to collaborate knowledgably with data scientists. Without better integration, business managers find it difficult to understand why it takes so long to go from prototype to production—and they are less likely to back the investment in projects they perceive as too slow.
Companies realized that without an integrated platform, data science work was inefficient, unsecure, and difficult to scale. This realization led to the emergence of data science platforms. These platforms are software hubs around which all data science work takes place. A good platform alleviates many of the challenges of implementing data science and helps businesses turn their data into insights faster and more efficiently.
With a centralized platform, data scientists can work in a collaborative environment using their favorite open source tools, with all their work synced by a version control system.
A data science platform reduces redundancy and drives innovation by allowing teams to share code, results, and reports. It removes bottlenecks in the flow of work by simplifying management and using open source tools, frameworks, and infrastructure.
For example, a data science platform might allow data scientists to deploy models as APIs, making it easy to integrate them into different applications. Data scientists can access tools, data, and infrastructure without having to wait for IT.
The demand for data science platforms has exploded in the market. In fact, the platform market is expected to grow at a compounded annual rate of more than 39 percent over the next few years and is projected to reach US$385 billion by 2025.
If you’re ready to explore the capabilities of data science platforms, there are some key capabilities to consider:
Finding and recruiting talent is the biggest barrier that companies face when they want to use data science for competitive advantage. In a recent McKinsey & Company survey, half of executives across geographies and industries reported greater difficulty in recruiting analytical talent than any other kind of skill. Retention is also a problem according to 40 percent of those surveyed.
In addition to data scientists, McKinsey reports that there are shortages in other analytics categories. In particular, there are shortages of skilled workers who can translate between business problems and the proper application of data science, and workers who are skilled at data visualization.
Indeed.com, Glassdoor, and Bloomberg provide further proof that there is significant demand for data science talent:
Artificial intelligence (AI) enables technology and machines to process data to learn, evolve, and execute human tasks.
Machine learning, a subset of artificial intelligence (AI), focuses on building systems that learn through data with a goal to automate and speed time to decision and accelerate time to value.
Machine learning, artificial intelligence, and data science are changing the way businesses approach complex problems to alter the trajectory of their respective industries. Read the latest articles to understand how the industry and your peers are approaching these technologies.