While companies create data products specific to their own requirements and goals, some steps in the value chain are consistent across organizations.
by Gwen Shapira
The term “data scientist” evokes images of a single genius working alone, applying esoteric formulas to vast amounts of data in search of useful insights. But this is only one step of a process. Data analysis is not a goal in itself; the goal is to enable the business to make better decisions. Data scientists must build products that allow everyone in the organization to use data better, enabling data-driven decision making in every department and at every level.
The data value chain is captured in products that automatically collect, clean and analyze data, delivering information and predictions to executive dashboards or reports. Analysis runs automatically and continuously as new data arrives and the data scientists can work with the business on refining the models and improving prediction accuracy.
While each company creates data products specific to its own requirements and goals, some of steps in the value chain are consistent across organizations:
Decide on the objectives: The first step of the data value chain must happen before there is data: the business unit has to decide on objectives for the data science teams. These objectives usually require significant data collection and analysis. Since we are looking at data to drive decision-making, we need a measurable way to know if the business is advancing toward its goals. Key metrics or performance indicators must be identified early in the process.
Identify business levers: The business should make changes to improve the key metrics and reach its goals. If there is nothing that can be changed, there can be no improvement regardless of how much data is collected and analyzed. Identifying the goals, metrics and levers early in the project provides the project with direction and avoids meaningless data analysis. For example, the goal can be improving customer retention, one of the metrics can be percent of customers renewing their subscriptions, and the business levers can be design of the renewal page, timing and content of reminder emails and special promotions.
Data collection: Cast a wide net for data. More data—especially data from more diverse sources—enables finding better correlations, building better models and finding more actionable insights. Big data economics mean that while individual records are often useless, having every record available for analysis can provide real value. Companies are instrumenting their websites to closely track user clicks and mouse movements, attaching RFIDs to products to track their movements through stores as coaches attach sensors to athletes’ bodies to track the way they move.
Data cleaning: The first step in data analysis is to improve data quality. Data scientists correct spelling mistakes, handle missing data and weed out nonsense information. This is the most critical step in the data value chain—even with the best analysis, junk data will generate wrong results and mislead the business. More than one company has been surprised to discover that a large percentage of customers live in Schenectady, NY, a rather small town with population of less than 70,000 people. However, Schenectady has zip code 12345, so it is disproportionately represented in almost every customer profile database since consumers are often reluctant to enter their real details into online forms. Analyzing this data will result in erroneous conclusions unless the data analysts take steps to validate and clean the data. It is especially important that this step will scale, since having continuous data value chain requires that incoming data will get cleaned immediately and at very high rates. This usually means automating the process, but it doesn't mean humans can't be involved.
Data modeling: Data scientists build models that correlate the data with the business outcomes and make recommendations regarding changes to the levers identified in the first step. This is where the unique expertise of data scientists becomes critical to business success—correlating the data and building models that predict business outcomes. Data scientists must have a strong background in statistics and machine learning to build scientifically accurate models and avoid the traps of meaningless correlations and models that are so reliant on existing data that their future predictions are useless. But statistical background is not enough; data scientists need to understand the business well enough that they will be able to recognize whether the results of the mathematical models are meaningful and relevant.
Grow a data science team: Since data scientists are notoriously difficult to hire, it’s a good idea to build a data science team that allows those with an advanced degree in statistics to focus on data modeling and predictions, while others in the team—qualified infrastructure engineers, software developers and ETL experts—build the necessary data collection infrastructure, data pipeline and data products that enable streaming the data through the models and displaying the results to the business in the form of reports and dashboards. These teams typically use large-scale data analysis platforms like Hadoop to automate the data collection and analysis and run the entire process as a product.
Optimize and repeat: The data value chain is a repeatable process and leads to continuous improvements, both to the business and to the data value chain itself. Based on the results of the model, the business will make changes to the driving levers and the data science team will measure the results. Based on the results, the business can decide on further action while the data science team improves its data collection, data cleanup and data models. The faster the business can repeat the process, the sooner it can make course corrections and get value out of the data. Ideally, after multiple iterations, the model will generate accurate predictions, the business will reach the predefined goals, and the resulting data value chain will be used for monitoring and reporting as everyone moves on to solve the next business challenge.
Gwen Shapira is a solutions architect at Cloudera and an Oracle ACE Director.