The new big data technologies provide new ways to process data efficiently and find the value hidden in it.
by Gwen Shapira, July 2012
Big data may be the latest trend in business technology, but big data can also mean a big headache. When asked about their priorities, CIOs talk about “reducing costs,” “adding strategic value to the business,” and “improving customer satisfaction.” They are unlikely to mention their desire to store, manage and analyze massive amounts of data. Big Data is a tool, not the end goal, and it is not a particularly easy tool for an organization to adopt.
Big data requires massive amounts of storage space, adoption of new technologies and significant processing power. All of these can be expensive and difficult to justify in a typically constrained IT budget. Big data is also a relatively new development in IT operations that increases the risk of exceeding budgets, missing project deadlines and disappointing results. The increased risk can keep businesses away from adopting big data methodologies.
So why are many companies in a wide range of industries so eager to adopt big data as part of their IT strategy? With big data technologies, companies can take advantage of data they were unable to analyze in the past: Web logs, industrial sensors, mobile and social media information. These types of data don’t fit well within the traditional enterprise data warehouse and business intelligence tools—they are unstructured, require large volume of storage space, and arrive at very high rates.
Big data technologies support the analysis of this previously untapped data. They are designed to be more efficient in analyzing large volumes of unstructured data. Once the data is processed and aggregated, it can be integrated within the existing enterprise infrastructure and can be used to improve customer relations, spot business trends, target online marketing, and find new sources of customers.
At Pythian, we noticed four major use cases where our customers leverage big data:
1. Analyze customer behavior. A large website such as Facebook generates 25 terabytes of logs each day. Even much smaller websites typically generate many gigabytes of web log data every day. These logs contain valuable information about customer behavior. By analyzing this information, managers can find out how many users use each feature in their site, how small changes in the home page design can generate more sales, and how a small bug fix can increase the number of return visitors. The can also chart the path each customer takes through their online store to figure out what drives or delays sales.
Without big data analysis systems, analyzing these amounts of data is prohibitively expensive. Using Hadoop, the most popular big data analysis platform, companies can store and analyze the data in a cost-effective manner and load the results into the enterprise data warehouse for traditional analysis in conjunction with sales and customer relations data from traditional enterprise databases.
2. Recommendation systems. In 2009, Netflix awarded a prize of US$1,000,000 to a team that managed to improve Netflix’s own movie recommendations system by 10 percent. By improving the accuracy of their recommendations, Netflix executives improved the possibility that their customers will follow the recommendations and order more online movies.
Other online retailers also make purchase recommendations—Amazon’s website will recommend books you are likely to enjoy, Zappos’s online advertisements will use your purchase history to feature products you are likely to buy, and Nordstrom will personalize email marketing with recommendations.
Making good recommendations is not simple. There are many data sources to take into account: purchase history, product ratings, products a visitor looked at but didn’t buy, products friends bought, responses to previous recommendations, products bought by people in similar geographic region, products purchased by people with similar tastes, etc. The possibilities are endless, and the main lesson learned from the Netflix challenge is that using more data improves the results much faster than using smarter algorithms.
By using enterprise data stores that integrate with the R statistical language, companies can rapidly develop and improve their recommendation algorithms using data already stored in the enterprise data warehouse and integrating it with unstructured data stored in Hadoop and other NoSQL data sources. This eliminates the need to copy massive amounts of data, reduces the cost of the IT infrastructure required to support the project, and allows our customers to quickly deploy improved recommendations on their websites.
3. Improve connection with your customers. One of our customers specializes in integrating information from social media with the traditional customer support infrastructure. Their system follows conversations in social media sites such as Twitter, filters them for relevance for their customer companies, and processes the content of the conversation using natural language algorithms. Whenever a tweet is determined to be a complaint or request for help, their system can automatically create a customer support ticket that can be managed and tracked using traditional customer support systems.
Finding relevant information in petabytes of text that is typically full of misspellings and ungrammatical sentences requires large storage systems, powerful text-mining algorithms, and seamless integration into the corporate customer relationship management (CRM) system. Big data technologies supply the scalable storage and processing power, and data integration solutions are typically custom-developed using integration frameworks from big data vendors. Our customers just supply the brilliant algorithms.
Text analysis is not just for cutting-edge customer support. It is used by Thompson Reuters to process legal documents and power their search engine, and it is used by media publishers to process job postings and blogs and find out the most recent trends.
4. Speed up the business insight cycle. A different customer struggled with a large overnight job. The job processed large amounts of data for a report that had to be sent to the chief technology officer by 8am the next day. Over time, the volume of data and the complexity of the report grew, causing the job to take over 12 hours to complete with impact on production performance.
The developers of the report decided to move the report from their transactional database to a more scalable Hadoop cluster in order to speed up data processing and reduce the impact on the production system. To support the new process, each night relevant data is extracted from the transactional database to structured files in Hadoop, processed using Hadoop jobs and the results are loaded back into the relational database where the morning report is generated and distributed.
The new job took less than 2 hours to process the data, with most of the time spent in loading and extracting the data. As data volumes and business requirements continue to grow, they can continue adding servers to the Hadoop cluster to keep the processing times down and always distribute the report on time. In addition, faster report generation times and the reduced impact on production allow the customer to generate the report multiple times each day. This, in turn, keeps business executives up-to-date with the latest information, and allows them to respond faster to operational changes.
In many cases, big data deals with data the organization already collects but cannot process in a cost effective manner. The new big data technologies provide new ways to process the data efficiently and find the value hidden in it. The platforms and systems to enable new types of data processing already exist—the challenge is in leveraging them to bring value to your business, as demonstrated by our customers.
Gwen Shapira is a senior consultant at Pythian and an Oracle ACE Director.