To make AI results more relevant, businesses turn to “last-mile” training

By fine-tuning large language models using specialized data, banks, hospitals, and others are boosting AI's accuracy—without the cost of training from scratch.

Aaron Ricadela | September 8, 2023

As powerful as generative AI systems have become, they can still stumble on the most complex and specialized cases—such as those presented by banks, pharma companies, publishers, hospitals, and mega retailers. Training large language models (LLMs) from scratch to learn the nuances of a specialized field can run into millions of dollars a month.

Instead, savvy companies are increasingly fine-tuning generative AI models to hit higher accuracy levels than they’re capable of off the shelf. Using techniques known as “last-mile” training, they’re feeding the models modest amounts of their own data so the systems can excel at industry-specific tasks—without breaking the bank on computing costs. By augmenting the general knowledge that models have learned with companies’ private financial, scientific, or sales data, the fine-tuners can customize models for a fraction of the cost and computing resources needed to train from the ground up.

“These models are foundations—they’re incredible starting points, and some problems they solve out of the box,” says Alex Ratner, CEO of Snorkel AI, whose software, Snorkel Flow, is used by banks, pharma, and tech companies to automate labeling of training data and speed up AI software development. “But in most enterprise settings with complex data and sophisticated use cases, you need to fine-tune and customize the model. You can’t tweak the model architecture or fiddle with the algorithms. The difference is the data you feed them.”

Supersized AI models’ ability to aid medical diagnoses, comb market analysis, or analyze conversations is upending how businesses apply computing power to their most important work. Generative AI is named for its ability to draft written summaries, create images, draft blog posts or press releases, or write working code. Building the AI models involves training complex statistical systems for millions of hours on data sets drawn largely from the web, so they learn how language works and can apply that knowledge to new areas. The resulting models are unleashing efficiency and accuracy gains that may add trillions of dollars to the global economy during the next decade.

Last-mile training strategies help generative AI systems tackle specialized jobs for dramatically less cost than tuning billions of parameters in the public cloud.

Until recently, AI performance gains came mostly from bulking up the billions of parameters in an AI model that weigh what’s important about relationships gleaned from the input data. Yet the millions of dollars it costs to train large language models and adjust their billions of parameters (OpenAI’s GPT-4 reportedly uses 1.76 trillion of them) is prohibitive for most businesses.

“There was a race to compete for more and more parameters,” says Jaron Waldman, chief product officer at AI model company Cohere. “You quickly get into realms where it’s so expensive to train and so slow to serve.”

So banks, pharma companies, publishers, hospitals, and retailers are leaving most parameters untouched, letting LLM purveyors, such as OpenAI, Cohere, or open source providers, handle the first-mile training, then fine-tuning generative AI models with a smattering of their own data.

The last-mile strategies that help systems excel at specialized jobs can cost as little as $25,000, using eight cloud-based GPU-powered servers over few days—or even hours—and yield marked accuracy gains. Costs can vary greatly, of course, based on models' complexity and accuracy requirements. Still, they’re dramatically less than the roughly $2.5 million it would take to fully train a model of even 65 billion or 70 billion parameters using hundreds of GPUs from a public cloud provider.

“We can take these models trained on the general internet and adapt them to do things specific to your business,” says Greg Pavlik, senior vice president for AI at Oracle, which is working with companies in retail, healthcare, and pharmaceuticals to fine-tune AI models using its generative AI services. “What is the art of the possible today? The art of the possible is pretty freaking cool.”

Summaries in seconds

Oracle invested in generative AI company Cohere in June and is building AI services on Oracle Cloud Infrastructure (OCI) that use Cohere models. Oracle AI teams are huddled with key customers on last-mile training scenarios that refine models using their own data.

They’re working with a large retailer on a model that could summarize a customer’s chatbot session in seconds when an agent picks up the phone to take over the conversation. A project with a United States cancer treatment center aims to extract information from oncologist, chemotherapist, and nutritionist notes, store it in a structured database, and predict the chance a patient may suffer a recurrence or need emergency room care. Oracle is also working with a European pharmaceutical company to summarize drugs’ effectiveness in clinical trials for regulatory reporting.

Microsoft in July said its Copilot generative AI software will be able to add businesses’ meeting transcripts, emails, and chats to pretrained AI models to generate summaries or combine companies’ internal data with information on the web to create strengths, weaknesses, opportunities, and threats (SWOT) analyses. OpenAI has introduced fine-tuned language models that can call other programs to complete tasks such as emailing a colleague, extracting structured data from text, or converting natural language into database queries.

Even if companies refine LLMs based on their data, there are still reasons to tread carefully when implementing them, including privacy, intellectual property, and reputation concerns. The models can memorize consumers’ personal information and ingest code during training— then repeat it later. They’re also prone to returning biased or erroneous information that isn’t acceptable when a business’s reputation, patients’ health, or customers’ money is on the line. A panoply of open source language models gaining popularity can pose licensing challenges. These kinds of problems are exactly what fine-tuning has to iron out.

“These models are trained to get a prompt and produce a statistically plausible output,” says Ratner. “Surprising behavior is part of what you have to deal with in the last mile.”

Frozen weights

Companies that get it right could reap huge gains. Generative AI could lift global GDP by 7%—or nearly $7 trillion—over the next 10 years by increasing office work productivity, hastening drug discovery, and speeding software development, an April Goldman Sachs report predicts. McKinsey & Co. estimated in June that three-quarters of generative AI’s expected value will come from four business areas: customer service, sales and marketing, software engineering, and R&D. Generative AI and related technologies could eventually automate work that now takes up 60% to 70% of employees’ time and tack as much as $4.4 trillion to global GDP, as half of today’s work becomes automated between 2030 and 2060, the consultancy said.

Data scientists are employing techniques called instruction tuning and reinforcement learning from human feedback (RLHF) to show neural networks fresh examples of how humans (or machines) label data. They use parameter-efficient fine-tuning (PEFT) to select which parameters to change. In instruction tuning, teams create a data set of instructions and their correct responses, using those to teach a LLM to follow similar demonstrations at inference time. Reinforcement learning expands on the approach by creating a “reward model” with human preferences that further refines the network. PEFT techniques can lower computing costs by integrating a small number of new parameters into a large model, training only the new parameters to improve problem-solving ability.

Separate from these training approaches, a method called retrieval augmented generation (RAG) creates enterprise applications from models that pull information from stored documents formatted for AI analysis to formulate a more specialized, current answer, beyond what they learned in training.

By “freezing” nearly all of a model’s weights and retraining only a few, showing them as little as a few dozen or hundred additional examples, companies can materially boost accuracy for analyzing financial reports, aiding oil and gas discovery, reviewing call center transcripts, or scouring medical records for a cancer-causing protein. Of course, more training can be needed to get acceptable levels of accuracy. And heftier problems such as fine-tuning a model on all the medical records in a large hospital chain may require thousands of additional examples.

The efficiency gains are tangible. Oracle has reported retraining 1% of the weights in a 176-billion parameter model using eight GPUs and 320 GB of memory, compared with hundreds of GPUs over days for a full fine-tuning of all a model’s parameters—or weeks for training from scratch. A PEFT technique from Microsoft (PDF) called LoRA can cut GPU memory requirements by threefold and the number of trainable parameters by 10,000 times compared with fine-tuning all of a 175-billion parameter model’s weights.

“If there’s some accuracy bar at a large company that’s not being hit, we need a tool to drive the model in that direction” for the customer, says Cohere’s Waldman. “With relatively few examples, the model itself becomes malleable to hit the accuracy you need for an enterprise.”

Waldman and his colleagues have seen targeted training deliver. For example, Cohere’s mainstay Command model last year struggled to correctly respond to prompts asking it to encode data with JSON, used to exchange data between applications on the web. Cohere’s team fed the 50-billion parameter model 50 new examples, and by the next day the AI system was wrapping JSON objects flawlessly. “That was really powerful, and it drove all the people building LLMs toward careful curation of data,” Waldman says.

The company has a software development kit that lets customers do that kind of refinement themselves—while keeping the results private.

Specialized data sets

To be sure, LLMs already contain the native ability to give plausible answers to novel questions with the help of just a few examples, a phenomenon called few-shot learning, which lets users show models new examples during inferencing. Their ability gets better as models grow in size. And experts can also shape LLMs’ responses through so-called prompt engineering, where machine learning experts craft precise instructions at runtime to lift performance.

That’s often enough refinement without last-mile training if the task is something such as summarizing articles or broadly classifying chat messages as belonging to happy or displeased customers, says Jun Qian, a vice president of AI development at Oracle, “Fine-tuning is a way to customize your model if few-shot isn’t enough.”

Companies with large, specialized data sets and deep coffers can also just train from scratch. That’s what financial data and news provider Bloomberg LP did with BloombergGPT, a 50-billion parameter LLM trained for 1.3 million hours on 512 GPUs in order to more accurately interpret markets news and company reports than generalist models could.

Commercial, government, and university efforts have applied last-mile training to open source AI models. Meta in July released its Llama 2 model with a commercial license aimed at professional developers; an earlier, leaked version led to offshoots from Stanford University, the University of California at Berkeley, and others, whose GPL licenses may not be palatable to enterprises. The Falcon 40B model, available with an Apache 2.0 license aimed at commercial use, was developed by the emirate of Abu Dhabi’s government technology research council.

Companies also face costs once they put models into production. They need to license APIs from model vendors for access, and most will turn to cloud computing providers for training capacity, letting them add or drop computing as usage dictates. Cloud providers are happy to host the training work, but they look to answer-serving inferencing as more lucrative over time.

“Even a few months ago, the assumption was that you’re going to be continually retraining,” says Pavlik. “People will do that, but that’s not where the mass market is. We want to bring these models closer to customers’ businesses.”

View more Oracle Connect articles