For Matt Thomson, a pioneering researcher at the California Institute of Technology, developing cancer treatments is mostly a big data challenge: applying machine learning (ML) models to patient data at massive scale to form new therapies for the hardest-to-cure tumors.
“We know that if we can harness the body's own immune system and get it to attack a tumor, we can cure cancer,” Thomson says. “But for some of the worst cancers, this kind of strategy doesn't work. So now we're using machine learning to look at all the data associated with patients where this works or doesn't work, and then design new therapies.”
Thomson is the principal investigator for Caltech’s Single-Cell Profiling and Engineering Center, informally dubbed the Thomson Lab. He and his team integrate and analyze widely variable data sets to build and apply large language machine learning models in a process called protein engineering.
Those models contain up to 100 billion parameters and require expertise in distributed computing to host, run, and fine-tune them at scale. Each model must be run thousands of times during protein design-test cycles. Protein design requires not just single models, but also libraries of models specialized for downstream applications such as immunomodulation (reducing or enhancing the immune response) and thermostability (a substance’s ability to maintain its characteristic properties when subjected to a moderate degree of heat). The challenge Thomson Lab faces is gaining access to the high performance computing (HPC) GPUs needed to run and test models at this massive scale.
“100 billion parameters won't fit on a single GPU,” Thomson says. “Gaining access to adequate and elastic HPC resources requires a multiyear contract. Within the academic community it is almost impossible to gain that level of funding.”
Historically, individual researchers and organizations built their own one-off computers for this kind of work, but those became obsolete in a matter of months. More recently, the lab has used Caltech’s own HPC cluster, but as its research progressed, even those powerful resources proved inadequate.
So Thomson turned to the cloud. The lab’s first attempt with a well-known cloud infrastructure provider was stymied by hidden costs and the burdens of internal administration. Through his network of contacts, Thomson connected with members of Oracle’s AI and ML team, leading to the design of a proof of concept (PoC) for creating and testing models on Oracle Cloud Infrastructure (OCI) GPU instances.
“By having ready access to the latest GPU instances on OCI, it is both possible and practical to enable researchers to leverage the latest technology. This may soon make on-premises HPC clusters obsolete for this type of research.”
For context: with each model, about 80 gigabytes of data out of a total database of about 20 terabytes is pulled into and held in GPU memory while the model is being trained. In the PoC, 1,000 models were created. Previously, Thomson Lab had been able to create a test of only 10 models at a time.
“During the PoC, Oracle was really collaborative in working with us, and the Oracle team continues to demonstrate their commitment to advancing our work,” Thomson says. “Other vendors will offer incentives for you to sign up, but then they don’t show any real interest in working with an organization of our size.”
Biological research requires the consolidation of increasingly larger amounts of data with myriad new mathematical models. Historically, the research community hasn’t relied on professional-level databases, opting instead to use inexpensive open source database services.
For example, Thomson Lab works with more than 100 data sets consisting of as much as 10 million rows and 30,000 columns each, generating about 20 terabytes of new data each week. Currently, data sets are stored individually as CSV files on local hard drives. But without a data storage and management system that can store all of Caltech’s data sets, along with those of other research organizations, machine learning models can’t be trained using all available and relevant information.
Consequently, the desired future state is for Thomson Lab to work with Oracle to develop a data storage and management system that holds all data sets while also being dynamically accessible to researchers at any institution.
Thomson is optimistic that Caltech’s work with Oracle will lead to groundbreaking advances in cancer research and care.
“All the tools are there,” he says. “We want to work with Oracle to bring everything together and make it economically possible in a mutually agreeable monetization model, not just for Caltech, but also for similar organizations. There is no ceiling on what we can accomplish together.”
Researchers run ML models twice as fast on OCI.
Train AI models using OCI Data Science, bare metal instances, and cluster networking.