With support from one GPU up to tens of thousands of GPUs, Oracle Cloud Infrastructure (OCI) Compute virtual machines and bare metal instances can power applications for computer vision, natural language processing, recommendation systems, and more. For training large language models (LLMs), including conversational AI and diffusion models, OCI Supercluster provides ultralow latency cluster networking, HPC storage, and OCI Compute bare metal instances powered by NVIDIA GPUs.
Learn about OCI’s supercluster architecture and hear from customers Adept and MosaicML.
Each OCI Compute bare metal instance is connected using OCI’s ultralow latency cluster networking that can scale up to 32,768 NVIDIA A100 GPUs in a single cluster. These instances use OCI’s unique high performance network architecture that leverages RDMA over Converged Ethernet (RoCE) v2 for creating RDMA superclusters with microseconds of latency between nodes and near line rate bandwidth of 200 Gb/sec between GPUs.
OCI’s implementation of RoCE v2 provides
High performance computing on Oracle Cloud Infrastructure provides powerful, cost-effective computing capabilities to solve complex mathematical and scientific problems across industries.
OCI's bare metal servers coupled with Oracle’s cluster networking provide access to ultralow-latency (less than 2 microseconds across clusters of tens of thousands of cores) RDMA over converged ethernet (RoCE) v2.
The chart shows the performance of Oracle’s cluster networking fabric. OCI can scale above 100% below 10,000 simulation cells per core with popular CFD codes, the same performance that you would see on-premises. It’s important to note that without the penalty of virtualization, bare metal HPC machines can use all the cores on the node without having to reserve any cores for costly overhead.
HPC on OCI rivals the performance of on-premises solutions with the elasticity and consumption-based costs of the cloud, offering on-demand potential to scale tens of thousands of cores simultaneously.
With HPC on OCI, you get access to high-frequency processors; fast and dense local storage; high-throughput, ultralow-latency RDMA cluster networks; and the tools to automate and run jobs seamlessly.
OCI can provide latencies as low as 1.7 microseconds—lower than any other cloud vendor, according to an analysis by Exabyte.io. By enabling RDMA-connected clusters, OCI has expanded cluster networking for bare metal servers equipped with NVIDIA A100 GPUs.
The groundbreaking backend network fabric lets customers use Mellanox’s ConnectX-5 100 Gb/sec network interface cards with RDMA over converged Ethernet (RoCE) v2 to create clusters with the same low-latency networking and application scalability that can be achieved on-premises.
OCI’s bare metal NVIDIA GPU instances offer startups a high performance computing platform for applications that rely on machine learning, image processing, and massively parallel high performance computing jobs. GPU instances are ideally suited for model training, inference computation, physics and image rendering, and massively parallel applications.
The BM.GPU4.8 instances have eight NVIDIA A100 GPUs and use Oracle’s low-latency cluster networking, based on remote direct memory access (RDMA) running over converged Ethernet (RoCE) with less than 2-microsecond latency. Customers can now host more than 500 GPU clusters and easily scale on demand.
Customers such as Adept, an ML research and product lab developing a universal AI teammate, are using the power of OCI and NVIDIA technologies to build the next generation of AI models. Running thousands of NVIDIA GPUs on clusters of OCI bare metal compute instances and capitalizing on OCI’s network bandwidth, Adept can train large-scale AI and ML models faster and more economically than before.
“With the scalability and computing power of OCI and NVIDIA technology, we are training a neural network to use every software application, website, and API in existence—building on the capabilities that software makers have already created.”
David Luan, CEO
“We view this relationship with OCI as long term. We’re excited about taking advantage of the GPUs and using that to train our next generation of voice AI. There's a lot that we think that OCI will provide for us in terms of future growth.”
James Hom, Cofounder and Vice President of Products
“We selected Oracle because of the affordability and performance of the GPUs combined with Oracle’s extensive cloud footprint. GPUs are very important for training deep neural network models. The higher the GPU performance, the better our models. And because we work in several different countries and regions, we needed the infrastructure to support that.”
Nils Helset, Cofounder and CEO
“When running experiments with the same configuration, the A100 uses about 25% less time on average. What makes it even better is the smooth process of setting up the machine on Oracle Cloud.”
Shuyang Cao, Graduate Student Research Assistant
University of Michigan
Learn why MosaicML found that OCI is the best foundation for AI training.
OCI provides world-class technical experts to help you get up and running. We remove the technical barriers of a complex deployment—from planning to launch—to help ensure your success.
OCI is built for enterprises seeking higher performance, consistently lower costs, and easier cloud migration for their current on-premises applications. When compared to AWS, OCI offers
Jag Brar, OCI Vice President and Distinguished Engineer, and Pradeep Vincent, OCI Senior Vice President and Chief Technical Architect
OCI offers many unique services, including cluster network, an ultrahigh performance network with support for remote direct memory access (RDMA). In our previous First Principles video and blog, “Building a high performance network in the public cloud,” we explained how OCI’s cluster network uses RDMA over Converged Ethernet (RoCE) to support RDMA.Read the complete post