AI Solution

Deploying LLMs with NVIDIA GPUs on OCI Compute Bare Metal

Introduction

Have you ever wondered how to deploy a large language model (LLM) on Oracle Cloud Infrastructure (OCI)? In this solution, you’ll learn how to deploy LLMs using OCI Compute Bare Metal instances powered by NVIDIA GPUs with an inference server called vLLM.

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API, meaning that we can choose OpenAI models (such as GPT-3.5 or GPT-4) to generate text for our request based just on two things.

  • The original user’s query
  • The model name of the LLM you want to run text generation against

These LLMs can come from any Hugging Face well-formed repository (developer’s choice), so we’ll need to authenticate to Hugging Face to pull the models (if we haven't built them from the source code) with an authentication token.

Prerequisites and setup

  1. Oracle Cloud account—sign-up page
  2. Oracle Cloud Infrastructure—documentation
  3. OCI Generative AI—documentation
  4. vLLM—getting started documentation

注:为免疑义,本网页所用以下术语专指以下含义:

  1. Oracle专指Oracle境外公司而非甲骨文中国。
  2. 相关Cloud或云术语均指代Oracle境外公司提供的云技术或其解决方案。