Deploying an NVIDIA NIM to EC2

This demo will show you how to deploy an NVIDIA NIM to an Amazon EC2 instance and enable it for inference.

NGC API key
Route 53 Hosted Zone/Domain
On-Demand vCPU quota

NGC API Key

In order to use this demo, you will need an NGC API key. This key can be generated here: https://org.ngc.nvidia.com/setup/personal-keys

We will upload this key to AWS Secrets Manager and will be used within the EC2 instance to log in to NGC to download the container.

To enable TLS security on the Application Load Balancer, we need to generate a certificate using AWS Certificate Manager. In order to generate this certificate, we need a domain and Hosted Zone in Amazon Route 53. We will configure this before launching the stack.

On-Demand vCPU quota

This demo will launch a G5 EC2 instance. The number of vCPUs used by this instance will depend on the size. For example, a G5.12xlarge instance will use 48 vCPUs. Be sure to check your utilization and quota value to make sure you can launch this instance. You can request an increase if necessary.

Setup

.env Configuration

Before launching, be sure to configure a .env file. Many of these fields can be left with their defaults.

# Stack Configuration
STACK_NAME=nim-stack
AWS_REGION=us-west-2
TEMPLATE_FILE=nims-stack.yaml
 
# Instance Configuration
INSTANCE_TYPE=g5.12xlarge
 
# AWS Resources
KEY_NAME=REPLACE_WITH_YOUR_KEY
VPC_ID=
SUBNET_IDS=
 
# Domain Configuration
DOMAIN_NAME=HOSTNAME.EXAMPLE.COM
HOSTED_ZONE_ID=REPLACE_WITH_YOUR_HOSTED_ZONE_ID
 
# NVIDIA Configuration
NGC_API_KEY_SECRET_NAME=NGCApiKeySecret
NGC_API_KEY=REPLACE_WITH_YOUR_NGC_API_KEY
REPOSITORY=nim/meta/llama3-8b-instruct
LATEST_TAG=1.0.0

You will need to update the following before launching:

# AWS Resources
KEY_NAME=REPLACE_WITH_YOUR_KEY
 
# Domain Configuration
DOMAIN_NAME=HOSTNAME.EXAMPLE.COM
HOSTED_ZONE_ID=REPLACE_WITH_YOUR_HOSTED_ZONE_ID
 
# NVIDIA Configuration
NGC_API_KEY=REPLACE_WITH_YOUR_NGC_API_KEY

Secret Update

To upload the NGC API key to Secrets Manager, run the script:

./update_ngc_api_key.sh

This API Key will be retrieved by the EC2 instance during deployment.

Launch Cloudformation Stack

You can either create the stack in the AWS Console, or through the included script. The script will populate the parameters required that were configured in the .env file. You will need to configure the required parameters in the console.

Inference

Once the deployment succeeds, you can run inference requests against it using a variety of tools. Included is a script that uses curl to make a request to the endpoint using the configured domain name.

curl -s -X POST "https://$DOMAIN_NAME/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "'"$model_id"'",
        "messages": [
            {"role": "user", "content": "Hello! How are you?"},
            {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
            {"role": "user", "content": "Write a short limerick about the wonders of GPU computing."}
        ],
        "max_tokens": 100
    }' | jq '.choices[0].message.content'

To test the endpoint and inference:

./inference.sh

How it Works

Components

EC2 Instance

Uses a G5 instance type (default: g5.12xlarge) for GPU capabilities
Runs Ubuntu with NVIDIA drivers, Docker, and NVIDIA Container Toolkit
Automatically pulls and runs the specified NIM container

Auto Scaling Group (ASG)

Manages a single EC2 instance
Ensures high availability and easy updates
Uses a Launch Template for instance configuration

Application Load Balancer (ALB)

Provides HTTPS termination
Routes traffic to the EC2 instance
Uses an ACM certificate for HTTPS

VPC and Networking

Creates a new VPC with public subnets (or uses existing ones)
Sets up necessary security groups and routing

Route 53

Creates a DNS record pointing to the ALB

EC2 Configuration

The stack creates or uses existing VPC and subnets
An EC2 instance is launched with the necessary software and configurations
The NIM container is pulled and started on the EC2 instance
An ALB is set up to route traffic to the EC2 instance
A Route 53 record is created to point the specified domain to the ALB

The stack uses conditions to determine whether to create new VPC resources or use existing ones, making it flexible for different deployment scenarios.

UserData Script Details

The UserData script in the Launch Template performs several key setup tasks:

System Updates and Dependencies
- Updates the package list
- Installs Python3-pip, AWS CLI, and CloudFormation helper scripts
NVIDIA Driver and CUDA Installation
- Installs Linux headers
- Adds NVIDIA CUDA repository
- Installs CUDA drivers
Docker Installation
- Adds Docker repository
- Installs Docker CE
NVIDIA Container Toolkit Installation
- Adds NVIDIA Docker repository
- Installs nvidia-docker2
- Restarts Docker service
NGC API Key Retrieval
- Fetches the NGC API Key from AWS Secrets Manager
NIM Container Setup and Run
- Sets up environment variables for the container
- Creates a local cache directory
- Logs in to the NGC container registry
- Runs the NIM container with the following configuration:
  - Uses NVIDIA runtime
  - Exposes port 8000
  - Mounts a local cache volume
  - Sets the NGC API Key as an environment variable
Signaling
- Signals the success or failure of the setup to CloudFormation

Docker Container and ALB Interaction

Container Exposure
- The NIM container exposes port 8000 inside the EC2 instance
Security Group Configuration
- The EC2 security group allows inbound traffic on port 8000 from the ALB security group
ALB Target Group
- An ALB Target Group is created with the following properties:
  - Protocol: HTTP
  - Port: 8000
  - Health check path: /health
ALB Listener
- The ALB listener is configured for HTTPS (port 443)
- It forwards traffic to the Target Group
Request Flow
- Incoming HTTPS requests hit the ALB
- The ALB terminates SSL and forwards the request as HTTP to the EC2 instance on port 8000
- The Docker container receives the request on port 8000 and processes it
Health Checks
- The ALB periodically sends health check requests to the /health endpoint
- The NIM container responds to these health checks, allowing the ALB to determine if the instance is healthy

Instance Types and Costs

This CloudFormation template supports the following instance types for NVIDIA AI Microservices (NIM):

G5 Instances

g5.xlarge: 1 GPU, 4 vCPUs, 16 GB RAM
g5.2xlarge: 1 GPU, 8 vCPUs, 32 GB RAM
g5.4xlarge: 1 GPU, 16 vCPUs, 64 GB RAM
g5.8xlarge: 1 GPU, 32 vCPUs, 128 GB RAM
g5.16xlarge: 1 GPU, 64 vCPUs, 256 GB RAM
g5.12xlarge: 4 GPUs, 48 vCPUs, 192 GB RAM
g5.24xlarge: 4 GPUs, 96 vCPUs, 384 GB RAM
g5.48xlarge: 8 GPUs, 192 vCPUs, 768 GB RAM

⚠️ Warning: These instances can be very expensive to run. Please check the current AWS pricing for your region before deploying. For example, as of 2023, the costs for G5 instances in the US East (N. Virginia) region range from $1.006/hour for g5.xlarge to $16.288/hour for g5.48xlarge.

Always remember to stop or terminate your instances when not in use to avoid unnecessary charges.

Deploying an NVIDIA NIM to EC2

Deploying an NVIDIA NIM to EC2

GitHub Repository

Requirements

NGC API Key

Route 53 Hosted Zone/Domain

On-Demand vCPU quota

Setup

.env Configuration

Secret Update

Launch Cloudformation Stack

Inference

How it Works

Components

EC2 Instance

Auto Scaling Group (ASG)

Application Load Balancer (ALB)

VPC and Networking

Route 53

EC2 Configuration

UserData Script Details

Docker Container and ALB Interaction

Instance Types and Costs

G5 Instances