Serving a trained model in production is a different beast from training it. You need low latency, high throughput, hardware utilization, and stability, all at once. NVIDIA Triton Inference Server is built to handle exactly that. In this post, we'll walk through taking a ResNet50 model fine-tuned on the EuroSAT dataset from Hugging Face, converting it to ONNX, and serving it via Triton.
What We're Building
EuroSAT is a satellite image classification dataset covering 10 land-use classes: forests, highways, industrial zones, residential areas, and more. A ResNet50 fine-tuned on it can classify 64x64 Sentinel-2 patches into these categories with strong accuracy.
NVIDIA Triton is a production-grade inference server that supports multiple backends (PyTorch, TensorFlow, ONNX Runtime, TensorRT, and more), dynamic batching, concurrent model execution, and both HTTP and gRPC endpoints out of the box.
HuggingFace ResNet50 (EuroSAT) → ONNX Export → Triton Model Repository → Triton Server → HTTP/gRPC Inference
Prerequisites
- Docker with NVIDIA Container Toolkit installed
- A GPU (the commands use
--gpus=all) - The Hugging Face model ID for your ResNet50 EuroSAT checkpoint
Step 1: Pull the Container Images
We use two separate NVIDIA containers, one for model conversion and one for serving.
# PyTorch container for ONNX conversion
docker pull nvcr.io/nvidia/pytorch:26.03-py3
# Triton Inference Server
docker pull nvcr.io/nvidia/tritonserver:26.04-py3
Keeping these separate is intentional. The PyTorch image is heavyweight and optimized for model work; the Triton image is lean and optimized for serving. There's no reason to bloat your production inference container with training-time dependencies.
Step 2: Convert the HuggingFace Model to ONNX
ONNX (Open Neural Network Exchange) is the format Triton's ONNX Runtime backend expects. Hugging Face's optimum library makes this conversion straightforward.
2a. Start the PyTorch Container
docker run --gpus=all -it -v ${PWD}:/workspace nvcr.io/nvidia/pytorch:26.03-py3
The -v ${PWD}:/workspace flag mounts your current directory into the container so exported files persist after the container exits.
2b. Install Dependencies
pip install timm accelerate
pip install "optimum-onnx[onnxruntime]"
2c. Export to ONNX
optimum-cli export onnx \
--model <hf/resnet50-eurosat> \
--task image-classification \
resnet50-eurosat-onnx
Replace <hf/resnet50-eurosat> with your actual Hugging Face model ID (e.g., microsoft/resnet-50 fine-tuned, or a community checkpoint). The --task image-classification flag tells optimum which input/output signature to produce.
This generates a resnet50-eurosat-onnx/ directory containing model.onnx and the tokenizer/processor config.
Step 3: Inspect the ONNX Model
Before writing the Triton config, confirm the exact input and output tensor names and shapes. Triton is strict about these.
import onnx
model = onnx.load("resnet50-eurosat-onnx/model.onnx")
print("Inputs:")
print(model.graph.input)
print("Outputs:")
print(model.graph.output)
For a standard image classification ResNet50, you'll see:
- Input:
pixel_values, shape[-1, 3, 224, 224], dtypefloat32 - Output:
logits, shape[-1, 10], dtypefloat32(10 classes for EuroSAT)
Keep these handy for the next step.
Step 4: Prepare the Model Repository
Triton expects models in a specific directory layout:
model_repository/
└── <model-name>/
├── config.pbtxt
└── 1/
└── model.onnx
The 1/ subdirectory is the model version. Triton supports multiple versions simultaneously and can route traffic between them, which is useful for A/B testing or canary deployments.
mkdir -p model_repository/image_classification/1
cp resnet50-eurosat-onnx/model.onnx model_repository/image_classification/1/model.onnx
Step 5: Write the Triton Config
Create model_repository/image_classification/config.pbtxt:
name: "image_classification"
backend: "onnxruntime"
max_batch_size: 0
input [
{
name: "pixel_values"
data_type: TYPE_FP32
dims: [ -1, 3, 224, 224 ]
}
]
output [
{
name: "logits"
data_type: TYPE_FP32
dims: [ -1, 10 ]
}
]
Your final directory structure should look like this:
model_repository/
└── image_classification/
├── config.pbtxt
└── 1/
└── model.onnx
Step 6: Start Triton Inference Server
docker run \
--gpus=all \
--rm \
--shm-size=256m \
-p 8000:8000 \
-p 8001:8001 \
-p 8002:8002 \
-v ${PWD}/model_repository:/models \
nvcr.io/nvidia/tritonserver:26.04-py3 \
tritonserver --model-repository=/models
Port mapping:
| Port | Protocol | Use |
|---|---|---|
| 8000 | HTTP | REST inference + management |
| 8001 | gRPC | gRPC inference |
| 8002 | HTTP | metrics |
When the server starts successfully, you'll see:
I tritonserver.cc Started GRPCInferenceService at 0.0.0.0:8001
I tritonserver.cc Started HTTPService at 0.0.0.0:8000
I tritonserver.cc Started Metrics Service at 0.0.0.0:8002
And model status:
I modelcheckerheuristic.cc Model image_classification: Status: READY
Step 7: Send an Inference Request
With the server running, you can query it over HTTP using Triton's HTTP client or plain curl. Here's a quick Python example using tritonclient:
import tritonclient.http as httpclient
import numpy as np
from PIL import Image
from torchvision import transforms
# Preprocess image the same way the model was trained
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
image = Image.open("eurosat_sample.jpg").convert("RGB")
tensor = transform(image).unsqueeze(0).numpy() # shape: [1, 3, 224, 224]
client = httpclient.InferenceServerClient(url="localhost:8000")
inputs = [httpclient.InferInput("pixel_values", tensor.shape, "FP32")]
inputs[0].set_data_from_numpy(tensor)
outputs = [httpclient.InferRequestedOutput("logits")]
response = client.infer(model_name="image_classification", inputs=inputs, outputs=outputs)
logits = response.as_numpy("logits")
EUROSAT_CLASSES = [
"AnnualCrop", "Forest", "HerbaceousVegetation", "Highway",
"Industrial", "Pasture", "PermanentCrop", "Residential",
"River", "SeaLake"
]
predicted_class = EUROSAT_CLASSES[logits.argmax()]
print(f"Predicted: {predicted_class}")
Summary
| Step | What Happens |
|---|---|
| Pull containers | Get PyTorch (conversion) and Triton (serving) images |
| Export to ONNX | Use optimum-cli to convert the HF model |
| Inspect ONNX | Confirm tensor names and shapes |
| Build model repo | Set up the model_repository/<name>/1/model.onnx structure |
| Write config | Define backend, inputs, outputs in config.pbtxt |
| Start server | tritonserver --model-repository=/models |
| Infer | Send requests via HTTP, gRPC, or tritonclient |
Triton handles the serving infrastructure, including batching, multi-GPU routing, health checks, and metrics, so your application code only needs to worry about sending requests and interpreting logits.