Krylox LLP is a specialist MLOps and AI inference optimization engineering firm. We work on MLOps strategy and inference architecture, and host models on our own GPU fleet so clients pay only for what they use.

How much faster can Krylox make my ML models?

Krylox achieves up to 10× faster inference through quantization (INT8/FP16), TensorRT optimization, kernel fusion, and intelligent batching.

How does Krylox reduce cloud ML costs?

Krylox reduces cloud ML costs by up to 60% through GPU right-sizing, spot instance strategies, and model compression across AWS, GCP, and Azure.

Does Krylox work with my existing cloud provider?

Yes. Krylox uses a BYOC (Bring Your Own Cloud) model, deploying within your AWS, GCP, or Azure environment. Alternatively, host on Krylox's own GPU infrastructure.

Does Krylox work with my ML framework?

Yes. Krylox follows BYOM (Bring Your Own Model) and optimizes PyTorch, TensorFlow, JAX, and ONNX models including fine-tuned LLMs.

Where does Krylox operate?

Krylox serves clients across EMEA, the UAE, India, and the United States.

What is the team background at Krylox?

The Krylox team has ML infrastructure experience from Google, Meta, and Bloomberg.

Deploying a ResNet50 EuroSAT Image Classifier on NVIDIA Triton Inference Server

Serving a trained model in production is a different beast from training it. You need low latency, high throughput, hardware utilization, and stability, all at once. NVIDIA Triton Inference Server is built to handle exactly that. In this post, we'll walk through taking a ResNet50 model fine-tuned on the EuroSAT dataset from Hugging Face, converting it to ONNX, and serving it via Triton.

What We're Building

EuroSAT is a satellite image classification dataset covering 10 land-use classes: forests, highways, industrial zones, residential areas, and more. A ResNet50 fine-tuned on it can classify 64x64 Sentinel-2 patches into these categories with strong accuracy.

NVIDIA Triton is a production-grade inference server that supports multiple backends (PyTorch, TensorFlow, ONNX Runtime, TensorRT, and more), dynamic batching, concurrent model execution, and both HTTP and gRPC endpoints out of the box.

HuggingFace ResNet50 (EuroSAT) → ONNX Export → Triton Model Repository → Triton Server → HTTP/gRPC Inference

Prerequisites

Docker with NVIDIA Container Toolkit installed
A GPU (the commands use --gpus=all)
The Hugging Face model ID for your ResNet50 EuroSAT checkpoint

Step 1: Pull the Container Images

We use two separate NVIDIA containers, one for model conversion and one for serving.

# PyTorch container for ONNX conversion
docker pull nvcr.io/nvidia/pytorch:26.03-py3

# Triton Inference Server
docker pull nvcr.io/nvidia/tritonserver:26.04-py3

Keeping these separate is intentional. The PyTorch image is heavyweight and optimized for model work; the Triton image is lean and optimized for serving. There's no reason to bloat your production inference container with training-time dependencies.

Step 2: Convert the HuggingFace Model to ONNX

ONNX (Open Neural Network Exchange) is the format Triton's ONNX Runtime backend expects. Hugging Face's optimum library makes this conversion straightforward.

2a. Start the PyTorch Container

docker run --gpus=all -it -v ${PWD}:/workspace nvcr.io/nvidia/pytorch:26.03-py3

The -v ${PWD}:/workspace flag mounts your current directory into the container so exported files persist after the container exits.

2b. Install Dependencies

pip install timm accelerate
pip install "optimum-onnx[onnxruntime]"

2c. Export to ONNX

optimum-cli export onnx \
  --model <hf/resnet50-eurosat> \
  --task image-classification \
  resnet50-eurosat-onnx

Replace <hf/resnet50-eurosat> with your actual Hugging Face model ID (e.g., microsoft/resnet-50 fine-tuned, or a community checkpoint). The --task image-classification flag tells optimum which input/output signature to produce.

This generates a resnet50-eurosat-onnx/ directory containing model.onnx and the tokenizer/processor config.

Step 3: Inspect the ONNX Model

Before writing the Triton config, confirm the exact input and output tensor names and shapes. Triton is strict about these.

import onnx

model = onnx.load("resnet50-eurosat-onnx/model.onnx")

print("Inputs:")
print(model.graph.input)

print("Outputs:")
print(model.graph.output)

For a standard image classification ResNet50, you'll see:

Input: pixel_values, shape [-1, 3, 224, 224], dtype float32
Output: logits, shape [-1, 10], dtype float32 (10 classes for EuroSAT)

Keep these handy for the next step.

Step 4: Prepare the Model Repository

Triton expects models in a specific directory layout:

model_repository/
└── <model-name>/
    ├── config.pbtxt
    └── 1/
        └── model.onnx

The 1/ subdirectory is the model version. Triton supports multiple versions simultaneously and can route traffic between them, which is useful for A/B testing or canary deployments.

mkdir -p model_repository/image_classification/1

cp resnet50-eurosat-onnx/model.onnx model_repository/image_classification/1/model.onnx

Step 5: Write the Triton Config

Create model_repository/image_classification/config.pbtxt:

name: "image_classification"
backend: "onnxruntime"
max_batch_size: 0

input [
  {
    name: "pixel_values"
    data_type: TYPE_FP32
    dims: [ -1, 3, 224, 224 ]
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1, 10 ]
  }
]

Your final directory structure should look like this:

model_repository/
└── image_classification/
    ├── config.pbtxt
    └── 1/
        └── model.onnx

Step 6: Start Triton Inference Server

docker run \
  --gpus=all \
  --rm \
  --shm-size=256m \
  -p 8000:8000 \
  -p 8001:8001 \
  -p 8002:8002 \
  -v ${PWD}/model_repository:/models \
  nvcr.io/nvidia/tritonserver:26.04-py3 \
  tritonserver --model-repository=/models

Port mapping:

Port	Protocol	Use
8000	HTTP	REST inference + management
8001	gRPC	gRPC inference
8002	HTTP	metrics

When the server starts successfully, you'll see:

I tritonserver.cc Started GRPCInferenceService at 0.0.0.0:8001
I tritonserver.cc Started HTTPService at 0.0.0.0:8000
I tritonserver.cc Started Metrics Service at 0.0.0.0:8002

And model status:

I modelcheckerheuristic.cc Model image_classification: Status: READY

Step 7: Send an Inference Request

With the server running, you can query it over HTTP using Triton's HTTP client or plain curl. Here's a quick Python example using tritonclient:

import tritonclient.http as httpclient
import numpy as np
from PIL import Image
from torchvision import transforms

# Preprocess image the same way the model was trained
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])

image = Image.open("eurosat_sample.jpg").convert("RGB")
tensor = transform(image).unsqueeze(0).numpy()  # shape: [1, 3, 224, 224]

client = httpclient.InferenceServerClient(url="localhost:8000")

inputs = [httpclient.InferInput("pixel_values", tensor.shape, "FP32")]
inputs[0].set_data_from_numpy(tensor)

outputs = [httpclient.InferRequestedOutput("logits")]

response = client.infer(model_name="image_classification", inputs=inputs, outputs=outputs)
logits = response.as_numpy("logits")

EUROSAT_CLASSES = [
    "AnnualCrop", "Forest", "HerbaceousVegetation", "Highway",
    "Industrial", "Pasture", "PermanentCrop", "Residential",
    "River", "SeaLake"
]

predicted_class = EUROSAT_CLASSES[logits.argmax()]
print(f"Predicted: {predicted_class}")

Summary

Step	What Happens
Pull containers	Get PyTorch (conversion) and Triton (serving) images
Export to ONNX	Use `optimum-cli` to convert the HF model
Inspect ONNX	Confirm tensor names and shapes
Build model repo	Set up the `model_repository/<name>/1/model.onnx` structure
Write config	Define backend, inputs, outputs in `config.pbtxt`
Start server	`tritonserver --model-repository=/models`
Infer	Send requests via HTTP, gRPC, or `tritonclient`

Triton handles the serving infrastructure, including batching, multi-GPU routing, health checks, and metrics, so your application code only needs to worry about sending requests and interpreting logits.