Krylox LLP is a specialist MLOps and AI inference optimization engineering firm. We work on MLOps strategy and inference architecture, and host models on our own GPU fleet so clients pay only for what they use.

How much faster can Krylox make my ML models?

Krylox achieves up to 10× faster inference through quantization (INT8/FP16), TensorRT optimization, kernel fusion, and intelligent batching.

How does Krylox reduce cloud ML costs?

Krylox reduces cloud ML costs by up to 60% through GPU right-sizing, spot instance strategies, and model compression across AWS, GCP, and Azure.

Does Krylox work with my existing cloud provider?

Yes. Krylox uses a BYOC (Bring Your Own Cloud) model, deploying within your AWS, GCP, or Azure environment. Alternatively, host on Krylox's own GPU infrastructure.

Does Krylox work with my ML framework?

Yes. Krylox follows BYOM (Bring Your Own Model) and optimizes PyTorch, TensorFlow, JAX, and ONNX models including fine-tuned LLMs.

Where does Krylox operate?

Krylox serves clients across EMEA, the UAE, India, and the United States.

What is the team background at Krylox?

The Krylox team has ML infrastructure experience from Google, Meta, and Bloomberg.

Your LLM Cold Start is Slower Than NASA's Rocket Ignition Sequence

A rocket ignition sequence takes about 6 seconds. Impressive engineering. Completely acceptable for a launch.

Your API cold start taking the same amount of time? That is a disaster.

Cold starts in serverless LLM deployments are one of the most underestimated problems in production AI. The first request to a scaled-down deployment stalls while the system pulls an image, loads a multi-gigabyte model, and initializes the GPU, leaving your user waiting long enough to notice.

Here is a breakdown of the problem at each layer, and the tools that actually help.

Why Cold Starts Hit LLMs Especially Hard

Standard serverless cold starts are measured in milliseconds to low seconds. LLM cold starts are different in scale:

A container image for a GPU-based inference server can be 10–30GB
A 7B parameter model in FP16 weighs ~14GB
CUDA initialization and model warmup add another 10–60 seconds depending on hardware

Stack those up and you can easily hit 2–5 minutes from scale-zero to first token. That is not a cold start. That is a full reboot.

There are three distinct phases where time gets lost, and each needs its own fix.

Phase 1: Container Image Download

The first bottleneck is pulling the container image itself. GPU inference images are large, and traditional container registries were not designed for this.

Use Cloud Object Storage Instead of a Registry

GCS and S3 offer significantly higher throughput than standard container registries, and they support parallel chunk downloads. For large GPU images, the difference is meaningful.

The pattern here is to store your image layers in object storage and use a snapshotter that can pull from there directly. This is especially effective when your compute is already in the same cloud region, giving you essentially local network speeds.

Lazy Pulling with Stargz-Snapshotter

The conventional approach pulls the entire image before starting the container. Lazy pulling flips this: the container starts running while the image is still being downloaded in the background, fetching chunks on demand as they are accessed.

Stargz-Snapshotter implements this for containerd. Your image needs to be in the eStargz format, but the conversion is straightforward and the startup time reduction is significant. In practice, containers can start in seconds even for images that would otherwise take minutes to pull.

# Convert an existing image to eStargz format
ctr-remote image optimize --oci IMAGE_REF OUTPUT_REF

Container Checkpoints

Rather than starting from an image every time, you can snapshot a running container (including its memory state) and restore from that snapshot on the next cold start. CRIU (Checkpoint/Restore In Userspace) makes this possible.

This is particularly powerful because you can checkpoint a container that has already completed initialization: CUDA context loaded, model in memory, server warmed up. The next cold start skips all of that and restores directly to a ready state.

Phase 2: Model Loading

Even once the container is running, loading the model takes time. A naive implementation copies the model from disk to CPU RAM and then transfers it to GPU memory. That is two full copies of a multi-gigabyte file.

Use the Right Model Format

The format your model is stored in has a direct impact on load time.

Safetensors (developed by Hugging Face) is designed for fast, safe loading. Unlike pickle-based PyTorch checkpoints, safetensors supports memory-mapped loading and does not execute arbitrary code during deserialization. For large models, the load time improvement over .bin files is noticeable.

from safetensors.torch import load_file

# Direct memory-mapped load, no intermediate copy
state_dict = load_file("model.safetensors", device="cuda")

GGUF (the format used by llama.cpp and its ecosystem) is optimized for efficient loading and supports quantized weights natively. If you are running quantized models for inference, GGUF eliminates a lot of the overhead from format conversion at load time.

Stream Directly to GPU Memory

The standard model loading path goes: storage → CPU RAM → GPU VRAM. That CPU RAM step is wasteful: you are moving gigabytes of data twice.

Run:ai's Model Streamer eliminates the intermediate copy by streaming model weights directly from object storage to GPU memory. It also parallelizes the transfer, saturating available bandwidth rather than loading sequentially.

For a 13B model on an A100, this can cut model loading time by 40–60% compared to the standard path.

from runai_model_streamer import SafetensorsStreamer

streamer = SafetensorsStreamer()
with streamer.get_tensors("s3://your-bucket/model.safetensors") as tensors:
    # Tensors land directly in GPU memory
    model.load_state_dict(tensors)

Phase 3: GPU Initialization

CUDA initialization is expensive. The first time a GPU context is created, drivers are loaded, kernels are compiled, and the device is configured. For complex models with custom CUDA kernels (common in optimized inference servers), this can add tens of seconds.

CUDA Checkpointing with CRIU

CRIU's CUDA checkpoint support extends the container checkpoint idea specifically to GPU state. You can snapshot a process that has a fully initialized CUDA context (GPU memory included) and restore it later without going through initialization again.

CUDA Checkpoint from NVIDIA integrates with this workflow, allowing you to checkpoint and restore GPU processes. Combined with CRIU for the CPU side, you can effectively snapshot an entire inference server in a ready state and restore it in seconds.

# Checkpoint a running process (including GPU state)
cuda-checkpoint --toggle --pid <inference_server_pid>

# Restore later
criu restore -d --images-dir /checkpoint/path

This is still a relatively new approach in production ML infrastructure, but for latency-critical deployments it is worth evaluating seriously.

Putting It Together

No single technique eliminates cold start by itself. The real gains come from stacking them:

Layer	Technique	Typical Improvement
Image pull	Stargz lazy pulling	60–80% reduction in pull time
Image pull	Object storage (GCS/S3)	2–4x faster pull throughput
Model load	Safetensors + direct GPU stream	40–60% faster load
GPU init	CUDA checkpoint + CRIU	Near-instant restore

For most teams, the highest-leverage starting point is lazy pulling (big impact, low integration cost) and switching to safetensors (almost no cost if you are already on Hugging Face). The checkpoint approaches require more infrastructure investment but unlock the best possible cold start times.

Tools Referenced

Stargz-Snapshotter: lazy image pulling for containerd
Run:ai Model Streamer: direct GPU memory streaming
CUDA Checkpoint: GPU process checkpointing
CRIU: checkpoint/restore for Linux processes

Cold start will never be zero. But with the right stack, it can stop being the thing that defines your API's first impression.