The Physics of LLM Inference: From First Principles to Cluster Scale

AI/ML & Data Science Intermediate

Why latency spikes the moment concurrent users hit an endpoint, and how continuous batching, PagedAttention, and distributed serving keep throughput high, with live demos of Managed Inference and fine-tune-to-deploy workflows on Crusoe Cloud. We will also dive into Crusoe MemoryAlloy, which reinvents KV caching for cluster-scale inference.

Many developers can download a model from Hugging Face, but few grasp why latency spikes the moment concurrent users hit an endpoint. This talk breaks down the physics of LLM inference: the compute-bound prefill phase versus the memory-bound decode phase, and the KV cache that sits at the center of both, turning generation from O(n²) to O(n) per token, but often consuming more memory than the model weights themselves. We'll look at how continuous batching and PagedAttention solve these bottlenecks on a single GPU, then scale the picture up with live demos on Crusoe Cloud: running inference with Crusoe Managed Inference, and fine-tuning a model before deploying it at scale. The talk closes with a look at Crusoe MemoryAlloy, which reinvents KV caching for cluster-scale inference.