A guide to LLM inference

By: Pioneer Team

An overview of what inference is and what affects inference speed and performance.

Building and deploying AI models draws a lot of attention to training, but inference is what runs on every single request. It is where latency, cost, and user experience are decided.

This guide covers what inference is, why encoder and decoder models behave differently, the metrics that define "fast," what slows inference down, and the concrete techniques to make it faster and cheaper.

What is LLM inference?

LLM inference is the process of running a trained model to generate output from an input. Training builds the model once; inference runs every time someone sends a prompt, which is where ongoing latency and cost accumulate.

For most chat-style models, inference happens in two phases:

  • Prefill: the model reads the entire prompt in parallel and builds the Key-Value (KV) cache. This phase is compute-bound.

  • Decode: the model generates the response one token at a time, reusing the KV cache at each step. This phase is memory-bandwidth-bound.

As the model processes tokens, it computes a key and a value vector for each one. The KV cache stores those vectors so the model does not have to recompute the entire context every time it generates a new token. It is the main reason generation stays fast, and, as you will see, the main consumer of memory during inference.

These two phases have different bottlenecks, and that difference is the root of nearly every optimization in this guide.

Flow diagram showing decoder LLM inference: prompt to prefill (builds KV cache, compute-bound) to decode loop (reuses KV cache to generate tokens one at a time, memory-bandwidth-bound).


Encoder vs decoder models: not all inference is autoregressive

Inference works differently depending on the model type. Decoder models (most chat LLMs) generate text token by token, one step at a time. Encoder models (like BERT and the GLiNER family) read the whole input in a single forward pass and produce a result directly, with no decode loop and no growing KV cache. That makes encoder models dramatically faster for tasks like classification and entity extraction.

This matters more than it first appears. Many jobs people reach for an LLM to do, such as named entity recognition, classification, routing, PII detection, and guardrails, do not need open-ended text generation at all. An encoder model handles them in one pass.

The latency consequence is direct: decoder latency grows with output length and context length, while encoder inference stays closer to a fixed, single-pass cost. GLiNER is a concrete example. It is built on a bidirectional transformer encoder and frames entity recognition as a matching problem in a shared embedding space, all in a single forward pass. No autoregression, no token-by-token loop.

If your task is extraction or classification rather than generation, the fastest path is often not a faster decoder. It is the right encoder.

What is LLM latency?

LLM latency is the time between a request and its response, but it is not a single number. There are three metrics developers track, and they trade off against each other:

  • Time to first token (TTFT): how long until the first token appears. This drives perceived responsiveness. A chatbot generally needs TTFT under about 500ms to feel responsive, and a code-completion tool wants it closer to 100ms.

  • Time per output token (TPOT): the steady-state generation speed after the first token. A TPOT of 100ms per token is about 10 tokens per second, faster than most people read.

  • Throughput: total tokens per second across all concurrent requests. This is the metric that drives cost at scale and matters most for batch jobs.

The key insight is that these pull in different directions. Interactive applications optimize TTFT and TPOT, while batch and offline workloads optimize throughput. The same model can look "fast" or "slow" depending on which metric you measure. For live, independently measured numbers across providers, Artificial Analysis benchmarks API endpoints continuously on speed, latency, and price.

What affects LLM inference speed?

Inference speed is governed mostly by memory bandwidth, not raw compute. During decode, the GPU spends more time moving weights and the KV cache in and out of memory than it spends doing math. The factors that matter most:

  • Model size and architecture: more parameters means more memory to move per token. Mixture-of-experts models help by activating only part of the network per token.

  • Output length: the single biggest driver of end-to-end latency, because every output token is a separate decode step.

  • **KV cache size:** it grows linearly with batch size multiplied by sequence length, can exceed the size of the model weights in memory, and caps how many requests you can run at once.

  • Batch size and concurrency: larger batches raise throughput but add queueing delay to individual requests, so per-request latency and total throughput trade off against each other.

  • Context length: longer prompts mean more prefill work and a bigger KV cache.

  • Cold starts: if a model scales to zero when idle, the first request has to wait for weights to load into GPU memory before any tokens stream.

  • Decoding strategy and attention design: sampling versus greedy decoding, and attention variants like multi-query and grouped-query attention, change how much memory moves per token.

  • Serving stack: the GPU generation, the inference engine (vLLM, TensorRT-LLM, SGLang), and the scheduler all matter.

  • Reasoning tokens: reasoning models generate hidden "thinking" tokens before answering, which inflates end-to-end latency even when raw token speed is high.

Most "my model is slow" problems trace back to one of these causes. In the next section, we'll walk through common tactics to address them.

How to increase inference speed and reduce latency

Reducing LLM inference latency comes down to moving less data through memory and avoiding redundant computation. The main techniques, with realistic gains and trade-offs:

  • Continuous batching: instead of waiting for a full batch, the server slots new requests in as others finish, keeping the GPU busy. Anyscale measured up to 23x higher throughput over naive batched serving with this approach.

  • KV cache management and prompt caching: the PagedAttention algorithm reduces wasted KV cache memory, and caching shared or repeated prefixes avoids recomputing them, which is a large TTFT win for system prompts and RAG.

  • Quantization: dropping precision to FP8, INT8, or INT4 shrinks the model 2 to 8x depending on the baseline and target precision, and speeds up matrix multiplications at a small accuracy cost. Recent work pushes this further: Google's TurboQuant compresses the KV cache to 3 bits with near-zero accuracy loss, cutting memory roughly 6x.

  • Speculative decoding: a small, fast draft model proposes several tokens and the large model verifies them in one pass. It delivers 2 to 3x speedups with mathematically identical output, since the large model overwrites any rejected tokens. It is now a standard technique in production serving stacks like vLLM.

  • Prefill/decode disaggregation: because prefill is compute-bound and decode is memory-bound, splitting them onto separate nodes lets each scale independently, with reported throughput gains around 2x. It runs in production at Meta and Hugging Face and is maturing in open-source engines.

  • Smaller, distilled, and encoder models: well-distilled 7B to 20B models handle a large share of single-turn tasks, and encoder models skip the decode loop entirely for classification and extraction.

A useful rule of thumb: for TTFT, look at prompt caching, disaggregation, and prefill optimization. For TPOT, look at quantization and speculative decoding. For throughput, look at continuous batching and right-sizing the model.

Most of these are things you either build and operate yourself or get from your provider, which brings us to cost.

How to reduce LLM inference costs

The most effective way to reduce LLM inference costs is to use the smallest model that still passes your evals, then stack caching, batching, and smart routing on top. Provider analyses report that combining these layers can cut inference bills by 60 to 80%. The levers, roughly by impact:

  • Right-size the model and route: send simple tasks (classification, extraction, intent detection) to small models, and reserve frontier models for genuinely hard work. For mixed workloads this is one of the largest cost levers available, since the price gap between a small model and a frontier one runs to roughly 100x per token, and it compounds with the caching and batching below.

  • Prompt and response caching: cache repeated contexts and identical requests so you stop paying to regenerate the same answers.

  • Batching: higher GPU utilization means a lower effective cost per request.

  • Quantization: a smaller memory footprint runs on cheaper or fewer GPUs.

  • Self-host versus API: at high, steady volume, self-hosting can be cheaper, but the economics are set by GPU utilization, not a single token count. A GPU billed by the hour only pays off if you keep it busy: at low utilization the effective cost per token can run roughly 10x higher than a managed API, while a fully saturated GPU at very high volume can be several times cheaper. The crossover depends on your traffic and which API you compare against (CloudZero tracks current per-token API prices), so model it on your own numbers rather than a headline figure. At low or spiky volume, managed APIs usually win on both cost and operational overhead.

One practical caution: measure each layer separately. Stacking three optimizations at once feels productive, but if you cannot attribute the savings, you will not know which lever to keep tuning.

A different approach: Adaptive Inference with Pioneer

Most teams face a binary choice. Pay frontier prices to a closed API, and absorb the latency, the spend, and the reality that your traffic feeds someone else's training set. Or self-host an open model, and own the GPUs, autoscaling, evals, observability, and a retraining pipeline you probably do not have time to build. Either way, the model ships on day one and stays exactly the same. It never learns from your traffic.

Pioneer, our inference API, is built to remove both problems. It gives you neutral access to leading closed-source, open-source, small, and large models behind one endpoint, and it is drop-in: change the base URL on your existing OpenAI or Anthropic client and you are running.

Behind that endpoint is Adaptive Inference. Pioneer clusters your production traffic by task, identifies and diagnoses failures, and fine-tunes a fleet of small open-source models in the background. When it finds one that is cheaper and better for your specific use case, it alerts you, and you decide when to route traffic to it. Your weights and training data stay yours, and the optimization is included: you pay for inference, and the improvement comes with it.

In other words, the levers in this guide (right-sizing, distillation, and routing) are not a one-time setup. Pioneer runs them continuously on your real traffic, which is what keeps the model getting cheaper and more accurate over time.


Frequently asked questions

What is LLM inference in simple terms?
Inference is running a trained model to answer a prompt. Training teaches the model once; inference is the model using what it learned, every time a user sends a request.

What is the difference between training and inference?
Training builds the model by adjusting its weights on large datasets, and happens once (or periodically). Inference uses the finished model to generate output for real requests. Training is a fixed, upfront cost; inference is an ongoing, per-request cost.

What is a good TTFT for a chatbot?
For conversational interfaces, a time to first token under about 500ms generally feels responsive. Latency-sensitive uses like code completion aim closer to 100ms. Perceived speed depends more on TTFT than on total generation time.

What is the difference between latency and throughput?
Latency is how fast a single request responds, measured by time to first token and time per output token. Throughput is how many tokens the system produces per second across all concurrent requests. Interactive apps optimize latency; batch jobs optimize throughput.

Are encoder models faster than decoder models?
For classification and extraction tasks, usually yes. Encoder models process the input in one forward pass with no token-by-token generation and no growing KV cache, so their latency stays closer to a fixed cost. Decoder models are slower because they generate output autoregressively.

How can I make my LLM faster without losing accuracy?
Speculative decoding gives 2 to 3x speedups with identical output, and prompt caching cuts time to first token on repeated contexts with no quality change. Routing simple tasks to smaller or encoder models also speeds things up without affecting the quality of harder tasks.

Is it cheaper to self-host or use an inference API?
At high, steady volume, self-hosting can be cheaper but requires you to run the infrastructure. At low or variable volume, managed APIs are usually cheaper once you account for idle GPUs and engineering time. The crossover point depends on your utilization.