← /blog

The request is the wrong unit of scale for LLMs on Kubernetes

The request is the wrong unit of scale for LLMs on Kubernetes

In a companion post I wrote about building a whole LLM serving platform on Kubernetes. This one is about the single assumption that platform has to unlearn first, because it is the one I would have got wrong coming straight from web-app instincts: that a request is a useful unit of work.

For a normal service it usually is. One request hits an API, does some bounded work, maybe touches a database, returns JSON, and ends. Requests are similar enough to each other that requests-per-second is a decent proxy for load. LLM serving breaks that quietly, and the failure mode is nasty precisely because your old dashboard stays calm while it happens.

Picture it: requests per second are flat. CPU looks fine. Memory looks normal. The HPA is asleep. And yet time to first token is drifting up, GPU memory pressure is climbing, the queue is growing, and users are saying the model is "thinking forever." Nothing at the HTTP layer moved. The traffic did not get heavier in requests. It got heavier in tokens.

Tokens are the work

Kubernetes sees one request. Your ingress sees one request. Your gateway sees one request. The GPU sees something completely different: prefill work, decode work, KV cache growth, memory pressure, and time spent generating output one token at a time.

One request might carry a 20-token question and produce a 50-token answer. Another might carry a long system prompt, full chat history, six retrieved documents, tool output, and a user asking for a 4,000-token report. Both are one HTTP request. They are not remotely the same workload.

This is why a deployment can look stable at the HTTP layer while the model server is genuinely struggling. The API did not get more traffic. The work inside each request got bigger, and requests-per-second cannot see inside the envelope.

Input and output tokens are two different problems

It is tempting to lump all tokens into one bucket. That is a fine start, but input and output stress the system in different ways, and the difference maps onto the two phases of inference.

Prefill processes the input prompt — system prompt, instructions, chat history, retrieved documents, tool results, the user message, and whatever formatting your app adds. Decode generates the response one token at a time: predict a token, append it, use the longer sequence to predict the next, repeat until a stop condition.

A way to remember it that has stuck with me:

  • Input tokens decide how heavy it is to start answering. Long prompts mean the model has to chew through the whole input before the first output token appears, which hurts time to first token.
  • Output tokens decide how long the model stays busy. Long responses keep the model in the decode loop, occupying the GPU the whole time.

Streaming makes long output feel better because the user watches it arrive, but it does not remove the backend work. The GPU is still busy for every one of those tokens. This is exactly why serious serving metrics talk about time to first token, time per output token, and tokens per second rather than a single "latency" number — both NVIDIA's and Databricks' published serving benchmarks split latency that way. A normal API request is one operation. An LLM request is a sequence of token work.

One request can hide an enormous prompt

Here is the trap that catches teams in production: the user does not send the real prompt. The application builds it.

A user types twelve tokens:

What is our refund policy for enterprise customers?

By the time your app sends the request to the model, the prompt might be:

system prompt:            700 tokens
developer instructions:   400 tokens
chat history:           1,500 tokens
retrieved policy docs:  6,000 tokens
citations + metadata:     600 tokens
user question:             12 tokens
formatting:               300 tokens
-------------------------------------
total input:           ~9,500 tokens

The user sent one short question. The model received nearly 9,500 input tokens before it generated anything. Teams measure the user-message size and miss the assembled-prompt size, and the gap is where the surprises live.

Retrieval makes this sharper. Bumping RAG top_k from 4 chunks to 12 looks like a harmless relevance tweak — no manifest changed, no model changed, request count identical, and the answers might even get better. But now every request carries thousands of extra input tokens, which moves TTFT, GPU memory pressure, KV cache usage, batch composition, and cost per interaction. That is why prompt assembly itself needs observability: not just "this request had 9,500 input tokens," but where they came from — history, retrieval, tool results, system instructions, or an agent loop that appends every intermediate step. Without that breakdown, token growth stays invisible until latency tells you about it.

A context window is a limit, not a target

Long-context models are useful — bigger documents, longer conversations, richer workflows. But "the model supports 128k context" does not mean "send 128k context casually." That is like saying a node has 1 TB of memory so every process should try to use it.

Long context changes the shape of serving. A handful of long-context requests can consume enough GPU memory and serving time to degrade everyone else. A chat session gets more expensive as its history grows. An agent quietly appends tool traces until each turn is heavier than the last. A summarize feature drifts from "summarize this page" to "summarize this folder" without the request count moving at all.

The fix is a habit, not a number: bucket prompts by size — short, medium, long, very long, batch — and watch the buckets. A 500-token chat and a 50,000-token document analysis should not be treated as the same class of work just because both arrived through /v1/chat/completions.

Output length is a capacity control, not a UX detail

Input tokens get the blame because long prompts are easy to see. Output tokens matter just as much for capacity. Two requests with the same input can cost wildly different amounts:

Request A:  1,000 input  +   100 output
Request B:  1,000 input  + 2,000 output

Same route, same prompt size, very different serving time — B keeps the model decoding far longer, GPU occupied the whole time, streaming or not. This is why max_tokens is not just a product parameter; it is a capacity control. If every request may generate 4,000 tokens, you have accepted a worst-case capacity problem even when most responses are short.

Track both the requested max output tokens (the risk you accepted) and the actual generated tokens (the work you did). If many requests hit the cap, users are getting truncated answers. If almost none approach it, your default is too generous. Output length is not formatting — it is how long each request rents the GPU.

Same request count, completely different load

This is the comparison I would put on the wall:

Window A          Window B
requests: 1,000   requests: 1,000
avg input:   500  avg input:  8,000
avg output:  150  avg output: 1,000
-----------------------------------------
work: 650k tok    work: 9,000k tok

Both windows show 1,000 requests. The second has roughly 14x the token volume. A request-count dashboard reports "traffic is flat." A token dashboard reports "the workload changed completely." The useful question is not only how many requests are we serving? but how many input tokens are arriving, how many output tokens are we generating, and where do they come from?

What Kubernetes sees vs. what the model server feels

Kubernetes is excellent at containers: scheduling, restarts, resource limits, rollouts, pinning workloads to GPU nodes. What it does not do is understand the shape of an LLM request. A pod can be healthy while the model server is drowning. CPU can look boring while GPU memory is the real ceiling. Generic memory can look fine while the KV cache is under pressure. Request count can look flat while token volume has exploded.

That is the division of labour: Kubernetes is the orchestration layer, the model server is the execution layer, the application builds the prompt, and the platform team has to connect the signals. If those layers do not share token-level metrics, you scale the wrong thing — and CPU-based HPA is the classic wrong thing.

The encouraging part is that the model servers already expose the right signals. vLLM, for instance, publishes Prometheus metrics for prompt tokens, generation tokens, time to first token, time per output token, queue time, prefill and decode time, KV cache usage, and running vs. waiting requests. The production surface of LLM serving is already token-aware. Your dashboard should be too.

Token-based observability is the actual job

Request count still matters at the API boundary — auth, rate limits, logs, tracing, errors. But it needs token context. At minimum, capture per request:

input / output / total tokens
requested max output tokens
time to first token
time per output token
end-to-end latency, queue time
model, version, deployment
tenant, route / feature, finish reason

And where you can, the prompt breakdown — system, history, retrieved context, tool results, user message — because that is where the production surprises hide.

For dashboards, averages lie. Average token count stays calm while the tail gets ugly, so watch p50/p95/p99 for both input and output, and slice latency by token bucket:

input  token p50 / p95 / p99
output token p50 / p95 / p99
TTFT by input-token bucket
TPOT by output-token bucket
queue time by token bucket
% of requests near the context limit
% of requests hitting the output cap
retrieved-context tokens per request
KV cache usage over time
waiting requests AND waiting-token estimate

That last line matters most. Do not only ask how many requests are waiting — ask how many tokens are waiting. A queue of 20 short chats and a queue of 20 long document analyses are not the same queue, and treating them the same is how you mis-size everything downstream.

Product changes are infrastructure changes

The uncomfortable truth of running these platforms: a product change can become an infrastructure change overnight. Adding a field to a web-app response rarely matters. Adding context to a prompt changes capacity. Increasing retrieval depth changes latency. Keeping longer history changes memory pressure. Allowing longer outputs changes GPU occupancy.

The product team says "we only changed the prompt." The platform team hears "we changed the workload." Both are right. This is not an argument against improving prompts — tokens are the product. It is an argument for making token impact visible before and after every change. A new prompt that improves quality but triples average input tokens might be a great trade; it should just be a conscious one, not a latency mystery discovered in production.

Practical rules

If you are starting to serve LLMs on Kubernetes:

  • Measure input and output tokens on every request, from day one. Do not wait for the first incident.
  • Track the assembled prompt, not the user message. The model does not care what the user typed; it cares what you sent.
  • Break input down by source — system, history, retrieval, tools, user — so you can see which change moved the number.
  • Separate requested vs. actual output tokens. One is accepted risk, the other is real work.
  • Use token buckets in latency dashboards. A p95 line that mixes tiny chats and huge documents is a lie of averages.
  • Budget RAG retrieval. top_k is a capacity knob, not only a relevance knob.
  • Treat context windows as limits. Set sane output defaults; make long answers intentional.
  • Separate workload classes — interactive chat, long RAG, report generation, agent loops, and batch are different shapes.

None of this makes the system slower or less capable. It makes it understandable, and you cannot operate what you cannot measure.

The real unit of scale

The request stays useful at the boundary — you need it for auth, rate limits, logs, tracing, and user flows. It just cannot tell you how much prompt the model processed, how long it generated, how much KV cache it needed, or whether the work was a short chat or a long agent loop.

Tokens get you closer to the truth. Input tokens explain the work before the first response appears. Output tokens explain how long the model keeps going. Distributions explain why averages lie. Sources explain which product change moved the workload. Token-aware metrics explain why your cluster looks healthy while users still feel latency.

You are not really scaling requests. You are scaling token work across expensive, memory-constrained, latency-sensitive GPUs. Once you see the platform that way, the rest of it — the router, the autoscaler, the cost model — starts to make sense.