Building a production LLM platform on Kubernetes

I have run Kubernetes in production before, but not for this. At RIDE Capital I ran ASP.NET Core microservices on EKS; at Kfzteile24 I wrote the terragrunt stacks that provisioned the clusters they sat on. That was stateless HTTP and the occasional queue worker, the kind of workload Kubernetes was built to schedule.

Serving large language models is a different animal, and most of the surprises are not where you expect them. You can stand up a cluster, attach a GPU node, deploy a model server, and have something answering on /v1/chat/completions in an afternoon. Then it falls over in ways your old dashboards cannot see, because the assumptions that make Kubernetes good at web apps quietly stop being true.

This is how I would build a production LLM serving platform today, and why each piece is there. It is vendor-neutral on purpose. I will use a concrete example throughout: a security assistant that reads alerts from Wazuh and explains them. But the architecture holds for any internal LLM API.

What "production-ready" actually means here

A production LLM platform is not one that never fails. It is one that is:

Predictable under load — a heavy request cannot quietly starve everyone else.
Observable at the token level — not just requests per second, but input tokens, output tokens, time to first token, and GPU memory pressure.
GPU-efficient — your most expensive resource is never idle when there is work, and never thrashing when there is too much.
Secure by default — tenant isolation, audit logs, and secrets handled properly from the first deploy, not bolted on later.
Operable — you can update a model, roll back a bad deploy, and explain a latency spike without guessing.

If you cannot do those five things, you have a demo, not a platform. Everything below is in service of one of them.

The architecture, in one diagram

Here is the shape of the whole thing. Read it top to bottom; each layer adds something Kubernetes alone will not give you.

Apps / internal tools
        |
        v
API gateway          auth, rate limits, request validation, TLS
        |
        v
LLM router           model selection, token estimation, queue-aware routing
        |
        v
Model servers        vLLM / TGI / TensorRT-LLM  (OpenAI-compatible)
        |
        v
GPU node pool        NVIDIA GPUs, DCGM exporter, local model cache
        |
        v
Observability        Prometheus, Grafana, Loki, OpenTelemetry
   +  Security        audit logs, network policies, SIEM
        |
        v
Storage              object store + model registry, Redis, PostgreSQL

The core idea: Kubernetes runs the platform, but your routing, metrics, autoscaling, and security layers have to understand LLM behavior. A generic K8s setup treats every pod and every request as roughly interchangeable. LLM serving punishes that assumption.

Decide the product before you pick the tools

The fastest way to build a generic, unremarkable platform is to start from the infrastructure. Start from the product instead, because the product decides your context window limits, your latency targets, and your data-residency story.

For the example I will carry through this piece, the product is narrow on purpose:

An AI security analyst that connects to Wazuh, reads alerts, explains them in plain language, maps them to MITRE ATT&CK, and suggests the next investigation step.

That is a far better starting point than "a platform that hosts LLMs," because it tells you exactly what to build: an OpenAI-compatible chat API, a retrieval layer over security runbooks, per-tenant isolation, and EU-friendly data handling. Pick your own narrow product. The architecture is the same; the constraints are what make it real.

The MVP: one model, one GPU, one server

Do not start with a 70B model, multi-node tensor parallelism, and a prefill/decode split. Start with one cluster, one GPU node pool, one model, one inference engine, and one monitoring stack. You will learn more from one model under real traffic than from a clever topology under none.

For the model, a 7B-8B instruct model (Llama 3.1 8B, Mistral 7B, Qwen2.5 7B) on a single L4 or A10G is plenty to prove the platform. For the server, I would reach for vLLM: it speaks the OpenAI API, does continuous batching, and manages the KV cache with PagedAttention, which is most of what makes serving efficient.

A first deployment looks like this. Note the probes — model loading is slow, so the startup probe has a long failure budget while the readiness probe gates traffic until the weights are actually in GPU memory.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-8b
  namespace: llm-serving
spec:
  replicas: 1
  selector:
    matchLabels: { app: llama-8b }
  template:
    metadata:
      labels: { app: llama-8b, model: llama-8b }
    spec:
      nodeSelector: { workload: llm }
      tolerations:
        - key: nvidia.com/gpu
          operator: Equal
          value: "true"
          effect: NoSchedule
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model=meta-llama/Llama-3.1-8B-Instruct
            - --max-model-len=8192
            - --gpu-memory-utilization=0.90
          ports: [{ containerPort: 8000 }]
          resources:
            limits: { nvidia.com/gpu: 1 }
            requests: { cpu: "4", memory: 16Gi }
          startupProbe:        # model load can take minutes
            httpGet: { path: /health, port: 8000 }
            failureThreshold: 60
            periodSeconds: 10
          readinessProbe:      # no traffic until weights are loaded
            httpGet: { path: /health, port: 8000 }
            periodSeconds: 10

The GPU nodes get a taint so ordinary workloads do not land on your expensive hardware, and a label so the model pods can select them. That separation matters more than it looks: a stray CPU pod scheduled onto a GPU node is money on fire.

The router Kubernetes will not give you

This is the component people skip, and it is the one that earns its keep first.

If you point your gateway straight at the model pods with plain round-robin, you are treating a 20-token chat and a 9,000-token document analysis as the same unit of work. They are not. The router sits between the gateway and the model servers and makes decisions Kubernetes has no way to make, because Kubernetes cannot see inside the request.

The logic is small but high-leverage:

estimate input tokens + requested max output tokens
if tenant over quota         -> 429, retry-after
if prompt is very large      -> route to long-context pool
if a replica's queue is deep -> route to a less loaded one
if tenant is premium         -> priority queue
else                         -> default pool

Build it in whatever your team is fastest in — Go or a small Python/FastAPI service both work. The point is not the language; it is that capacity decisions for LLMs need to be token-aware, and the only place that can happen is a layer that estimates token cost before it picks a backend.

Token-based accounting, not request counts

The single most expensive mistake you can make is billing and limiting by request count. One request might be 20 input tokens and 50 output tokens; the next might be 9,000 input tokens and a 2,000-token report. Counting both as "one request" makes your quotas and your cost model fiction.

Record a usage event per request and store the tokens:

CREATE TABLE llm_usage_events (
  id              UUID PRIMARY KEY,
  tenant_id       TEXT        NOT NULL,
  model_name      TEXT        NOT NULL,
  input_tokens    INTEGER     NOT NULL,
  output_tokens   INTEGER     NOT NULL,
  total_tokens    INTEGER     NOT NULL,
  ttft_ms         INTEGER,
  total_latency_ms INTEGER,
  status_code     INTEGER,
  created_at      TIMESTAMPTZ DEFAULT now()
);

Once you have that table, cost per tenant, cost per model, cost per million tokens, and abuse detection all fall out of a GROUP BY. Without it, you are guessing at the one number — cost per token — that decides whether the product is viable. I wrote a whole companion post on why tokens, not requests, are the real unit of scale; this table is where that idea becomes operational.

Autoscaling the LLM-aware way

CPU-based horizontal pod autoscaling is the wrong instrument here. A model server can be at 95% GPU memory and modest CPU at the same time. Scale on the signals that actually predict pain: queue depth, waiting requests, and p95 time to first token. KEDA with Prometheus metrics gives you exactly that.

if queue_depth > 50 and p95_ttft > 2s   -> add a warm replica
if gpu_memory_pressure > 90%            -> shed long-context work to that replica
if queue_depth == 0 for 30 min          -> scale down, but keep a warm floor

The non-obvious rule: do not scale your primary model to zero. Cold-starting a model means pulling weights and loading them into GPU memory, which can take minutes — far too long to do in the request path. Keep a minimum warm replica. For an MVP, min: 1, max: 3, scaling on queue depth and TTFT, is a sane starting point.

Cold starts are a first-class problem

Because loading is slow, treat it as part of the design rather than an afterthought. The pattern that works:

object store / registry
        |
   init container pulls weights
        |
   local NVMe / PV cache
        |
   vLLM starts from the local path
        |
   readiness passes only once loaded

Pre-pull images, cache weights on fast local disk, and keep warm replicas sized to your floor of demand. The goal is that scaling up adds capacity in seconds of scheduling plus a warm start, not minutes of cold model loading while a queue backs up.

RAG for the security use case

For the Wazuh analyst, the base model is not enough — it does not know your runbooks, your past incidents, or your MITRE mappings. Retrieval-augmented generation closes that gap:

Wazuh alert -> normalize -> embed -> vector search
   -> retrieve runbooks + similar past alerts
   -> LLM explains severity + suggested next step
   -> analyst reviews in the dashboard

For an MVP, PostgreSQL with pgvector is the pragmatic choice: one fewer system to run, and good enough until your corpus is large enough to justify a dedicated vector database. Retrieval depth is not a free quality knob, though — every extra chunk you retrieve is extra input tokens on every request, which is exactly the kind of invisible cost growth that the token accounting above is there to catch.

Security from day one

If the product is a security tool, the platform has to look serious from the first commit. The non-negotiable minimum: OIDC for humans, per-tenant API keys, network policies between namespaces, secrets in a real secret manager (not env vars baked into images), and audit logs on every privileged action. Admission control with Kyverno or OPA Gatekeeper to forbid privileged containers, latest tags in production, and unapproved registries. If you are building on Wazuh anyway, point it at your own nodes and gateway logs — a security company whose platform cannot detect problems in itself is not credible.

A 30-day build order

You cannot build all of this at once, and you should not try. The order that keeps you shippable:

Week 1  cluster + GPU operator + one vLLM model, verified end to end
Week 2  gateway + router + API keys + Prometheus/Grafana + token metrics
Week 3  Wazuh ingestion + alert-explanation endpoint + pgvector RAG
Week 4  KEDA autoscaling + queue + rate limits + load test + dashboard

Before you call it production-ready, you should be able to answer yes to each of these:

Can one GPU node fail without total outage?
Can the system reject overload safely (429, not a stall)?
Can you see p95 TTFT and cost per tenant?
Can you update a model without downtime, and roll back a bad deploy?
Can you prove where customer data lives and delete it on request?

If any answer is no, that is your next sprint.

The principle

The mistake I see — and the one I had to unlearn from my web-app instincts — is treating an LLM platform as "Kubernetes plus a model." Kubernetes is necessary and it is excellent at what it does, but it sees containers and requests. Your platform has to see tokens, queues, GPU memory, and cost. The router, the token accounting, the LLM-aware autoscaling, and the security layer are the parts that turn a model behind an ingress into something you can actually operate, bill for, and stand behind.

Build the narrow product first. Make it observable at the token level. Then scale the parts that hurt.