← /blog

How I would secure a Kubernetes cluster from day one

How I would secure a Kubernetes cluster from day one

I ran .NET microservices on EKS in production for a few years — at RIDE Capital, and at Kfzteile24 where I wrote the terragrunt stacks the clusters sat on. The thing that aged worst in hindsight was not uptime or cost. It was that we treated security as something to bolt on later, after the workloads were running and the deadlines had moved. Later always arrives as an incident, an audit, or a customer security questionnaire you cannot answer.

A fresh Kubernetes cluster is default-allow. Out of the box, any pod can talk to any other pod, run as root, mount host paths, and reach the API server with whatever the default service account was granted. None of that is malicious — Kubernetes ships primitives, not a secure cluster. Security is the part you opt into. Now that I am building toward a security product around Wazuh, here is the day-one hardening I would never defer again, layer by layer.

The mental model: default-deny everything

Every control below is the same move applied to a different surface: flip the default from allow to deny, then permit exactly what you need. Identity, network, workloads, secrets, and runtime. If you internalise that one idea, the specific YAML is just detail.

1. RBAC: stop handing out cluster-admin

The most common real-world mistake is not an exotic exploit — it is cluster-admin bound to humans, CI pipelines, and service accounts that needed three verbs on one resource. And the quiet one: every pod gets the namespace's default service account token mounted automatically, so a compromised container can immediately start probing the API.

Two defaults to change on day one. First, stop auto-mounting tokens where they are not used:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: api
  namespace: shop
automountServiceAccountToken: false

Second, grant narrow, namespaced Roles instead of reaching for ClusterRoles:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: shop
  name: read-config
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list"]

A workload that only reads ConfigMaps should be able to do exactly that and nothing else. The test I apply: if a single pod is compromised, what can its token do? With least-privilege RBAC the answer is "almost nothing." With the defaults, the answer is often "more than you would like to explain to an auditor."

2. Network policies: a flat network is a gift to an attacker

By default, cluster networking is flat. Any pod can open a connection to any other pod, in any namespace. That is the single most useful thing to an attacker who lands one foothold, because lateral movement is free.

Fix it with a default-deny policy per namespace, then allow specific flows:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: shop
spec:
  podSelector: {}
  policyTypes: ["Ingress", "Egress"]
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-from-web
  namespace: shop
spec:
  podSelector:
    matchLabels: { app: api }
  ingress:
    - from:
        - podSelector:
            matchLabels: { app: web }
      ports:
        - { protocol: TCP, port: 8080 }

One caveat that bites people: NetworkPolicy is only enforced if your CNI implements it. Calico and Cilium do; the stock AWS VPC CNI needs the network policy controller enabled. A policy with no enforcer is a comment, not a control — verify it actually blocks traffic, do not assume.

3. Admission control: make the rules un-skippable

RBAC and network policy are runtime fences. Admission control stops bad workloads from being admitted in the first place. Nothing in a default cluster prevents a teammate from shipping a privileged container that runs as root from a :latest tag pulled off an unknown registry.

Start with the built-in Pod Security Admission, set to restricted at the namespace level:

apiVersion: v1
kind: Namespace
metadata:
  name: shop
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/warn: restricted

Then add a policy engine — Kyverno or OPA Gatekeeper — for the org-specific rules PSA does not cover: approved registries only, required resource limits, no :latest, no hostPath. A Kyverno rule reads almost like English:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-privileged
spec:
  validationFailureAction: Enforce
  rules:
    - name: no-privileged-containers
      match:
        any:
          - resources: { kinds: ["Pod"] }
      validate:
        message: "Privileged containers are not allowed."
        pattern:
          spec:
            containers:
              - =(securityContext):
                  =(privileged): "false"

The point is to move security left of the incident. A policy that rejects the deploy is worth ten dashboards that notice the breach afterward.

4. Secrets: base64 is not encryption

Kubernetes Secrets are base64-encoded, not encrypted. Anyone who can read the object, or read etcd, reads the secret — unless you have explicitly turned on encryption at rest. On EKS that means enabling envelope encryption with a KMS key; do it when you create the cluster, because retrofitting is painful.

Better still, keep the real secret out of the cluster entirely. The External Secrets Operator pulls from AWS Secrets Manager (or Vault) and projects a Kubernetes Secret that is never the source of truth:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db
  namespace: shop
spec:
  secretStoreRef: { name: aws-sm, kind: ClusterSecretStore }
  target: { name: db-credentials }
  data:
    - secretKey: password
      remoteRef: { key: prod/shop/db, property: password }

And the boring rules that prevent most leaks: never bake secrets into images, never put them in plain env in a committed manifest, and lock down get/list on secrets with RBAC so a compromised pod cannot enumerate them.

5. Runtime: assume something gets in

Everything above is prevention. Mature security assumes prevention eventually fails and asks a second question: would I even know? On most clusters I have seen, the honest answer is no.

Two layers close that gap. Falco watches syscalls (via eBPF) and fires on the patterns that almost always mean trouble — a shell spawned inside a container, an unexpected write to a system binary, an outbound connection from a pod that should never make one. Wazuh sits a level out: node OS logs, file-integrity monitoring, authentication events, and — importantly — the Kubernetes API audit log, correlated in one place so "someone exec'd into a prod pod at 3am" is a detection, not a story you reconstruct later.

So turn on API server audit logging, ship those logs somewhere they cannot be wiped by whoever you are trying to catch, and put a runtime sensor on the nodes. A cluster that cannot detect a problem in itself is not finished — that is the line I keep coming back to as I build on Wazuh, and it applies just as much to the platform you run today.

The day-one checklist

When I stand up a cluster now, this is the list I will not move past:

[ ] RBAC: no cluster-admin for humans/CI; namespaced Roles; least privilege
[ ] automountServiceAccountToken: false unless the pod calls the API
[ ] NetworkPolicy default-deny per namespace, then allow explicitly
[ ] CNI actually enforces NetworkPolicy (verify with a blocked-traffic test)
[ ] Pod Security Admission: restricted
[ ] Kyverno/OPA: no privileged, no :latest, approved registries, required limits
[ ] etcd encryption at rest (KMS) enabled at cluster creation
[ ] secrets via External Secrets Operator; none baked into images or manifests
[ ] API server audit logging on, shipped off-cluster
[ ] runtime detection (Falco) + host/audit correlation (Wazuh)

The principle

Kubernetes gives you the primitives to build a secure cluster and a default configuration that is none of those things. The work is not exotic — it is flipping a handful of defaults from allow to deny, encrypting what should have been encrypted, and admitting that prevention fails so detection has to exist. Almost all of it is one-time configuration you do before the first real workload lands.

I learned that order the expensive way: ship first, secure later, explain later. Day-one hardening is cheaper than the meeting where you explain why you skipped it.