Pod Reaper

On nodes that are close to full, the Kubernetes scheduler can occasionally bind a backup pod to a node that cannot actually start it. The pod then stays in ContainerCreating indefinitely, and because the node still hosts a pod it cannot be drained, which blocks node recycling and requires manual intervention.

The pod reaper watches the pods the operator manages and deletes the ones that get wedged in ContainerCreating, so their StatefulSet or Deployment can reschedule them onto a healthy node.

Configuration

The pod reaper is configured through the operator’s Helm values:

operator:
  config:
    podReaper:
      enabled: true
      timeout: 5m
      interval: 1m

Property	Type	Default	Description
`operator.config.podReaper.enabled`	`boolean`	`true`	Whether the pod reaper runs.
`operator.config.podReaper.timeout`	`duration`	`5m`	How long a pod may stay stuck in `ContainerCreating` before it is reaped.
`operator.config.podReaper.interval`	`duration`	`1m`	How often the operator scans for stuck pods.

Durations are written as a number and a unit, for example 30s, 5m, or 1h.

What the pod reaper does

It only acts on managed backup pods

The reaper considers backup pods and schema registry backup pods. The operator marks these pods with the io.kannika/reap: "true" annotation, and the reaper only deletes pods that carry it. Restore pods are run as Jobs with their own retry semantics and are left alone; pods that are not managed by Kannika Armory are never touched.

To exclude a specific workload from reaping, set the io.kannika/reap annotation to "false" on the Backup or SchemaRegistryBackup resource. The operator propagates that value onto its pods, so they are no longer reaped. See Opt a backup out of reaping.

It only reaps pods stuck in ContainerCreating

A pod is reaped when all of the following hold:

a container is waiting with reason ContainerCreating;
the pod was created more than timeout ago;
it is not already being deleted.

The age is taken from the pod’s creationTimestamp, so it covers the whole time since the pod was created, including any time spent waiting to be scheduled. The operator re-checks on every interval, so a pod is reaped within one interval of crossing the timeout.

Other waiting reasons, such as ImagePullBackOff or CreateContainerConfigError, are deliberately left alone: they signal a permanent problem (a missing image, a missing secret) that deleting and recreating the pod would not fix, so reaping would only create a restart loop that hides the real error.

It deletes the pod and records an event

When a pod is reaped, the operator deletes it so its controller recreates it, and records a warning event in the pod’s namespace. Because the reaped pod is gone, the event shows up in the namespace event list rather than under kubectl describe pod:

$ kubectl get events -n <namespace> --field-selector reason=StuckPodReaped
LAST SEEN   TYPE      REASON           OBJECT             MESSAGE
30s         Warning   StuckPodReaped   pod/<stuck-pod>    Pod <stuck-pod> was stuck in ContainerCreating past the reaper timeout and was deleted so it can reschedule

A structured log line is emitted as well, with the pod name and namespace.

Examples

Tune the timeout

Reap pods that have been stuck for more than ten minutes, scanning every two minutes:

operator:
  config:
    podReaper:
      enabled: true
      timeout: 10m
      interval: 2m

Opt a backup out of reaping

Annotate a Backup (or SchemaRegistryBackup) with io.kannika/reap: "false" to keep the reaper from deleting its pods, while leaving it enabled for everything else:

apiVersion: kannika.io/v1alpha
kind: Backup
metadata:
  name: my-backup
  annotations:
    io.kannika/reap: "false"
spec:
  # ...

Disable the pod reaper

Turn the reaper off entirely if you prefer to handle stuck pods manually:

operator:
  config:
    podReaper:
      enabled: false