Skip to content

Pod Reaper

On nodes that are close to full, the Kubernetes scheduler can occasionally bind a backup pod to a node that cannot actually start it. The pod then stays in ContainerCreating indefinitely, and because the node still hosts a pod it cannot be drained, which blocks node recycling and requires manual intervention.

The pod reaper watches the pods the operator manages and deletes the ones that get wedged in ContainerCreating, so their StatefulSet or Deployment can reschedule them onto a healthy node.

The pod reaper is configured through the operator’s Helm values:

values.yaml
operator:
config:
podReaper:
enabled: true
timeout: 5m
interval: 1m
PropertyTypeDefaultDescription
operator.config.podReaper.enabledbooleantrueWhether the pod reaper runs.
operator.config.podReaper.timeoutduration5mHow long a pod may stay stuck in ContainerCreating before it is reaped.
operator.config.podReaper.intervalduration1mHow often the operator scans for stuck pods.

Durations are written as a number and a unit, for example 30s, 5m, or 1h.

The reaper considers backup pods and schema registry backup pods. The operator marks these pods with the io.kannika/reap: "true" annotation, and the reaper only deletes pods that carry it. Restore pods are run as Jobs with their own retry semantics and are left alone; pods that are not managed by Kannika Armory are never touched.

To exclude a specific workload from reaping, set the io.kannika/reap annotation to "false" on the Backup or SchemaRegistryBackup resource. The operator propagates that value onto its pods, so they are no longer reaped. See Opt a backup out of reaping.

It only reaps pods stuck in ContainerCreating

Section titled “It only reaps pods stuck in ContainerCreating”

A pod is reaped when all of the following hold:

  • a container is waiting with reason ContainerCreating;
  • the pod was created more than timeout ago;
  • it is not already being deleted.

The age is taken from the pod’s creationTimestamp, so it covers the whole time since the pod was created, including any time spent waiting to be scheduled. The operator re-checks on every interval, so a pod is reaped within one interval of crossing the timeout.

Other waiting reasons, such as ImagePullBackOff or CreateContainerConfigError, are deliberately left alone: they signal a permanent problem (a missing image, a missing secret) that deleting and recreating the pod would not fix, so reaping would only create a restart loop that hides the real error.

When a pod is reaped, the operator deletes it so its controller recreates it, and records a warning event in the pod’s namespace. Because the reaped pod is gone, the event shows up in the namespace event list rather than under kubectl describe pod:

$ kubectl get events -n <namespace> --field-selector reason=StuckPodReaped
LAST SEEN TYPE REASON OBJECT MESSAGE
30s Warning StuckPodReaped pod/<stuck-pod> Pod <stuck-pod> was stuck in ContainerCreating past the reaper timeout and was deleted so it can reschedule

A structured log line is emitted as well, with the pod name and namespace.

Reap pods that have been stuck for more than ten minutes, scanning every two minutes:

values.yaml
operator:
config:
podReaper:
enabled: true
timeout: 10m
interval: 2m

Annotate a Backup (or SchemaRegistryBackup) with io.kannika/reap: "false" to keep the reaper from deleting its pods, while leaving it enabled for everything else:

backup.yaml
apiVersion: kannika.io/v1alpha
kind: Backup
metadata:
name: my-backup
annotations:
io.kannika/reap: "false"
spec:
# ...

Turn the reaper off entirely if you prefer to handle stuck pods manually:

values.yaml
operator:
config:
podReaper:
enabled: false