Overview

Pods’ progress metrics

Every backup pod or restore job offer metrics in the Prometheus format about the current state of the process. They can be queried with a GET request on the /metrics endpoint on port 9000.

Here’s an example of a backup’s metrics

curl localhost:9000/metrics
# HELP gauge Total number of bytes present in the storage (affected by chosen compression algorithm)
# TYPE backup_sink_bytes_count gauge
backup_sink_bytes_count{topic="std-1"} 1750738
backup_sink_bytes_count{topic="std-2"} 1324392

# HELP counter Total number of bytes present in the storage, including expired segments (affected by chosen compression algorithm)
# TYPE backup_sink_bytes_count_total counter
backup_sink_bytes_count_total{topic="std-1"} 1750738
backup_sink_bytes_count_total{topic="std-2"} 1324392

# HELP counter Total number of records sent to the storage since program start
# TYPE backup_sink_jobs_count counter
backup_sink_jobs_count{topic="std-1"} 753
backup_sink_jobs_count{topic="std-2"} 376

# HELP gauge Indicator of time spent idle: 0=fully busy, 1=fully idle
# TYPE backup_sink_jobs_idleratio gauge
backup_sink_jobs_idleratio{topic="std-1"} 0.9982043009466929
backup_sink_jobs_idleratio{topic="std-2"} 0.9987926299963017

# HELP counter Last offset backed-up for each topic's partition
# TYPE backup_sink_offset counter
backup_sink_offset{topic="std-1",partition="0"} 388
backup_sink_offset{topic="std-1",partition="1"} 284
backup_sink_offset{topic="std-1",partition="2"} 272
backup_sink_offset{topic="std-2",partition="2"} 168
backup_sink_offset{topic="std-2",partition="1"} 146
backup_sink_offset{topic="std-2",partition="0"} 161

# HELP gauge Total number of records present in the storage sink.
# TYPE backup_sink_records_count gauge
backup_sink_records_count{topic="std-1"} 947
backup_sink_records_count{topic="std-2"} 478

# HELP counter Total number of records written to the storage sink, including expired segments
# TYPE backup_sink_records_count_total counter
backup_sink_records_count_total{topic="std-1"} 947
backup_sink_records_count_total{topic="std-2"} 478

# HELP counter Total number of records filtered out (user-defined filters, plugins, ...)
# TYPE backup_sink_records_filtered_count counter
backup_sink_records_filtered_count{topic="std-1"} 0
backup_sink_records_filtered_count{topic="std-2"} 0

# HELP gauge Total size of backup'd records (key + headers + payload)
# TYPE backup_sink_records_size gauge
backup_sink_records_size{topic="std-1"} 11328438
backup_sink_records_size{topic="std-2"} 5968210

# HELP counter Total size of backup'd records (key + headers + payload), including expired segments
# TYPE backup_sink_records_size_total counter
backup_sink_records_size_total{topic="std-1"} 11328438
backup_sink_records_size_total{topic="std-2"} 5968210

# HELP gauge Number of segments (.kan files) written to the storage
# TYPE backup_sink_segments_count gauge
backup_sink_segments_count{topic="std-1"} 2
backup_sink_segments_count{topic="std-2"} 2

# HELP counter Number of segments (.kan files) written to the storage, including expired segments
# TYPE backup_sink_segments_count_total counter
backup_sink_segments_count_total{topic="std-1"} 2
backup_sink_segments_count_total{topic="std-2"} 2

# HELP gauge State of this topic's backup task: 0=Created, 1xx=Running, 2xx=Paused, 3xx=Backoff, 8xx=Done, 9xx=Failed
# TYPE backup_state gauge
backup_state{topic="std-1"} 100
backup_state{topic="std-2"} 100

# HELP gauge Progress of a topic's partition backup as a number between 0 and 1
# TYPE backup_topic_partition_progress gauge
backup_topic_partition_progress{topic="std-1",partition="1"} 0.9665551839464883
backup_topic_partition_progress{topic="std-1",partition="2"} 0.9581881533101045
backup_topic_partition_progress{topic="std-1",partition="0"} 0.9607843137254902
backup_topic_partition_progress{topic="std-2",partition="2"} 0.949438202247191
backup_topic_partition_progress{topic="std-2",partition="0"} 0.9642857142857143
backup_topic_partition_progress{topic="std-2",partition="1"} 0.9735099337748344

# HELP gauge Overall progress of a topic's backup as a number between 0 and 1
# TYPE backup_topic_progress gauge
backup_topic_progress{topic="std-1"} 0.961842550327361
backup_topic_progress{topic="std-2"} 0.9624112834359133

# HELP jobs_state Number of jobs in a given running state
# TYPE jobs_state gauge
jobs_state{state="Created"} 0
jobs_state{state="Done"} 0
jobs_state{state="Running"} 2
jobs_state{state="Paused"} 0
jobs_state{state="Failed"} 0
jobs_state{state="Backoff"} 0

Push metrics

Backup and Restore jobs (including schema registry backups & restores) push their metrics to the API so that the Armory console can display the progress being made.

Metrics are sent to the event-gateway Kubernetes service which listens on port 8082.

By default, pushing metrics will be configured and enabled when using the main Helm chart. It can be disabled by setting the operator.config.eventGateway.enabled flag to false.

The following properties can be used to change the event gateway service configuration which points to the API:

api:
  eventGateway:
    enabled: true
    service:
      name: "event-gateway"
      type: ClusterIP
      port: 8082 # port that the service will expose
      targetPort: 8082 # container port
      annotations: { }
      labels: { }

To change where restores push the metrics to, update the operator.config.eventGateway.service properties:

operator:
  config:
    eventGateway:
      enabled: true
      service:
        name: event-gateway
        namespace: ""
        port: 8082