Improve metrics #3847

tiithansen · 2024-12-13T07:49:42Z

This PR makes following changes:

Introduce new metrics gha_runner_job which can be used to link jobs to runner pods (metric is only exported while job is running).
Replace high cardinality histograms with last duration gauges as jobs can run with long irregular intervals which makes rate functions hard to use.
Add new duration gauge to show how long job sat in queue before being picked up.
Fix name label to always contain the clean runnerScaleSetName (value used in GHA job runs-on property to select runner).
Fix only export durations if both times used in duration calculations are set.
Remove job_workflow_ref, runner_id and runner_name from duration metrics as they will cause a creation of a new metric/series with each run.

Example queries:

Memory usage per job:

label_replace(gha_runner_job{repository=~"$repository.*", job_name=~"$job.*"}, "pod", "$1", "pod_name", "(.*)") * on(pod) group_right(job_name) sum(container_memory_working_set_bytes{container!=""}) by (pod, container)

CPU usage per job:

label_replace(gha_runner_job{repository=~"$repository.*", job_name=~"$job.*"}, "pod", "$1", "pod_name", "(.*)") * on(pod) group_right(job_name) sum(rate(container_cpu_usage_seconds_total{container!=""}[1m])) by (pod, container)

CPU Throttling:

label_replace(gha_runner_job{repository=~"$repository.*", job_name=~"$job.*"}, "pod", "$1", "pod_name", "(.*)") * on(pod) group_right(job_name) sum(
    sum by (container,pod)
        (rate(container_cpu_cfs_throttled_periods_total{container!=""}[1m]))
 /
    sum by (container,pod)
        (rate(container_cpu_cfs_periods_total{container!=""}[1m]))
) by (pod, container)

…s are set

…on times. Its difficult to calculate any duration is intervals between jobs are not frequent enough. Last duration would give a better overview.

They cause a creation of new services with each job execution.

…in GHA runs-on

…n otherwords waiting for a runner

…query memory, cpu and cpu throttling metrics

atsu85 · 2024-12-13T08:25:27Z

cmd/ghalistener/metrics/metrics.go

@@ -144,75 +144,25 @@ var (
 		completedJobsTotalLabels,
 	)

-	jobStartupDurationSeconds = prometheus.NewHistogramVec(
-		prometheus.HistogramOpts{
+	jobLastStartupDurationSeconds = prometheus.NewGaugeVec(


worth adding a comment (in addition to commit message) why Gague is used, while ideally Histogram seems better data type - might avoid a lot of WTFs and wasting time basically reverting this change ;)

atsu85 · 2024-12-13T08:30:46Z

cmd/ghalistener/metrics/metrics.go

typo in commit description?

They cause a creation of new services with each job execution.

vs

They cause a creation of new series with each job execution.

Also worth mentioning cardinality explosion and OOMs

tiithansen added 6 commits December 13, 2024 09:11

fix: Only observe duration if both times used in duration calculation…

9a7febc

…s are set

feat: Replace duration histograms with gauges which lost last executi…

0be5981

…on times. Its difficult to calculate any duration is intervals between jobs are not frequent enough. Last duration would give a better overview.

fix: Remove runner_name, runner_id and job_workflow_ref labels

00c8719

They cause a creation of new services with each job execution.

fix: Consistently report same value for name label as the value used …

83a75d7

…in GHA runs-on

feat: Add metric to export last duration job spent waiting in queue i…

c2eba45

…n otherwords waiting for a runner

feat: Add new metric which would enable to join job to runner pod to …

fb40b23

…query memory, cpu and cpu throttling metrics

tiithansen requested review from mumoshu, toast-gear, rentziass and a team as code owners December 13, 2024 07:49

Merge branch 'master' into improve-metrics

8c32cd2

atsu85 reviewed Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve metrics #3847

Improve metrics #3847

tiithansen commented Dec 13, 2024

atsu85 Dec 13, 2024

atsu85 Dec 13, 2024 •

edited

Loading

Improve metrics #3847

Are you sure you want to change the base?

Improve metrics #3847

Conversation

tiithansen commented Dec 13, 2024

atsu85 Dec 13, 2024

Choose a reason for hiding this comment

atsu85 Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

atsu85 Dec 13, 2024 •

edited

Loading