CodeX

Everything connected with Tech & Code. Follow to join our 1M+ monthly readers

Follow publication

Improving Kubernetes-Mixin API Server Rules Consistency

6 min readNov 16, 2024

Preamble

A very interesting and widely used project in the Kubernetes space is kubernetes-mixin. The project, written in Jsonnet, provides a well-tested and customisable set of Prometheus rules, alerts, and Grafana dashboards.

An important set of rules are the ones regarding the Kubernetes API Server, providing valuable insights, such as the server’s availability, error budget, workqueues length and more.

I recently made a small contribution to that project, solving an issue I had been seeing for a while. Solving the issue forced me to deep-dive into the Prometheus rules details, and I am writing this to share what I learned.

The Problem

What I observed happening in some very large clusters, was that the 30-day Availability and error budget from time to time went (slightly) above 100%.

GitHub Issue — https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/975

Root Cause Analysis

The problem happened sporadically, especially on large and heavily used clusters.

The kubernetes-mixin splits the API server availability into three views

  • Write availability (verb=”write”), i.e. requests with the verb label matching selector verb=~"POST|PUT|PATCH|DELETE"
  • Read availability (verb=”read”), i.e. requests with the verb label matching selector verb=~"LIST|GET"
  • Overall (verb=”all”) availability.

Maths Come to Rescue

The problem affected all three views of the availability, but in the following I will refer only to the write one, for the sake of simplicity, since it has the easiest expression of the three to analyse.

To understand why the 30d availability was going above 100%, what I did was starting from looking at the PromQL expression, and analyse it.

The following is the expression (in Jsonnet) for the write 30-day availability

1 - (
(
# too slow
sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb="write"})
-
sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb="write",le="1"})
)
+
# errors
sum by (cluster) (code:apiserver_request_total:increase%(SLODays)s{verb="write",code=~"5.."} or vector(0))
)
/
sum by (cluster) (code:apiserver_request_total:increase%(SLODays)s{verb="write"})

The expression terms can be refactored as follows

1 - ((all_writes - fast_enough_writes) + write_errors) / write_requests_total
=
1 - (slow_writes+write_errors) / write_requests_total
=
1 - unavailable_writes_percent

Refactored terms are simpler to analyse

  • To go above 100%, the unavailable_writes_percent term needs to be negative (1 - a is greater than 1 – i.e. >100% – if a is negative)
  • unavailable_writes_percent is calculated as a fraction – ((all_writes - fast_enough_writes) + write_errors) / write_requests_total
  • Since all terms that concur to the calculation are individually positive (all_writes, fast_enough_writes, write_errors, write_requests_total), and the denominator is positive (as made of one positive term only)
  • Then it must be the numerator that evaluates to negative values from time to time
  • This means that fast_enough_writes > all_writes + write_errors
  • (all_writes - fast_enough_writes) + write_errors < 0
  • all_writes + write_errors < fast_enough_writes.

Usually, write_errors is a very small amount, especially when compared to the other two terms, so let’s assume that it amounts to zero, which leaves us with fast_enough_writes > all_writes.

all_writes is calculated as

  • cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb="write"}, from now on, I will refer to this as the “30d count” metric, or just the “count”.

fast_enough_writes is on the other hand calculated from this other recording rule:

  • cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb="write", le="1"}, from now on, I will refer to this as the “30d bucket” metric, or just the “bucket”.

The term fast_enough_writes filters the bucket metric to get the number of write operations that took less than one second (1s) to complete. When the system is healthy, we can expect this number to be close to the overall number of write requests (i.e. the bucket infinity, le="+Inf").

Seeking Confirmations

Plot of bucket infinity vs count.

To confirm the hypothesis, I plotted for one of the clusters showing this behaviour cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase1h{cluster="mycluster", le="+Inf", verb="GET", scope="resource"} (the green line in the image) vs cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h{cluster="mycluster", verb="GET", scope="resource"} (yellow line), and it is easy to notice from the image that the yellow line (count) is always below or equal the green one (bucket infinity). That was true for all the verbs I’ve tested.

The Cause / Digging Into Prometheus Internals

Recording Rules Are Not Metrics

The count and bucket metrics look like they are part of a histogram metric in Prometheus. In Prometheus, we’re guaranteed that in a histogram metric, the count and the bucket infinity (le=”+Inf”) amount to the same value, but for some reason this doesn’t seem to be true in our case.

The reason is the Kubernetes API server does not expose any 30d metric out-of the box, those are recording rules. As a matter of facts, these two metrics, the 30 count and bucket, are calculated as a recording rule (which in turn uses another pair of recording rules for the 1h count and bucket).

Prometheus

By digging a little bit into how Prometheus evaluates recording rules, we discover that recording rules in the same rule group are guaranteed to have the same timestamp, and that they are executed in order, but they are not guaranteed to be executed on the same data.

Key — Recording rules in a group are not guaranteed to be executed on the same data.

By quickly checking how the 30d (and 1h) rules are calculated, I’ve found that they are indeed part of the same rule group, thus they share the same timestamp, but because of what stated above they’re not guaranteed to be executed on the same data (reference issue) which seem to be the issue here.

Interesting is, that in the way they are organised, the bucket metrics (both 1h and 30d) appear after the respective count metrics. This would align with the behaviour observed above: let’s assume that, between the evaluation of the count and bucket recording rules, new data arrives (increasing the total number of observed events). In such a case, the bucket metric is calculated on “fresher” — more — data than the count is. Therefore, the count rule would evaluate to a lower number than the infinity bucket rule (count < bucket{le="+Inf"}) in this case, which is precisely what was observed.

Solution

Merged Pull Request — https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/976.

I’ve though of different possible solutions to the issue, such as

  • Introducing an offset to ensure calculation at times t were executed on data from max t - offset, by sizing the offset wisely it would be possible to ensure this issue does not appear, but also your results would be skewed by that offset
  • Capping the value at 100%, by accepting the loss in accuracy
  • Reordering rules to calculate the bucket first, and the count as the value of the infinity bucket just calculated.

This last solution was the one with fewer drawbacks, and that is implemented in the PR that was merged. Basically, what it does is reordering the rules in the rule group as such

  1. Calculate the 1h bucket using its current formula, as is
  2. Calculate the 1h count by changing the expression to count1h = bucket1h{le="+Inf"}
  3. Calculate the 30d bucket with the same formula
  4. Calculate the 30d count with the expression count30d = bucket30d{le="+Inf"}.

This ordering allows ensuring the count is coherent with the bucket, because that is enforced by the expression and rule evaluation order, without trading this off with skewing the data or else.

Conclusion

The takeaway from this is that when dealing with Prometheus rules, and using them to calculate new aggregation we want to expose as histograms, mimicking the native metrics behaviour, we should take into account that recording rules are not like metrics, and that although unlikely, at scale those differences might cause unexpected behaviour.

Taking preventive steps, like ensuring in the rules’ expressions assumptions like the count being equal to the bucket infinity, might help us mitigate such issues from occurring.

Moreover, Prometheus is not a perfect system, and does not aspire to be, so when looking at dashboards based on that, we should never blindly trust the data we see, as (especially with aggregations) small errors may snowball and become big quite fast.

To conclude, I hope you liked this article, and see you next time!

CodeX
CodeX

Published in CodeX

Everything connected with Tech & Code. Follow to join our 1M+ monthly readers

Lorenzo Felletti
Lorenzo Felletti

Written by Lorenzo Felletti

Cloud & DevOps Engineer, open-source enthusiast

Write a response