Improving Kubernetes-Mixin API Server Rules Consistency
A journey into troubleshooting an insidious, and subtle, issue that may occur with Prometheus Recording Rules
Preamble
A very interesting and widely used project in the Kubernetes space is kubernetes-mixin. The project, written in Jsonnet, provides a well-tested and customisable set of Prometheus rules, alerts, and Grafana dashboards.
An important set of rules are the ones regarding the Kubernetes API Server, providing valuable insights, such as the server’s availability, error budget, workqueues length and more.
I recently made a small contribution to that project, solving an issue I had been seeing for a while. Solving the issue forced me to deep-dive into the Prometheus rules details, and I am writing this to share what I learned.
The Problem
What I observed happening in some very large clusters, was that the 30-day Availability and error budget from time to time went (slightly) above 100%.
GitHub Issue — https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/975
Root Cause Analysis
The problem happened sporadically, especially on large and heavily used clusters.
The kubernetes-mixin splits the API server availability into three views
- Write availability (
verb=”write”
), i.e. requests with theverb
label matching selectorverb=~"POST|PUT|PATCH|DELETE"
- Read availability (
verb=”read”
), i.e. requests with theverb
label matching selectorverb=~"LIST|GET"
- Overall (
verb=”all”
) availability.
Maths Come to Rescue
The problem affected all three views of the availability, but in the following I will refer only to the write one, for the sake of simplicity, since it has the easiest expression of the three to analyse.
To understand why the 30d availability was going above 100%, what I did was starting from looking at the PromQL expression, and analyse it.
The following is the expression (in Jsonnet) for the write 30-day availability
1 - (
(
# too slow
sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb="write"})
-
sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb="write",le="1"})
)
+
# errors
sum by (cluster) (code:apiserver_request_total:increase%(SLODays)s{verb="write",code=~"5.."} or vector(0))
)
/
sum by (cluster) (code:apiserver_request_total:increase%(SLODays)s{verb="write"})
The expression terms can be refactored as follows
1 - ((all_writes - fast_enough_writes) + write_errors) / write_requests_total
=
1 - (slow_writes+write_errors) / write_requests_total
=
1 - unavailable_writes_percent
Refactored terms are simpler to analyse
- To go above 100%, the
unavailable_writes_percent
term needs to be negative (1 - a
is greater than1
– i.e.>100%
– ifa
is negative) unavailable_writes_percent
is calculated as a fraction –((all_writes - fast_enough_writes) + write_errors) / write_requests_total
- Since all terms that concur to the calculation are individually positive (
all_writes
,fast_enough_writes
,write_errors
,write_requests_total
), and the denominator is positive (as made of one positive term only) - Then it must be the numerator that evaluates to negative values from time to time
- This means that
fast_enough_writes > all_writes + write_errors
(all_writes - fast_enough_writes) + write_errors < 0
all_writes + write_errors < fast_enough_writes
.
Usually, write_errors
is a very small amount, especially when compared to the other two terms, so let’s assume that it amounts to zero, which leaves us with fast_enough_writes > all_writes
.
all_writes
is calculated as
cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb="write"}
, from now on, I will refer to this as the “30d count” metric, or just the “count”.
fast_enough_writes
is on the other hand calculated from this other recording rule:
cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb="write", le="1"}
, from now on, I will refer to this as the “30d bucket” metric, or just the “bucket”.
The term
fast_enough_writes
filters the bucket metric to get the number of write operations that took less than one second (1s
) to complete. When the system is healthy, we can expect this number to be close to the overall number of write requests (i.e. the bucket infinity,le="+Inf"
).
Seeking Confirmations

To confirm the hypothesis, I plotted for one of the clusters showing this behaviour cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase1h{cluster="mycluster", le="+Inf", verb="GET", scope="resource"}
(the green line in the image) vs cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h{cluster="mycluster", verb="GET", scope="resource"}
(yellow line), and it is easy to notice from the image that the yellow line (count) is always below or equal the green one (bucket infinity). That was true for all the verbs I’ve tested.
The Cause / Digging Into Prometheus Internals
Recording Rules Are Not Metrics
The count and bucket metrics look like they are part of a histogram metric in Prometheus. In Prometheus, we’re guaranteed that in a histogram metric, the count and the bucket infinity (le=”+Inf”
) amount to the same value, but for some reason this doesn’t seem to be true in our case.
The reason is the Kubernetes API server does not expose any 30d metric out-of the box, those are recording rules. As a matter of facts, these two metrics, the 30 count and bucket, are calculated as a recording rule (which in turn uses another pair of recording rules for the 1h count and bucket).
Prometheus
By digging a little bit into how Prometheus evaluates recording rules, we discover that recording rules in the same rule group are guaranteed to have the same timestamp, and that they are executed in order, but they are not guaranteed to be executed on the same data.
Key — Recording rules in a group are not guaranteed to be executed on the same data.
By quickly checking how the 30d (and 1h) rules are calculated, I’ve found that they are indeed part of the same rule group, thus they share the same timestamp, but because of what stated above they’re not guaranteed to be executed on the same data (reference issue) which seem to be the issue here.
Interesting is, that in the way they are organised, the bucket metrics (both 1h and 30d) appear after the respective count metrics. This would align with the behaviour observed above: let’s assume that, between the evaluation of the count and bucket recording rules, new data arrives (increasing the total number of observed events). In such a case, the bucket metric is calculated on “fresher” — more — data than the count is. Therefore, the count rule would evaluate to a lower number than the infinity bucket rule (count < bucket{le="+Inf"}
) in this case, which is precisely what was observed.
Solution
Merged Pull Request — https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/976.
I’ve though of different possible solutions to the issue, such as
- Introducing an offset to ensure calculation at times
t
were executed on data from maxt - offset
, by sizing the offset wisely it would be possible to ensure this issue does not appear, but also your results would be skewed by that offset - Capping the value at 100%, by accepting the loss in accuracy
- Reordering rules to calculate the bucket first, and the count as the value of the infinity bucket just calculated.
This last solution was the one with fewer drawbacks, and that is implemented in the PR that was merged. Basically, what it does is reordering the rules in the rule group as such
- Calculate the 1h bucket using its current formula, as is
- Calculate the 1h count by changing the expression to
count1h = bucket1h{le="+Inf"}
- Calculate the 30d bucket with the same formula
- Calculate the 30d count with the expression
count30d = bucket30d{le="+Inf"}
.
This ordering allows ensuring the count is coherent with the bucket, because that is enforced by the expression and rule evaluation order, without trading this off with skewing the data or else.
Conclusion
The takeaway from this is that when dealing with Prometheus rules, and using them to calculate new aggregation we want to expose as histograms, mimicking the native metrics behaviour, we should take into account that recording rules are not like metrics, and that although unlikely, at scale those differences might cause unexpected behaviour.
Taking preventive steps, like ensuring in the rules’ expressions assumptions like the count being equal to the bucket infinity, might help us mitigate such issues from occurring.
Moreover, Prometheus is not a perfect system, and does not aspire to be, so when looking at dashboards based on that, we should never blindly trust the data we see, as (especially with aggregations) small errors may snowball and become big quite fast.
To conclude, I hope you liked this article, and see you next time!