etcd database space exceeded due to many old retinaendpoints.retina.sh objects #1132

tomglaza · 2024-12-12T08:22:12Z

Describe the bug
Retina sometimes fails to remove retinaendpoints.retina.sh objects leading to errors: ‘etcdserver: mvcc: database space exceeded’ and stop cluster operation.

To Reproduce
It is difficult to pinpoint clear steps to get the problem, as the problem (at least since the last update) occurs periodically. Most obsolescence occurs in namespaces where tasks are started using spark-operator. Many of the pods in this namespace end up with the status: Error, ContainerStatusUnknown or OOMKilled.
Last time I deleted all retinaendpoints.retina.sh objects (2 weeks ago it was 10 times more than pods), all was well for a while. Now I see that the problem must have occurred again, below is a etcd database summary:

[root@master-3 ~]# etcdctl get /registry --prefix --keys-only | grep -v ^$ | awk -F '/'  '{ h[$3]++ } END {for (k in h) print h[k], k}' | sort -nr | head
20811 events
8927 retina.sh
3785 cilium.io
3500 kyverno.io
2895 argoproj.io
1929 pods
1169 configmaps
1032 services
1010 replicasets
633 secrets

As you can see, the number of retina.sh objects is much higher than the number of pods or cilium.io objects, which in my opinion is an incorrect condition.

Expected behavior
The number of retina.sh objects in the etcd database should not significantly exceed the number of pods objects

Platform (please complete the following information):

OS: Alma Linux 8
Kubernetes Version: 1.30.6
Host: self-host
Retina Version: v0.0.19

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

github-project-automation bot added this to Retina Triage Board Dec 12, 2024

nddq added type/bug Something isn't working area/operator labels Dec 12, 2024

nddq moved this to Accepted in Retina Triage Board Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd database space exceeded due to many old retinaendpoints.retina.sh objects #1132

etcd database space exceeded due to many old retinaendpoints.retina.sh objects #1132

tomglaza commented Dec 12, 2024

etcd database space exceeded due to many old retinaendpoints.retina.sh objects #1132

etcd database space exceeded due to many old retinaendpoints.retina.sh objects #1132

Comments

tomglaza commented Dec 12, 2024