webhook: PR build frequently failing #2179

stevenhorsman · 2024-12-03T16:42:42Z

On several PRs recently I've found the webhook build failing e.g. https://github.com/confidential-containers/cloud-api-adaptor/actions/runs/12144031128/job/33862372335?pr=2173
with:

not ok 3 [webhook] test default parameters can be changed
# (in test file tests/e2e/webhook_tests.bats, line 149)
#   `kubectl apply -f -' failed
# runtimeclass.node.k8s.io/kata-wh-test created
# deployment.apps/peer-pods-webhook-controller-manager env updated
# peer-pods-webhook-controller-manager has been successfully rolled out
# peer-pods-webhook-controller-manager is ready
# All pods have the correct TARGET_RUNTIMECLASS value: kata-wh-test
# Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "mwebhook.peerpods.io": failed to call webhook: Post "[https://peer-pods-webhook-webhook-service.peer-pods-webhook-system.svc:443/mutate-v1-pod?timeout=10s](https://peer-pods-webhook-webhook-service.peer-pods-webhook-system.svc/mutate-v1-pod?timeout=10s)": context deadline exceeded
# Error from server (NotFound): error when deleting "/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook/hack/pod.yaml": pods "nginx" not found

which is getting pretty annoying to have to re-run them, so would be good to investigate and fix

The text was updated successfully, but these errors were encountered:

The third webhook test is frequently failing (see confidential-containers#2179), so skip this until there is time to investigate and fix it. Signed-off-by: stevenhorsman <[email protected]>

The third webhook test is frequently failing (see #2179), so skip this until there is time to investigate and fix it. Signed-off-by: stevenhorsman <[email protected]>

wainersm · 2024-12-17T13:57:06Z

I've investigated that issue.

The failing test test default parameters can be changed is one that changes the already deployed webhook before launching a pod and checking it got modified. After the webhook change, it's simply waiting the deployment to be ready:

<snip>
kubectl set env deployment/peer-pods-webhook-controller-manager \
		-n peer-pods-webhook-system TARGET_RUNTIMECLASS="$runtimeclass"

# Wait for the controller pods to be ready.
wait_for_deployment peer-pods-webhook-controller-manager peer-pods-webhook-system
<snip>

I suspect that waiting for Readness of the deployment pod isn't enough, it should be waiting on something else to ensure the service is fully operating again. I thought it could be the certificate needing to be re-generated by the cert-manager, I added a kubectl wait --for=condition=Ready certificate/peer-pods-webhook-serving-cert -n peer-pods-webhook-system but I realized the certificate is always Ready anyway. I poked around but didn't find the exactly root of the problem...lacks me knowledge on k8s webhooks.

Well, then I added an arbitrary sleep 15 after the deployment being Ready. I ran the tests 10x and they all passed. Is it an acceptable workaround?

ldoktor · 2024-12-18T07:04:34Z

Not sure I'll have enough time to play with this but when I was developing a webhook I used to use: kubectl -n webhook-namespace rollout restart deployment webhook-server && oc rollout status --watch --timeout=0s -n webhook-namespace deployment webhook-server to wait for it to start. It's true that if the webhook takes some time to initialize, it'd be even better to wait for a log message in the deployment output. But I'd be against plain sleep ...

wainersm · 2024-12-18T13:57:20Z

Hi @ldoktor !

Not sure I'll have enough time to play with this but when I was developing a webhook I used to use: kubectl -n webhook-namespace rollout restart deployment webhook-server && oc rollout status --watch --timeout=0s -n webhook-namespace deployment webhook-server to wait for it to start. It's true that if the webhook takes some time to initialize, it'd be even better to wait for a log message in the deployment output. But I'd be against plain sleep ...

Interesting that we are using kubectl rollout to wait for the deployment to be ready. What I didn't find in internet/documentation is whether rollout checks the livenessProbe probe or not.

stevenhorsman added bug Something isn't working CI Issues related to CI workflows labels Dec 3, 2024

xutao323 mentioned this issue Dec 9, 2024

providers/libvirt: add support for aarch64 host #2193

Merged

stevenhorsman mentioned this issue Dec 11, 2024

webhook: Skip flakey e2e test #2205

Merged

wainersm pushed a commit that referenced this issue Dec 16, 2024

webhook: Skip flakey e2e test

c6919be

The third webhook test is frequently failing (see #2179), so skip this until there is time to investigate and fix it. Signed-off-by: stevenhorsman <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webhook: PR build frequently failing #2179

webhook: PR build frequently failing #2179

stevenhorsman commented Dec 3, 2024

wainersm commented Dec 17, 2024

ldoktor commented Dec 18, 2024

wainersm commented Dec 18, 2024

webhook: PR build frequently failing #2179

webhook: PR build frequently failing #2179

Comments

stevenhorsman commented Dec 3, 2024

wainersm commented Dec 17, 2024

ldoktor commented Dec 18, 2024

wainersm commented Dec 18, 2024