Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webhook: PR build frequently failing #2179

Open
stevenhorsman opened this issue Dec 3, 2024 · 3 comments
Open

webhook: PR build frequently failing #2179

stevenhorsman opened this issue Dec 3, 2024 · 3 comments
Labels
bug Something isn't working CI Issues related to CI workflows

Comments

@stevenhorsman
Copy link
Member

On several PRs recently I've found the webhook build failing e.g. https://github.com/confidential-containers/cloud-api-adaptor/actions/runs/12144031128/job/33862372335?pr=2173
with:

not ok 3 [webhook] test default parameters can be changed
# (in test file tests/e2e/webhook_tests.bats, line 149)
#   `kubectl apply -f -' failed
# runtimeclass.node.k8s.io/kata-wh-test created
# deployment.apps/peer-pods-webhook-controller-manager env updated
# peer-pods-webhook-controller-manager has been successfully rolled out
# peer-pods-webhook-controller-manager is ready
# All pods have the correct TARGET_RUNTIMECLASS value: kata-wh-test
# Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "mwebhook.peerpods.io": failed to call webhook: Post "[https://peer-pods-webhook-webhook-service.peer-pods-webhook-system.svc:443/mutate-v1-pod?timeout=10s](https://peer-pods-webhook-webhook-service.peer-pods-webhook-system.svc/mutate-v1-pod?timeout=10s)": context deadline exceeded
# Error from server (NotFound): error when deleting "/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook/hack/pod.yaml": pods "nginx" not found

which is getting pretty annoying to have to re-run them, so would be good to investigate and fix

@stevenhorsman stevenhorsman added bug Something isn't working CI Issues related to CI workflows labels Dec 3, 2024
stevenhorsman added a commit to stevenhorsman/cloud-api-adaptor that referenced this issue Dec 11, 2024
The third webhook test is frequently failing (see confidential-containers#2179),
so skip this until there is time to investigate and fix it.

Signed-off-by: stevenhorsman <[email protected]>
stevenhorsman added a commit to stevenhorsman/cloud-api-adaptor that referenced this issue Dec 11, 2024
The third webhook test is frequently failing (see confidential-containers#2179),
so skip this until there is time to investigate and fix it.

Signed-off-by: stevenhorsman <[email protected]>
stevenhorsman added a commit to stevenhorsman/cloud-api-adaptor that referenced this issue Dec 13, 2024
The third webhook test is frequently failing (see confidential-containers#2179),
so skip this until there is time to investigate and fix it.

Signed-off-by: stevenhorsman <[email protected]>
wainersm pushed a commit that referenced this issue Dec 16, 2024
The third webhook test is frequently failing (see #2179),
so skip this until there is time to investigate and fix it.

Signed-off-by: stevenhorsman <[email protected]>
@wainersm
Copy link
Member

I've investigated that issue.

The failing test test default parameters can be changed is one that changes the already deployed webhook before launching a pod and checking it got modified. After the webhook change, it's simply waiting the deployment to be ready:

<snip>
kubectl set env deployment/peer-pods-webhook-controller-manager \
		-n peer-pods-webhook-system TARGET_RUNTIMECLASS="$runtimeclass"

# Wait for the controller pods to be ready.
wait_for_deployment peer-pods-webhook-controller-manager peer-pods-webhook-system
<snip>

I suspect that waiting for Readness of the deployment pod isn't enough, it should be waiting on something else to ensure the service is fully operating again. I thought it could be the certificate needing to be re-generated by the cert-manager, I added a kubectl wait --for=condition=Ready certificate/peer-pods-webhook-serving-cert -n peer-pods-webhook-system but I realized the certificate is always Ready anyway. I poked around but didn't find the exactly root of the problem...lacks me knowledge on k8s webhooks.

Well, then I added an arbitrary sleep 15 after the deployment being Ready. I ran the tests 10x and they all passed. Is it an acceptable workaround?

@ldoktor
Copy link
Contributor

ldoktor commented Dec 18, 2024

Not sure I'll have enough time to play with this but when I was developing a webhook I used to use: kubectl -n webhook-namespace rollout restart deployment webhook-server && oc rollout status --watch --timeout=0s -n webhook-namespace deployment webhook-server to wait for it to start. It's true that if the webhook takes some time to initialize, it'd be even better to wait for a log message in the deployment output. But I'd be against plain sleep ...

@wainersm
Copy link
Member

Hi @ldoktor !

Not sure I'll have enough time to play with this but when I was developing a webhook I used to use: kubectl -n webhook-namespace rollout restart deployment webhook-server && oc rollout status --watch --timeout=0s -n webhook-namespace deployment webhook-server to wait for it to start. It's true that if the webhook takes some time to initialize, it'd be even better to wait for a log message in the deployment output. But I'd be against plain sleep ...

Interesting that we are using kubectl rollout to wait for the deployment to be ready. What I didn't find in internet/documentation is whether rollout checks the livenessProbe probe or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CI Issues related to CI workflows
Projects
None yet
Development

No branches or pull requests

3 participants