CAPI pivot test always case failing in e2es #5252

nrb · 2024-12-13T00:43:31Z

/kind failing-test

What steps did you take and what happened:

Both pull request jobs and periodic jobs are regularly failing on the capa-e2e.[It] [unmanaged] [Cluster API Framework] Self Hosted Spec Should pivot the bootstrap cluster to a self-hosted cluster test case.

A sample periodic job: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464

A sample pull request job: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-aws/5250/pull-cluster-api-provider-aws-e2e/1867146874104844288

What did you expect to happen:

Test case would pass more often

Anything else you would like to add:

Having dug into this a few times (see PRs #5249 and #5251), I've come to the conclusion that, for some reason, the container image for the CAPA manager that's built during the test run isn't present on the Kubeadm control plane node during a clusterctl move.

The below samples are pulling information from the periodic job at https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464

build log output

   [FAILED] Timed out after 1200.001s.
  Timed out waiting for all MachineDeployment self-hosted-rjpecj/self-hosted-lv1y15-md-0 Machines to be upgraded to kubernetes version v1.29.9
  The function passed to Eventually returned the following error:
      <*errors.fundamental | 0xc003693da0>: 
      old Machines remain
      {
          msg: "old Machines remain",
          stack: [0x25eeeaa, 0x4f0046, 0x4ef159, 0xa6931f, 0xa6a3ec, 0xa67a46, 0x25eeb93, 0x25f2ece, 0x26aaa6b, 0xa45593, 0xa5974d, 0x47b3a1],
      }
  In [It] at: /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/machine_helpers.go:221 @ 12/11/24 22:12:08.155

clusterctl move output

From https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/logs/self-hosted-rjpecj/clusterctl-move.log

Deleting AWSMachine="self-hosted-lv1y15-md-0-9xwxz-5hxvg" Namespace="self-hosted-rjpecj"
Retrying with backoff cause="error adding delete-for-move annotation from \"infrastructure.cluster.x-k8s.io/v1beta2, Kind=AWSMachine\" self-hosted-rjpecj/self-hosted-lv1y15-md-0-9xwxz-5hxvg: Internal error occurred: failed calling webhook \"mutation.awsmachine.infrastructure.cluster.x-k8s.io\": failed to call webhook: Post \"https://capa-webhook-service.capa-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta2-awsmachine?timeout=10s\": dial tcp 10.106.211.204:443: connect: connection refused"
Deleting AWSMachine="self-hosted-lv1y15-md-0-9xwxz-5hxvg" Namespace="self-hosted-rjpecj"
Retrying with backoff cause="error adding delete-for-move annotation from \"infrastructure.cluster.x-k8s.io/v1beta2, Kind=AWSMachine\" self-hosted-rjpecj/self-hosted-lv1y15-md-0-9xwxz-5hxvg: Internal error occurred: failed calling webhook \"mutation.awsmachine.infrastructure.cluster.x-k8s.io\": failed to call webhook: Post \"https://capa-webhook-service.capa-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta2-awsmachine?timeout=10s\": dial tcp 10.106.211.204:443: connect: connection refused"
Deleting AWSMachine="self-hosted-lv1y15-md-0-9xwxz-5hxvg" Namespace="self-hosted-rjpecj"

(retries continue until the job's terminated)

Since this failing to reach webhooks, I looked at the CAPA control plane.

capa-manager Pod

This is the most obvious problem; the container image isn't found, sending the pod into CrashLoopBackOff.

https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/resources/capa-system/Pod/capa-controller-manager-7f5964cb58-wmvb5.yaml

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-12-11T21:52:58Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2024-12-11T21:52:55Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-12-11T21:52:55Z"
    message: 'containers with unready status: [manager]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-12-11T21:52:55Z"
    message: 'containers with unready status: [manager]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-12-11T21:52:55Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: gcr.io/k8s-staging-cluster-api/capa-manager:e2e
    imageID: ""
    lastState: {}
    name: manager
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: Back-off pulling image "gcr.io/k8s-staging-cluster-api/capa-manager:e2e"
        reason: ImagePullBackOff
  hostIP: 10.0.136.158
  hostIPs:
  - ip: 10.0.136.158
  phase: Pending
  podIP: 192.168.74.199
  podIPs:
  - ip: 192.168.74.199
  qosClass: BestEffort
  startTime: "2024-12-11T21:52:55Z"

Associated Node

The node associated with the pod does not list the gcr.io/k8s-staging-cluster-api/capa-manager:e2e image as being present.

From https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/resources/Node/ip-10-0-136-158.us-west-2.compute.internal.yaml

 images:
  - names:
    - docker.io/calico/cni@sha256:e60b90d7861e872efa720ead575008bc6eca7bee41656735dcaa8210b688fcd9
    - docker.io/calico/cni:v3.24.1
    sizeBytes: 87382462
  - names:
    - docker.io/calico/node@sha256:43f6cee5ca002505ea142b3821a76d585aa0c8d22bc58b7e48589ca7deb48c13
    - docker.io/calico/node:v3.24.1
    sizeBytes: 80180860
  - names:
    - registry.k8s.io/etcd@sha256:29901446ff08461789b7cd8565fc5b538134e58f81ca1f50fd65d0371cf6571e
    - registry.k8s.io/etcd:3.5.11-0
    sizeBytes: 57232947
  - names:
    - registry.k8s.io/kube-apiserver@sha256:b88538e7fdf73583c8670540eec5b3620af75c9ec200434a5815ee7fba5021f3
    - registry.k8s.io/kube-apiserver:v1.29.9
    sizeBytes: 35210641
  - names:
    - registry.k8s.io/kube-controller-manager@sha256:f2f18973ccb6996687d10ba5bd1b8f303e3dd2fed80f831a44d2ac8191e5bb9b
    - registry.k8s.io/kube-controller-manager:v1.29.9
    sizeBytes: 33739229
  - names:
    - docker.io/calico/kube-controllers@sha256:4010b2739792ae5e77a750be909939c0a0a372e378f3c81020754efcf4a91efa
    - docker.io/calico/kube-controllers:v3.24.1
    sizeBytes: 31125927
  - names:
    - registry.k8s.io/provider-aws/aws-ebs-csi-driver@sha256:02c42645c7a672bbf313ed420e384507dbf0b04992624a3979b87aa4b3f9228e
    - registry.k8s.io/provider-aws/aws-ebs-csi-driver:v1.17.0
    sizeBytes: 30172691
  - names:
    - registry.k8s.io/kube-proxy@sha256:124040dbe6b5294352355f5d34c692ecbc940cdc57a8fd06d0f38f76b6138906
    - registry.k8s.io/kube-proxy:v1.29.9
    sizeBytes: 28600769
  - names:
    - registry.k8s.io/kube-proxy@sha256:559a093080f70ca863922f5e4bb90d6926d52653a91edb5b72c685ebb65f1858
    - registry.k8s.io/kube-proxy:v1.29.8
    sizeBytes: 28599399
  - names:
    - registry.k8s.io/sig-storage/csi-provisioner@sha256:e468dddcd275163a042ab297b2d8c2aca50d5e148d2d22f3b6ba119e2f31fa79
    - registry.k8s.io/sig-storage/csi-provisioner:v3.4.0
    sizeBytes: 27427836
  - names:
    - registry.k8s.io/sig-storage/csi-resizer@sha256:3a7bdf5d105783d05d0962fa06ca53032b01694556e633f27366201c2881e01d
    - registry.k8s.io/sig-storage/csi-resizer:v1.7.0
    sizeBytes: 25809460
  - names:
    - registry.k8s.io/sig-storage/csi-snapshotter@sha256:714aa06ccdd3781f1a76487e2dc7592ece9a12ae9e0b726e4f93d1639129b771
    - registry.k8s.io/sig-storage/csi-snapshotter:v6.2.1
    sizeBytes: 25537921
  - names:
    - registry.k8s.io/sig-storage/csi-attacher@sha256:34cf9b32736c6624fc9787fb149ea6e0fbeb45415707ac2f6440ac960f1116e6
    - registry.k8s.io/sig-storage/csi-attacher:v4.2.0
    sizeBytes: 25508181
  - names:
    - registry.k8s.io/kube-scheduler@sha256:9c164076eebaefdaebad46a5ccd550e9f38c63588c02d35163c6a09e164ab8a8
    - registry.k8s.io/kube-scheduler:v1.29.9
    sizeBytes: 18851030
  - names:
    - registry.k8s.io/coredns/coredns@sha256:1eeb4c7316bacb1d4c8ead65571cd92dd21e27359f0d4917f1a5822a73b75db1
    - registry.k8s.io/coredns/coredns:v1.11.1
    sizeBytes: 18182961
  - names:
    - gcr.io/k8s-staging-provider-aws/cloud-controller-manager@sha256:533d2d64c213719da59c5791835ba05e55ddaaeb2b220ecf7cc3d88823580fc7
    - gcr.io/k8s-staging-provider-aws/cloud-controller-manager:v1.20.0-alpha.0
    sizeBytes: 15350315
  - names:
    - registry.k8s.io/sig-storage/csi-node-driver-registrar@sha256:4a4cae5118c4404e35d66059346b7fa0835d7e6319ff45ed73f4bba335cf5183
    - registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.7.0
    sizeBytes: 10147874
  - names:
    - registry.k8s.io/sig-storage/livenessprobe@sha256:2b10b24dafdc3ba94a03fc94d9df9941ca9d6a9207b927f5dfd21d59fbe05ba0
    - registry.k8s.io/sig-storage/livenessprobe:v2.9.0
    sizeBytes: 9194114
  - names:
    - registry.k8s.io/pause@sha256:7031c1b283388d2c2e09b57badb803c05ebed362dc88d84b480cc47f72a21097
    - registry.k8s.io/pause:3.9
    sizeBytes: 321520

KubeadmConfig

The KubeadmConfig shows that the containerd runtime should be copying a container image from ECR before joining the cluster.

https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/resources/self-hosted-rjpecj/KubeadmConfig/self-hosted-lv1y15-control-plane-qhfvf.yaml

  preKubeadmCommands:
  - mkdir -p /opt/cluster-api
  - ctr -n k8s.io images pull "public.ecr.aws/m3v9m3w5/capa/update:e2e"
  - ctr -n k8s.io images tag "public.ecr.aws/m3v9m3w5/capa/update:e2e" gcr.io/k8s-staging-cluster-api/capa-manager:e2e

The KubeadmControlPlane has the same entry.

Creating the test image

Based on our end-to-end test definitions, the image is successfully created and uploaded to ECR. All other tests seem to be able to find it.

The ensureTestImageUploaded function is what logs in to ECR and uploads the image so that the nodes may then download it. https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/test/e2e/shared/aws.go#L676

The ginkgo suites require this function to pass.

cluster-api-provider-aws/test/e2e/shared/suite.go

Line 159 in 3a646b3

Expect(ensureTestImageUploaded(e2eCtx)).NotTo(HaveOccurred())

Environment:

Cluster-api-provider-aws version: main
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release): Ubuntu on Kube CI

The text was updated successfully, but these errors were encountered:

nrb · 2024-12-13T00:46:16Z

/triage accepted
/priority critical-urgent
/assign

nrb · 2024-12-13T00:49:01Z

I think the preKubeadmCommands are passed to the node via cloud-init.

Is this possibly related to #4745?

k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 13, 2024

k8s-ci-robot assigned nrb Dec 13, 2024

nrb changed the title ~~CAPI pivot test case failing in e2es~~ CAPI pivot test always case failing in e2es Dec 13, 2024

nrb pinned this issue Dec 13, 2024

nrb mentioned this issue Dec 13, 2024

🐛 fix(awscluster): update with secondary control plane load balancer #5248

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAPI pivot test always case failing in e2es #5252

CAPI pivot test always case failing in e2es #5252

nrb commented Dec 13, 2024

nrb commented Dec 13, 2024

nrb commented Dec 13, 2024

CAPI pivot test always case failing in e2es #5252

CAPI pivot test always case failing in e2es #5252

Comments

nrb commented Dec 13, 2024

build log output

clusterctl move output

capa-manager Pod

Associated Node

KubeadmConfig

Creating the test image

nrb commented Dec 13, 2024

nrb commented Dec 13, 2024