EKS e2e tests permanently failing #5237

nrb · 2024-12-04T19:26:21Z

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

CI is failing with errors like this: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-aws/5211/pull-cluster-api-provider-aws-e2e-eks/1856371404925046784

It appears that the EKS control plane with addons test is never succeeding.

Further investigation showed that the control plane was constantly blocked on CoreDNS updating.

What did you expect to happen:

Tests pass

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

I have tried running the test locally and have found that the CoreDNS pods never get scheduled.

A sample kubectl describe output:

Name:                 coredns-787cb67946-g9qjz
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      coredns
Node:                 <none>
Labels:               eks.amazonaws.com/component=coredns
                      k8s-app=kube-dns
                      pod-template-hash=787cb67946
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/coredns-787cb67946
Containers:
  coredns:
    Image:       602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/coredns:v1.11.1-eksbuild.8
    Ports:       53/UDP, 53/TCP, 9153/TCP
    Host Ports:  0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kf9wg (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  kube-api-access-kf9wg:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:               <none>
Tolerations:                  CriticalAddonsOnly op=Exists
                              node-role.kubernetes.io/control-plane:NoSchedule
                              node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  topology.kubernetes.io/zone:ScheduleAnyway when max skew 1 is exceeded for selector k8s-app=kube-dns
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  2m38s (x64 over 12m)  default-scheduler  no nodes available to schedule pods

I'm trying to get access to the EKS nodes to validate the taints defined on them, but so far haven't been able to do so.

Also, changing the version of Kubernetes to 1.29 also results in this behavior.

Environment:

Cluster-api-provider-aws version: main
Kubernetes version: (use kubectl version): v1.30
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

nrb · 2024-12-04T19:55:13Z

/triage accepted
/priority critical-urgent

nrb · 2024-12-04T19:55:22Z

/assign

nrb · 2024-12-05T18:47:08Z

I created an EKS cluster through the AWS console with only VPC CNI and CoreDNS as addons; I have the same problem. CoreDNS pods are never scheduled.

I enabled the scheduler log in CloudTrail and it's looping on this:

I1205 18:44:19.381666      11 schedule_one.go:1040] "Unable to schedule pod; no nodes are registered to the cluster; waiting" pod="kube-system/coredns-787cb67946-7dd5f"

This leads me to believe that there's possibly something in the networking or DNS entries that's incorrect.

nrb · 2024-12-05T18:50:09Z

I've made sure my API server access is public, so I don't believe that's the issue. And throughout this process I've been able to use kubectl to interact with the cluster.

nrb · 2024-12-05T19:59:05Z

Making my own node allowed the CoreDNS pods to schedule.

I1205 19:41:33.701770      11 schedule_one.go:304] "Successfully bound pod to node" pod="kube-system/coredns-787cb67946-7dd5f" node="ip-10-0-8-110.us-west-2.compute.internal" evaluatedNodes=1 feasibleNodes=1

So I think the test is no longer valid, and creating an EKS cluster with just a control plane and no user nodes no longer supports scheduling CoreDNS, even though it did in the past.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 4, 2024

k8s-ci-robot assigned nrb Dec 4, 2024

nrb linked a pull request Dec 5, 2024 that will close this issue

🌱 Remove invalid EKS test #5239

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EKS e2e tests permanently failing #5237

EKS e2e tests permanently failing #5237

nrb commented Dec 4, 2024

nrb commented Dec 4, 2024

nrb commented Dec 4, 2024

nrb commented Dec 5, 2024

nrb commented Dec 5, 2024

nrb commented Dec 5, 2024

EKS e2e tests permanently failing #5237

EKS e2e tests permanently failing #5237

Comments

nrb commented Dec 4, 2024

nrb commented Dec 4, 2024

nrb commented Dec 4, 2024

nrb commented Dec 5, 2024

nrb commented Dec 5, 2024

nrb commented Dec 5, 2024