Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS e2e tests permanently failing #5237

Open
nrb opened this issue Dec 4, 2024 · 5 comments · May be fixed by #5239
Open

EKS e2e tests permanently failing #5237

nrb opened this issue Dec 4, 2024 · 5 comments · May be fixed by #5239
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@nrb
Copy link
Contributor

nrb commented Dec 4, 2024

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

CI is failing with errors like this: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-aws/5211/pull-cluster-api-provider-aws-e2e-eks/1856371404925046784

It appears that the EKS control plane with addons test is never succeeding.

Further investigation showed that the control plane was constantly blocked on CoreDNS updating.

What did you expect to happen:

Tests pass

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

I have tried running the test locally and have found that the CoreDNS pods never get scheduled.

A sample kubectl describe output:

Name:                 coredns-787cb67946-g9qjz
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      coredns
Node:                 <none>
Labels:               eks.amazonaws.com/component=coredns
                      k8s-app=kube-dns
                      pod-template-hash=787cb67946
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/coredns-787cb67946
Containers:
  coredns:
    Image:       602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/coredns:v1.11.1-eksbuild.8
    Ports:       53/UDP, 53/TCP, 9153/TCP
    Host Ports:  0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kf9wg (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  kube-api-access-kf9wg:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:               <none>
Tolerations:                  CriticalAddonsOnly op=Exists
                              node-role.kubernetes.io/control-plane:NoSchedule
                              node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  topology.kubernetes.io/zone:ScheduleAnyway when max skew 1 is exceeded for selector k8s-app=kube-dns
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  2m38s (x64 over 12m)  default-scheduler  no nodes available to schedule pods

I'm trying to get access to the EKS nodes to validate the taints defined on them, but so far haven't been able to do so.

Also, changing the version of Kubernetes to 1.29 also results in this behavior.

Environment:

  • Cluster-api-provider-aws version: main
  • Kubernetes version: (use kubectl version): v1.30
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 4, 2024
@nrb
Copy link
Contributor Author

nrb commented Dec 4, 2024

/triage accepted
/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Dec 4, 2024
@nrb
Copy link
Contributor Author

nrb commented Dec 4, 2024

/assign

@nrb
Copy link
Contributor Author

nrb commented Dec 5, 2024

I created an EKS cluster through the AWS console with only VPC CNI and CoreDNS as addons; I have the same problem. CoreDNS pods are never scheduled.

I enabled the scheduler log in CloudTrail and it's looping on this:

I1205 18:44:19.381666      11 schedule_one.go:1040] "Unable to schedule pod; no nodes are registered to the cluster; waiting" pod="kube-system/coredns-787cb67946-7dd5f"

This leads me to believe that there's possibly something in the networking or DNS entries that's incorrect.

@nrb
Copy link
Contributor Author

nrb commented Dec 5, 2024

I've made sure my API server access is public, so I don't believe that's the issue. And throughout this process I've been able to use kubectl to interact with the cluster.

Screenshot 2024-12-05 at 1 49 27 PM

@nrb nrb linked a pull request Dec 5, 2024 that will close this issue
5 tasks
@nrb
Copy link
Contributor Author

nrb commented Dec 5, 2024

Making my own node allowed the CoreDNS pods to schedule.

I1205 19:41:33.701770      11 schedule_one.go:304] "Successfully bound pod to node" pod="kube-system/coredns-787cb67946-7dd5f" node="ip-10-0-8-110.us-west-2.compute.internal" evaluatedNodes=1 feasibleNodes=1

So I think the test is no longer valid, and creating an EKS cluster with just a control plane and no user nodes no longer supports scheduling CoreDNS, even though it did in the past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants