ELB health check fails with Kubernetes >=v1.30.x #5139

dkoshkin · 2024-10-07T22:02:12Z

/kind bug

What steps did you take and what happened:
Follow the quickstart documentation with Kubernetes v1.30.5 and a custom built AMI (the public AMIs are missing for that version and the default v1.31.0 version).
The ELB Health Check fails and the cluster is stuck after creating the first control-plane instance. The AWS console shows that 0 of 1 instanced are in service.

The CAPA API defaults create a Classic ELB with an SSL health check target. (HTTPS also doesn't work, but TCP does)
Starting in Go 1.22 the RSA ciphers were removed - crypto/tls: disable RSA key exchange cipher suites by default golang/go#63413
Kubernetes >v1.30 switched to Go 1.22 in recent releases kubernetes/kubernetes@ddb0b8d

What did you expect to happen:
The defaults should result in a working cluster.

Anything else you would like to add:

Changing the health check to TCP in the AWS console did fix the check, but this update is not allowed by a webhook here and even after removing the webhook, the new value from AWSCluster never got updated.
Setting this on the apiserver and other control-plane components allowed the ELB health check to pass

tls-cipher-suites: ...,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_3DES_EDE_CBC_SHA

Using an NLB loadbalancer works

  controlPlaneLoadBalancer:
    loadBalancerType: nlb

Some discussion about this in the Kuberentes slack https://kubernetes.slack.com/archives/C3QUFP0QM/p1726622974749509

Environment:

Cluster-api-provider-aws version: 2.6.1
Kubernetes version: (use kubectl version): v1.30.5
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

AndiDog · 2024-10-10T11:03:02Z

/triage accepted
/priority important-soon

richardcase · 2024-10-14T13:40:06Z

I'm running into this as well.

richardcase · 2024-10-15T09:16:58Z

/priority critical-urgent
/milestone v2.7.0

richardcase · 2024-10-15T09:19:05Z

This was discussed at the office hours 14th October 2024. The summary is that:

We should update the templates, docs (and at releases) notes on how to explicitly specify the tls cipher suites to ythe kube components. Also, recommend that new clusters consider using nlb instead of classic elb.
In the future with an API version bump we could consider making nlb the default.

richardcase · 2024-10-15T10:18:39Z

/help

k8s-ci-robot · 2024-10-15T10:18:42Z

@richardcase:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

richardcase · 2024-10-28T15:10:50Z

/milestone v2.8.0

richardcase · 2024-10-29T14:10:45Z

/milestone v2.7.2

richardcase · 2024-10-29T14:13:03Z

I tried setting the tls cipher suites but that didn't work:

https://gist.github.com/richardcase/47118a404bc832904c399ba1360462f2

dkoshkin · 2024-10-29T16:42:27Z

@richardcase I wasn't able to just apply your spec directly because of some IAM issues, but was able to create by explicitly setting this public AMI:

spec:
  ami:
    id: ami-0797a44e8719c5e53

AWSCluster

  controlPlaneLoadBalancer:
    crossZoneLoadBalancing: false
    healthCheckProtocol: HTTPS
    loadBalancerType: classic
    scheme: internet-facing

and KCP:

    initConfiguration:
      localAPIEndpoint: {}
      nodeRegistration:
        imagePullPolicy: IfNotPresent
        kubeletExtraArgs:
          cloud-provider: external
          tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_3DES_EDE_CBC_SHA
        name: '{{ ds.meta_data.local_hostname }}'
    joinConfiguration:
      discovery: {}
      nodeRegistration:
        imagePullPolicy: IfNotPresent
        kubeletExtraArgs:
          cloud-provider: external
          tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_3DES_EDE_CBC_SHA
        name: '{{ ds.meta_data.local_hostname }}'

AndiDog · 2024-11-13T17:09:20Z

I got it working with this additional argument in the template (I'm using cluster-template-flatcar-machinepool.yaml but that shouldn't matter):

apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
  name: "${CLUSTER_NAME}-control-plane"
spec:
  kubeadmConfigSpec:
    clusterConfiguration:
      apiServer:
        extraArgs:
          cloud-provider: external

          # This is needed for Kubernetes v1.30+ since else it uses the Go defaults which don't
          # work with AWS classic load balancers, see
          # https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5139. If you use
          # another load balancer type such as NLB, this is not needed.
          #
          # The list consists of the secure ciphers from Go 1.23.3, plus some less secure
          # RSA ciphers which the AWS classic load balancer instance health check supports.
          tls-cipher-suites: TLS_AES_128_GCM_SHA256,TLS_AES_256_GCM_SHA384,TLS_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA

Do we really want to hardcode these less secure settings in the template? This makes it very likely for users to blindly take it over. I'm rather thinking of other options:

Explicitly set NLB as load balancer type in templates, making it easier for users to make the switch later once we make breaking changes to the API specs. However, these template changes should also go into new 2.x release, not a patch release. Document what needs to be done for ELBs. This gives some chance of silent breakage for users (different LB type than expected; what happens if they apply the changed manifest on existing clusters...).
Only create a documentation page about ELBs. Hard to figure out for users what's wrong.
(My preference:) Perform the correction in CAPA code with an if block. This is the easiest way to make the defaults work immediately without changes by the users, and also it's easy to rip it out later if AWS should improve TLS cipher support in ELB instance health checks.

AndiDog · 2024-11-25T21:56:37Z

We talked in the office hours to consider switching the default type to NLB.

Next steps:

Test what happens when the LB type changes (downtime?). The webhook code currently doesn't forbid a change of that field.
Test use of secondaryControlPlaneLoadBalancer or other alternatives to do a migration

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 7, 2024

k8s-ci-robot added this to the v2.7.0 milestone Oct 15, 2024

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 15, 2024

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 15, 2024

AverageMarcus mentioned this issue Oct 15, 2024

CAPA: Kubernetes v30 giantswarm/roadmap#3644

Open

5 tasks

k8s-ci-robot modified the milestones: v2.7.0, v2.8.0 Oct 28, 2024

k8s-ci-robot modified the milestones: v2.8.0, v2.7.2 Oct 29, 2024

This was referenced Nov 14, 2024

Update CAPI / CAPZ / CAPA controller to support K8s release v1.30.x giantswarm/roadmap#3661

Open

Remove outdated ciphers from API server flags giantswarm/roadmap#3766

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ELB health check fails with Kubernetes >=v1.30.x #5139

ELB health check fails with Kubernetes >=v1.30.x #5139

dkoshkin commented Oct 7, 2024 •

edited

Loading

AndiDog commented Oct 10, 2024

richardcase commented Oct 14, 2024

richardcase commented Oct 15, 2024

richardcase commented Oct 15, 2024

richardcase commented Oct 15, 2024

k8s-ci-robot commented Oct 15, 2024

richardcase commented Oct 28, 2024

richardcase commented Oct 29, 2024

richardcase commented Oct 29, 2024

dkoshkin commented Oct 29, 2024

AndiDog commented Nov 13, 2024

AndiDog commented Nov 25, 2024

ELB health check fails with Kubernetes >=v1.30.x #5139

ELB health check fails with Kubernetes >=v1.30.x #5139

Comments

dkoshkin commented Oct 7, 2024 • edited Loading

AndiDog commented Oct 10, 2024

richardcase commented Oct 14, 2024

richardcase commented Oct 15, 2024

richardcase commented Oct 15, 2024

richardcase commented Oct 15, 2024

k8s-ci-robot commented Oct 15, 2024

Guidelines

richardcase commented Oct 28, 2024

richardcase commented Oct 29, 2024

richardcase commented Oct 29, 2024

dkoshkin commented Oct 29, 2024

AndiDog commented Nov 13, 2024

AndiDog commented Nov 25, 2024

dkoshkin commented Oct 7, 2024 •

edited

Loading