Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eks-prow-build-cluster: Reconsider instance type selection #5066

Open
Tracked by #5167
tzneal opened this issue Mar 30, 2023 · 20 comments
Open
Tracked by #5167

eks-prow-build-cluster: Reconsider instance type selection #5066

tzneal opened this issue Mar 30, 2023 · 20 comments
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Milestone

Comments

@tzneal
Copy link

tzneal commented Mar 30, 2023

What should be cleaned up or changed:

Some changes were made to the EKS cluster to attempt to resolve an issue with test flakes. These changes also increased the per-node cost. We should consider reverting these changes to reduce cost.

a) Changing to an instance type without instance storage.

b) Changing back to an AMD CPU type

c) Changing to a roughly 8 CPU / 64GB type to more closely match the existing GCP cluster nodes

The cluster currently uses an r5d.4xlarge (16 CPU/ 128 GB) with an on-demand cost of 1.152

An r5a.4xlarge (16 CPU / 128 GB) has an on-demand cost of 0.904 per hour

An r5a.2xlarge (8 CPU / 64 GB) has an on-demand cost of 0.45 per hour

Provide any links for context:

@tzneal tzneal added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Mar 30, 2023
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 30, 2023
@tzneal
Copy link
Author

tzneal commented Mar 30, 2023

/sig k8s-infra

@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 30, 2023
@BenTheElder BenTheElder added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Mar 31, 2023
@xmudrii
Copy link
Member

xmudrii commented Apr 2, 2023

I'm going to transfer this issue to k/k8s.io as other issues related to this cluster are already there.
/transfer-issue k8s.io

@k8s-ci-robot k8s-ci-robot transferred this issue from kubernetes/test-infra Apr 2, 2023
@xmudrii
Copy link
Member

xmudrii commented Apr 2, 2023

/assign @xmudrii @pkprzekwas

@BenTheElder
Copy link
Member

One thing to consider: Because kubernetes doesn't have IO/IOPS isolation, sizing really large nodes changes the CPU : I/O ratio. (Though this will also not be 1:1 between GCP and AWS anyhow), so while really large nodes can allow high core count jobs OR bin packing more jobs per node ... the latter can cause issues by over-packing for I/O throughput.

This is less of an issue today than when we ran bazel builds widely, but it's still something that can cause performance issues. The existing size is semi-arbitrary though, and may be somewhat GCP specific, but right now tests that are likely to be IO heavy sometimes reserve that IO by reserving ~all of the CPU at our current node sizes.

@xmudrii
Copy link
Member

xmudrii commented Apr 2, 2023

xref #4686

@xmudrii
Copy link
Member

xmudrii commented Apr 2, 2023

To add to what @BenTheElder said: we already had issues with GOMAXPROCS for unit tests. We "migrated" 5 jobs so far and one was affected (potentially one more). To avoid such issues, we might want to have instances close to what we have on GCP. We can't have 1:1 mapping, but we can try using similar instances based on what AWS offers.

Not having to deal with stuff such as GOMAXPROCS is going to make the migration more smooth and we'll avoid spending a lot of time on debugging such issues.

@dims
Copy link
Member

dims commented Apr 2, 2023

@xmudrii
Copy link
Member

xmudrii commented Apr 2, 2023

@dims Thanks for driving this forward. But just to note, this fixes it only for k/k, other subprojects might be affected by it and would need to apply a similar patch.

@BenTheElder
Copy link
Member

Go is expected to solve GOMAXPROCS upstream, it's been accepted to detect this in the stdlib, and GOMAXPROCS can also be set in the CI in the meantime, as-is jobs already have this wrong and we should resolve that independently of selecting node-size.

@tzneal
Copy link
Author

tzneal commented Apr 3, 2023

as-is jobs already have this wrong and we should resolve that independently of selecting node-size.

+1 for setting this on existing jobs. I have a secret hope that it might generally reduce flakiness a bit.

@TerryHowe
Copy link
Member

Maybe try some bare metal node like an m5.2xlarge or m6g.2xlarge?

@xmudrii
Copy link
Member

xmudrii commented Apr 3, 2023

@TerryHowe We need to use memory optimized instances because our jobs tend to use a lot of memory.

@xmudrii
Copy link
Member

xmudrii commented Apr 3, 2023

Update: we decided to go with a 3 step phased approach:

  • Switch from r5d.4xlarge to r6id.2xlarge (this instance size should be very close to what we have on GCP)
  • Switch from r6id.2xlarge to r6i.2xlarge (i.e. switch from SSDs to EBS)
  • Switch from r6i.2xlarge to r6a.2xlarge (i.e. switch to AMD CPUs)

Note: the order of phases might get changed.

Each phase should last at least 24 hours to ensure that tests are stable. I just started the first phase and I think we should leave it on until Wednesday morning CEST.

@xmudrii
Copy link
Member

xmudrii commented Apr 3, 2023

Update: we tried r6id.2xlarge but it seems that 8 vCPUs are not enough:

  Type     Reason             Age   From                Message
  ----     ------             ----  ----                -------
  Warning  FailedScheduling   44s   default-scheduler   0/20 nodes are available: 20 Insufficient cpu. preemption: 0/20 nodes are available: 20 No preemption victims found for incoming pod.
  Normal   NotTriggerScaleUp  38s   cluster-autoscaler  pod didn't trigger scale-up: 1 Insufficient cpu

I'm trying r5ad.4xlarge instead.

@xmudrii
Copy link
Member

xmudrii commented Apr 25, 2023

/retitle eks-prow-build-cluster: Reconsider instance type selection

@k8s-ci-robot k8s-ci-robot changed the title Consider instance type selection on the EKS Cluster eks-prow-build-cluster: Reconsider instance type selection Apr 25, 2023
@ameukam
Copy link
Member

ameukam commented Nov 15, 2023

@xmudrii are we still doing this ? Do we want to use a instance type with less resources ?

@xmudrii
Copy link
Member

xmudrii commented Nov 15, 2023

@ameukam I would still like to take a look into this, but we'd mostly like need to adopt Karpenter to be able to do this (#5168)
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Nov 15, 2023
@xmudrii
Copy link
Member

xmudrii commented Feb 12, 2024

Blocked by #5168
/unassign @xmudrii @pkprzekwas

@ameukam
Copy link
Member

ameukam commented Dec 6, 2024

@xmudrii @koksay I think it's unblocked now. Do we still need to reconsider the instance types ?

@xmudrii
Copy link
Member

xmudrii commented Dec 6, 2024

@ameukam Yes, let's figure this out after the 1.32 release.
/milestone v1.33

@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

8 participants