WIP: 🐛 Attempt to clean up CF IAM users #5242

nrb · 2024-12-06T22:24:51Z

What type of PR is this?
/kind failing-test

What this PR does / why we need it:

Periodic tests seemed to get into a failure loop because an IAM user
with the same name already existed, which is not allowed. This then
failed the entire CloudFoundation stack. Depite the stack claiming to
have been rolled back, the next iteration would run into the same
problem.

This change includes IAM users in the list of resources we need to
specifically delete in the case of a CloudFoundation failure, just in
case they've leaked

Special notes for your reviewer:

The periodic tests at https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-aws#periodic-e2e-release-2-7 were failing roughly every other day between Nov 23, 2024 to Dec 6, 2024.
We'd seen failures prior to that, but testgrid's history doesn't appear to go that far back.

Nearly all the failures within the capa-e2e.[SynchronizedBeforeSuite] function contained this log entry:

STEP: Event details for AWSIAMUserBootstrapper : Resource: AWS::IAM::User, Status: CREATE_FAILED, Reason: Resource handler returned message: "Resource of type 'AWS::IAM::User' with identifier 'bootstrapper.cluster-api-provider-aws.sigs.k8s.io' already exists." (RequestToken: 9149fdc5-32aa-007f-086d-d60101e23ee9, HandlerErrorCode: AlreadyExists) @ 12/05/24 15:56:04.338

Checklist:

includes emojis
adds or updates e2e tests

Release note:
-->

NONE

k8s-ci-robot · 2024-12-06T22:25:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from nrb. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nrb · 2024-12-06T22:25:21Z

/test ?

k8s-ci-robot · 2024-12-06T22:25:23Z

@nrb: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-aws-build
/test pull-cluster-api-provider-aws-build-docker
/test pull-cluster-api-provider-aws-test
/test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-aws-apidiff-main
/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-blocking
/test pull-cluster-api-provider-aws-e2e-clusterclass
/test pull-cluster-api-provider-aws-e2e-conformance
/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
/test pull-cluster-api-provider-aws-e2e-eks
/test pull-cluster-api-provider-aws-e2e-eks-gc
/test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-aws-apidiff-main
pull-cluster-api-provider-aws-build
pull-cluster-api-provider-aws-build-docker
pull-cluster-api-provider-aws-test
pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

nrb · 2024-12-06T22:29:22Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2024-12-07T09:51:55Z

/test pull-cluster-api-provider-aws-e2e

nrb · 2024-12-08T01:18:36Z

Probably needs to be rebased onto #5240

Periodic tests seemed to get into a failure loop because an IAM user with the same name already existed, which is not allowed. This then failed the entire CloudFoundation stack. Depite the stack claiming to have been rolled back, the next iteration would run into the same problem. This change includes IAM users in the list of resources we need to specifically delete in the case of a CloudFoundation failure, just in case they've leaked Signed-off-by: Nolan Brubaker <[email protected]>

nrb · 2024-12-09T13:57:17Z

/test pull-cluster-api-provider-aws-e2e

nrb · 2024-12-09T14:34:54Z

/test pull-cluster-api-provider-aws-test

VPC limit was reached for this test.

k8s-ci-robot · 2024-12-09T15:03:50Z

@nrb: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-aws-e2e	`5a34a13`	link	false	`/test pull-cluster-api-provider-aws-e2e`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

richardcase · 2024-12-12T07:46:04Z

This looks good to me. It also points to a failure in aws-janitor potentially as that should clean up a AWS account that has a failed test.

nrb · 2024-12-12T16:21:52Z

@richardcase Yeah, I asked about the janitor on Slack. The IAM code only looks at roles and instance policies (https://github.com/kubernetes-sigs/boskos/tree/master/aws-janitor/resources).

I'm suspecting that what could be happening is that multiple periodics are using CF at the same time and stepping on each other. With your account logging PR, we can double check that in the future.

k8s-ci-robot requested review from dlipovetsky and richardcase December 6, 2024 22:25

nrb changed the title ~~🐛 Attempt to clean up CF IAM users~~ WIP: 🐛 Attempt to clean up CF IAM users Dec 6, 2024

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 6, 2024

nrb force-pushed the clean-up-cf-user branch from a3bedc8 to d77ca9d Compare December 6, 2024 22:37

nrb force-pushed the clean-up-cf-user branch from d77ca9d to 5a34a13 Compare December 9, 2024 13:57

nrb mentioned this pull request Dec 11, 2024

🌱 Bump CAPI to 1.8.6 #5249

Open

1 task

nrb mentioned this pull request Dec 13, 2024

🐛 fix(awscluster): update with secondary control plane load balancer #5248

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: 🐛 Attempt to clean up CF IAM users #5242

WIP: 🐛 Attempt to clean up CF IAM users #5242

nrb commented Dec 6, 2024

k8s-ci-robot commented Dec 6, 2024

nrb commented Dec 6, 2024

k8s-ci-robot commented Dec 6, 2024

nrb commented Dec 6, 2024

damdo commented Dec 7, 2024

nrb commented Dec 8, 2024

nrb commented Dec 9, 2024

nrb commented Dec 9, 2024

k8s-ci-robot commented Dec 9, 2024

richardcase commented Dec 12, 2024

nrb commented Dec 12, 2024

WIP: 🐛 Attempt to clean up CF IAM users #5242

Are you sure you want to change the base?

WIP: 🐛 Attempt to clean up CF IAM users #5242

Conversation

nrb commented Dec 6, 2024

k8s-ci-robot commented Dec 6, 2024

nrb commented Dec 6, 2024

k8s-ci-robot commented Dec 6, 2024

nrb commented Dec 6, 2024

damdo commented Dec 7, 2024

nrb commented Dec 8, 2024

nrb commented Dec 9, 2024

nrb commented Dec 9, 2024

k8s-ci-robot commented Dec 9, 2024

richardcase commented Dec 12, 2024

nrb commented Dec 12, 2024