Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E test images: httpd images failed to push to staging #20884

Closed
claudiubelu opened this issue Feb 17, 2021 · 46 comments · Fixed by kubernetes/kubernetes#99609
Closed

E2E test images: httpd images failed to push to staging #20884

claudiubelu opened this issue Feb 17, 2021 · 46 comments · Fixed by kubernetes/kubernetes#99609
Assignees
Labels
area/artifacts Issues or PRs related to the hosting of release artifacts for subprojects area/images area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Milestone

Comments

@claudiubelu
Copy link
Contributor

claudiubelu commented Feb 17, 2021

What happened:

The Image Builder postsubmit jobs post-kubernetes-push-e2e-httpd-test-images and post-kubernetes-push-e2e-new-httpd-test-images are failing with a 401 Unauthorized error while trying to push to gcr.io/k8s-staging-e2e-test-images.

What you expected to happen:

It should have been able to push the images.

How to reproduce it (as minimally and precisely as possible):

Rerun the jobs.

Please provide links to example occurrences, if any:

[1] https://testgrid.k8s.io/sig-testing-images#post-kubernetes-push-e2e-httpd-test-images
[2] https://testgrid.k8s.io/sig-testing-images#post-kubernetes-push-e2e-httpd-new-test-images
[3] https://testgrid.k8s.io/sig-testing-images#kubernetes-e2e-windows-servercore-cache

Anything else we need to know?:

Worth noting that the job passed on 2021.02.09, but failed on 2021.02.15. The prow job config is fine, running the k8s-staging-e2e-test-images.sh script that generated the job reveals no diff.

Additionally, on 2021.02.11 the kubernetes-e2e-windows-servercore-cache job passed [3], a job which is similarly defined to the other 2 jobs.

@claudiubelu claudiubelu added the kind/bug Categorizes issue or PR as related to a bug. label Feb 17, 2021
@claudiubelu
Copy link
Contributor Author

/assign @spiffxp

@spiffxp
Copy link
Member

spiffxp commented Feb 18, 2021

In case the problem is with the k8s-staging-e2e-test-images project, I looked at other jobs that push to that

No job other than http have run since agnhost last passed, so it's tough to isolate to "the http jobs" vs. "all jobs pushing to this project"

@spiffxp
Copy link
Member

spiffxp commented Feb 18, 2021

Nothing merged to https://github.com/kubernetes/test-infra/commits/master/config/jobs/image-pushing in between last pass / first fail

@spiffxp
Copy link
Member

spiffxp commented Feb 18, 2021

Things merged to https://github.com/kubernetes/k8s.io that could have changed something:

@spiffxp
Copy link
Member

spiffxp commented Feb 18, 2021

I saw cluster-api-aws recently push an image, so using them as a known-good

bad="e2e-test-images"
good="cluster-api-aws"

for s in "${bad}" "${good}"; do 
  p="k8s-staging-${s}"
  output="iam-${s}.txt"
  (
    for b in $(gsutil ls -p "${p}"); do
      gsutil iam get "${b}" | jq 'del(.etag)'
    done
    gcloud --format=json projects get-iam-policy "${p}" | jq 'del(.etag)'
  ) > "${output}"
  num=$(gcloud projects describe k8s-infra-prow-build "--format=value(projectNumber)")
  sed -i.bak -e "s/${s}/subproject/g;s/${num}/123456789/g" "${output}"
  rm -f "${output}.bak"
done

diff -y -W 100 iam-{${bad},${good}}.txt

no diff in gcs iam; the e2e project appears to have additional role bindings, but how would that restrict access?

{						    {
      "members": [				      "members": [
        "serviceAccount:service-456067983721@ |	        "serviceAccount:service-433651898792@
      ],				      <
      "role": "roles/containerregistry.Servic <
    },					      <
    {					      <
      "members": [			      <
        "serviceAccount:service-456067983721@ <
      ],					      ],
      "role": "roles/containerscanning.Servic	      "role": "roles/containerscanning.Servic
    },						    },
    {						    {
      "members": [				      "members": [
        "serviceAccount:456067983721-compute@ |	        "serviceAccount:433651898792-compute@
        "serviceAccount:456067983721@cloudser |	        "serviceAccount:433651898792@cloudser
        "serviceAccount:service-456067983721@ |	        "serviceAccount:service-433651898792@
      ],					      ],
      "role": "roles/editor"			      "role": "roles/editor"
    },						    },
    {						    {
      "members": [				      "members": [
        "user:[email protected]"		        "user:[email protected]"
      ],					      ],
      "role": "roles/owner"			      "role": "roles/owner"
    },						    },
    {						    {
      "members": [				      "members": [
        "serviceAccount:456067983721@cloudbui <
      ],				      <
      "role": "roles/secretmanager.secretAcce <
    },					      <
    {					      <
      "members": [			      <
        "group:k8s-infra-staging-subproject@k	        "group:k8s-infra-staging-subproject@k
      ],					      ],
      "role": "roles/serviceusage.serviceUsag	      "role": "roles/serviceusage.serviceUsag
    },						    },

@spiffxp
Copy link
Member

spiffxp commented Feb 18, 2021

@spiffxp
Copy link
Member

spiffxp commented Feb 18, 2021

that worked, which leads me to believe a change introduced by kubernetes/kubernetes#99030 is the culprit

@claudiubelu
Copy link
Contributor Author

that worked, which leads me to believe a change introduced by kubernetes/kubernetes#99030 is the culprit

I wonder how. The changes in that PR affects Windows, since those changes are made into the Dockerfile_windows files. The only change outside of those files are the VERSION files, which changed 2.4.38-alpine to 2.4.38-1. Note that the failure ocucrs before getting to the Windows images, it fails on the linux/amd64 image, which is the first one in the list.

@spiffxp
Copy link
Member

spiffxp commented Feb 18, 2021

I can't find a reference for this but I'm wondering if the docker registry api doesn't like that tag format?

@spiffxp
Copy link
Member

spiffxp commented Feb 18, 2021

https://docs.docker.com/engine/reference/commandline/tag/#extended-description

A tag name must be valid ASCII and may contain lowercase and uppercase letters, digits, underscores, periods and dashes. A tag name may not start with a period or a dash and may contain a maximum of 128 characters.

which makes it sound like the new tag would be valid

@claudiubelu
Copy link
Contributor Author

I was able to build and push claudiubelu/httpd:2.4.38-1, which includes claudiubelu/httpd:2.4.38-1-linux-amd64: https://paste.ubuntu.com/p/G5ydg43RZm/ (the logs from the first images were cut)

@spiffxp
Copy link
Member

spiffxp commented Feb 18, 2021

Going to try messing with the source file contents and submitting builds manually

# get the source
mkdir -p gcb && cd gcb
gsutil cp gs://k8s-staging-e2e-test-images-gcb/source/1612863662.49-29f09d2c41c5417f952f03585182c7aa.tgz .
tar xvzf 1612863662.49-29f09d2c41c5417f952f03585182c7aa.tgz && rm 1612863662.49-29f09d2c41c5417f952f03585182c7aa.tgz
# make a change
vi test/images/httpd/VERSION
# try running a build with the change
tar -czf ../spiffxp-http-gcb.tgz *
gsutil cp ../spiffxp-http-gcb.tgz gs://k8s-staging-e2e-test-images-gcb/source/
gcloud builds submit \
  --verbosity debug \
  --config /Users/spiffxp/w/kubernetes/kubernetes/test/images/cloudbuild.yaml \
  --substitutions _PULL_BASE_REF=master,_WHAT=httpd,_GIT_TAG=v20210218-v1.21.0-alpha.3-197-g9e5fcc49ec5 \
  --project k8s-staging-e2e-test-images \
  --gcs-log-dir gs://k8s-staging-e2e-test-images-gcb/logs \
  --gcs-source-staging-dir gs://k8s-staging-e2e-test-images-gcb/source \
  gs://k8s-staging-e2e-test-images-gcb/source/spiffxp-http-gcb.tgz

@spiffxp
Copy link
Member

spiffxp commented Feb 18, 2021

Same result (same SHA as before)

#5 pushing manifest for gcr.io/k8s-staging-e2e-test-images/httpd:2.4.38-monkeys-linux-amd64
#5 pushing manifest for gcr.io/k8s-staging-e2e-test-images/httpd:2.4.38-monkeys-linux-amd64 0.3s done
#5 ERROR: failed commit on ref "manifest-sha256:ec099bd94d243bcfba864acbeaf87ab38e6b4a67c25d8508103e0ad96446d84c": unexpected status: 401 Unauthorized

@spiffxp
Copy link
Member

spiffxp commented Feb 18, 2021

trying a different image

vi test/images/nginx/VERSION #
# submit with _WHAT=nginx

yields a similar error

#5 pushing layers 0.0s done
#5 pushing manifest for gcr.io/k8s-staging-e2e-test-images/nginx:1.14-monkeys-linux-amd64
#5 pushing manifest for gcr.io/k8s-staging-e2e-test-images/nginx:1.14-monkeys-linux-amd64 0.3s done
#5 ERROR: failed commit on ref "manifest-sha256:a2d0ea7d3550b0853d04263025e6cfcc353f3e102fe725d19b9fc51282603f02": unexpected status: 401 Unauthorized

@spiffxp
Copy link
Member

spiffxp commented Feb 18, 2021

Using source file from "good" build (gs://k8s-staging-e2e-test-images-gcb/source/1612863662.49-29f09d2c41c5417f952f03585182c7aa.tgz)

Manually submitting with no changes works

Changing to push nginx with new version

vi test/images/nginx/VERSION
# submit with _WHAT=nginx

yields a similar error

#5 exporting to image
#5 exporting layers done
#5 exporting manifest sha256:a2d0ea7d3550b0853d04263025e6cfcc353f3e102fe725d19b9fc51282603f02 0.0s done
#5 exporting config sha256:f4ac389d78cd8be151962a9fc7227b3b23862d9040eeb0686158a3229da60022 0.0s done
#5 pushing layers 0.1s done
#5 pushing manifest for gcr.io/k8s-staging-e2e-test-images/nginx:1.14-monkeys-linux-amd64
#5 pushing manifest for gcr.io/k8s-staging-e2e-test-images/nginx:1.14-monkeys-linux-amd64 0.3s done
#5 ERROR: failed commit on ref "manifest-sha256:a2d0ea7d3550b0853d04263025e6cfcc353f3e102fe725d19b9fc51282603f02": unexpected status: 401 Unauthorized
------
 > exporting to image:
------
failed to solve: rpc error: code = Unknown desc = failed commit on ref "manifest-sha256:a2d0ea7d3550b0853d04263025e6cfcc353f3e102fe725d19b9fc51282603f02": unexpected status: 401 Unauthorized
make: *** [Makefile:43: all-build-and-push] Error 1

so... new tag, unchanged manifest = error... is this working as intended, or should we be allowing this?

@spiffxp
Copy link
Member

spiffxp commented Feb 18, 2021

https://cloud.google.com/container-registry/docs/access-control#permissions_and_roles

my guess now is that update or delete permissions may be missing, if someone wants to take a look through https://github.com/kubernetes/k8s.io/blob/main/infra/gcp/ensure-staging-storage.sh and files it sources, a PR would be appreciated

@spiffxp
Copy link
Member

spiffxp commented Feb 19, 2021

/wg k8s-infra
/sig testing
/area prow
/priority critical-urgent
/milestone v1.21

@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 19, 2021
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. area/prow Issues or PRs related to prow priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Feb 19, 2021
@spiffxp
Copy link
Member

spiffxp commented Feb 22, 2021

Per https://cloud.google.com/container-registry/docs/gcr-service-account, the GCR service account for staging projects today may be overprivileged:

  • service-{project_id}@containerregistry.iam.gserviceaccount.com should have role roles/containerregistry.ServiceAgent
  • service-{project_id}@containerregistry.iam.gserviceaccount.com should NOT have role roles/editor

Per https://cloud.google.com/container-registry/docs/access-control#permissions_and_roles, the following are required for "Push (write) and pull (read) images for existing registry hosts in a project"

  • storage.objects.create
  • storage.objects.delete
  • storage.objects.get
  • storage.objects.list
  • storage.objects.update

The bucket permissions are for "Add a registry to a project by pushing the first image to the registry host", which I think the k8s-infra staging project script does.

I used https://cloud.google.com/iam/docs/troubleshooting-access to gather permissions for different accounts against the gcs bucket backing gcr.io/k8s-staging-e2e-test-images (ref: https://gist.github.com/spiffxp/0110b663a229ebda1e805d92b23e646b)

permission email access
storage.buckets.delete [email protected] UNKNOWN_INFO_DENIED
storage.buckets.get [email protected] GRANTED
storage.buckets.update [email protected] UNKNOWN_INFO_DENIED
storage.objects.create [email protected] GRANTED
storage.objects.delete [email protected] GRANTED
storage.objects.get [email protected] GRANTED
storage.objects.list [email protected] GRANTED
storage.objects.update [email protected] GRANTED
storage.buckets.delete [email protected] UNKNOWN_INFO_DENIED
storage.buckets.get [email protected] GRANTED
storage.buckets.update [email protected] UNKNOWN_INFO_DENIED
storage.objects.create [email protected] GRANTED
storage.objects.delete [email protected] GRANTED
storage.objects.get [email protected] GRANTED
storage.objects.list [email protected] GRANTED
storage.objects.update [email protected] GRANTED
storage.buckets.delete [email protected] GRANTED
storage.buckets.get [email protected] GRANTED
storage.buckets.update [email protected] GRANTED
storage.objects.create [email protected] GRANTED
storage.objects.delete [email protected] GRANTED
storage.objects.get [email protected] GRANTED
storage.objects.list [email protected] GRANTED
storage.objects.update [email protected] UNKNOWN_INFO_DENIED

Going to look at these in order:

  • Is the GCR service account setup correctly?
    • [email protected] is GCR's service account that writes to GCS on GCR's behalf
    • it has project bindingsroles/containerregistry.ServiceAgent (OK), roles/editor (it shouldn't)
    • because of roles/editor it has roles/storage.legacyBucketOwner for the gcr.io bucket (among others)
    • it is missing storage.objects.update for the gcr.io bucket (so neither of these roles grant that)... should it have it?
  • Is the GCB service account setup correctly?
    • [email protected] is what the GCB job runs as
    • it has roles/cloudbuild.builds.builder for the project
    • this gives it storage.objects.update on the GCS bucket, it should be fine to write to gcr.io?
    • it lacks some storage.bucket.* permissions, does it need them?

@spiffxp
Copy link
Member

spiffxp commented Feb 22, 2021

Give GCR's SA roles/storage.objectAdmin directly on the gcr.io bucket

gsutil iam ch serviceAccount:[email protected]:objectAdmin gs://artifacts.k8s-staging-e2e-test-images.appspot.com

Run same build as #20884 (comment), same result

So removing that

gsutil iam ch -d serviceAccount:[email protected]:objectAdmin gs://artifacts.k8s-staging-e2e-test-images.appspot.com

@spiffxp
Copy link
Member

spiffxp commented Feb 23, 2021

Giving GCB's SA roles/storage.legacyBucketOwner directly on the gcr.io bucket to give it storage.buckets.update

gsutil iam ch serviceAccount:[email protected]:legacyBucketOwner gs://artifacts.k8s-staging-e2e-test-images.appspot.com

The only other storage role with storage.buckets.update is roles/storage.admin, which also has storage.buckets.delete (which I'm trying to avoid)

Run same build as #20884 (comment), same result

@spiffxp
Copy link
Member

spiffxp commented May 7, 2021

/remove-priority critical-urgent
/priority important-longterm
/milestone v1.22
I think this is still relevant, but I don't think we urgently need to do anything to solve this anymore, correct @claudiubelu?

If we don't need to solve, I'd rather wait for the fix to come to us.

AFAICT the latest release of moby/buildkit still lacks the fix, as does docker buildx

But I suspect it's coming soon:

The alternative would be someone taking on the work of using unreleased code.

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels May 7, 2021
@k8s-ci-robot k8s-ci-robot modified the milestones: v1.21, v1.22 May 7, 2021
@spiffxp
Copy link
Member

spiffxp commented May 7, 2021

I'm kicking this off the wg-k8s-infra project board since it seems like no action required on our part

@spiffxp
Copy link
Member

spiffxp commented May 7, 2021

/remove-area prow
/area artifacts
This also doesn't specifically have to do with prow. I'm not entirely sure it even belongs in this repo, but I'll leave it in tagged with artifacts for now, since the issue is with pushing multi-arch artifacts to k8s.gcr.io that are built with a broken docker buildx

@k8s-ci-robot k8s-ci-robot added area/artifacts Issues or PRs related to the hosting of release artifacts for subprojects and removed area/prow Issues or PRs related to prow labels May 7, 2021
@spiffxp
Copy link
Member

spiffxp commented May 7, 2021

/sig release
/area release-eng

@k8s-ci-robot k8s-ci-robot added sig/release Categorizes an issue or PR as relevant to SIG Release. area/release-eng Issues or PRs related to the Release Engineering subproject labels May 7, 2021
@spiffxp
Copy link
Member

spiffxp commented May 7, 2021

/area images
The actual dependency will need to come in through the image used by https://github.com/kubernetes/kubernetes/blob/master/test/images/cloudbuild.yaml

So like a PR to update this:

# Install docker Buildx for fast docker builds
ENV HOME=/root
RUN mkdir -p $HOME/.docker/cli-plugins \
&& curl -fsSL "https://github.com/docker/buildx/releases/download/v0.4.1/buildx-v0.4.1.linux-amd64" --output $HOME/.docker/cli-plugins/docker-buildx \
&& chmod a+x $HOME/.docker/cli-plugins/docker-buildx

And a subsequent PR to kubernetes/kubernetes to update cloudbuild.yaml

@spiffxp
Copy link
Member

spiffxp commented Jul 22, 2021

Opened #22977 to take a crack at bumping buildx to the latest release to see if this fixes this issue

@claudiubelu
Copy link
Contributor Author

Opened #22977 to take a crack at bumping buildx to the latest release to see if this fixes this issue

Did this fail since then? I thought we solved this a few months ago. The issue was that the same SHA couldn't be pushed again, so we added a label in the images / dockerfiles. Haven't seen it fail since.

@spiffxp
Copy link
Member

spiffxp commented Jul 26, 2021

The issue was that the same SHA couldn't be pushed again

This is what I'm trying to prove is fixed. We should never have had to do the workaround in the first place, but some part of buildx or its dependencies was written in a way that didn't work with GCR (among other container registries)

@spiffxp
Copy link
Member

spiffxp commented Jul 27, 2021

/milestone v1.23

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.22, v1.23 Jul 27, 2021
@claudiubelu
Copy link
Contributor Author

The issue has been resolved some time ago. We have promoted httpd images since this was open, and the last job was green as well: https://testgrid.k8s.io/sig-testing-images#post-kubernetes-push-e2e-httpd-test-images

/close

@k8s-ci-robot
Copy link
Contributor

@claudiubelu: Closing this issue.

In response to this:

The issue has been resolved some time ago. We have promoted httpd images since this was open, and the last job was green as well: https://testgrid.k8s.io/sig-testing-images#post-kubernetes-push-e2e-httpd-test-images

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts Issues or PRs related to the hosting of release artifacts for subprojects area/images area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants