Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] feat: draft of lessons learned on publish security #538

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

lwasser
Copy link
Member

@lwasser lwasser commented Dec 14, 2024

This is the start of a draft post on publishing to pypi best practices. it should be super duper user friendly and easy to read. I will need feedback!!

i will be working on some graphics and such next.


python-version: '3.9'

# Step 3: Cache dependencies
- name: Cache dependencies
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not tested and could be wrong.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the uses but oftentimes, building is performed by pypa/build so it'd be just pip install build and not multiple direct deps as one might expect from a requirements.txt file. So the example seems a bit unrealistic, except for some untypical projects with surprising build automation required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right, good point. What is a good example of caching dependencies (other than what we say with ultranalytics)? I use caching when i'm publishing.

we pip install hatch and use hatch build

What do you think about removing the example and just telling people that if they do cache dependencies for any reason, don't do it when the build is for publishing?

for instance, doing it for a website build could be ok if it's a preview, etc. vs publishing a package?

@lwasser lwasser marked this pull request as ready for review December 17, 2024 18:51
@lwasser
Copy link
Member Author

lwasser commented Dec 17, 2024

hey @webknjaz @hugovk do you have a bit of time to review this post for accuracy? i'm trying to provide a more beginner-friendly overview of what we learned that our community can quickly understand and implement.

We use the the workflow @webknjaz helped us create for the pyosmeta package. And we want to link to all of the pypi / psf posts that are relevant too. many thanks for your time!

Copy link

@webknjaz webknjaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some quick notes. I might've missed something.

Let's also ask @woodruffw for feedback.


## Is your PyPI publication workflow secure?

The recent Python package breach [involving Ultralytics](https://blog.pypi.org/posts/2024-12-11-ultralytics-attack-analysis/) has highlighted the need for secure PyPI publishing workflows. Hackers exploited a GitHub workflow (`pull_request_target`) to inject malicious code, which was published to PyPI. Users who downloaded the package unknowingly allowed their machines to be hijacked for Bitcoin mining.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a GitHub workflow (pull_request_target)

pull_request_target is not a workflow, it's one of the events that may trigger a workflow run. A workflow may “subscribe” to different events like push and pull_request, this is just another event type with elevated privileges in its security context.

unknowingly allowed their machines to be hijacked for Bitcoin mining.

Not only that, but they also did that in Google Colab which got their accounts banned pretty swiftly...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOOOH no way. i missed the Google Colab part!! that's a pretty bold move for a hacker? maybe they were new hackers ?

The Ultralytics breach is a wake-up call for all maintainers: secure your workflows now to protect your users and the Python ecosystem. Start with these key actions:

### 🔐 Secure your workflows 🔐
- 🚫 Avoid risky events like `pull_request_target` and adopt release-based workflows.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately clear what makes a workflow “release-based”. I haven't read past this point at the time of writing this comment, though.
Rephrase, perhaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point - fixed!!

  • 🚫 Avoid risky trigger events in GitHub actions like pull_request_target
  • Consider using publish workflows on GitHub that are triggered by a GitHub release rather than a pull request merge.


### 🔐 Secure your workflows 🔐
- 🚫 Avoid risky events like `pull_request_target` and adopt release-based workflows.
- ♻️ Don’t cache dependencies in your publish workflows to prevent tampering.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably an oversimplification. Not caching the deps may as well lead to something like this if the transitive deps are unpinned and get updated in uncontrolled manner. The problem was that the cache was poisoned in a job with a security context that had high privileges and then, another job picked it up, effectively reusing what that first one “pinned” in the installed deps.

But if there's no cache, a compromised transitive dependency can be downloaded from PyPI, resulting in the same problem.

Security exists in context and not in isolation. It should come from a place of understanding how different bits fit together.

That said, I wouldn't make a call of whether using caching or not will save us all. The problem was that it was possible to compromise the cache that ended up being used.

I think it's possible to use cache securely by having separate cache keys unique to the dist building job. So caches for building the dists and testing them would not leak into each other. That's the real fix for cache.

Another side of this is the supply chain problem, which cache ended up being a part of in this instance. But as I mentioned earlier, the supply chain could be compromised differently, without cache being a factor. The solution to this is pinning your dependencies. This means recording every single version number for every single transitive dependency there is. One of the low-level tools to use for this is pip-tools, in the context of pip install (until we have PEP 751, that is). Non-pip installers have different mechanisms. With pip, you can use pip install -r requirements.in -c constraints.txt, where requirements.in would contain a direct and unrestricted dependency list, and constraints.txt would set the entire tree to exact versions. A complete solution is usually integration of different aspects through multiple tools (or a "one-tool-to-rule-them-all" which some projects aim to become). I've been taking this to extremes with "DIY lock files" in https://github.com/webknjaz/ep2024-multilock-toy-template / https://github.com/ansible/awx-plugins/tree/2c2b22c/dependencies/lock-files.
Additionally, there's --require-hashes which also records every single dist file hash, not just version (this is not compatible with -c, though, due to a bug in pip).

To summarize, the solution here is to be in control of your deps, make sure their updates are transparent and controllable on all the levels, and to know what influences them… Out of the context, disallowing caching is a band-aid at best. But it may be harmful in that it would cultivate a false sense of security.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. This is super helpful. I could remove the caching part altogether, given so many moving pieces and caching seems more nuanced then just locking down the workflow itself with better triggers and tighter permissions. But I also have a question about the dependencies. Because it is my understanding and experience that pinning all deps can also be highly problematic for users. For instance, if you have an environment with many different packages and try installing a package with pinned dips, that could lead to significant user issues.

Unless we're just talking about dips in the workflow itself, such as hatch, pypa/build, etc.

web apps make sense to pin but i can't quite wrap my head around pinning deps for a package. Could you please help me understand? And do you agree - maybe for more beginner users we just drop caching altogether as a topic?

### 🔐 Secure your workflows 🔐
- 🚫 Avoid risky events like `pull_request_target` and adopt release-based workflows.
- ♻️ Don’t cache dependencies in your publish workflows to prevent tampering.
- If you reference branches in a pull request, clean or sanitize branch names in your workflow.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestions!! I may add these links lower down in the post at the end, so the takeaways are just text.. but I'll definitely include them. This is what I have now:

  • If you reference branches in a pull request, clean or sanitize branch names in your workflow.
  • Consider using zizmor to check your GitHub workflows for security vulnerabilities!

i haven't used zizmor yet but i saw a package that Hugo maintains uses it so i may try it out with stravalib and pyosmeta!!


### Lock down GitHub repo access
- 🔒 Restrict repository access to essential maintainers only.
- ✅ Add automated checks to ensure releases are authorized and secure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is mostly about setting an environment in the job and making sure that environment has required reviewers in the repo settings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the blog posts, it seemed like the hackers somehow got direct access to that repo. Did i read that wrong?


* [<i class="fa-brands fa-discourse"></i> Discourse](https://pyopensci.discourse.group/)
* [<i class="fa-brands fa-mastodon"></i> Mastodon](https://fosstodon.org/@pyopensci)
* [<i class="fa-solid fa-cloud"></i> Bluesky](https://bsky.app/profile/pyopensci.bsky.social)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip

Pro tip: consider updating the Bluesky username to pyopensci.org. This can be configured and verified in like 3 minutes (through either DNS or a .well-known document on the website, but adds credibility to the account). You can create a repository in the org, call in .well-known, and add a file with the account DID as follows: webknjaz/.well-known@5a3c310. Just don't forget to enable Pages in the repo settings before clicking Verify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WOAH!! i never knew about this for bluesky!.
thank you for this tip - and for all of the thoughtful feedback. i'm going to work on this some more tomorrow!

publish:
name: >-
Publish Python 🐍 distribution 📦 to PyPI
if: github.repository_owner == 'pyopensci'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started using the exact repository ID some time ago. This is more accurate and would survive renames and transfers:

Suggested change
if: github.repository_owner == 'pyopensci'
if: github.repository_id == 174412809

(looked up @ https://api.github.com/repos/pyOpenSci/pyopensci.github.io)

If you want to check for the org, you could use the org ID. Org names are mutable and if the org gets renamed for whatever reason (might be a user account that hosts a repo in some cases), this check would break.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wow - I've always used the username as the conditional value. The only challenging piece here for users is they'd have to use the API to get that repo id value. I don't see a clear way to get it from the GitHub GUI. we could make a note about an owner name being mutable however and provide the command to find the ID (it requires a token i think?).

_posts/2024-12-13-python-packaging-security.md Outdated Show resolved Hide resolved

You can see how to set up GitHub Actions securely in our own [PyPI publishing GitHub workflow](https://github.com/pyOpenSci/pyosMeta/blob/main/.github/workflows/publish-pypi.yml), which follows the Trusted Publisher approach.

**Note:** Trusted Publisher workflows are currently only available for GitHub. Support for GitLab may be coming in the future—stay tuned!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the doc you already linked, there's an example for GitLab as well: https://docs.pypi.org/trusted-publishers/adding-a-publisher/#gitlab-cicd. There are 4 platforms that support this mechanism already, and I saw some Bitbucket folks asking around whether they can integrate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the breakout and added that link!! I somehow read that GitLab support wasn't available yet, but obviously, I wasn't looking in the right place! thanks!

lwasser and others added 5 commits December 18, 2024 18:15
Co-authored-by: 🇺🇦 Sviatoslav Sydorenko (Святослав Сидоренко) <[email protected]>
Co-authored-by: 🇺🇦 Sviatoslav Sydorenko (Святослав Сидоренко) <[email protected]>
Co-authored-by: 🇺🇦 Sviatoslav Sydorenko (Святослав Сидоренко) <[email protected]>
@lwasser
Copy link
Member Author

lwasser commented Dec 19, 2024

NOTES: i'm doing a big rewrite (more of a reorg) of this content. I'll start with the high-level concepts of securing the environment / GitHub and GitHub-PyPI, and then the other things like sanitizing branch names and such can come next.
This gives users 2 key takeaways (plus adding Zizmor as a third!) to start with now rather than 10 things, which is a lot! so I'd avoid more reviews until I push these edits given the structure will change (to hopefully be more clear and more accurate!) 🚀

i'm super appreciative of all of the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants