Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redact package normalization #42

Closed
wants to merge 1 commit into from

Conversation

darakian
Copy link
Collaborator

Normalization is not followed in practice.
Package names as they appear on pypi.org are the only knowable source of truth.

Normalization is not followed in practice.
Package names as they appear on pypi.org are the only knowable source of truth.
@oliverchang
Copy link
Contributor

Thanks for raising this @darakian ! Could you please explain the rationale here a bit more re "Normalization is not followed in practice"?

CC @di who is on the PSF. WDYT? Should these advisories be encoded using the normalized project name, or as they appear on pypi.org?

@darakian
Copy link
Collaborator Author

Simple example is https://pypi.org/project/PyYAML/
which exposes itself to the python runtime as yaml

https://pyyaml.org/wiki/PyYAMLDocumentation

@oliverchang
Copy link
Contributor

Some rationale for this (copied from github/advisory-database#52 (comment))

This is to make these package names more consistent and easier to consume and index on.

The same package in Python can be specified in an infinite number of ways.

e.g. pip install Flask-Caching, pip install flask.caching pip install flask......caching pip install flask----caching all have the same effect and refer to the same package. Having a normalized name makes it easier to have more consistency.

Package URLs also made the same decision: https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst#pypi

@oliverchang
Copy link
Contributor

Simple example is https://pypi.org/project/PyYAML/ which exposes itself to the python runtime as yaml

https://pyyaml.org/wiki/PyYAMLDocumentation

Sure! But I think there's a distinction here between an import name (which the OSV spec does not intend to encode) and a package name.

I'm not sure normalization implies any relationship between the import name and the package name. It's intended to make PyYAML more consistent according to Python's own rules because pyYAML, PyYAML, pyyaml are all considered the same package.

I'll let @di chime in as well since he's the Python expert.

@darakian
Copy link
Collaborator Author

Flask-Caching is the string which most people would refer to the package hosted here https://pypi.org/project/Flask-Caching/. It is the string to which all those other forms are mapped.

@di
Copy link
Member

di commented Mar 29, 2022

Given that the un-normalized display name might change from one release to the next, I think it makes sense to continue using the normalized name here: it means that OSV can consistently refer to the same project without having to track the un-normalized name.

It also means consumers don't have to do normalization of every project name themselves when using the database.

Note that PyPI respects normalization everywhere, e.g. https://pypi.org/project/flask-caching will always resolve to the same project no matter what the un-normalized name is, and uses the normalized name internally when referring to the project. The un-normalized name is only used for display/vanity purposes.

@oliverchang
Copy link
Contributor

Thanks @di for the input! If the un-normalized display name is unstable and can change any time, then that's a pretty strong argument against this.

To add to this..

Flask-Caching is the string which most people would refer to the package hosted here https://pypi.org/project/Flask-Caching/. It is the string to which all those other forms are mapped.

So we need a consistent way to refer to the package, whether it's Flask-Caching, or flask-caching. In this example, even Flask-Caching's own documentation says pip install flask-caching, which is a different casing.

Using the display name (i.e. Flask-Caching) also has a dependency on pypi.org, because it's not possible to compute the display name from any of these other forms unless we query pypi.org. The latter (flask-caching) only requires a simple offline algorithm. This seems much simpler to do.

@darakian
Copy link
Collaborator Author

darakian commented Mar 29, 2022

also has a dependency on pypi.org

Packages are inherently dependent on their package registries.

If Flask-Caching and flask.caching are equivalent on pypi then I don't see why either name is not acceptable for an advisory about a package on pypi.

@di
Copy link
Member

di commented Mar 29, 2022

If Flask-Caching and flask.caching are equivalent on pypi then I don't see why either name is not acceptable for an advisory about a package on pypi.

I mean, the same is true for the normalized name as well then, no?

I'm not seeing the downside to using the normalized name here, only upsides, but maybe I'm missing something.

@darakian
Copy link
Collaborator Author

darakian commented Mar 29, 2022

I mean, the same is true for the normalized name as well then, no?

Agreed I think that the normal form should also be acceptable.

I'm not seeing the downside to using the normalized name here, only upsides, but maybe I'm missing something.

I see it as an unnecessary restriction on the field.

Edit:
I'm not opposed to this being rejected but it seems like it creates extra work to not allow the same values as the backing package registry.

@oliverchang
Copy link
Contributor

Thanks for opening this issue!

We're going to close this for now, given the reasons that we enumerated in this thread. It does create slightly extra work (applying a regex replace), but it comes with significant upsides around database consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants