Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash is unexpectedly None for empty files #69

Open
stefan6419846 opened this issue Sep 20, 2024 · 2 comments
Open

Hash is unexpectedly None for empty files #69

stefan6419846 opened this issue Sep 20, 2024 · 2 comments

Comments

@stefan6419846
Copy link
Contributor

Generating hashes for empty files will always return None, which is not documented and different from the usual hashing algorithms as well as contradicting the SPDX standard.

Example:

from commoncode.hash import sha1
from hashlib import sha1 as sha1_hashlib
from tempfile import NamedTemporaryFile

with NamedTemporaryFile() as temporary_file:
    temporary_file.write(b'')
    temporary_file.seek(0)
    print(sha1(location=temporary_file.name))
    print(sha1_hashlib(string=temporary_file.read(), used_for_security=False).hexdigest())

The reason seems to be that

self.h = msg and hmodule(msg).digest()[: self.digest_size] or None
does not use msg is not None, but basically bool(msg), which is False for empty inputs as well.

Replacing the line with

            self.h = msg is not None and hmodule(msg).digest()[:self.digest_size] or None

(as well as replacing the same pattern in sha1_git_hasher) seems to fix this issue.

@stefan6419846 stefan6419846 changed the title Hash is None for empty files Hash is unexpectedly None for empty files Sep 20, 2024
@pombredanne
Copy link
Member

Stefan: Thanks for the report!
This is intentional because there is a lot of emptiness out there! Returning None for empty files has been a design choice to avoid creating dummy checksums.

FWIW, this predates SPDX and this library is not SPDX specific .... With this said, this has not been documented correctly and should be documented. I will keep this open until we document this alright.

And for the SPDX compatibility, this is handled when creating SPDX documents in ScanCode Toolkit at https://github.com/aboutcode-org/scancode-toolkit/blob/498467c1e5f677edc20e92b12910f16d771ccad3/src/formattedcode/output_spdx.py#L255 where we effectively return a SHA1 for empty files there. (And in the future, other checksums, hardcoded in a similar fashion).

@stefan6419846
Copy link
Contributor Author

Thanks for the explanation. Then this apparently is just a matter of missing/improper documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants