Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

untokenize() does not round-trip for tab characters #128031

Open
tomasr8 opened this issue Dec 17, 2024 · 5 comments
Open

untokenize() does not round-trip for tab characters #128031

tomasr8 opened this issue Dec 17, 2024 · 5 comments
Assignees
Labels
stdlib Python modules in the Lib dir topic-parser type-feature A feature request or enhancement

Comments

@tomasr8
Copy link
Member

tomasr8 commented Dec 17, 2024

Bug report

Bug description:

Tab characters used as white space do not roundtrip when using untokenize. Here's an example:

import tokenize, io

source = "a +\tb"

tokens = list(tokenize.generate_tokens(io.StringIO(source).readline))
x = tokenize.untokenize(tokens)
print(x)
# a + b

Here, the tab is replaced with a regular space. Given that untokenize tries to match the source string exactly, I think we should fix this. We can do that by getting the characters from the source line rather than always using a normal space:

self.tokens.append(" " * col_offset)

I'll send a fix in a moment :)

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

Linked PRs

@tomasr8 tomasr8 added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir topic-parser labels Dec 17, 2024
@terryjreedy terryjreedy added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Dec 17, 2024
@terryjreedy
Copy link
Member

terryjreedy commented Dec 17, 2024

The example behavior is the same back to 3.8, and I believe it is intentional and not a bug. Untokenize does not intend to reproduce the source string exactly. Which is to say, untokenize('tokenize'(source)) == source is not guaranteed. (I quote 'tokenize' as call details are omitted.) The current doc says "the spacing between tokens (column positions) may change."

Rather the doc says that untokenize should produce a source string that tokenizes back to a matching stream of TokenInfo objects. In other words, 'tokenize'(untokenize(stream) match= stream. The matching is that the first two items of each tuple, the token type and value, are the same. Since there is no WHITESPACE token, there are no ' ' and '\t' values to match -- or not match.

Even though spacing may change, the current code tries to maintain it by inferring whitespace from differences between successive end and start columns. So the start and end position may match also. But the before and after line values may still differ. The proposal is to also avoid this if possible. (I do not know what other cases might be covered by the existing disclaimer.)

There are two back-compatibility issues: First, there may be users that want or depend upon the current behavior, perhaps because the change would break tests. Second, adding a required parameter with no default to Untokenizer.add_whitespace could break user code. Even though the absence of "Untokenizer" from tokenize.__all__ (and the doc) makes it private, it might be in use. If the proposal were otherwise acceptible, a 'discuss' ideas posts might be a good idea.

@ericvsmith
Copy link
Member

I don't think this is worth changing. As @terryjreedy points out, the intent was never to produce the same source code as the input to tokenize. I think this change would start a slippery slope to trying to achieve perfect fidelity.

@tomasr8
Copy link
Member Author

tomasr8 commented Dec 18, 2024

Fair enough, if you think this is not worth fixing I'm happy to close the issue :)
Though, the motivation for this was the docstring of untokenize which says that for 5-tuple tokens, the output should match exactly:

cpython/Lib/tokenize.py

Lines 321 to 330 in bad3cde

Round-trip invariant for full input:
Untokenized source will match input source exactly
Round-trip invariant for limited input:
# Output bytes will tokenize back to the input
t1 = [tok[:2] for tok in tokenize(f.readline)]
newcode = untokenize(t1)
readline = BytesIO(newcode).readline
t2 = [tok[:2] for tok in tokenize(readline)]
assert t1 == t2

Maybe we should update the docstring to match what the docs say then?

@terryjreedy terryjreedy self-assigned this Dec 18, 2024
@ericvsmith
Copy link
Member

Yes, I think the docstring should be changed to match the docs.

@terryjreedy
Copy link
Member

Agree with Eric. The full input part of the docstring was merged by Thomas Wouters 2006-12-12 in 89f507f. The initial version of the current .rst doc was merged by Georg Brandl 2007-08-15 in 116aa62 and is essentially unchanged since. As best I remember, @pablogsal has made comments on other issues that the guarantee is given in the doc.

There are other sentences in both doc and docstring that need revision. I intend to later post suggested revisions here for discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-parser type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants