-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
untokenize() does not round-trip for tab characters #128031
Comments
The example behavior is the same back to 3.8, and I believe it is intentional and not a bug. Untokenize does not intend to reproduce the source string exactly. Which is to say, untokenize('tokenize'(source)) == source is not guaranteed. (I quote 'tokenize' as call details are omitted.) The current doc says "the spacing between tokens (column positions) may change." Rather the doc says that untokenize should produce a source string that tokenizes back to a matching stream of TokenInfo objects. In other words, 'tokenize'(untokenize(stream) match= stream. The matching is that the first two items of each tuple, the token type and value, are the same. Since there is no WHITESPACE token, there are no Even though spacing may change, the current code tries to maintain it by inferring whitespace from differences between successive end and start columns. So the start and end position may match also. But the before and after line values may still differ. The proposal is to also avoid this if possible. (I do not know what other cases might be covered by the existing disclaimer.) There are two back-compatibility issues: First, there may be users that want or depend upon the current behavior, perhaps because the change would break tests. Second, adding a required parameter with no default to Untokenizer.add_whitespace could break user code. Even though the absence of "Untokenizer" from |
I don't think this is worth changing. As @terryjreedy points out, the intent was never to produce the same source code as the input to tokenize. I think this change would start a slippery slope to trying to achieve perfect fidelity. |
Fair enough, if you think this is not worth fixing I'm happy to close the issue :) Lines 321 to 330 in bad3cde
Maybe we should update the docstring to match what the docs say then? |
Yes, I think the docstring should be changed to match the docs. |
Agree with Eric. The full input part of the docstring was merged by Thomas Wouters 2006-12-12 in 89f507f. The initial version of the current .rst doc was merged by Georg Brandl 2007-08-15 in 116aa62 and is essentially unchanged since. As best I remember, @pablogsal has made comments on other issues that the guarantee is given in the doc. There are other sentences in both doc and docstring that need revision. I intend to later post suggested revisions here for discussion. |
Bug report
Bug description:
Tab characters used as white space do not roundtrip when using
untokenize
. Here's an example:Here, the tab is replaced with a regular space. Given that untokenize tries to match the source string exactly, I think we should fix this. We can do that by getting the characters from the source line rather than always using a normal space:
cpython/Lib/tokenize.py
Line 185 in 9c2bb7d
I'll send a fix in a moment :)
CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux
Linked PRs
untokenize()
#128032The text was updated successfully, but these errors were encountered: