untokenize() does not round-trip for tab characters #128031

tomasr8 · 2024-12-17T14:57:41Z

Bug report

Bug description:

Tab characters used as white space do not roundtrip when using untokenize. Here's an example:

import tokenize, io

source = "a +\tb"

tokens = list(tokenize.generate_tokens(io.StringIO(source).readline))
x = tokenize.untokenize(tokens)
print(x)
# a + b

Here, the tab is replaced with a regular space. Given that untokenize tries to match the source string exactly, I think we should fix this. We can do that by getting the characters from the source line rather than always using a normal space:

cpython/Lib/tokenize.py

Line 185 in 9c2bb7d

self.tokens.append(" " * col_offset)

I'll send a fix in a moment :)

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

Linked PRs

gh-128031: Preserve tab characters in untokenize() #128032

The text was updated successfully, but these errors were encountered:

terryjreedy · 2024-12-17T18:35:22Z

The example behavior is the same back to 3.8, and I believe it is intentional and not a bug. Untokenize does not intend to reproduce the source string exactly. Which is to say, untokenize('tokenize'(source)) == source is not guaranteed. (I quote 'tokenize' as call details are omitted.) The current doc says "the spacing between tokens (column positions) may change."

Rather the doc says that untokenize should produce a source string that tokenizes back to a matching stream of TokenInfo objects. In other words, 'tokenize'(untokenize(stream) match= stream. The matching is that the first two items of each tuple, the token type and value, are the same. Since there is no WHITESPACE token, there are no ' ' and '\t' values to match -- or not match.

Even though spacing may change, the current code tries to maintain it by inferring whitespace from differences between successive end and start columns. So the start and end position may match also. But the before and after line values may still differ. The proposal is to also avoid this if possible. (I do not know what other cases might be covered by the existing disclaimer.)

There are two back-compatibility issues: First, there may be users that want or depend upon the current behavior, perhaps because the change would break tests. Second, adding a required parameter with no default to Untokenizer.add_whitespace could break user code. Even though the absence of "Untokenizer" from tokenize.__all__ (and the doc) makes it private, it might be in use. If the proposal were otherwise acceptible, a 'discuss' ideas posts might be a good idea.

ericvsmith · 2024-12-17T19:00:25Z

I don't think this is worth changing. As @terryjreedy points out, the intent was never to produce the same source code as the input to tokenize. I think this change would start a slippery slope to trying to achieve perfect fidelity.

tomasr8 · 2024-12-18T11:09:28Z

Fair enough, if you think this is not worth fixing I'm happy to close the issue :)
Though, the motivation for this was the docstring of untokenize which says that for 5-tuple tokens, the output should match exactly:

cpython/Lib/tokenize.py

Lines 321 to 330 in bad3cde

    
               Round-trip invariant for full input: 
        
                   Untokenized source will match input source exactly 
        
               Round-trip invariant for limited input: 
        
                   # Output bytes will tokenize back to the input 
        
                   t1 = [tok[:2] for tok in tokenize(f.readline)] 
        
                   newcode = untokenize(t1) 
        
                   readline = BytesIO(newcode).readline 
        
                   t2 = [tok[:2] for tok in tokenize(readline)] 
        
                   assert t1 == t2

Maybe we should update the docstring to match what the docs say then?

ericvsmith · 2024-12-18T18:32:13Z

Yes, I think the docstring should be changed to match the docs.

terryjreedy · 2024-12-18T20:07:15Z

Agree with Eric. The full input part of the docstring was merged by Thomas Wouters 2006-12-12 in 89f507f. The initial version of the current .rst doc was merged by Georg Brandl 2007-08-15 in 116aa62 and is essentially unchanged since. As best I remember, @pablogsal has made comments on other issues that the guarantee is given in the doc.

There are other sentences in both doc and docstring that need revision. I intend to later post suggested revisions here for discussion.

tomasr8 added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir topic-parser labels Dec 17, 2024

bedevere-app bot mentioned this issue Dec 17, 2024

gh-128031: Preserve tab characters in untokenize() #128032

Open

terryjreedy added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Dec 17, 2024

terryjreedy self-assigned this Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

untokenize() does not round-trip for tab characters #128031

untokenize() does not round-trip for tab characters #128031

tomasr8 commented Dec 17, 2024 •

edited by bedevere-app bot

Loading

terryjreedy commented Dec 17, 2024 •

edited

Loading

ericvsmith commented Dec 17, 2024

tomasr8 commented Dec 18, 2024

ericvsmith commented Dec 18, 2024

terryjreedy commented Dec 18, 2024

untokenize() does not round-trip for tab characters #128031

untokenize() does not round-trip for tab characters #128031

Comments

tomasr8 commented Dec 17, 2024 • edited by bedevere-app bot Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

terryjreedy commented Dec 17, 2024 • edited Loading

ericvsmith commented Dec 17, 2024

tomasr8 commented Dec 18, 2024

ericvsmith commented Dec 18, 2024

terryjreedy commented Dec 18, 2024

tomasr8 commented Dec 17, 2024 •

edited by bedevere-app bot

Loading

terryjreedy commented Dec 17, 2024 •

edited

Loading