Implement comment tracking in the new lexer #389

mcy · 2024-12-09T21:51:25Z

This change does two things:

It adds support for attacking comments to tokens in the token package, as well as a way to query the actual text of a comment token. Comment tokens which are token trees are used to represent tokens that are grouped together.
It adds support in the lexer for automatically attaching comments. I have included a test consisting of all of the funny cases listed on protobuf.com.

jhump

I don't think it's worth complicating the lexer this way.

In the current lexer, comment attribution is super dumb and simple. In order to implement trailing vs. leading comments in source code info, there is a later step that can decide to "donate" one token's leading detached comments to a prior token's trailing comments.

That also makes it way more predictable as far as what is a "trailing" comment in the AST, like for a formatter.

So in the current lexer, IIRC, a trailing comment is attributed only when the token is the last (non-skippable) token on the line and there is a comment between the token and the newline. For all other cases, the comments are associated with the following token.

Also, there was a fuzz test performance issue with having to calculate line numbers to do these checks. So I had to write #343 to work-around that. That causes the current lexer to track the current line number, just so it can annotate a token to keep track of when the previous token's end line matches the subsequent comment's start line (and to also make sure that the next token's start line is higher -- to make sure the two tokens are not on the same line).

I see your logic just scans the text for newlines, which is likely much faster than the logic I had that induced the perf issue (which used the O(log n) search to compute line numbers for the tokens and comments). But it still could be an issue for a pathological source file to have to do that scan instead of using trivial integer ops as each token is lexed to dead-reckon line numbers for this purpose.

mcy · 2024-12-12T20:04:42Z

As discussed elsewhere, W're going to punt this for now. I'm going to convert this to draft for future reference, but what we'll probably do is make comment attribution lazy.

mcy added 2 commits December 9, 2024 13:33

add iters.Enumerate

615d187

token changes

f841375

mcy requested a review from jhump December 9, 2024 21:51

lexer changes

6673c09

mcy force-pushed the mcy/comments branch from 70af9c6 to 6673c09 Compare December 9, 2024 21:54

jhump reviewed Dec 10, 2024

View reviewed changes

mcy marked this pull request as draft December 12, 2024 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement comment tracking in the new lexer #389

Implement comment tracking in the new lexer #389

mcy commented Dec 9, 2024

jhump left a comment

mcy commented Dec 12, 2024

Implement comment tracking in the new lexer #389

Are you sure you want to change the base?

Implement comment tracking in the new lexer #389

Conversation

mcy commented Dec 9, 2024

jhump left a comment

Choose a reason for hiding this comment

mcy commented Dec 12, 2024