Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement comment tracking in the new lexer #389

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Implement comment tracking in the new lexer #389

wants to merge 3 commits into from

Conversation

mcy
Copy link
Member

@mcy mcy commented Dec 9, 2024

This change does two things:

  1. It adds support for attacking comments to tokens in the token package, as well as a way to query the actual text of a comment token. Comment tokens which are token trees are used to represent tokens that are grouped together.

  2. It adds support in the lexer for automatically attaching comments. I have included a test consisting of all of the funny cases listed on protobuf.com.

@mcy mcy requested a review from jhump December 9, 2024 21:51
Copy link
Member

@jhump jhump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's worth complicating the lexer this way.

In the current lexer, comment attribution is super dumb and simple. In order to implement trailing vs. leading comments in source code info, there is a later step that can decide to "donate" one token's leading detached comments to a prior token's trailing comments.

That also makes it way more predictable as far as what is a "trailing" comment in the AST, like for a formatter.

So in the current lexer, IIRC, a trailing comment is attributed only when the token is the last (non-skippable) token on the line and there is a comment between the token and the newline. For all other cases, the comments are associated with the following token.

Also, there was a fuzz test performance issue with having to calculate line numbers to do these checks. So I had to write #343 to work-around that. That causes the current lexer to track the current line number, just so it can annotate a token to keep track of when the previous token's end line matches the subsequent comment's start line (and to also make sure that the next token's start line is higher -- to make sure the two tokens are not on the same line).

I see your logic just scans the text for newlines, which is likely much faster than the logic I had that induced the perf issue (which used the O(log n) search to compute line numbers for the tokens and comments). But it still could be an issue for a pathological source file to have to do that scan instead of using trivial integer ops as each token is lexed to dead-reckon line numbers for this purpose.

@mcy
Copy link
Member Author

mcy commented Dec 12, 2024

As discussed elsewhere, W're going to punt this for now. I'm going to convert this to draft for future reference, but what we'll probably do is make comment attribution lazy.

@mcy mcy marked this pull request as draft December 12, 2024 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants