Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a failing test case for comments #84

Closed
wants to merge 1 commit into from

Conversation

stefnotch
Copy link
Contributor

To prepare for me looking closer into #83 , I wanted to make sure that I understand some of the parsing edge cases.

While doing so, I actually did find a case that the current parser appears to not handle correctly. The case is

// A comment with a nested multiline comment
// Notice how the "//" inside the multiline comment doesn't take effect
            r"/*
//*
*/commented
*/not commented",

The parser currently goes

  1. /* starts a multiline comment. Increase the block_depth
  2. The next line has one token, namely //. Ignore that. <=== This is where the error happens. Actually, that's not a // token, instead it's a nested /* token
  3. The next line has a */ token. Decrease the block_depth back to zero. Then output commented, since it's no longer inside a multiline comment
  4. The next line has a */ token. Decrease the block_depth again. This should be a parsing error, but we'll silently ignore it. Output not commented.

I see two ways of fixing this

  • Introduce a separate regex for parsing inside a multiline comment.
  • Wildly change the architecture to use nom (or another crate like winnow) for parsing tasks, since my experience is that using them for parsing leads to fewer subtle bugs.

@stefnotch
Copy link
Contributor Author

I believe using a parser combinator library, like nom, would make parsing "quoted strings" easier. This would be useful for #81

@robtfm
Copy link
Collaborator

robtfm commented Apr 26, 2024

good point. also noticed that */ on a line by itself will output instead of */, so i added fixes for both into #81 (and rolled your extra tests in there too).

i've not used nom but it might be worthwhile if it gets any more complex - does it add much compile time overhead?

@stefnotch
Copy link
Contributor Author

Lovely, thanks for putting my unit tests in there. So I can safely close this PR, right?

Regarding nom, using nom actually tends to improve compile times. The reason is not that nom is such a simple crate, but rather that the regex crate is an absolutely fascinating beast.
The regex crate comes with a parser, a virtual machine, multiple regex backends, a ton of Unicode stuff and more.

I would start with a very simple usage of nom and then give you some time to review it. And after that, we can convert more stuff to nom. Does that sound alright?

@robtfm
Copy link
Collaborator

robtfm commented Apr 26, 2024

yes that sounds very good to me

@stefnotch
Copy link
Contributor Author

Perfect, I'll get around to it hopefully soon.

@stefnotch stefnotch closed this Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants