Add a failing test case for comments #84

stefnotch · 2024-04-25T18:28:57Z

To prepare for me looking closer into #83 , I wanted to make sure that I understand some of the parsing edge cases.

While doing so, I actually did find a case that the current parser appears to not handle correctly. The case is

// A comment with a nested multiline comment
// Notice how the "//" inside the multiline comment doesn't take effect
            r"/*
//*
*/commented
*/not commented",

The parser currently goes

/* starts a multiline comment. Increase the block_depth
The next line has one token, namely //. Ignore that. <=== This is where the error happens. Actually, that's not a // token, instead it's a nested /* token
The next line has a */ token. Decrease the block_depth back to zero. Then output commented, since it's no longer inside a multiline comment
The next line has a */ token. Decrease the block_depth again. This should be a parsing error, but we'll silently ignore it. Output not commented.

I see two ways of fixing this

Introduce a separate regex for parsing inside a multiline comment.
Wildly change the architecture to use nom (or another crate like winnow) for parsing tasks, since my experience is that using them for parsing leads to fewer subtle bugs.

stefnotch · 2024-04-25T18:31:17Z

I believe using a parser combinator library, like nom, would make parsing "quoted strings" easier. This would be useful for #81

robtfm · 2024-04-26T00:45:34Z

good point. also noticed that */ on a line by itself will output instead of */, so i added fixes for both into #81 (and rolled your extra tests in there too).

i've not used nom but it might be worthwhile if it gets any more complex - does it add much compile time overhead?

stefnotch · 2024-04-26T05:54:33Z

Lovely, thanks for putting my unit tests in there. So I can safely close this PR, right?

Regarding nom, using nom actually tends to improve compile times. The reason is not that nom is such a simple crate, but rather that the regex crate is an absolutely fascinating beast.
The regex crate comes with a parser, a virtual machine, multiple regex backends, a ton of Unicode stuff and more.

I would start with a very simple usage of nom and then give you some time to review it. And after that, we can convert more stuff to nom. Does that sound alright?

robtfm · 2024-04-26T07:51:06Z

yes that sounds very good to me

stefnotch · 2024-04-26T07:53:14Z

Perfect, I'll get around to it hopefully soon.

Add a failing test case for comments

cab736a

stefnotch closed this Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a failing test case for comments #84

Add a failing test case for comments #84

stefnotch commented Apr 25, 2024

stefnotch commented Apr 25, 2024

robtfm commented Apr 26, 2024 •

edited

Loading

stefnotch commented Apr 26, 2024

robtfm commented Apr 26, 2024

stefnotch commented Apr 26, 2024

Add a failing test case for comments #84

Add a failing test case for comments #84

Conversation

stefnotch commented Apr 25, 2024

stefnotch commented Apr 25, 2024

robtfm commented Apr 26, 2024 • edited Loading

stefnotch commented Apr 26, 2024

robtfm commented Apr 26, 2024

stefnotch commented Apr 26, 2024

robtfm commented Apr 26, 2024 •

edited

Loading