Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use proper version of CTRE + use multiline mode with it #14

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hanickadot
Copy link

use latest version of CTRE (it switched from master to main long time ago), also use multiline range as default mode is single line

… ago), also use multiline range as default mode is single line
@hanickadot
Copy link
Author

I notice someone added CTRE here, but they selected master branch which I left in CTRE repository 2 years ago.

@hanickadot
Copy link
Author

Also meanwhile CTRE started to support switching modes (?i) and word boundaries. Case insensitive mode still only ASCII based (there is work done in full Unicode 14 support direction, but not there yet)

@BurntSushi
Copy link

What does multiline range do for CTRE? Usually multi-line mode makes ^ and $ match at the beginning/end of lines in addition to the beginning/end of text. I don't see any regexes in this benchmark that make use of ^/$, so I would assume it doesn't matter? Interestingly, the existing programs don't see consistent. For example, it looks like multiline is enabled for PCRE2 but not for RE2.

@hanickadot
Copy link
Author

hanickadot commented Jul 12, 2022

multiline mode in CTRE (and also in PCRE) won't allow to match a newline for . I noticed there is a slight inconsistency when a single-mode is used:

Regex: '.{2,4}(Tom|Sawyer|Huckleberry|Finn)'
[      ctre] time:   351.5 ms (+/-  1.1 %), matches:     2419
[rust_regex] time:    28.3 ms (+/-  1.9 %), matches:     1976
[rust_regrs] time:   976.8 ms (+/-  0.8 %), matches:     1976

But with multiline it has expected results:

Regex: '.{2,4}(Tom|Sawyer|Huckleberry|Finn)'
[      ctre] time:   332.9 ms (+/-  0.2 %), matches:     1976
[rust_regex] time:    28.4 ms (+/-  1.1 %), matches:     1976
[rust_regrs] time:   983.7 ms (+/-  0.8 %), matches:     1976

@BurntSushi
Copy link

Hmmm, for PCRE, "multiline" mode doesn't impact whether . matches \n:

By default, for the purposes of matching "start of line" and "end of line",
PCRE2 treats the subject string as consisting of a single line of characters,
even if it actually contains newlines. The "start of line" metacharacter (^)
matches only at the start of the string, and the "end of line" metacharacter
($) matches only at the end of the string, or before a terminating newline
(except when PCRE2_DOLLAR_ENDONLY is set). Note, however, that unless
PCRE2_DOTALL is set, the "any character" metacharacter (.) does not match at
a newline. This behaviour (for ^, $, and dot) is the same as Perl.

In order to make PCRE match \n for ., you need to enable "dot all" mode:

If this bit is set, a dot metacharacter in the pattern matches any character,
including one that indicates a newline. However, it only ever matches one
character, even if newlines are coded as CRLF. Without this option, a dot
does not match when the current position in the subject is at a newline.
This option is equivalent to Perl's /s option, and it can be changed within
a pattern by a (?s) option setting. A negative class such as [^a] always
matches newline characters, independent of the setting of this option.

Most regex engines I'm aware of don't have "dot all" mode enabled by default. That is, by default, . typically does not match \n.

But in any case, it looks like my question was answered. I had assumed CTRE's "multiline" mode corresponded to the same thing in other regex engines, but it sounds like it might be a little different.

@hanickadot
Copy link
Author

Good point, maybe I should implement dot_all mode too. Thanks!

@HFTrader
Copy link
Contributor

Hijacking this thread.
@hanickadot I have rerun the tests using main instead of master and for the most part the performance was unchanged withe the exception of the two regexes below that jumped from 55 to 355 ms. I might be doing something wrong perhaps? Please advise.

.{0,2}(Tom|Sawyer|Huckleberry|Finn)
.{2,4}(Tom|Sawyer|Huckleberry|Finn)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants