forked from rust-lang/regex
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] master from rust-lang:master #8
Open
pull
wants to merge
549
commits into
mesalock-linux:master
Choose a base branch
from
rust-lang:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Otherwise it's possible for the fuzzer to build a regex that is big enough to timeout on a big haystack.
The fuzzer keeps finding regexes that just fit into the limit, have a Unicode word boundary assertion and gives a decent sized haystack. This in turn results in slowish searches. The searches are horrificly slow, but they become much slower with the sanitizers enabled it looks like. So... drop the size limit down even more.
This fixes a bug where the calculation for the min/max length of a regex could overflow if the counted repetitions in the pattern are big enough. The panic only happens when debug assertions are enabled, which means there is no panic by default in release mode. One may wonder whether other bad things happen in release mode though, since in that case, the arithmetic will wrap around instead. Since this is in new code and since the regex crate doesn't yet utilize the min/max attributes of an Hir, the wrap around in this case is completely innocuous. Fixes #995
And also enable debug assertions to try and catch more bugs.
We keep beating back the OSS-fuzz timeouts. It keeps finding bigger and bigger haystacks with even smallish regexes that have Unicode word boundaries in them. This results in using the PikeVM which is just slow. There's really nothing to be done other than to tell the fuzzer: "this is OK."
This bug results in some regexes reporting matches at every position even when it should't. The bug happens because the internal literal optimizer winds up using an "empty" searcher that reports a match at every position. This is technically correct whenever the literal searcher is used as a prefilter (although slower than necessary), but an optimization added later enabled the searcher to run on its own and not as a prefilter. i.e., Without the confirm step by the regex engine. In that context, the "empty" searcher is totally incorrect. So this bug corresponds to a code path where the "empty" literal searcher is used, but is also in a case where the literal searcher is used directly to find matches and not as a prefilter. I believe at least the following are required to trigger this path: * The literals extracted need to be "complete." That is, the language described by the regex is small and finite. * There needs to be at least 26 distinct starting bytes among all of the elements in the language described by the regex. * There needs to be fewer than 26 distinct ending bytes among all of the elements in the language described by the regex. * Possibly other criteria... The actual fix is to change the code that selects the literal searcher. Indeed, there was even a comment in the erroneous code saying that the path was impossible, but of course, it isn't. We change that path to return None, as it should have long ago. This in turn results in the case outlined above not using a literal searcher and just the regex engine. Fixes #999
This regex failed to compile in `regex <1.8`, but the migration to regex-automata tweaked the rules in a subtle way that permitted it to compile despite the fact that the old/status-quo matching engines can't handle it correctly. By that, I mean that they may permit the \B to match between code units. That in turn results in panicking when slicing a &str. In `regex 1.9`, this regex will actually be able to be compiled, but the matching engines will correctly and robustly never report matches that split UTF-8 code units. For now, we just add code that causes `regex 1.8` to have the same behavior as previous releases. Fixes #1006
This essentially copied the visit_alternation_in methods, but for concatenations. This is useful for some niche use cases where one wants to visit concatenations in reverse. PR #1017
This effectively copies my regex-automata work into this crate and does a bunch of rejiggering to make it work. In particular, we wire up its new test harness to the public regex crate API. In this commit, that means the regex crate API is being simultaneously tested using both the old and new test suites. This does *not* get rid of the old regex crate implementation. That will happen in a subsequent commit. This is just a staging commit to prepare for that.
If we need this again, we should just rewrite it in Rust and put it in 'regex-cli'.
All of the old tests should be covered by either porting them over explicitly, or in the TOML test suite.
We're going to drop the old benchmark suite in favor of rebar, but it's worth recording some final results. This ensures we get a fair comparison with the regex crate before and after its internals have been rewritten.
We are going to remove the old benchmark harness, but it seems like a good idea to save the old measurements. In the future, benchmarks will be maintained by rebar: https://github.com/BurntSushi/rebar
As stated in a previous commit, we'll be moving to rebar. (rebar isn't actually published at time of writing, but it's essentially ready to go.)
We had previously release regex 1.10.4 but omitted a changelog entry for it. So this adds it.
This is an update from a change made to the trait: rust-lang/rust#127481 There shouldn't be any behavior changes here. PR #1219
rustc seems to warn about this. And I would prefer writing the lifetime here anyway. That it wasn't was probably an oversight.
It looks like rustc picks this up now but didn't before.
This complements `matched_any` with a means to check if a set of patterns all matched the haystack. PR #1228
This was an oversight omission when porting the old generator shell script to regex-cli. This hasn't been an issue because I don't think we've generated data for a new release of Unicode with this new infrastructure yet. This was flagged by unit tests that failed because \d was no longer a subset of \w.
I am teetering on removing this cursed implementation. Fixes #1231
This adds a new predicate that supports very minimal introspection ability into why DFA construction failed. Closes #1236
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )