Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from rust-lang:master #8

Open
wants to merge 549 commits into
base: master
Choose a base branch
from

Conversation

pull[bot]
Copy link

@pull pull bot commented Jan 9, 2020

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull bot added the ⤵️ pull label Jan 9, 2020
@pull pull bot added the merge-conflict Resolve conflicts manually label Feb 19, 2020
BurntSushi and others added 28 commits April 21, 2023 07:58
Otherwise it's possible for the fuzzer to build a regex that is big
enough to timeout on a big haystack.
The fuzzer keeps finding regexes that just fit into the limit, have a
Unicode word boundary assertion and gives a decent sized haystack. This
in turn results in slowish searches. The searches are horrificly slow,
but they become much slower with the sanitizers enabled it looks like.

So... drop the size limit down even more.
This fixes a bug where the calculation for the min/max length of a regex
could overflow if the counted repetitions in the pattern are big enough.
The panic only happens when debug assertions are enabled, which means
there is no panic by default in release mode.

One may wonder whether other bad things happen in release mode though,
since in that case, the arithmetic will wrap around instead. Since this
is in new code and since the regex crate doesn't yet utilize the min/max
attributes of an Hir, the wrap around in this case is completely
innocuous.

Fixes #995
And also enable debug assertions to try and catch
more bugs.
We keep beating back the OSS-fuzz timeouts. It keeps finding bigger and
bigger haystacks with even smallish regexes that have Unicode word
boundaries in them. This results in using the PikeVM which is just slow.
There's really nothing to be done other than to tell the fuzzer: "this
is OK."
This bug results in some regexes reporting matches at every position
even when it should't. The bug happens because the internal literal
optimizer winds up using an "empty" searcher that reports a match at
every position. This is technically correct whenever the literal
searcher is used as a prefilter (although slower than necessary), but an
optimization added later enabled the searcher to run on its own and not
as a prefilter. i.e., Without the confirm step by the regex engine. In
that context, the "empty" searcher is totally incorrect.

So this bug corresponds to a code path where the "empty" literal
searcher is used, but is also in a case where the literal searcher is
used directly to find matches and not as a prefilter. I believe at
least the following are required to trigger this path:

* The literals extracted need to be "complete." That is, the language
described by the regex is small and finite.
* There needs to be at least 26 distinct starting bytes among all of
the elements in the language described by the regex.
* There needs to be fewer than 26 distinct ending bytes among all of
the elements in the language described by the regex.
* Possibly other criteria...

The actual fix is to change the code that selects the literal searcher.
Indeed, there was even a comment in the erroneous code saying that the
path was impossible, but of course, it isn't. We change that path to
return None, as it should have long ago. This in turn results in the
case outlined above not using a literal searcher and just the regex
engine.

Fixes #999
This regex failed to compile in `regex <1.8`, but the migration to
regex-automata tweaked the rules in a subtle way that permitted it
to compile despite the fact that the old/status-quo matching engines
can't handle it correctly. By that, I mean that they may permit the \B
to match between code units. That in turn results in panicking when
slicing a &str.

In `regex 1.9`, this regex will actually be able to be compiled, but
the matching engines will correctly and robustly never report matches
that split UTF-8 code units. For now, we just add code that causes
`regex 1.8` to have the same behavior as previous releases.

Fixes #1006
This essentially copied the visit_alternation_in methods, but
for concatenations. This is useful for some niche use cases
where one wants to visit concatenations in reverse.

PR #1017
This effectively copies my regex-automata work into this crate and does
a bunch of rejiggering to make it work. In particular, we wire up its
new test harness to the public regex crate API. In this commit, that
means the regex crate API is being simultaneously tested using both the
old and new test suites.

This does *not* get rid of the old regex crate implementation. That will
happen in a subsequent commit. This is just a staging commit to prepare
for that.
If we need this again, we should just rewrite it in Rust and put it in
'regex-cli'.
All of the old tests should be covered by either porting them over
explicitly, or in the TOML test suite.
We're going to drop the old benchmark suite in favor of rebar, but it's
worth recording some final results. This ensures we get a fair
comparison with the regex crate before and after its internals have been
rewritten.
We are going to remove the old benchmark harness, but it seems like a
good idea to save the old measurements.

In the future, benchmarks will be maintained by rebar:
https://github.com/BurntSushi/rebar
As stated in a previous commit, we'll be moving to rebar. (rebar isn't
actually published at time of writing, but it's essentially ready to
go.)
purrden and others added 30 commits June 2, 2024 19:30
We had previously release regex 1.10.4 but omitted a changelog entry for
it. So this adds it.
This is an update from a change made to the trait:
rust-lang/rust#127481

There shouldn't be any behavior changes here.

PR #1219
rustc seems to warn about this. And I would prefer writing the lifetime
here anyway. That it wasn't was probably an oversight.
It looks like rustc picks this up now but didn't before.
This complements `matched_any` with a means to check if a set of
patterns all matched the haystack.

PR #1228
This was an oversight omission when porting the old generator shell
script to regex-cli. This hasn't been an issue because I don't think
we've generated data for a new release of Unicode with this new
infrastructure yet.

This was flagged by unit tests that failed because \d was no longer a
subset of \w.
I am teetering on removing this cursed implementation.

Fixes #1231
This adds a new predicate that supports very minimal introspection
ability into why DFA construction failed.

Closes #1236
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⤵️ pull merge-conflict Resolve conflicts manually
Projects
None yet
Development

Successfully merging this pull request may close these issues.