Name		Name	Last commit message	Last commit date
parent directory ..
_gen		_gen
README.md		README.md
build.sh		build.sh
common.sh		common.sh
do-done-edited.dot		do-done-edited.dot
do-done.re2c.cc		do-done.re2c.cc
else-elif-edited.dot		else-elif-edited.dot
else-elif.re2c.cc		else-elif.re2c.cc
fixed-strings.re2c.cc		fixed-strings.re2c.cc
fixed-strings.sh		fixed-strings.sh
make_pat.py		make_pat.py
re2_grep.cc		re2_grep.cc
synthetic-rsc.re2c.cc		synthetic-rsc.re2c.cc
trie.re2c.cc		trie.re2c.cc
words.sh		words.sh

README.md

fgrep-problem-benchmarks

The "fgrep problem" is to match a set of fixed strings in a corpus. The matching can be done "in parallel" by a DFA.

There are various methods to generate the DFA.

See the top of fixed-strings.sh for instructions on running these benchmarks.

Lobste.rs Discussion about Aho-Corasick Algorithm

Visualization

Simple DFA for "do" | "done" | "break". See below for the DFA for the bigger benchmark problem.

Description of the Problem and the re2c Solution

Matching this file:
-rw-rw-r-- 1 andy andy 338M Nov 10 11:09 _tmp/all-10.txt

Against 13 keywords:
   for while continue break if fi then else elif case esac do done

DFA for fixed-strings.re2.cc
Source generated by re2c from fixed-strings.re2c.cc Note that there are ~35 states yy2 - yy36.
Native code stats for fixed-strings.re2c.cc. GrepFixedStrings() is compiled to 791 bytes of native code. It has one or two variables, so all of them should fix in registers.

Approximate Benchmark Results

Benchmark	User Space Time (ms)	Real Time (ms)
cat	0	46
wc -l	139	189
fgrep	1,368	1,432
grep	1,093	1,133
ripgrep	949	962
`fixed-strings.cc read:count-lines` (touch every byte in C++)	49	186
`fixed-strings.cc read:re2c-match` (re2c compiler)	1,211	1,347
`re2_grep.cc` (RE2 interpreter)	1,906	2,055
Python re	5,781	5,946

NOTE: Each benchmarks matches the same set of strings, but the output is slightly different. I did enough experiments to convince myself that this doesn't matter. Aside from I/O, which we've accounted for, matching is the lion's share of the work, not say printing lines.

Observations

grep is faster than native code! I think this is because doesn't touch every byte. It knows how to skip bytes in the fashion of Boyer-Moore. See links below.
I think that Python is slower because it's a Perl-style backtracking engine rather than an automata-based angine. However, re.findall() must create a Python string object for every result (nearly 30 million of them), which is slow. So it might not be completely fair.

Links

Why GNU Grep is Fast (Discussions)
ripgrep is faster than {grep, ag, git grep, ucg, pt, sift}
- ripgrep uses many strategies, including NFA, DFA, Aho-Corasick, and Teddy (SIMD)!
aho-corasick library in Rust.

TODO

Measure running time of other regex implementations on the same problem. (Pull requests accepted.)
- Are all backtracking engines slower than automata-based engines?
- Compare Perl 5 vs. Python. Perl 5 is the "original" backtracking engine.
- Does interpreting vs. compiling make a difference? grep is an interpreter but it beats a compiler, probably due to the strings it does not try to match.
- Others like Go regex, which is based on re2?
- redgrep? Uses derivatives to construct automata. (I tried to build it but the dependencies seem out of date, and it doesn't have binaries.)
Plot running time vs. number of strings. Right now we have a fixed set of 13 strings. It would be nice to do 2, 10, 20, 30, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fgrep-problem-benchmarks

fgrep-problem-benchmarks

README.md

fgrep-problem-benchmarks

Visualization

Description of the Problem and the re2c Solution

Approximate Benchmark Results

Observations

Links

TODO

Files

fgrep-problem-benchmarks

Directory actions

More options

Directory actions

More options

Latest commit

History

fgrep-problem-benchmarks

Folders and files

parent directory

README.md

fgrep-problem-benchmarks

Visualization

Description of the Problem and the re2c Solution

Approximate Benchmark Results

Observations

Links

TODO