-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tcl core 8.5 with PCRE (and DFA) regular expressions engine #5
Open
sebres
wants to merge
29
commits into
core-8-5-branch
Choose a base branch
from
core_8_5_pcre
base: core-8-5-branch
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…2e77] from [1520767fff]; rebased to sebres-8-5-timerate, conflicts resolved (fixes of Tcl_RegexpObjCmd from tclCmdMZ.c going to tclRegexp.c#TclRegexpClassic, etc)
… additionally linker options (e. g. to compile with PCRE), for example: set PCREDIR="..\..\lib\pcre" nmake -nologo -f makefile.vc release OPTS=threads,thrdalloc OPTIMIZATIONS="-DHAVE_PCRE -I%PCREDIR% -Ox -Ot -Oi -Gs" ADDLINKOPTS="%PCREDIR%\pcre.lib"
… interp engine if specified as parameter), test cases extended in order to cover this situation; bug fixing (test-cases repaired) in case of no HAVE_PCRE; several optimizations and code review (e. g. option -type compiled now if token is a simple word);
Several known bugs of classic regexp-engine fixed in PCRE, thus the constraints deactivated now for this test-cases (in -pcre mode); Test cases extended for several PCRE features. PCRE almost ready implemented now (todo: binary recognition resp. option "-binary" or "-bytearray", normalize code of Tcl_RegsubObjCmd because too large ATM).
…now in classic also) and offsets (PCRE vectors), etc; Test cases extended;
sebres
changed the title
core 8.5 with PCRE (and DFA) regular expressions engine
Tcl core 8.5 with PCRE (and DFA) regular expressions engine
Nov 17, 2017
…storage; Common handling for fast access of reStorage rewritten (one pointer instead of multiple references); Bugs fixed.
implemented real DFA-workspace with reallocate on demand if to small; small code review. |
…allows static linkage of pcre
(conflicts resolved)
…ility tests; illustrates wrong handling of indices using pcre engine - retrieves bytes offsets instead of char offsets (tests 'regexp-3.8m*-pcre' and 'regexp-17.3m*-pcre' fail)
…set, etc); more tests
…ic nfa regexp, probably possible if RE gets recompiled on demand without PCRE_UTF8, but there are '\w' which could then confuse some byte sequences with chars, so to avoid regression let parse it as utf-8 now); added regression test cases regexp-27.*
…gexp* parameter flags replaced all, inline, indices; additional tests for regsub over multi-byte string (check correct initial offsets)
…ith PCRE_JAVASCRIPT_COMPAT (unfortunately it introduces another ugly restriction - lone closing square bracket in a pattern causes a compile-time error, but it can be escaped like "\]", so let use it unless \uXXXX can be compiled in tcl directly, e. g. by some pre-processor)
(conflicts resolved)
… array of known offsets with middle indices)
…red if captured groups present or -line boundary matching used)
…here another significantly slower: % regexp -type p -inline {^([\w]+)://([^/\s?#]+)([^\s?#]*)(?:\?([^\s#]*)?(?:#([^\s]*))?)?$} "http://[email protected]/uri?args#id" - 1.308844 µs/# 764033 # 764033 #/sec 1000.000 net-ms + 0.918862 µs/# 1088303 # 1088303 #/sec 1000.000 net-ms % proc test {s} { timerate { regexp -type p {\w+} $s } }; test " [string repeat abc 10000] " - 101.726 µs/# 9831 # 9830.3 #/sec 1000.070 net-ms + 168.865 µs/# 5922 # 5921.9 #/sec 1000.017 net-ms % proc test {s} { timerate { regexp -type p {\w+} $s } }; test " [string repeat abc 100] " - 1.166015 µs/# 857622 # 857622 #/sec 1000.000 net-ms + 1.811757 µs/# 551951 # 551950 #/sec 1000.001 net-ms so may be it must be optional (-jit option)
…NST_REGEXP by compiling of switch command
Another performance comparison - https://gist.github.com/sebres/5de04dac9426a47b683974f5919986e5 (greedy vs. non-greedy without and with PCRE) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Artificial PR for core 8.5 with PCRE (and DFA) regular expressions engine.
I've rebased original not completed patch to newer 8.5 (many conflicts resolved) relative my branch sebres-8-5-timerate (in order to test performance also).
Compared to original variant provided von Jeffrey with patch pcre-20080121, it is complete:
Additionally:
`regexp -type dfa -inline {a|ab|abc} -abc-` returns dfa-alternatives `{abc ab a}` ;)
Todo's:
regexp -binary
for real binary capability through PCRE;regexp -dict
for capturing named groups of PCRE;As regards the performance, the PCRE as well as DFA are very faster as classic NFA of tcl (up to 10 times and on large regexp still faster).
Here an excerpt as a foretaste:
Note, the DFA has not realy a back-references here (but match alternatives), just the regsub used in order to minimize overhead of some tcl internal by measure (round about setting of variables or creating the lists by -indices or -inline).
Tested with PCRE 8.40 up to 8.45.
If interested by TCT I'll rebase it to fossil as soon as possible and provide my 8.6th and 8.7th branches for this.
I'll just spare this work (rebase) if nobody needs it.
Ah, yes, don't forget: Thanks to Jeffrey for the original work!