Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tcl core 8.5 with PCRE (and DFA) regular expressions engine #5

Open
wants to merge 29 commits into
base: core-8-5-branch
Choose a base branch
from

Conversation

sebres
Copy link
Owner

@sebres sebres commented Nov 17, 2017

Artificial PR for core 8.5 with PCRE (and DFA) regular expressions engine.

I've rebased original not completed patch to newer 8.5 (many conflicts resolved) relative my branch sebres-8-5-timerate (in order to test performance also).

Compared to original variant provided von Jeffrey with patch pcre-20080121, it is complete:

  • many bugs fixed, regsub working as expected now, more robust and faster;
  • has better backwards compatibility to the classic NFA-regexp of tcl;
  • e. g. UCP (Unicode properties for \d, \w, etc.) and many others.

Additionally:

  • better UTF8 and UCP support;
  • types are compiled now;
  • I've fixed all known bugs (of the classic regexp) for this new engine;
  • added DFA type (still draft, but it works well), thus for example
    `regexp -type dfa -inline {a|ab|abc} -abc-` returns dfa-alternatives `{abc ab a}` ;)
  • new test-cases added in order to explain differences (advantages and disadvantages) of all 3 variants.
  • I've also a build for windows resp. auto-scripts and makefile for windows, thus if somebody needs, just ask :)

Todo's:

  • I should review and rewrite a few of code pieces for better understanding and to avoid code duplication;
  • regexp -binary for real binary capability through PCRE;
  • regexp -dict for capturing named groups of PCRE;
  • provide real DFA-workspace (ATM fixed in stack) with reallocate on demand if to small;
  • docu.

As regards the performance, the PCRE as well as DFA are very faster as classic NFA of tcl (up to 10 times and on large regexp still faster).
Here an excerpt as a foretaste:

% timerate -calibrate {}

% foreach t {c p d} { 
    proc test_$t {} \
    [string map [list _REENG_ $t] \
      {puts _REENG_:[timerate {regsub -type _REENG_ -all -line {^((\d{2})-(\d{2})-(\d{2,4})|NULL)$} "10-10-2017\nNULL\n20-10-2017" {**\1**\2**}}]} 
]; puts "% [info body test_$t]"; test_$t }
% puts c:[timerate {regsub -type c -all -line {^((\d{2})-(\d{2})-(\d{2,4})|NULL)$} "10-10-2017\nNULL\n20-10-2017" {**\1**\2**}}]
-c:32.4819 µs/# 30763 # 30786.4 #/sec 999.241 net-ms
% puts p:[timerate {regsub -type p -all -line {^((\d{2})-(\d{2})-(\d{2,4})|NULL)$} "10-10-2017\nNULL\n20-10-2017" {**\1**\2**}}]
+p:1.266349 µs/# 774477 # 789671 #/sec 980.758 net-ms
% puts d:[timerate {regsub -type d -all -line {^((\d{2})-(\d{2})-(\d{2,4})|NULL)$} "10-10-2017\nNULL\n20-10-2017" {**\1**\2**}}]
+d:1.204760 µs/# 813269 # 830040 #/sec 979.794 net-ms

Note, the DFA has not realy a back-references here (but match alternatives), just the regsub used in order to minimize overhead of some tcl internal by measure (round about setting of variables or creating the lists by -indices or -inline).

Tested with PCRE 8.40 up to 8.45.

If interested by TCT I'll rebase it to fossil as soon as possible and provide my 8.6th and 8.7th branches for this.
I'll just spare this work (rebase) if nobody needs it.

Ah, yes, don't forget: Thanks to Jeffrey for the original work!

…2e77] from [1520767fff];

rebased to sebres-8-5-timerate, conflicts resolved (fixes of Tcl_RegexpObjCmd from tclCmdMZ.c going to tclRegexp.c#TclRegexpClassic, etc)
… additionally linker options (e. g. to compile with PCRE), for example:

set PCREDIR="..\..\lib\pcre"
nmake -nologo -f makefile.vc release OPTS=threads,thrdalloc OPTIMIZATIONS="-DHAVE_PCRE -I%PCREDIR% -Ox -Ot -Oi -Gs" ADDLINKOPTS="%PCREDIR%\pcre.lib"
… interp engine if specified as parameter), test cases extended in order to cover this situation;

bug fixing (test-cases repaired) in case of no HAVE_PCRE;
several optimizations and code review (e. g. option -type compiled now if token is a simple word);
Several known bugs of classic regexp-engine fixed in PCRE, thus the constraints deactivated now for this test-cases (in -pcre mode);
Test cases extended for several PCRE features.
PCRE almost ready implemented now (todo: binary recognition resp. option "-binary" or "-bytearray", normalize code of Tcl_RegsubObjCmd because too large ATM).
…now in classic also) and offsets (PCRE vectors), etc;

Test cases extended;
@sebres sebres changed the title core 8.5 with PCRE (and DFA) regular expressions engine Tcl core 8.5 with PCRE (and DFA) regular expressions engine Nov 17, 2017
…storage;

Common handling for fast access of reStorage rewritten (one pointer instead of multiple references);
Bugs fixed.
@sebres
Copy link
Owner Author

sebres commented Nov 17, 2017

implemented real DFA-workspace with reallocate on demand if to small; small code review.

@sebres sebres changed the base branch from sebres_8_5_timerate to core-8-5-branch June 29, 2021 13:52
sebres added 10 commits June 29, 2021 20:11
…ility tests;

illustrates wrong handling of indices using pcre engine - retrieves bytes offsets instead of char offsets (tests 'regexp-3.8m*-pcre' and 'regexp-17.3m*-pcre' fail)
…ic nfa regexp, probably possible if RE gets recompiled on demand without PCRE_UTF8, but there are '\w' which could then confuse some byte sequences with chars, so to avoid regression let parse it as utf-8 now);

added regression test cases regexp-27.*
…gexp* parameter flags replaced all, inline, indices;

additional tests for regsub over multi-byte string (check correct initial offsets)
…ith PCRE_JAVASCRIPT_COMPAT (unfortunately it introduces another ugly restriction - lone closing square bracket in a pattern causes a compile-time error, but it can be escaped like "\]", so let use it unless \uXXXX can be compiled in tcl directly, e. g. by some pre-processor)
… array of known offsets with middle indices)
…red if captured groups present or -line boundary matching used)
…here another significantly slower:

  % regexp -type p -inline {^([\w]+)://([^/\s?#]+)([^\s?#]*)(?:\?([^\s#]*)?(?:#([^\s]*))?)?$} "http://[email protected]/uri?args#id"
  - 1.308844 µs/# 764033 # 764033 #/sec 1000.000 net-ms
  + 0.918862 µs/# 1088303 # 1088303 #/sec 1000.000 net-ms
  % proc test {s} { timerate { regexp -type p {\w+} $s } }; test "  [string repeat abc 10000]  "
  - 101.726 µs/# 9831 # 9830.3 #/sec 1000.070 net-ms
  + 168.865 µs/# 5922 # 5921.9 #/sec 1000.017 net-ms
  % proc test {s} { timerate { regexp -type p {\w+} $s } }; test "  [string repeat abc 100]  "
  - 1.166015 µs/# 857622 # 857622 #/sec 1000.000 net-ms
  + 1.811757 µs/# 551951 # 551950 #/sec 1000.001 net-ms
so may be it must be optional (-jit option)
@sebres
Copy link
Owner Author

sebres commented Nov 12, 2024

Another performance comparison - https://gist.github.com/sebres/5de04dac9426a47b683974f5919986e5 (greedy vs. non-greedy without and with PCRE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant