-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regex slowness #7
Comments
I also have a performance problem. I have three version of programs:
I measure two time: regex pre-compile time and process time. #1 program: 1.7s 1.5s They all feed by the same regex and input. Then I follow the documentation about profiling in Cython. Got the following result: ncalls tottime percall cumtime percall filename:lineno(function) 652884 16.477 0.000 25.349 0.000 re2.pyx:394(_search) 9479 6.059 0.001 41.806 0.004 export_plain.py:60(match) 652884 4.243 0.000 33.602 0.000 {method 'search' of 're2.Pattern' objects} 652884 4.010 0.000 29.359 0.000 re2.pyx:442(search) 652884 3.056 0.000 3.056 0.000 re2.pyx:114(__init__) 652953 2.145 0.000 2.145 0.000 {isinstance} 652884 2.002 0.000 2.002 0.000 re2.pyx:123(__dealloc__) 652953 1.911 0.000 1.911 0.000 re2.pyx:75(unicode_to_bytestring) 652953 1.902 0.000 1.902 0.000 re2.pyx:86(pystring_to_bytestring) 1 0.330 0.330 42.492 42.492 export_plain.py:98(export_fname) 9479 0.173 0.000 0.173 0.000 {built-in method sub} 10000 0.120 0.000 0.120 0.000 {method 'split' of 'str' objects} 8967 0.063 0.000 0.099 0.000 re2.pyx:801(get) 10069 0.061 0.000 0.061 0.000 {method 'strip' of 'str' objects} 69 0.043 0.001 0.146 0.002 re2.pyx:806(prepare_pattern) 9036 0.038 0.000 0.038 0.000 re2.pyx:788(__next) 69 0.022 0.000 0.169 0.002 re2.pyx:905(_compile) 1 0.005 0.005 0.177 0.177 export_plain.py:36(load) 69 0.002 0.000 0.003 0.000 re2.pyx:784(__init__) 69 0.001 0.000 0.170 0.002 re2.pyx:763(compile) 38 0.001 0.000 0.001 0.000 {method 'write' of 'file' objects} 69 0.001 0.000 0.171 0.002 {re2.compile} 1 0.001 0.001 42.669 42.669 export_plain.py:160(main) 3 0.000 0.000 0.000 0.000 {open} 69 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects} 19 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects} 1 0.000 0.000 0.000 0.000 genericpath.py:38(isdir) 1 0.000 0.000 42.669 42.669 export_plain.py:153(run_re2_test) 1 0.000 0.000 0.000 0.000 {posix.stat} 4 0.000 0.000 0.000 0.000 {time.time} 1 0.000 0.000 0.000 0.000 posixpath.py:59(join) 1 0.000 0.000 42.670 42.670 :1() 1 0.000 0.000 0.000 0.000 {method 'encode' of 'unicode' objects} 3 0.000 0.000 0.000 0.000 {method 'rfind' of 'str' objects} 2 0.000 0.000 0.000 0.000 posixpath.py:109(basename) 1 0.000 0.000 0.000 0.000 posixpath.py:117(dirname) 1 0.000 0.000 0.000 0.000 stat.py:40(S_ISDIR) 2 0.000 0.000 0.000 0.000 {len} 1 0.000 0.000 0.000 0.000 {method 'extend' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects} 1 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects} 1 0.000 0.000 0.000 0.000 stat.py:24(S_IFMT) 1 0.000 0.000 0.000 0.000 {method '__enter__' of 'file' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} It looks like that the _search function (re2.pyx:393) cost most of the time. I guess the actual Match function (re2.pyx:424) cost most of the time in this function.
but I don't how can be so different between this and the pure C version. PS: |
Played with re2/pyre2 recently to estimate its performance improvements over the standard re module, and found the noticeable slowdown instead: In [1]: import re In [2]: import re2 In [3]: r1 = re.compile(r'^(?:(?P<_db>databases/(?P<db_name>.+)/(?P<db_value>.+))|(?P<_col>collections/(?P<col_name>.+)/(?P<col_value>.+))|(?P<_u>users/(?P<u_name>.+)/(?P<u_value>.+))|(?P<_gr>groups/(?P<gr_name>.+)/(?P<gr_value>.+))|(?P<_comb>combinations/(?P<comb_name>.+)/(?P<comb_value>.+))|(?P<_ind>indexes/(?P<ind_name>.+)/(?P<ind_value>.+))|(?P<_inv>invoices/(?P<inv_name>.+)/(?P<inv_value>.+))|(?P<_pl>plans/(?P<pl_name>.+)/(?P<pl_value>.+))|(?P<_srv>services/(?P<srv_name>.+)/(?P<srv_value>.+))|(?P<_req>requests/(?P<req_name>.+)/(?P<req_value>.+)))$') In [4]: r2 = re2.compile(r'^(?:(?P<_db>databases/(?P<db_name>.+)/(?P<db_value>.+))|(?P<_col>collections/(?P<col_name>.+)/(?P<col_value>.+))|(?P<_u>users/(?P<u_name>.+)/(?P<u_value>.+))|(?P<_gr>groups/(?P<gr_name>.+)/(?P<gr_value>.+))|(?P<_comb>combinations/(?P<comb_name>.+)/(?P<comb_value>.+))|(?P<_ind>indexes/(?P<ind_name>.+)/(?P<ind_value>.+))|(?P<_inv>invoices/(?P<inv_name>.+)/(?P<inv_value>.+))|(?P<_pl>plans/(?P<pl_name>.+)/(?P<pl_value>.+))|(?P<_srv>services/(?P<srv_name>.+)/(?P<srv_value>.+))|(?P<_req>requests/(?P<req_name>.+)/(?P<req_value>.+)))$') In [5]: %timeit r1.match('invoices/qCMSrSo5aBcQ6rqS3a6J1VQvRx9PRAec/BVGqqoiA92ws2BqaAZNh0YagTpRUUvyZ') In [6]: %timeit r2.match('invoices/qCMSrSo5aBcQ6rqS3a6J1VQvRx9PRAec/BVGqqoiA92ws2BqaAZNh0YagTpRUUvyZ') pyre2 is as of 543f228 (Sun Feb 5 10:49:45 2012 -0500), re2 is as of 7bf5e87ef4e8 (Sun Oct 21 10:48:11 2012 -0400) |
using latest pyre2 and RE2, when trying to proccess a big regex for webserver logs pythons re is far more quick than pyre2.
weblog structure: '$domain $remote_addr $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$request_time"
One can reproduce the bug by adding this on tests/performance.py:
test |Description |# total runs |
re
time(s) |re2
time(s) | %regex
timeweblog scan search and extract weblog data. |1 | 8.678 | 94.259 |1086.13%
i cant understand if this is related to RE2 or pyre2 or my code.
The text was updated successfully, but these errors were encountered: