Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex slowness #7

Open
RaHus opened this issue Oct 11, 2011 · 2 comments
Open

regex slowness #7

RaHus opened this issue Oct 11, 2011 · 2 comments

Comments

@RaHus
Copy link

RaHus commented Oct 11, 2011

using latest pyre2 and RE2, when trying to proccess a big regex for webserver logs pythons re is far more quick than pyre2.
weblog structure: '$domain $remote_addr $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$request_time"
One can reproduce the bug by adding this on tests/performance.py:


def getweblogdata():
    return open('/var/log/apache2/access.log') ##Mine is about 9mb

@register_test("weblog scan",
                r'^(([\w\d]|[\w][\w\d\-]*[\w\d])\.)*([\w]|[\w][\w\d\-]*[\w\d])[\s]([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})[\s]*(.+)[\s]\[(.*)\][\s]"(GET|HEAD|POST|PUT|DELETE|OPTIONS)[\s](.*)[\s](HTTP.*)"[\s]([\d]*)[\s]([\d]*)[\s]"(.*)"[\s]"(.+)"[\s]"(.+)"',
                1,
               data=getweblogdata())
def weblog_matches(pattern, data):
    """
    Match weblog data line by line.
    """
    total=0
    for line in data:
        p = pattern.match(line)
        if p==None:
            continue
        total += len(p.groups())
    data.seek(0)
    return total

test |Description |# total runs |re time(s) |re2 time(s) | % regex time
weblog scan search and extract weblog data. |1 | 8.678 | 94.259 |1086.13%

i cant understand if this is related to RE2 or pyre2 or my code.

@vls
Copy link

vls commented Feb 17, 2012

I also have a performance problem.

I have three version of programs:

  1. pure Python using built-in re module
  2. Python using Pyre2
  3. C/C++ using re2

I measure two time: regex pre-compile time and process time.

#1 program: 1.7s 1.5s
#2 program: 0.04s 4.4s
#3 program: 0.02s 0.8s

They all feed by the same regex and input.

Then I follow the documentation about profiling in Cython. Got the following result:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   652884   16.477    0.000   25.349    0.000 re2.pyx:394(_search)
     9479    6.059    0.001   41.806    0.004 export_plain.py:60(match)
   652884    4.243    0.000   33.602    0.000 {method 'search' of 're2.Pattern' objects}
   652884    4.010    0.000   29.359    0.000 re2.pyx:442(search)
   652884    3.056    0.000    3.056    0.000 re2.pyx:114(__init__)
   652953    2.145    0.000    2.145    0.000 {isinstance}
   652884    2.002    0.000    2.002    0.000 re2.pyx:123(__dealloc__)
   652953    1.911    0.000    1.911    0.000 re2.pyx:75(unicode_to_bytestring)
   652953    1.902    0.000    1.902    0.000 re2.pyx:86(pystring_to_bytestring)
        1    0.330    0.330   42.492   42.492 export_plain.py:98(export_fname)
     9479    0.173    0.000    0.173    0.000 {built-in method sub}
    10000    0.120    0.000    0.120    0.000 {method 'split' of 'str' objects}
     8967    0.063    0.000    0.099    0.000 re2.pyx:801(get)
    10069    0.061    0.000    0.061    0.000 {method 'strip' of 'str' objects}
       69    0.043    0.001    0.146    0.002 re2.pyx:806(prepare_pattern)
     9036    0.038    0.000    0.038    0.000 re2.pyx:788(__next)
       69    0.022    0.000    0.169    0.002 re2.pyx:905(_compile)
        1    0.005    0.005    0.177    0.177 export_plain.py:36(load)
       69    0.002    0.000    0.003    0.000 re2.pyx:784(__init__)
       69    0.001    0.000    0.170    0.002 re2.pyx:763(compile)
       38    0.001    0.000    0.001    0.000 {method 'write' of 'file' objects}
       69    0.001    0.000    0.171    0.002 {re2.compile}
        1    0.001    0.001   42.669   42.669 export_plain.py:160(main)
        3    0.000    0.000    0.000    0.000 {open}
       69    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
       19    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        1    0.000    0.000    0.000    0.000 genericpath.py:38(isdir)
        1    0.000    0.000   42.669   42.669 export_plain.py:153(run_re2_test)
        1    0.000    0.000    0.000    0.000 {posix.stat}
        4    0.000    0.000    0.000    0.000 {time.time}
        1    0.000    0.000    0.000    0.000 posixpath.py:59(join)
        1    0.000    0.000   42.670   42.670 :1()
        1    0.000    0.000    0.000    0.000 {method 'encode' of 'unicode' objects}
        3    0.000    0.000    0.000    0.000 {method 'rfind' of 'str' objects}
        2    0.000    0.000    0.000    0.000 posixpath.py:109(basename)
        1    0.000    0.000    0.000    0.000 posixpath.py:117(dirname)
        1    0.000    0.000    0.000    0.000 stat.py:40(S_ISDIR)
        2    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'extend' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'endswith' of 'str' objects}
        1    0.000    0.000    0.000    0.000 stat.py:24(S_IFMT)
        1    0.000    0.000    0.000    0.000 {method '__enter__' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

It looks like that the _search function (re2.pyx:393) cost most of the time. I guess the actual Match function (re2.pyx:424) cost most of the time in this function.

result = self.re_pattern.Match(sp[0], <int>pos, <int>size, anchoring, m.matches, self.ngroups + 1)

but I don't how can be so different between this and the pure C version.

PS:
Pyre2 revision : commit 543f228
re2 revision : changeset: 79:0c439a6bd795

@amyodov
Copy link

amyodov commented Nov 8, 2012

Played with re2/pyre2 recently to estimate its performance improvements over the standard re module, and found the noticeable slowdown instead:

In [1]: import re

In [2]: import re2

In [3]: r1 = re.compile(r'^(?:(?P<_db>databases/(?P<db_name>.+)/(?P<db_value>.+))|(?P<_col>collections/(?P<col_name>.+)/(?P<col_value>.+))|(?P<_u>users/(?P<u_name>.+)/(?P<u_value>.+))|(?P<_gr>groups/(?P<gr_name>.+)/(?P<gr_value>.+))|(?P<_comb>combinations/(?P<comb_name>.+)/(?P<comb_value>.+))|(?P<_ind>indexes/(?P<ind_name>.+)/(?P<ind_value>.+))|(?P<_inv>invoices/(?P<inv_name>.+)/(?P<inv_value>.+))|(?P<_pl>plans/(?P<pl_name>.+)/(?P<pl_value>.+))|(?P<_srv>services/(?P<srv_name>.+)/(?P<srv_value>.+))|(?P<_req>requests/(?P<req_name>.+)/(?P<req_value>.+)))$')

In [4]: r2 = re2.compile(r'^(?:(?P<_db>databases/(?P<db_name>.+)/(?P<db_value>.+))|(?P<_col>collections/(?P<col_name>.+)/(?P<col_value>.+))|(?P<_u>users/(?P<u_name>.+)/(?P<u_value>.+))|(?P<_gr>groups/(?P<gr_name>.+)/(?P<gr_value>.+))|(?P<_comb>combinations/(?P<comb_name>.+)/(?P<comb_value>.+))|(?P<_ind>indexes/(?P<ind_name>.+)/(?P<ind_value>.+))|(?P<_inv>invoices/(?P<inv_name>.+)/(?P<inv_value>.+))|(?P<_pl>plans/(?P<pl_name>.+)/(?P<pl_value>.+))|(?P<_srv>services/(?P<srv_name>.+)/(?P<srv_value>.+))|(?P<_req>requests/(?P<req_name>.+)/(?P<req_value>.+)))$')

In [5]: %timeit r1.match('invoices/qCMSrSo5aBcQ6rqS3a6J1VQvRx9PRAec/BVGqqoiA92ws2BqaAZNh0YagTpRUUvyZ')
100000 loops, best of 3: 6.57 us per loop

In [6]: %timeit r2.match('invoices/qCMSrSo5aBcQ6rqS3a6J1VQvRx9PRAec/BVGqqoiA92ws2BqaAZNh0YagTpRUUvyZ')
1000 loops, best of 3: 272 us per loop

pyre2 is as of 543f228 (Sun Feb 5 10:49:45 2012 -0500), re2 is as of 7bf5e87ef4e8 (Sun Oct 21 10:48:11 2012 -0400)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants