Skip to content

Commit

Permalink
regex-based parser
Browse files Browse the repository at this point in the history
Uses ua-parser/uap-rust#3

- add an optional dependency on `ua-parser-rs` under the `regex` key
- add a regex-based parser

misc:

- update the classifiers
- bump required-python to 3.9 (3.8 is basically EOL)
- update CI to better split up the steps
- fix up the check for binary pyyaml: requirements_dev was removed in
  81da21a, in May 2023, so this
  hasn't been working for 18 months
- fix CLI script to correctly handle optional modules so it can run on
  pypy and graal, add regex, make tracemalloc optional as pypy doesn't
  support it (didn't check graal)
- update tox: remove cpython 3.8, pypy 3.8 and 3.9 from tox (3.8's
  last supporting release was 7.3.11 in May 2023, 3.9's was 7.3.16 in
  April 2024), add graal

Fixes ua-parser#166
  • Loading branch information
masklinn committed Oct 13, 2024
1 parent 022ab80 commit 1a4b20c
Show file tree
Hide file tree
Showing 9 changed files with 196 additions and 69 deletions.
53 changes: 18 additions & 35 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,8 @@ name: CI

on:
push:
branches: [ '*' ]
pull_request:
branches: [ '*' ]
workflow_dispatch:
schedule:
# cron is kinda random, assumes 22:00 UTC is a low ebb, eastern
# countries are very early morning, and US are mid-day to
# mid-afternoon
- cron: '0 22 * * 2'

jobs:
checks:
Expand Down Expand Up @@ -79,7 +72,6 @@ jobs:
test:
runs-on: ubuntu-latest
needs: compile
continue-on-error: ${{ matrix.python-version == '3.13' || matrix.python-version == 'pypy-3.11' }}
strategy:
fail-fast: false
matrix:
Expand All @@ -88,19 +80,14 @@ jobs:
- sdist
- source
python-version:
- "3.8"
- "3.9"
- "3.10"
- "3.11"
- "3.12"
- "3.13"
- "pypy-3.8"
- "pypy-3.9"
- "pypy-3.10"
# - "pypy-3.11"
# don't enable graal because it's slower than even pypy and
# fails because oracle/graalpython#385
# - "graalpy-23"
- "graalpy-24"
include:
- source: sdist
artifact: dist/*.tar.gz
Expand All @@ -116,34 +103,30 @@ jobs:
with:
python-version: ${{ matrix.python-version }}
allow-prereleases: true
- name: Install test dependencies
run: |
python -mpip install --upgrade pip
# cyaml is outright broken on pypy
if ! ${{ startsWith(matrix.python-version, 'pypy-') }}; then
# if binary wheels are not available for the current
# package install libyaml-dev so we can install pyyaml
# from source
if ! pip download --only-binary pyyaml -rrequirements_dev.txt > /dev/null 2>&1; then
sudo apt install libyaml-dev
fi
- run: python -mpip install --upgrade pip
- run: |
# if binary wheels are not available for the current
# package install libyaml-dev so we can install pyyaml
# from source
if ! pip download --only-binary :all: pyyaml > /dev/null 2>&1; then
sudo apt install libyaml-dev
fi
python -mpip install pytest pyyaml
# re2 is basically impossible to install from source so don't
# bother, and suppress installation failure so the test does
# not fail (re2 tests will just be skipped for versions /
# implementations for which google does not provide a binary
# wheel)
python -mpip install --only-binary :all: google-re2 || true
- run: python -mpip install pytest pyyaml
# install rs accelerator if available, ignore if not
- run: python -mpip install ua-parser-rs || true
# re2 is basically impossible to install from source so don't
# bother, and suppress installation failure so the test does
# not fail (re2 tests will just be skipped for versions /
# implementations for which google does not provide a binary
# wheel)
- run: 'python -mpip install --only-binary :all: google-re2 || true'
- name: download ${{ matrix.source }} artifact
if: matrix.artifact
uses: actions/download-artifact@v4
with:
name: ${{ matrix.source }}
path: dist/
- name: install package in environment
run: |
pip install ${{ matrix.artifact || '.' }}
run: pip install ${{ matrix.artifact || '.' }}
- name: run tests
run: pytest -v -Werror -Wignore::ImportWarning --doctest-glob="*.rst" -ra
2 changes: 2 additions & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,11 @@
rst_epilog = """
.. |pyyaml| replace:: ``PyYaml``
.. |re2| replace:: ``google-re2``
.. |regex| replace:: ``regex``
.. _pyyaml: https://pyyaml.org
.. _re2: https://pypi.org/project/google-re2
.. _regex: https://pypi.org/project/ua-parser-rs
"""

# -- General configuration ---------------------------------------------------
Expand Down
18 changes: 11 additions & 7 deletions doc/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,14 @@ Installation
Python Version
==============

ua-parser currently supports Python 3.8 and newer, as well as recent
versions of PyPy supporting the same standards.
ua-parser currently supports CPython 3.9 and newer, recent Pypy
(supporting 3.10), and Graal 24.

.. note:: While PyPy is supported, it is not *fast*, and google-re2 is
not supported on it.
.. note::

While pypy and graal are supported, they are rather slow when using
pure python mode and ``[re2]`` is not supported, so using the
``[regex]`` feature is very strongly recommended.

Installation
============
Expand All @@ -21,13 +24,14 @@ Installation
Optional Dependencies
=====================

ua-parser currently has two optional dependencies, |re2|_ and
|pyyaml|_. These dependencies will be detected and used automatically
ua-parser currently has three optional dependencies, |regex|_, |re2|_ and
|pyyaml|_. These dependencies will be detected and used augitomatically
if installed, but can also be installed via and alongside ua-parser:

.. code-block:: sh
$ pip install 'ua-parser[regex]'
$ pip install 'ua-parser[re2]'
$ pip install 'ua-parser[yaml]'
$ pip install 'ua-parser[re2,yaml]'
$ pip install 'ua-parser[regex,yaml]'
22 changes: 17 additions & 5 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,8 @@ name = "ua-parser"
description = "Python port of Browserscope's user agent parser"
version = "1.0.0a1"
readme = "README.rst"
requires-python = ">=3.8"
requires-python = ">=3.9"
dependencies = []
optional-dependencies = { yaml = ["PyYaml"], re2 = ["google-re2"] }

license = {text = "Apache 2.0"}
urls = {repository = "https://github.com/ua-parser/uap-python"}
Expand All @@ -35,20 +34,33 @@ classifiers = [
"Topic :: Internet :: WWW/HTTP",
"Topic :: Software Development :: Libraries :: Python Modules",
"Programming Language :: Python",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Python :: Implementation :: PyPy"
"Programming Language :: Python :: Implementation :: PyPy",
# no graalpy classifier yet (pypa/trove-classifiers#188)
# "Programming Language :: Python :: Implementation :: GraalPy",
]

[project.optional-dependencies]
yaml = ["PyYaml"]
re2 = ["google-re2"]
regex = ["ua-parser-rs"]

[tool.setuptools.packages.find]
where = ["src"]

[tool.setuptools.package-data]
"ua_parser" = ["py.typed"]

[tool.ruff]
exclude = [
"src/ua_parser/_lazy.py",
"src/ua_parser/_matchers.py",
]

[tool.ruff.lint]
select = ["F", "E", "W", "I", "RET", "RUF", "PT"]
ignore = [
Expand All @@ -63,7 +75,7 @@ known-first-party = ["ua_parser"]
combine-as-imports = true

[tool.mypy]
python_version = "3.8"
python_version = "3.9"
files = "src,tests"

# can't use strict because it's only global
Expand Down
14 changes: 9 additions & 5 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,16 +67,20 @@ def run(self) -> None:
dest_lazy = outdir / "_lazy.py"
dest_legacy = outdir / "_regexes.py"

with dest.open("wb") as eager, dest_lazy.open("wb") as lazy, dest_legacy.open(
"wb"
) as legacy:
with (
dest.open("wb") as eager,
dest_lazy.open("wb") as lazy,
dest_legacy.open("wb") as legacy,
):
eager = EagerWriter(eager)
lazy = LazyWriter(lazy)
legacy = LegacyWriter(legacy)

for section in ["user_agent_parsers", "os_parsers", "device_parsers"]:
with eager.section(section), lazy.section(section), legacy.section(
section
with (
eager.section(section),
lazy.section(section),
legacy.section(section),
):
extract = EXTRACTORS[section]
for p in regexes[section]:
Expand Down
39 changes: 31 additions & 8 deletions src/ua_parser/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
import sys
import threading
import time
import tracemalloc
import types
from typing import (
Any,
Callable,
Expand All @@ -38,12 +38,21 @@
)
from .caching import Cache, Local
from .loaders import load_builtins, load_yaml
from .re2 import Resolver as Re2Resolver

try:
from .re2 import Resolver as Re2Resolver
except ImportError:
pass
try:
from .regex import Resolver as RegexResolver
except ImportError:
pass
from .user_agent_parser import Parse

CACHEABLE = {
"basic": True,
"re2": True,
"regex": True,
"legacy": False,
}

Expand All @@ -58,6 +67,17 @@
]
)

try:
import tracemalloc
except ImportError:
snapshot = types.SimpleNamespace(
compare_to=lambda _1, _2: [],
)
tracemalloc = types.SimpleNamespace( # type: ignore
start=lambda: None,
take_snapshot=lambda: snapshot,
)


def get_rules(parsers: List[str], regexes: Optional[io.IOBase]) -> Matchers:
if regexes:
Expand Down Expand Up @@ -178,6 +198,8 @@ def get_parser(
r = BasicResolver(rules)
elif parser == "re2":
r = Re2Resolver(rules)
elif parser == "regex":
r = RegexResolver(rules)
else:
sys.exit(f"unknown parser {parser!r}")

Expand Down Expand Up @@ -327,6 +349,7 @@ def run_threaded(args: argparse.Namespace) -> None:
("locking-lru", CachingResolver(basic, caching.Lru(CACHESIZE))),
("local-lru", CachingResolver(basic, Local(lambda: caching.Lru(CACHESIZE)))),
("re2", Re2Resolver(load_builtins())),
("regex", RegexResolver(load_builtins())),
]
for name, resolver in resolvers:
print(f"{name:11}: ", end="", flush=True)
Expand Down Expand Up @@ -436,14 +459,14 @@ def __call__(
bench.add_argument(
"--bases",
nargs="+",
choices=["basic", "re2", "legacy"],
default=["basic", "re2", "legacy"],
choices=["basic", "re2", "regex", "legacy"],
default=["basic", "re2", "regex", "legacy"],
help="""Base resolvers to benchmark. `basic` is a linear search
through the regexes file, `re2` is a prefiltered regex set
implemented in C++, `legacy` is the legacy API (essentially a
basic resolver with a clearing cache of fixed 200 entries, but
less layered so usually slightly faster than an equivalent
basic-based resolver).""",
implemented in C++, `regex` is a prefiltered regex set implemented
in Rust, `legacy` is the legacy API (essentially a basic resolver
with a clearing cache of fixed 200 entries, but less layered so
usually slightly faster than an equivalent basic-based resolver).""",
)
bench.add_argument(
"--caches",
Expand Down
76 changes: 76 additions & 0 deletions src/ua_parser/regex.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
__all__ = ["Resolver"]

from operator import attrgetter

import ua_parser_rs # type: ignore

from .core import (
Device,
Domain,
Matchers,
OS,
PartialResult,
UserAgent,
)


class Resolver:
ua: ua_parser_rs.UserAgentExtractor
os: ua_parser_rs.OSExtractor
de: ua_parser_rs.DeviceExtractor

def __init__(self, matchers: Matchers) -> None:
ua, os, de = matchers
self.ua = ua_parser_rs.UserAgentExtractor(
map(
attrgetter("regex", "family", "major", "minor", "patch", "patch_minor"),
ua,
)
)
self.os = ua_parser_rs.OSExtractor(
map(
attrgetter("regex", "family", "major", "minor", "patch", "patch_minor"),
os,
)
)
self.de = ua_parser_rs.DeviceExtractor(
map(
attrgetter("regex", "regex_flag", "family", "brand", "model"),
de,
)
)

def __call__(self, ua: str, domains: Domain, /) -> PartialResult:
user_agent = os = device = None
if Domain.USER_AGENT in domains:
if m := self.ua.extract(ua):
user_agent = UserAgent(
m.family,
m.major,
m.minor,
m.patch,
m.patch_minor,
)
if Domain.OS in domains:
if m := self.os.extract(ua):
os = OS(
m.family,
m.major,
m.minor,
m.patch,
m.patch_minor,
)
if Domain.DEVICE in domains:
if m := self.de.extract(ua):
device = Device(
m.family,
m.brand,
m.model,
)
return PartialResult(
domains=domains,
string=ua,
user_agent=user_agent,
os=os,
device=device,
)
Loading

0 comments on commit 1a4b20c

Please sign in to comment.