Skip to content

Commit

Permalink
perf: Parse HogQL with C++ for a huge speedup (#17659)
Browse files Browse the repository at this point in the history
* Add partial C++ parser of HogQL

* Support all the rules and add proper error handling

* Use `AlignAfterOpenBracket: BlockIndent`

* Reuse all the parser tests for the C++ backend

* Fix relationship between headers and implementations

* Add more error handling and fix minor issues

* Build both Python and C++ parsers in the package.json script

* Update ARRAY JOIN error assertion

* Improve timeit

* Move the C extension to the top level

* Refactor `vector_to_list_string`

* Build the parser on Linux

* Build wheels for the parser

* Simplify Linux build and fix macOS

* Fix Homebrew paths on x86 and don't fail fast in CI

* Set MACOSX_DEPLOYMENT_TARGET for C++20

* Set up QEMU for Linux ARM builds

* Publish the wheels on PyPI

* Avoiding Linux ARM emulation in CI for now

* Build sdist too

* Revert Dockerfile changes

* Fix PyPI publish

* Add README and optimize sdist build

* Use setup.py directly instead of build

* Use PyPI hogql-parser instead of local

* Also revert production-unit.Dockerfile

* Fix sdist upload and add Linux ARM back

* No Linux ARM build in the end

* Fix artifact uploading

* Do try building Linux ARM

We need this for prod.

* Use `npm` in `grammar:build`

`pnpm` is not available in that job.

* Fix formatting of hogql_parser

* Build everything on macOS

* Revert "Build everything on macOS"

Not so fast actually.

* Use hogql-parser=0.1.1

* Fix dylib in macOS wheel

* Bump hogql-parser version

* Fix missing module error

* Delete timeit.py

* Make error handling robust

* Format the C++

* Use `hogql-parser==0.1.1`

* Fix reserved keyword error assertions

* Use HEAD hogql_paresr in CI

* Fix `apt` usage

* Add some sudo in CI

* Ensure package will be releasable before build

* Bump version to 0.1.3

* Cover C++ `unquote_string` with tests

* Use BuildJet ARM runners for ARM builds

* Add some instructions

* Add HogQL version check to backend CI

* Update requirements.txt

* Use `setuptools` instead of the deprecated `distutils`

* Fix working dir in backend CI

* Align ANTLR versions

* Add test for "mismatched input"

This is thrown differently than other HogQLSyntaxExceptions in C++, so might help reveal what's going on with tests failing only on Linux CI and not macOS dev

* Add types and bump version

* Comment instead of failing version check

* Automate hogql-release version bump

* Fix checkout token

* Don't build hogql-parser if there were no changes

* Update query snapshots

* Update query snapshots

* Update query snapshots

* Update query snapshots

* Improve documentation

* Use new hogql-parser version

* Fix error start and end initialization

* Note `antlr4-cpp-runtime`

Co-authored-by: Marius Andra <[email protected]>

* Also remove NUL chars in C++

* Check ANTLR4 runtime archive checksum for security

* Note more decrefs to add

* Add vector size checks

* Use new hogql-parser version

* Don't support the `start` arg in C++ `parse_expr`

* Use new hogql-parser version

---------

Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Marius Andra <[email protected]>
  • Loading branch information
3 people authored Oct 13, 2023
1 parent d8e67c0 commit 16a71f6
Show file tree
Hide file tree
Showing 50 changed files with 18,577 additions and 1,328 deletions.
2 changes: 1 addition & 1 deletion .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,4 @@
!plugin-server/.prettierrc
!share/GeoLite2-City.mmdb
!hogvm/python
!unit.json
!unit.json
41 changes: 36 additions & 5 deletions .github/actions/run-backend-tests/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,17 @@ runs:
python-version: ${{ inputs.python-version }}
token: ${{ inputs.token }}

- name: Determine if hogql-parser has changed compared to master
shell: bash
id: hogql-parser-diff
run: |
changed=$(git diff --quiet HEAD master -- hogql_parser/ && echo "false" || echo "true")
echo "::set-output name=changed::$changed"
- name: Install SAML (python3-saml) dependencies
shell: bash
run: |
sudo apt-get update
sudo apt-get install libxml2-dev libxmlsec1-dev libxmlsec1-openssl
sudo apt-get update && sudo apt-get install libxml2-dev libxmlsec1-dev libxmlsec1-openssl
- uses: syphar/restore-virtualenv@v1
id: cache-backend-tests
Expand All @@ -63,12 +69,37 @@ runs:
- uses: syphar/restore-pip-download-cache@v1
if: steps.cache-backend-tests.outputs.cache-hit != 'true'

- name: Install python dependencies
- name: Install Python dependencies
if: steps.cache-backend-tests.outputs.cache-hit != 'true'
shell: bash
run: |
python -m pip install -r requirements-dev.txt
python -m pip install -r requirements.txt
pip install -r requirements.txt -r requirements-dev.txt
- name: Install the working version of hogql-parser
if: steps.hogql-parser-diff.outputs.changed == 'true'
shell: bash
# This is not cached currently, as it's important to build the current HEAD version of hogql-parser if it has
# changed (requirements.txt has the already-published version)
run: |
sudo apt-get install libboost-all-dev unzip cmake curl uuid pkg-config
curl https://www.antlr.org/download/antlr4-cpp-runtime-4.13.0-source.zip --output antlr4-source.zip
# Check that the downloaded archive is the expected runtime - a security measure
anltr_known_md5sum="ff214b65fb02e150b4f515d7983bca92"
antlr_found_ms5sum="$(md5sum antlr4-source.zip | cut -d' ' -f1)"
if [[ "$anltr_known_md5sum" != "$antlr_found_ms5sum" ]]; then
echo "Unexpected MD5 sum of antlr4-source.zip!"
echo "Known: $anltr_known_md5sum"
echo "Found: $antlr_found_ms5sum"
exit 64
fi
unzip antlr4-source.zip -d antlr4-source && cd antlr4-source
cmake .
DESTDIR=out make install
sudo cp -r out/usr/local/include/antlr4-runtime /usr/include/
sudo cp out/usr/local/lib/libantlr4-runtime.so* /usr/lib/
sudo ldconfig
cd ..
pip install ./hogql_parser
- name: Set up needed files
shell: bash
Expand Down
130 changes: 130 additions & 0 deletions .github/workflows/build-hogql-parser.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
name: Release hogql-parser

on:
push:
branches:
- master
paths:
- hogql_parser/**
- .github/workflows/build-hogql-parser.yml
pull_request:
paths:
- hogql_parser/**
- .github/workflows/build-hogql-parser.yml

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

jobs:
check-version:
name: Check version legitimacy
runs-on: ubuntu-22.04
outputs:
parser_any_changed: ${{ steps.changed-files-yaml.outputs.parser_any_changed }}
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Fetching all for comparison since last push (not just last commit)

- name: Check if hogql_parser/ has changed
id: changed-files-yaml
uses: tj-actions/changed-files@v39
with:
since_last_remote_commit: true
files_yaml: |
parser:
- hogql_parser/**
- name: Notify about release needed
if: steps.changed-files-yaml.outputs.parser_any_changed == 'true'
shell: bash
run: |
published=$(curl -fSsl https://pypi.org/pypi/hogql-parser/json | jq -r '.info.version')
local=$(python hogql_parser/setup.py --version)
# TODO: Only comment if no comment alraedy exists for $local
if [[ "$published" == "$local" ]]; then
MESSAGE_BODY="It looks like the code of `hogql-parser` has changed since last push, but its version stayed the same at $local. 👀\nMake sure to resolve this in `hogql_parser/setup.py` before merging!"
curl -s -u posthog-bot:${{ secrets.POSTHOG_BOT_GITHUB_TOKEN || secrets.GITHUB_TOKEN }} -X POST -d "{ \"body\": \"$MESSAGE_BODY\" }" "https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.pull_request.number }}/comments"
fi
build-wheels:
name: Build wheels on ${{ matrix.os }}
needs: check-version
runs-on: ${{ matrix.os }}
timeout-minutes: 30
if: ${{ needs.check-version.outputs.parser_any_changed == 'true' }}
strategy:
matrix:
# As of October 2023, GitHub doesn't have ARM Actions runners… and ARM emulation is insanely slow
# (20x longer) on the Linux runners (while being reasonable on the macOS runners). Hence, we use
# BuildJet as a provider of ARM runners - this solution saves a lot of time and consequently some money.
os: [ubuntu-22.04, buildjet-2vcpu-ubuntu-2204-arm, macos-12]

steps:
- uses: actions/checkout@v4

- if: ${{ !endsWith(matrix.os, '-arm') }}
uses: actions/setup-python@v4
with:
python-version: '3.11'

- if: ${{ endsWith(matrix.os, '-arm') }}
uses: deadsnakes/[email protected] # Unfortunately actions/setup-python@v4 just doesn't work on ARM! This does
with:
python-version: '3.11'

- name: Build sdist
if: matrix.os == 'ubuntu-22.04' # Only build the sdist once
run: cd hogql_parser && python setup.py sdist

- name: Install cibuildwheel
run: python -m pip install cibuildwheel==2.16.*

- name: Build wheels
run: cd hogql_parser && python -m cibuildwheel --output-dir dist
env:
MACOSX_DEPLOYMENT_TARGET: '12' # A modern target allows us to use C++20

- uses: actions/upload-artifact@v3
with:
path: |
hogql_parser/dist/*.whl
hogql_parser/dist/*.tar.gz
if-no-files-found: error

publish:
name: Publish on PyPI
needs: build-wheels
environment: pypi-hogql-parser
permissions:
id-token: write
runs-on: ubuntu-22.04
steps:
- name: Fetch wheels
uses: actions/download-artifact@v3
with:
name: artifact
path: dist/

- name: Publish package to PyPI
uses: pypa/gh-action-pypi-publish@release/v1

- uses: actions/checkout@v4
with:
token: ${{ secrets.POSTHOG_BOT_GITHUB_TOKEN }}
ref: ${{ github.event.pull_request.head.ref }}

- name: Update hogql-parser in requirements
shell: bash
run: |
local=$(python hogql_parser/setup.py --version)
sed -i "s/hogql-parser==.*/hogql-parser==${local}/g" requirements.in
sed -i "s/hogql-parser==.*/hogql-parser==${local}/g" requirements.txt
- uses: EndBug/add-and-commit@v9
with:
add: '["requirements.in", "requirements.txt"]'
message: 'Use new hogql-parser version'
default_author: github_actions
github_token: ${{ secrets.POSTHOG_BOT_GITHUB_TOKEN }}
21 changes: 8 additions & 13 deletions .github/workflows/ci-backend.yml
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,6 @@ jobs:
- uses: actions/checkout@v3
with:
fetch-depth: 1
path: 'current/'

- name: Set up Python
uses: actions/setup-python@v4
Expand All @@ -119,38 +118,30 @@ jobs:
sudo apt-get update
sudo apt-get install libxml2-dev libxmlsec1 libxmlsec1-dev libxmlsec1-openssl
- name: Install python dependencies
- name: Install Python dependencies
if: steps.cache-backend-tests.outputs.cache-hit != 'true'
run: |
cd current
python -m pip install -r requirements.txt -r requirements-dev.txt
- name: Check for syntax errors, import sort, and code style violations
run: |
cd current
ruff .
- name: Check formatting
run: |
cd current
black --exclude posthog/hogql/grammar --check .
- name: Check static typing
run: |
cd current
mypy -p posthog --exclude bin/migrate_kafka_data.py --exclude posthog/hogql/grammar/HogQLParser.py --exclude gunicorn.config.py --enable-recursive-aliases
- name: Check if "schema.py" is up to date
run: |
cd current
npm run schema:build:python && git diff --exit-code
- name: Check if antlr definitions are up to date
- name: Check if ANTLR definitions are up to date
run: |
# Installing a version of ant compatible with what we use in development from homebrew (4.13)
# "apt-get install antlr" would install 4.7 which is incompatible with our grammar.
export ANTLR_VERSION=4.13.0
# java version doesn't matter
cd ..
sudo apt-get install default-jre
mkdir antlr
cd antlr
Expand All @@ -162,9 +153,13 @@ jobs:
export CLASSPATH=".:$PWD/antlr.jar:$CLASSPATH"
export PATH="$PWD:$PATH"
cd ../current
cd ../posthog
antlr | grep "Version"
npm run grammar:build && git diff --exit-code
env:
# Installing a version of ANTLR compatible with what's in Homebrew as of October 2023 (version 4.13),
# as apt-get is quite out of date. The same version must be set in hogql_parser/pyproject.toml
ANTLR_VERSION: '4.13.0'

check-migrations:
needs: changes
Expand Down
25 changes: 25 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,31 @@
"console": "integratedTerminal",
"python": "${workspaceFolder}/env/bin/python",
"cwd": "${workspaceFolder}"
},
{
"name": "Pytest: Current File",
"type": "python",
"request": "launch",
"module": "pytest",
"args": ["${file}", "-vvv"],
"console": "integratedTerminal",
"justMyCode": true
},
{
"name": "(lldb) Attach",
"type": "cppdbg",
"request": "attach",
"program": "/Users/twixes/.pyenv/versions/3.10.10/envs/posthog-3.10/bin/python",
"MIMode": "lldb"
},
{
"name": "Python C++ Debugger: Current File",
"type": "pythoncpp",
"request": "launch",
"pythonConfig": "custom",
"pythonLaunchName": "Pytest: Current File",
"cppConfig": "custom",
"cppAttachName": "(lldb) Attach"
}
],
"compounds": [
Expand Down
3 changes: 3 additions & 0 deletions hogql_parser/.clang-format
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
BasedOnStyle: Chromium
ColumnLimit: 120
AlignAfterOpenBracket: BlockIndent
5 changes: 5 additions & 0 deletions hogql_parser/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Build
build/
*.egg-info
*.so
dist/
50 changes: 50 additions & 0 deletions hogql_parser/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Developing `hogql-parser`

## Mandatory reading

If you're new to Python C/C++ extensions, there are some things you must have in your mind.

### [Objects, Types and Reference Counts in CPython](https://docs.python.org/3/c-api/intro.html#objects-types-and-reference-counts)

Key takeaways:

1. `Py_INCREF()` and `Py_DECREF()` need to be used accurately, or there'll be memory leaks (or, less likely, segfaults).
1. `Py_None`, `Py_True`, and `Py_False` are singletons, but they still need to be incref'd/decref'd - the best way to do create a new reference to them is wrapping them in `Py_NewRef()`.
1. Pretty much only `PyList_SET_ITEM()` _steals_ references (i.e. assumes ownership of objects passed into it), if you pass an object into any other function and no longer need it after that - remember to `Py_DECREF` it!

### [Building Values in CPython](https://docs.python.org/3/c-api/arg.html#building-values)

Key takeaways:

1. Use `Py_BuildValue()` for building tuples, dicts, and lists of static size. Use type-specific functions (e.g. `PyUnicode_FromString()` or `PyList_New()`) otherwise.
1. `str`-building with `s` involves `strlen`, while `s#` doesn't - it's better to use the latter with C++ strings.
1. `object`-passing with `O` increments the object's refcount, while doing it with `N` doesn't - we should use `N` pretty much exclusively, because the parse tree converter is about creating new objects (not borrowing).

## Conventions

1. Use `snake_case`. ANTLR is `camelCase`-heavy because of its Java heritage, but both the C++ stdlib and CPython are snaky.
2. Use the `auto` type for ANTLR and ANTLR-derived types, since they can be pretty verbose. Otherwise specify the type explictly.
3. Stay out of Python land as long as possible. E.g. avoid using `PyObject*`s` for bools or strings.
Do use Python for parsing numbers though - that way we don't need to consider integer overflow.
4. If any child rule results in an AST node, so must the parent rule - once in Python land, always in Python land.
E.g. it doesn't make sense to create a `vector<PyObject*>`, that should just be a `PyObject*` of Python type `list`.

## How to develop locally on macOS

1. Install libraries:

```bash
brew install boost antlr4-cpp-runtime
```

1. Install `hogql_parser` by building from local sources:

```bash
pip install ./hogql_parser
```

1. If you now run tests, the locally-built version of `hogql_parser` will be used:

```bash
pytest posthog/hogql/
```
Loading

0 comments on commit 16a71f6

Please sign in to comment.