Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Parse HogQL with C++ for a huge speedup #17659

Merged
merged 78 commits into from
Oct 13, 2023
Merged
Show file tree
Hide file tree
Changes from 69 commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
46d04cf
Add partial C++ parser of HogQL
Twixes Sep 28, 2023
a8dc5d4
Support all the rules and add proper error handling
Twixes Oct 2, 2023
e53ec36
Use `AlignAfterOpenBracket: BlockIndent`
Twixes Oct 2, 2023
3be9872
Reuse all the parser tests for the C++ backend
Twixes Oct 2, 2023
7906414
Fix relationship between headers and implementations
Twixes Oct 2, 2023
b866ebf
Add more error handling and fix minor issues
Twixes Oct 2, 2023
b35c93d
Build both Python and C++ parsers in the package.json script
Twixes Oct 2, 2023
f2a62ea
Update ARRAY JOIN error assertion
Twixes Oct 2, 2023
102d0a1
Improve timeit
Twixes Oct 2, 2023
6a695af
Move the C extension to the top level
Twixes Oct 2, 2023
ef5101a
Refactor `vector_to_list_string`
Twixes Oct 2, 2023
ec8ea8b
Build the parser on Linux
Twixes Oct 4, 2023
4af40cc
Build wheels for the parser
Twixes Oct 5, 2023
b5101ac
Simplify Linux build and fix macOS
Twixes Oct 5, 2023
ea137d5
Fix Homebrew paths on x86 and don't fail fast in CI
Twixes Oct 5, 2023
c261a81
Set MACOSX_DEPLOYMENT_TARGET for C++20
Twixes Oct 5, 2023
841bddb
Set up QEMU for Linux ARM builds
Twixes Oct 5, 2023
2a8d760
Publish the wheels on PyPI
Twixes Oct 5, 2023
c7dcefe
Avoiding Linux ARM emulation in CI for now
Twixes Oct 5, 2023
30a6257
Build sdist too
Twixes Oct 5, 2023
37ae412
Revert Dockerfile changes
Twixes Oct 5, 2023
6a4eece
Fix PyPI publish
Twixes Oct 5, 2023
5bf907b
Add README and optimize sdist build
Twixes Oct 5, 2023
9c4f952
Use setup.py directly instead of build
Twixes Oct 6, 2023
39f1b52
Use PyPI hogql-parser instead of local
Twixes Oct 6, 2023
d4cd499
Also revert production-unit.Dockerfile
Twixes Oct 6, 2023
feccbd2
Fix sdist upload and add Linux ARM back
Twixes Oct 6, 2023
c142308
No Linux ARM build in the end
Twixes Oct 6, 2023
db1b7d9
Fix artifact uploading
Twixes Oct 6, 2023
8c363d6
Do try building Linux ARM
Twixes Oct 6, 2023
e1453dd
Use `npm` in `grammar:build`
Twixes Oct 6, 2023
e194c89
Fix formatting of hogql_parser
Twixes Oct 6, 2023
e54387b
Build everything on macOS
Twixes Oct 6, 2023
c30afea
Revert "Build everything on macOS"
Twixes Oct 6, 2023
ca1fa3c
Use hogql-parser=0.1.1
Twixes Oct 6, 2023
7e10a55
Fix dylib in macOS wheel
Twixes Oct 6, 2023
713bfec
Bump hogql-parser version
Twixes Oct 6, 2023
bfed4f6
Fix missing module error
Twixes Oct 6, 2023
1214e21
Delete timeit.py
Twixes Oct 6, 2023
58df63b
Make error handling robust
Twixes Oct 6, 2023
62e7497
Format the C++
Twixes Oct 6, 2023
cc7f5c6
Use `hogql-parser==0.1.1`
Twixes Oct 9, 2023
66a980d
Fix reserved keyword error assertions
Twixes Oct 9, 2023
0e53256
Use HEAD hogql_paresr in CI
Twixes Oct 9, 2023
cff8a67
Fix `apt` usage
Twixes Oct 9, 2023
507b78b
Add some sudo in CI
Twixes Oct 9, 2023
bfe4603
Ensure package will be releasable before build
Twixes Oct 9, 2023
0d089c4
Bump version to 0.1.3
Twixes Oct 9, 2023
81d3c4b
Cover C++ `unquote_string` with tests
Twixes Oct 9, 2023
f3dfb89
Use BuildJet ARM runners for ARM builds
Twixes Oct 9, 2023
e8f94ed
Add some instructions
Twixes Oct 9, 2023
d0b6647
Add HogQL version check to backend CI
Twixes Oct 9, 2023
729a897
Update requirements.txt
Twixes Oct 9, 2023
826b9d4
Use `setuptools` instead of the deprecated `distutils`
Twixes Oct 9, 2023
1dfd70c
Fix working dir in backend CI
Twixes Oct 9, 2023
99c5bdf
Align ANTLR versions
Twixes Oct 10, 2023
2da9925
Add test for "mismatched input"
Twixes Oct 10, 2023
4922e64
Add types and bump version
Twixes Oct 10, 2023
3051b6e
Comment instead of failing version check
Twixes Oct 10, 2023
c50bf47
Automate hogql-release version bump
Twixes Oct 10, 2023
39edff4
Fix checkout token
Twixes Oct 10, 2023
954e1ac
Merge branch 'master' into turbo-parser
Twixes Oct 10, 2023
d3c8f34
Don't build hogql-parser if there were no changes
Twixes Oct 10, 2023
2b78d5d
Update query snapshots
github-actions[bot] Oct 10, 2023
2f28676
Update query snapshots
github-actions[bot] Oct 10, 2023
e0854eb
Update query snapshots
github-actions[bot] Oct 10, 2023
527ec7e
Update query snapshots
github-actions[bot] Oct 10, 2023
19617f5
Improve documentation
Twixes Oct 11, 2023
7a89bf2
Use new hogql-parser version
github-actions[bot] Oct 11, 2023
4b5e8a4
Fix error start and end initialization
Twixes Oct 12, 2023
4096807
Note `antlr4-cpp-runtime`
Twixes Oct 12, 2023
5e5ba40
Also remove NUL chars in C++
Twixes Oct 12, 2023
debb5ed
Check ANTLR4 runtime archive checksum for security
Twixes Oct 12, 2023
640c159
Note more decrefs to add
Twixes Oct 12, 2023
eca0d64
Add vector size checks
Twixes Oct 12, 2023
6015bc4
Use new hogql-parser version
github-actions[bot] Oct 12, 2023
4d4d6c0
Don't support the `start` arg in C++ `parse_expr`
Twixes Oct 13, 2023
29980df
Use new hogql-parser version
github-actions[bot] Oct 13, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,4 @@
!plugin-server/.prettierrc
!share/GeoLite2-City.mmdb
!hogvm/python
!unit.json
!unit.json
32 changes: 27 additions & 5 deletions .github/actions/run-backend-tests/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,17 @@ runs:
python-version: ${{ inputs.python-version }}
token: ${{ inputs.token }}

- name: Determine if hogql-parser has changed compared to master
shell: bash
id: hogql-parser-diff
run: |
changed=$(git diff --quiet HEAD master -- hogql_parser/ && echo "false" || echo "true")
echo "::set-output name=changed::$changed"

- name: Install SAML (python3-saml) dependencies
shell: bash
run: |
sudo apt-get update
sudo apt-get install libxml2-dev libxmlsec1-dev libxmlsec1-openssl
sudo apt-get update && sudo apt-get install libxml2-dev libxmlsec1-dev libxmlsec1-openssl

- uses: syphar/restore-virtualenv@v1
id: cache-backend-tests
Expand All @@ -63,12 +69,28 @@ runs:
- uses: syphar/restore-pip-download-cache@v1
if: steps.cache-backend-tests.outputs.cache-hit != 'true'

- name: Install python dependencies
- name: Install Python dependencies
if: steps.cache-backend-tests.outputs.cache-hit != 'true'
shell: bash
run: |
python -m pip install -r requirements-dev.txt
python -m pip install -r requirements.txt
pip install -r requirements.txt -r requirements-dev.txt

- name: Install the working version of hogql-parser
if: steps.hogql-parser-diff.outputs.changed == 'true'
shell: bash
# This is not cached currently, as it's important to build the current HEAD version of hogql-parser if it has
# changed (requirements.txt has the already-published version)
run: |
sudo apt-get install libboost-all-dev unzip cmake curl uuid pkg-config
curl https://www.antlr.org/download/antlr4-cpp-runtime-4.13.0-source.zip --output antlr4-source.zip
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth chekcing the md5sum here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, added that!

unzip antlr4-source.zip -d antlr4-source && cd antlr4-source
cmake .
DESTDIR=out make install
sudo cp -r out/usr/local/include/antlr4-runtime /usr/include/
sudo cp out/usr/local/lib/libantlr4-runtime.so* /usr/lib/
sudo ldconfig
cd ..
pip install ./hogql_parser

- name: Set up needed files
shell: bash
Expand Down
130 changes: 130 additions & 0 deletions .github/workflows/build-hogql-parser.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
name: Release hogql-parser

on:
push:
branches:
- master
paths:
- hogql_parser/**
- .github/workflows/build-hogql-parser.yml
pull_request:
paths:
- hogql_parser/**
- .github/workflows/build-hogql-parser.yml

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

jobs:
check-version:
name: Check version legitimacy
runs-on: ubuntu-22.04
outputs:
parser_any_changed: ${{ steps.changed-files-yaml.outputs.parser_any_changed }}
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Fetching all for comparison since last push (not just last commit)

- name: Check if hogql_parser/ has changed
id: changed-files-yaml
uses: tj-actions/changed-files@v39
with:
since_last_remote_commit: true
files_yaml: |
parser:
- hogql_parser/**
- name: Notify about release needed
if: steps.changed-files-yaml.outputs.parser_any_changed == 'true'
shell: bash
run: |
published=$(curl -fSsl https://pypi.org/pypi/hogql-parser/json | jq -r '.info.version')
local=$(python hogql_parser/setup.py --version)
# TODO: Only comment if no comment alraedy exists for $local
if [[ "$published" == "$local" ]]; then
MESSAGE_BODY="It looks like the code of `hogql-parser` has changed since last push, but its version stayed the same at $local. 👀\nMake sure to resolve this in `hogql_parser/setup.py` before merging!"
curl -s -u posthog-bot:${{ secrets.POSTHOG_BOT_GITHUB_TOKEN || secrets.GITHUB_TOKEN }} -X POST -d "{ \"body\": \"$MESSAGE_BODY\" }" "https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.pull_request.number }}/comments"
fi
build-wheels:
name: Build wheels on ${{ matrix.os }}
needs: check-version
runs-on: ${{ matrix.os }}
timeout-minutes: 30
if: ${{ needs.check-version.outputs.parser_any_changed == 'true' }}
strategy:
matrix:
# As of October 2023, GitHub doesn't have ARM Actions runners… and ARM emulation is insanely slow
# (20x longer) on the Linux runners (while being reasonable on the macOS runners). Hence, we use
# BuildJet as a provider of ARM runners - this solution saves a lot of time and consequently some money.
os: [ubuntu-22.04, buildjet-2vcpu-ubuntu-2204-arm, macos-12]

steps:
- uses: actions/checkout@v4

- if: ${{ !endsWith(matrix.os, '-arm') }}
uses: actions/setup-python@v4
with:
python-version: '3.11'

- if: ${{ endsWith(matrix.os, '-arm') }}
uses: deadsnakes/[email protected] # Unfortunately actions/setup-python@v4 just doesn't work on ARM! This does
with:
python-version: '3.11'

- name: Build sdist
if: matrix.os == 'ubuntu-22.04' # Only build the sdist once
run: cd hogql_parser && python setup.py sdist

- name: Install cibuildwheel
run: python -m pip install cibuildwheel==2.16.*

- name: Build wheels
run: cd hogql_parser && python -m cibuildwheel --output-dir dist
env:
MACOSX_DEPLOYMENT_TARGET: '12' # A modern target allows us to use C++20

- uses: actions/upload-artifact@v3
with:
path: |
hogql_parser/dist/*.whl
hogql_parser/dist/*.tar.gz
if-no-files-found: error

publish:
name: Publish on PyPI
needs: build-wheels
environment: pypi-hogql-parser
permissions:
id-token: write
runs-on: ubuntu-22.04
steps:
- name: Fetch wheels
uses: actions/download-artifact@v3
with:
name: artifact
path: dist/

- name: Publish package to PyPI
uses: pypa/gh-action-pypi-publish@release/v1

- uses: actions/checkout@v4
with:
token: ${{ secrets.POSTHOG_BOT_GITHUB_TOKEN }}
ref: ${{ github.event.pull_request.head.ref }}

- name: Update hogql-parser in requirements
shell: bash
run: |
local=$(python hogql_parser/setup.py --version)
sed -i "s/hogql-parser==.*/hogql-parser==${local}/g" requirements.in
sed -i "s/hogql-parser==.*/hogql-parser==${local}/g" requirements.txt
- uses: EndBug/add-and-commit@v9
with:
add: '["requirements.in", "requirements.txt"]'
message: 'Use new hogql-parser version'
default_author: github_actions
github_token: ${{ secrets.POSTHOG_BOT_GITHUB_TOKEN }}
21 changes: 8 additions & 13 deletions .github/workflows/ci-backend.yml
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,6 @@ jobs:
- uses: actions/checkout@v3
with:
fetch-depth: 1
path: 'current/'

- name: Set up Python
uses: actions/setup-python@v4
Expand All @@ -119,38 +118,30 @@ jobs:
sudo apt-get update
sudo apt-get install libxml2-dev libxmlsec1 libxmlsec1-dev libxmlsec1-openssl

- name: Install python dependencies
- name: Install Python dependencies
if: steps.cache-backend-tests.outputs.cache-hit != 'true'
run: |
cd current
python -m pip install -r requirements.txt -r requirements-dev.txt

- name: Check for syntax errors, import sort, and code style violations
run: |
cd current
ruff .

- name: Check formatting
run: |
cd current
black --exclude posthog/hogql/grammar --check .

- name: Check static typing
run: |
cd current
mypy -p posthog --exclude bin/migrate_kafka_data.py --exclude posthog/hogql/grammar/HogQLParser.py --exclude gunicorn.config.py --enable-recursive-aliases

- name: Check if "schema.py" is up to date
run: |
cd current
npm run schema:build:python && git diff --exit-code

- name: Check if antlr definitions are up to date
- name: Check if ANTLR definitions are up to date
run: |
# Installing a version of ant compatible with what we use in development from homebrew (4.13)
# "apt-get install antlr" would install 4.7 which is incompatible with our grammar.
export ANTLR_VERSION=4.13.0
# java version doesn't matter
cd ..
sudo apt-get install default-jre
mkdir antlr
cd antlr
Expand All @@ -162,9 +153,13 @@ jobs:
export CLASSPATH=".:$PWD/antlr.jar:$CLASSPATH"
export PATH="$PWD:$PATH"

cd ../current
cd ../posthog
antlr | grep "Version"
npm run grammar:build && git diff --exit-code
env:
# Installing a version of ANTLR compatible with what's in Homebrew as of October 2023 (version 4.13),
# as apt-get is quite out of date. The same version must be set in hogql_parser/pyproject.toml
ANTLR_VERSION: '4.13.0'

check-migrations:
needs: changes
Expand Down
25 changes: 25 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,31 @@
"console": "integratedTerminal",
"python": "${workspaceFolder}/env/bin/python",
"cwd": "${workspaceFolder}"
},
{
"name": "Pytest: Current File",
"type": "python",
"request": "launch",
"module": "pytest",
"args": ["${file}", "-vvv"],
"console": "integratedTerminal",
"justMyCode": true
},
{
"name": "(lldb) Attach",
"type": "cppdbg",
"request": "attach",
"program": "/Users/twixes/.pyenv/versions/3.10.10/envs/posthog-3.10/bin/python",
"MIMode": "lldb"
},
{
"name": "Python C++ Debugger: Current File",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome for debugging a Python C++ extension, can set breakpoints in C++ code and they work perfectly when this debug config is ran in e.g. test_parser_cpp.py

"type": "pythoncpp",
"request": "launch",
"pythonConfig": "custom",
"pythonLaunchName": "Pytest: Current File",
"cppConfig": "custom",
"cppAttachName": "(lldb) Attach"
}
],
"compounds": [
Expand Down
3 changes: 3 additions & 0 deletions hogql_parser/.clang-format
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
BasedOnStyle: Chromium
ColumnLimit: 120
AlignAfterOpenBracket: BlockIndent
5 changes: 5 additions & 0 deletions hogql_parser/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Build
build/
*.egg-info
*.so
dist/
50 changes: 50 additions & 0 deletions hogql_parser/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Developing `hogql-parser`

## Mandatory reading
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great I applaud you for writing this 👏


If you're new to Python C/C++ extensions, there are some things you must have in your mind.

### [Objects, Types and Reference Counts in CPython](https://docs.python.org/3/c-api/intro.html#objects-types-and-reference-counts)

Key takeaways:

1. `Py_INCREF()` and `Py_DECREF()` need to be used accurately, or there'll be memory leaks (or, less likely, segfaults).
1. `Py_None`, `Py_True`, and `Py_False` are singletons, but they still need to be incref'd/decref'd - the best way to do create a new reference to them is wrapping them in `Py_NewRef()`.
1. Pretty much only `PyList_SET_ITEM()` _steals_ references (i.e. assumes ownership of objects passed into it), if you pass an object into any other function and no longer need it after that - remember to `Py_DECREF` it!

### [Building Values in CPython](https://docs.python.org/3/c-api/arg.html#building-values)

Key takeaways:

1. Use `Py_BuildValue()` for building tuples, dicts, and lists of static size. Use type-specific functions (e.g. `PyUnicode_FromString()` or `PyList_New()`) otherwise.
1. `str`-building with `s` involves `strlen`, while `s#` doesn't - it's better to use the latter with C++ strings.
1. `object`-passing with `O` increments the object's refcount, while doing it with `N` doesn't - we should use `N` pretty much exclusively, because the parse tree converter is about creating new objects (not borrowing).

## Conventions

1. Use `snake_case`. ANTLR is `camelCase`-heavy because of its Java heritage, but both the C++ stdlib and CPython are snaky.
2. Use the `auto` type for ANTLR and ANTLR-derived types, since they can be pretty verbose. Otherwise specify the type explictly.
3. Stay out of Python land as long as possible. E.g. avoid using `PyObject*`s` for bools or strings.
Do use Python for parsing numbers though - that way we don't need to consider integer overflow.
4. If any child rule results in an AST node, so must the parent rule - once in Python land, always in Python land.
E.g. it doesn't make sense to create a `vector<PyObject*>`, that should just be a `PyObject*` of Python type `list`.

## How to develop locally on macOS

1. Install libraries:

```bash
brew install boost antlr
Twixes marked this conversation as resolved.
Show resolved Hide resolved
```

1. Install `hogql_parser` by building from local sources:

```bash
pip install ./hogql_parser
```

1. If you now run tests, the locally-built version of `hogql_parser` will be used:

```bash
pytest posthog/hogql/
```
Loading
Loading