-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
perf: Parse HogQL with C++ for a huge speedup (#17659)
* Add partial C++ parser of HogQL * Support all the rules and add proper error handling * Use `AlignAfterOpenBracket: BlockIndent` * Reuse all the parser tests for the C++ backend * Fix relationship between headers and implementations * Add more error handling and fix minor issues * Build both Python and C++ parsers in the package.json script * Update ARRAY JOIN error assertion * Improve timeit * Move the C extension to the top level * Refactor `vector_to_list_string` * Build the parser on Linux * Build wheels for the parser * Simplify Linux build and fix macOS * Fix Homebrew paths on x86 and don't fail fast in CI * Set MACOSX_DEPLOYMENT_TARGET for C++20 * Set up QEMU for Linux ARM builds * Publish the wheels on PyPI * Avoiding Linux ARM emulation in CI for now * Build sdist too * Revert Dockerfile changes * Fix PyPI publish * Add README and optimize sdist build * Use setup.py directly instead of build * Use PyPI hogql-parser instead of local * Also revert production-unit.Dockerfile * Fix sdist upload and add Linux ARM back * No Linux ARM build in the end * Fix artifact uploading * Do try building Linux ARM We need this for prod. * Use `npm` in `grammar:build` `pnpm` is not available in that job. * Fix formatting of hogql_parser * Build everything on macOS * Revert "Build everything on macOS" Not so fast actually. * Use hogql-parser=0.1.1 * Fix dylib in macOS wheel * Bump hogql-parser version * Fix missing module error * Delete timeit.py * Make error handling robust * Format the C++ * Use `hogql-parser==0.1.1` * Fix reserved keyword error assertions * Use HEAD hogql_paresr in CI * Fix `apt` usage * Add some sudo in CI * Ensure package will be releasable before build * Bump version to 0.1.3 * Cover C++ `unquote_string` with tests * Use BuildJet ARM runners for ARM builds * Add some instructions * Add HogQL version check to backend CI * Update requirements.txt * Use `setuptools` instead of the deprecated `distutils` * Fix working dir in backend CI * Align ANTLR versions * Add test for "mismatched input" This is thrown differently than other HogQLSyntaxExceptions in C++, so might help reveal what's going on with tests failing only on Linux CI and not macOS dev * Add types and bump version * Comment instead of failing version check * Automate hogql-release version bump * Fix checkout token * Don't build hogql-parser if there were no changes * Update query snapshots * Update query snapshots * Update query snapshots * Update query snapshots * Improve documentation * Use new hogql-parser version * Fix error start and end initialization * Note `antlr4-cpp-runtime` Co-authored-by: Marius Andra <[email protected]> * Also remove NUL chars in C++ * Check ANTLR4 runtime archive checksum for security * Note more decrefs to add * Add vector size checks * Use new hogql-parser version * Don't support the `start` arg in C++ `parse_expr` * Use new hogql-parser version --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Marius Andra <[email protected]>
- Loading branch information
Showing
50 changed files
with
18,577 additions
and
1,328 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,4 +33,4 @@ | |
!plugin-server/.prettierrc | ||
!share/GeoLite2-City.mmdb | ||
!hogvm/python | ||
!unit.json | ||
!unit.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
name: Release hogql-parser | ||
|
||
on: | ||
push: | ||
branches: | ||
- master | ||
paths: | ||
- hogql_parser/** | ||
- .github/workflows/build-hogql-parser.yml | ||
pull_request: | ||
paths: | ||
- hogql_parser/** | ||
- .github/workflows/build-hogql-parser.yml | ||
|
||
concurrency: | ||
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} | ||
cancel-in-progress: true | ||
|
||
jobs: | ||
check-version: | ||
name: Check version legitimacy | ||
runs-on: ubuntu-22.04 | ||
outputs: | ||
parser_any_changed: ${{ steps.changed-files-yaml.outputs.parser_any_changed }} | ||
steps: | ||
- uses: actions/checkout@v4 | ||
with: | ||
fetch-depth: 0 # Fetching all for comparison since last push (not just last commit) | ||
|
||
- name: Check if hogql_parser/ has changed | ||
id: changed-files-yaml | ||
uses: tj-actions/changed-files@v39 | ||
with: | ||
since_last_remote_commit: true | ||
files_yaml: | | ||
parser: | ||
- hogql_parser/** | ||
- name: Notify about release needed | ||
if: steps.changed-files-yaml.outputs.parser_any_changed == 'true' | ||
shell: bash | ||
run: | | ||
published=$(curl -fSsl https://pypi.org/pypi/hogql-parser/json | jq -r '.info.version') | ||
local=$(python hogql_parser/setup.py --version) | ||
# TODO: Only comment if no comment alraedy exists for $local | ||
if [[ "$published" == "$local" ]]; then | ||
MESSAGE_BODY="It looks like the code of `hogql-parser` has changed since last push, but its version stayed the same at $local. 👀\nMake sure to resolve this in `hogql_parser/setup.py` before merging!" | ||
curl -s -u posthog-bot:${{ secrets.POSTHOG_BOT_GITHUB_TOKEN || secrets.GITHUB_TOKEN }} -X POST -d "{ \"body\": \"$MESSAGE_BODY\" }" "https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.pull_request.number }}/comments" | ||
fi | ||
build-wheels: | ||
name: Build wheels on ${{ matrix.os }} | ||
needs: check-version | ||
runs-on: ${{ matrix.os }} | ||
timeout-minutes: 30 | ||
if: ${{ needs.check-version.outputs.parser_any_changed == 'true' }} | ||
strategy: | ||
matrix: | ||
# As of October 2023, GitHub doesn't have ARM Actions runners… and ARM emulation is insanely slow | ||
# (20x longer) on the Linux runners (while being reasonable on the macOS runners). Hence, we use | ||
# BuildJet as a provider of ARM runners - this solution saves a lot of time and consequently some money. | ||
os: [ubuntu-22.04, buildjet-2vcpu-ubuntu-2204-arm, macos-12] | ||
|
||
steps: | ||
- uses: actions/checkout@v4 | ||
|
||
- if: ${{ !endsWith(matrix.os, '-arm') }} | ||
uses: actions/setup-python@v4 | ||
with: | ||
python-version: '3.11' | ||
|
||
- if: ${{ endsWith(matrix.os, '-arm') }} | ||
uses: deadsnakes/[email protected] # Unfortunately actions/setup-python@v4 just doesn't work on ARM! This does | ||
with: | ||
python-version: '3.11' | ||
|
||
- name: Build sdist | ||
if: matrix.os == 'ubuntu-22.04' # Only build the sdist once | ||
run: cd hogql_parser && python setup.py sdist | ||
|
||
- name: Install cibuildwheel | ||
run: python -m pip install cibuildwheel==2.16.* | ||
|
||
- name: Build wheels | ||
run: cd hogql_parser && python -m cibuildwheel --output-dir dist | ||
env: | ||
MACOSX_DEPLOYMENT_TARGET: '12' # A modern target allows us to use C++20 | ||
|
||
- uses: actions/upload-artifact@v3 | ||
with: | ||
path: | | ||
hogql_parser/dist/*.whl | ||
hogql_parser/dist/*.tar.gz | ||
if-no-files-found: error | ||
|
||
publish: | ||
name: Publish on PyPI | ||
needs: build-wheels | ||
environment: pypi-hogql-parser | ||
permissions: | ||
id-token: write | ||
runs-on: ubuntu-22.04 | ||
steps: | ||
- name: Fetch wheels | ||
uses: actions/download-artifact@v3 | ||
with: | ||
name: artifact | ||
path: dist/ | ||
|
||
- name: Publish package to PyPI | ||
uses: pypa/gh-action-pypi-publish@release/v1 | ||
|
||
- uses: actions/checkout@v4 | ||
with: | ||
token: ${{ secrets.POSTHOG_BOT_GITHUB_TOKEN }} | ||
ref: ${{ github.event.pull_request.head.ref }} | ||
|
||
- name: Update hogql-parser in requirements | ||
shell: bash | ||
run: | | ||
local=$(python hogql_parser/setup.py --version) | ||
sed -i "s/hogql-parser==.*/hogql-parser==${local}/g" requirements.in | ||
sed -i "s/hogql-parser==.*/hogql-parser==${local}/g" requirements.txt | ||
- uses: EndBug/add-and-commit@v9 | ||
with: | ||
add: '["requirements.in", "requirements.txt"]' | ||
message: 'Use new hogql-parser version' | ||
default_author: github_actions | ||
github_token: ${{ secrets.POSTHOG_BOT_GITHUB_TOKEN }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
BasedOnStyle: Chromium | ||
ColumnLimit: 120 | ||
AlignAfterOpenBracket: BlockIndent |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# Build | ||
build/ | ||
*.egg-info | ||
*.so | ||
dist/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
# Developing `hogql-parser` | ||
|
||
## Mandatory reading | ||
|
||
If you're new to Python C/C++ extensions, there are some things you must have in your mind. | ||
|
||
### [Objects, Types and Reference Counts in CPython](https://docs.python.org/3/c-api/intro.html#objects-types-and-reference-counts) | ||
|
||
Key takeaways: | ||
|
||
1. `Py_INCREF()` and `Py_DECREF()` need to be used accurately, or there'll be memory leaks (or, less likely, segfaults). | ||
1. `Py_None`, `Py_True`, and `Py_False` are singletons, but they still need to be incref'd/decref'd - the best way to do create a new reference to them is wrapping them in `Py_NewRef()`. | ||
1. Pretty much only `PyList_SET_ITEM()` _steals_ references (i.e. assumes ownership of objects passed into it), if you pass an object into any other function and no longer need it after that - remember to `Py_DECREF` it! | ||
|
||
### [Building Values in CPython](https://docs.python.org/3/c-api/arg.html#building-values) | ||
|
||
Key takeaways: | ||
|
||
1. Use `Py_BuildValue()` for building tuples, dicts, and lists of static size. Use type-specific functions (e.g. `PyUnicode_FromString()` or `PyList_New()`) otherwise. | ||
1. `str`-building with `s` involves `strlen`, while `s#` doesn't - it's better to use the latter with C++ strings. | ||
1. `object`-passing with `O` increments the object's refcount, while doing it with `N` doesn't - we should use `N` pretty much exclusively, because the parse tree converter is about creating new objects (not borrowing). | ||
|
||
## Conventions | ||
|
||
1. Use `snake_case`. ANTLR is `camelCase`-heavy because of its Java heritage, but both the C++ stdlib and CPython are snaky. | ||
2. Use the `auto` type for ANTLR and ANTLR-derived types, since they can be pretty verbose. Otherwise specify the type explictly. | ||
3. Stay out of Python land as long as possible. E.g. avoid using `PyObject*`s` for bools or strings. | ||
Do use Python for parsing numbers though - that way we don't need to consider integer overflow. | ||
4. If any child rule results in an AST node, so must the parent rule - once in Python land, always in Python land. | ||
E.g. it doesn't make sense to create a `vector<PyObject*>`, that should just be a `PyObject*` of Python type `list`. | ||
|
||
## How to develop locally on macOS | ||
|
||
1. Install libraries: | ||
|
||
```bash | ||
brew install boost antlr4-cpp-runtime | ||
``` | ||
|
||
1. Install `hogql_parser` by building from local sources: | ||
|
||
```bash | ||
pip install ./hogql_parser | ||
``` | ||
|
||
1. If you now run tests, the locally-built version of `hogql_parser` will be used: | ||
|
||
```bash | ||
pytest posthog/hogql/ | ||
``` |
Oops, something went wrong.