Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Parse HogQL with C++ for a huge speedup #17659

Merged
merged 78 commits into from
Oct 13, 2023
Merged

perf: Parse HogQL with C++ for a huge speedup #17659

merged 78 commits into from
Oct 13, 2023

Conversation

Twixes
Copy link
Member

@Twixes Twixes commented Sep 28, 2023

Changes

This adds a new C++ parser of HogQL, which is a 10-100x speedup over the Python parser.

TODOs remaining:

  • Convert the remaining rule conversions
  • Sort out cross-platform compilation (include/lib paths, currently only macOS works)

How did you test this code?

The new parser must pass the same test suite as the Python parser.

@Twixes Twixes force-pushed the turbo-parser branch 2 times, most recently from ade2a44 to afc4224 Compare October 4, 2023 08:06
@Twixes Twixes had a problem deploying to pypi-hogql-parser October 5, 2023 20:36 — with GitHub Actions Failure
@Twixes Twixes had a problem deploying to pypi-hogql-parser October 5, 2023 20:53 — with GitHub Actions Failure
@Twixes Twixes had a problem deploying to pypi-hogql-parser October 5, 2023 21:25 — with GitHub Actions Failure
@Twixes Twixes had a problem deploying to pypi-hogql-parser October 5, 2023 22:22 — with GitHub Actions Failure
@Twixes Twixes temporarily deployed to pypi-hogql-parser October 6, 2023 07:44 — with GitHub Actions Inactive
@Twixes Twixes had a problem deploying to pypi-hogql-parser October 6, 2023 07:57 — with GitHub Actions Failure
@Twixes Twixes had a problem deploying to pypi-hogql-parser October 6, 2023 09:46 — with GitHub Actions Failure
@Twixes Twixes had a problem deploying to pypi-hogql-parser October 6, 2023 10:46 — with GitHub Actions Failure
@Twixes Twixes temporarily deployed to pypi-hogql-parser October 6, 2023 12:21 — with GitHub Actions Inactive
@Twixes Twixes force-pushed the turbo-parser branch 3 times, most recently from ac4e44c to 32e0901 Compare October 6, 2023 15:11
@Twixes Twixes temporarily deployed to pypi-hogql-parser October 6, 2023 15:32 — with GitHub Actions Inactive
@Twixes Twixes had a problem deploying to pypi-hogql-parser October 6, 2023 18:07 — with GitHub Actions Failure
@Twixes Twixes marked this pull request as ready for review October 9, 2023 08:02
@Twixes Twixes requested a review from mariusandra October 9, 2023 08:02
@Twixes Twixes force-pushed the turbo-parser branch 3 times, most recently from 83b0e7c to 4149e36 Compare October 9, 2023 10:29
@Twixes Twixes had a problem deploying to pypi-hogql-parser October 9, 2023 12:12 — with GitHub Actions Failure
@Twixes Twixes force-pushed the turbo-parser branch 4 times, most recently from 9e06dc1 to a3396ed Compare October 9, 2023 14:50
@Twixes Twixes temporarily deployed to pypi-hogql-parser October 11, 2023 08:10 — with GitHub Actions Inactive
static PyObject* method_parse_expr(PyObject* self, PyObject* args, PyObject* kwargs) {
parser_state* state = get_module_state(self);
const char* str;
int start; // TODO: Determine if this `start` kwarg of `parse_expr` is needed for anything
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not quite understood how the start arg of parse_expr is for in the Python original. Do we need it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only use is this.

Basically in case the subquery here has a parser error:

image

.. then we won't show the error at a random location in the query.

Since we're anyway replacing the subquery with a join, this doesn't really matter... and it can be solved differently regardless. So feel free to remove and make life easier here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Removed this bit in C++ then

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was renamed to _test_parser.py, but Git doesn't see it that way because of the factory pattern's added indentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was renamed to _test_parse_string.py, but Git doesn't see it that way because of the factory pattern's added indentation

self.assertEqual(parse_string("`a\\asd`"), "a\asd")
self.assertEqual(parse_string("`a\\vsd`"), "a\vsd")
self.assertEqual(parse_string("`a\\\\sd`"), "a\\sd")
self.assertEqual(parse_string("`a\\0sd`"), "asd")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only change in this whole test suite: NUL is no longer supported, because Python ignores it in PyUnicode_FromStringAndSize. This is surprising to me, but I really couldn't figure out a way to keep the NUL in the str constructed from C++

self.assertEqual(e.exception.start, 7)
self.assertEqual(e.exception.end, 16)

def test_malformed_sql(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only addition to this test suite

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason with hogql-parser installed, pytest no longer wants to collect these manage.py command tests unless the posthog.management.commands is a module, which requires __init__.py

"MIMode": "lldb"
},
{
"name": "Python C++ Debugger: Current File",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome for debugging a Python C++ extension, can set breakpoints in C++ code and they work perfectly when this debug config is ran in e.g. test_parser_cpp.py

Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got about half way...

# changed (requirements.txt has the already-published version)
run: |
sudo apt-get install libboost-all-dev unzip cmake curl uuid pkg-config
curl https://www.antlr.org/download/antlr4-cpp-runtime-4.13.0-source.zip --output antlr4-source.zip
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth chekcing the md5sum here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, added that!

hogql_parser/CONTRIBUTING.md Outdated Show resolved Hide resolved
@@ -301,7 +326,7 @@ def visitJoinOpLeftRight(self, ctx: HogQLParser.JoinOpLeftRightContext):

def visitJoinOpFull(self, ctx: HogQLParser.JoinOpFullContext):
tokens = []
if ctx.LEFT():
if ctx.FULL():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

hogql_parser/string.cpp Outdated Show resolved Hide resolved
if (node.has_value() && node.type() == typeid(PyObject*)) {
PyObject* py_node = any_cast<PyObject*>(node);
if (py_node && is_ast_node_instance(py_node)) {
// FIXME: This is leak, because the value argument is not decref'd. Fix for all PyObject_SetAttrString calls.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something to do here? 🧐

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, need to do a bunch of DECREFs, caught this late in development. But this is not urgent at all, so I was thinking I'll fix this as a follow-up PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"It's a memory leak, no big deal" make me question things... 🤣

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a big deal in production – just not urgent in terms of functionality. Simply put, the primary goal of this already-huge PR is to get all the tests to pass. To not make it even huger, the production-readiness memory tightness is next 😅

Comment on lines 252 to 253
Py_DECREF(flattened_queries);
throw HogQLParsingException("Unexpected query node type: " + string(Py_TYPE(query)->tp_name));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should anything be DECREF-d on the select_queries when this throws?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, yeah, they all should be decremented

Copy link
Member Author

@Twixes Twixes Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do the same thing as with the other missing DECREFs noted above – follow-up PR tightening that up. For now just added a comment here

Comment on lines +281 to +285
for (size_t i = 0; i < window_expr_ctxs.size(); i++) {
PyDict_SetItemString(
window_exprs, visitAsString(identifier_ctxs[i]).c_str(), visitAsPyObject(window_expr_ctxs[i])
);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grammar dictates that we have the same count of window exprs and identifiers, but without double checking that it's the case, I'm a bit nervous about reading out of bounds memory with identifier_ctxs[i]. Probably fine though 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently a parse tree where this doesn't hold is impossible. But good to be defensive, I added a check to ensure this also holds in the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added a similar check in RatioExpr

@Twixes Twixes temporarily deployed to pypi-hogql-parser October 12, 2023 18:13 — with GitHub Actions Inactive
@Twixes
Copy link
Member Author

Twixes commented Oct 12, 2023

BTW I went into Codespaces to debug the 187 issue, which ended up being pretty helpful (well, step-through debugging just refused to work, but it's anyway infinitely easier to iterate in an actual dev environment as opposed to CI).

Turns out the issue was that uninitialized integer struct members (so error.start and error.end) had the handy value 0 on macOS… but this was just a happy coincidence, and in fact the uninitialized value is whatever garbage happens to be at that point in memory. The garbage was zero on macOS, and non-zero on Linux.

Now we initialize these members with zero, which solves the problem.

I think I've been thinking too much about Go recently, where every primitive type has a "zero" default value. :D

@mariusandra
Copy link
Collaborator

BTW I went into Codespaces to debug the 187 issue, which ended up being pretty helpful (well, step-through debugging just refused to work, but it's anyway infinitely easier to iterate in an actual dev environment as opposed to CI).

Turns out the issue was that uninitialized integer struct members (so error.start and error.end) had the handy value 0 on macOS… but this was just a happy coincidence, and in fact the uninitialized value is whatever garbage happens to be at that point in memory. The garbage was zero on macOS, and non-zero on Linux.

Now we initialize these members with zero, which solves the problem.

I think I've been thinking too much about Go recently, where every primitive type has a "zero" default value. :D

Hahaha... if there's a "classic C" moment, this is probably it. I've fallen victim of such problems in the past. Mostly when forgetting to memset a new "array" (or pointer of pointers) or such... Great that you found this!

if (node.has_value() && node.type() == typeid(PyObject*)) {
PyObject* py_node = any_cast<PyObject*>(node);
if (py_node && is_ast_node_instance(py_node)) {
// FIXME: This is leak, because the value argument is not decref'd. Fix for all PyObject_SetAttrString calls.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"It's a memory leak, no big deal" make me question things... 🤣

}
PyObject* expr = visitAsPyObject(ctx->columnExpr());

if (find(RESERVED_KEYWORDS.begin(), RESERVED_KEYWORDS.end(), boost::algorithm::to_lower_copy(alias)) !=
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those boost::algorithm::to_lower_copy(alias) strings also getting cleared up, or should we explicitly free them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use new anywhere, so no need to delete anything – this is on the stack

Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I guess you deserve a medal for the work done here 💪! Very well done!

There are still some loose ends, but since this isn't directly used anywhere, I'm happy to get it in, and iterate.

I didn't spot anything leaky further down the code, but I'm sure I could have easily missed something. However there are the memory leaks you already know of.

Just to verify, I ran the simplest memory leak test in the world in shell_plus:

sql="SELECT groupArray(start_of_period) AS date, groupArray(counts) AS total, status FROM (SELECT if(equals(status, 'dormant'), negate(sum(counts)), negate(negate(sum(counts)))) AS counts, start_of_period, status FROM (SELECT periods.start_of_period AS start_of_period, 0 AS counts, status FROM (SELECT minus(dateTrunc('day', assumeNotNull(toDateTime('2023-09-20 23:59:59'))), toIntervalDay(number)) AS start_of_period FROM numbers(dateDiff('day', dateTrunc('day', assumeNotNull(toDateTime('2023-09-13 00:00:00'))), dateTrunc('day', plus(assumeNotNull(toDateTime('2023-09-20 23:59:59')), toIntervalDay(1))))) AS numbers) AS periods CROSS JOIN (SELECT status FROM (SELECT 1) ARRAY JOIN ['new', 'returning', 'resurrecting', 'dormant'] AS status) AS sec ORDER BY status ASC, start_of_period ASC UNION ALL SELECT start_of_period, count(DISTINCT person_id) AS counts, status FROM (SELECT events.person.id AS person_id, min(events.person.created_at) AS created_at, arraySort(groupUniqArray(dateTrunc('day', events.timestamp))) AS all_activity, arrayPopBack(arrayPushFront(all_activity, dateTrunc('day', created_at))) AS previous_activity, arrayPopFront(arrayPushBack(all_activity, dateTrunc('day', toDateTime('1970-01-01 00:00:00')))) AS following_activity, arrayMap((previous, current, index) -> if(equals(previous, current), 'new', if(and(equals(minus(current, toIntervalDay(1)), previous), notEquals(index, 1)), 'returning', 'resurrecting')), previous_activity, all_activity, arrayEnumerate(all_activity)) AS initial_status, arrayMap((current, next) -> if(equals(plus(current, toIntervalDay(1)), next), '', 'dormant'), all_activity, following_activity) AS dormant_status, arrayMap(x -> plus(x, toIntervalDay(1)), arrayFilter((current, is_dormant) -> equals(is_dormant, 'dormant'), all_activity, dormant_status)) AS dormant_periods, arrayMap(x -> 'dormant', dormant_periods) AS dormant_label, arrayConcat(arrayZip(all_activity, initial_status), arrayZip(dormant_periods, dormant_label)) AS temp_concat, arrayJoin(temp_concat) AS period_status_pairs, period_status_pairs.1 AS start_of_period, period_status_pairs.2 AS status FROM events WHERE and(greaterOrEquals(timestamp, minus(dateTrunc('day', assumeNotNull(toDateTime('2023-09-13 00:00:00'))), toIntervalDay(1))), less(timestamp, plus(dateTrunc('day', assumeNotNull(toDateTime('2023-09-20 23:59:59'))), toIntervalDay(1))), equals(event, '$pageview')) GROUP BY person_id) GROUP BY start_of_period, status) WHERE and(lessOrEquals(start_of_period, dateTrunc('day', assumeNotNull(toDateTime('2023-09-20 23:59:59')))), greaterOrEquals(start_of_period, dateTrunc('day', assumeNotNull(toDateTime('2023-09-13 00:00:00'))))) GROUP BY start_of_period, status ORDER BY start_of_period ASC) GROUP BY status LIMIT 10000 "
from posthog.hogql.parser import parse_select
import gc
i = 0
while True:
    i += 1
    parse_select(sql, backend='python') and gc.collect() and i

With the python parser, things are slow, but memory stays more or less steady:

2023-10-13 11 53 15

On C++ the gc.collect() function returns 0 items collected, so i is not even printed... and memory goes vroom 📈

2023-10-13 11 56 29

Since memory leaks can always reappear after any changes, we should really invest in some tooling that just runs a bunch of queries in a loop (both good and bad queries), maybe even just the parser tests, and checks that nothing grows unexpectedly... 🤔

static PyObject* method_parse_expr(PyObject* self, PyObject* args, PyObject* kwargs) {
parser_state* state = get_module_state(self);
const char* str;
int start; // TODO: Determine if this `start` kwarg of `parse_expr` is needed for anything
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only use is this.

Basically in case the subquery here has a parser error:

image

.. then we won't show the error at a random location in the query.

Since we're anyway replacing the subquery with a join, this doesn't really matter... and it can be solved differently regardless. So feel free to remove and make life easier here.

@mariusandra
Copy link
Collaborator

Aside: we should/could enable the C++ parser for testing if a HogQL Query Modifier is set.

@neilkakkar
Copy link
Collaborator

Very cool test! Out of pure curiosity, how did you find a SQL for which the leak would happen 😮

@mariusandra
Copy link
Collaborator

Very cool test! Out of pure curiosity, how did you find a SQL for which the leak would happen 😮

There are comments in the codebase that say // TODO: this leaks 😅

@@ -0,0 +1,50 @@
# Developing `hogql-parser`

## Mandatory reading
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great I applaud you for writing this 👏

@Twixes Twixes temporarily deployed to pypi-hogql-parser October 13, 2023 11:17 — with GitHub Actions Inactive
@Twixes Twixes enabled auto-merge (squash) October 13, 2023 11:22
@Twixes Twixes merged commit 16a71f6 into master Oct 13, 2023
76 checks passed
@Twixes Twixes deleted the turbo-parser branch October 13, 2023 13:58
daibhin pushed a commit that referenced this pull request Oct 23, 2023
* Add partial C++ parser of HogQL

* Support all the rules and add proper error handling

* Use `AlignAfterOpenBracket: BlockIndent`

* Reuse all the parser tests for the C++ backend

* Fix relationship between headers and implementations

* Add more error handling and fix minor issues

* Build both Python and C++ parsers in the package.json script

* Update ARRAY JOIN error assertion

* Improve timeit

* Move the C extension to the top level

* Refactor `vector_to_list_string`

* Build the parser on Linux

* Build wheels for the parser

* Simplify Linux build and fix macOS

* Fix Homebrew paths on x86 and don't fail fast in CI

* Set MACOSX_DEPLOYMENT_TARGET for C++20

* Set up QEMU for Linux ARM builds

* Publish the wheels on PyPI

* Avoiding Linux ARM emulation in CI for now

* Build sdist too

* Revert Dockerfile changes

* Fix PyPI publish

* Add README and optimize sdist build

* Use setup.py directly instead of build

* Use PyPI hogql-parser instead of local

* Also revert production-unit.Dockerfile

* Fix sdist upload and add Linux ARM back

* No Linux ARM build in the end

* Fix artifact uploading

* Do try building Linux ARM

We need this for prod.

* Use `npm` in `grammar:build`

`pnpm` is not available in that job.

* Fix formatting of hogql_parser

* Build everything on macOS

* Revert "Build everything on macOS"

Not so fast actually.

* Use hogql-parser=0.1.1

* Fix dylib in macOS wheel

* Bump hogql-parser version

* Fix missing module error

* Delete timeit.py

* Make error handling robust

* Format the C++

* Use `hogql-parser==0.1.1`

* Fix reserved keyword error assertions

* Use HEAD hogql_paresr in CI

* Fix `apt` usage

* Add some sudo in CI

* Ensure package will be releasable before build

* Bump version to 0.1.3

* Cover C++ `unquote_string` with tests

* Use BuildJet ARM runners for ARM builds

* Add some instructions

* Add HogQL version check to backend CI

* Update requirements.txt

* Use `setuptools` instead of the deprecated `distutils`

* Fix working dir in backend CI

* Align ANTLR versions

* Add test for "mismatched input"

This is thrown differently than other HogQLSyntaxExceptions in C++, so might help reveal what's going on with tests failing only on Linux CI and not macOS dev

* Add types and bump version

* Comment instead of failing version check

* Automate hogql-release version bump

* Fix checkout token

* Don't build hogql-parser if there were no changes

* Update query snapshots

* Update query snapshots

* Update query snapshots

* Update query snapshots

* Improve documentation

* Use new hogql-parser version

* Fix error start and end initialization

* Note `antlr4-cpp-runtime`

Co-authored-by: Marius Andra <[email protected]>

* Also remove NUL chars in C++

* Check ANTLR4 runtime archive checksum for security

* Note more decrefs to add

* Add vector size checks

* Use new hogql-parser version

* Don't support the `start` arg in C++ `parse_expr`

* Use new hogql-parser version

---------

Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Marius Andra <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants