perf: Parse HogQL with C++ for a huge speedup #17659

Twixes · 2023-09-28T07:35:26Z

Changes

This adds a new C++ parser of HogQL, which is a 10-100x speedup over the Python parser.

TODOs remaining:

Convert the remaining rule conversions
Sort out cross-platform compilation (include/lib paths, currently only macOS works)

How did you test this code?

The new parser must pass the same test suite as the Python parser.

Twixes · 2023-10-10T21:00:16Z

hogql_parser/parser.cpp

+static PyObject* method_parse_expr(PyObject* self, PyObject* args, PyObject* kwargs) {
+  parser_state* state = get_module_state(self);
+  const char* str;
+  int start;  // TODO: Determine if this `start` kwarg of `parse_expr` is needed for anything


I have not quite understood how the start arg of parse_expr is for in the Python original. Do we need it?

I think the only use is this.

Basically in case the subquery here has a parser error:

.. then we won't show the error at a random location in the query.

Since we're anyway replacing the subquery with a join, this doesn't really matter... and it can be solved differently regardless. So feel free to remove and make life easier here.

👍 Removed this bit in C++ then

Twixes · 2023-10-11T09:25:46Z

posthog/hogql/test/test_parser.py

This was renamed to _test_parser.py, but Git doesn't see it that way because of the factory pattern's added indentation

Twixes · 2023-10-11T09:25:54Z

posthog/hogql/test/test_parse_string.py

This was renamed to _test_parse_string.py, but Git doesn't see it that way because of the factory pattern's added indentation

Twixes · 2023-10-11T09:27:24Z

posthog/hogql/test/_test_parse_string.py

+            self.assertEqual(parse_string("`a\\asd`"), "a\asd")
+            self.assertEqual(parse_string("`a\\vsd`"), "a\vsd")
+            self.assertEqual(parse_string("`a\\\\sd`"), "a\\sd")
+            self.assertEqual(parse_string("`a\\0sd`"), "asd")


This is the only change in this whole test suite: NUL is no longer supported, because Python ignores it in PyUnicode_FromStringAndSize. This is surprising to me, but I really couldn't figure out a way to keep the NUL in the str constructed from C++

Twixes · 2023-10-11T09:28:05Z

posthog/hogql/test/_test_parser.py

+            self.assertEqual(e.exception.start, 7)
+            self.assertEqual(e.exception.end, 16)
+
+        def test_malformed_sql(self):


This is the only addition to this test suite

Twixes · 2023-10-11T09:29:30Z

posthog/management/commands/__init__.py

For some reason with hogql-parser installed, pytest no longer wants to collect these manage.py command tests unless the posthog.management.commands is a module, which requires __init__.py

Twixes · 2023-10-11T09:31:04Z

.vscode/launch.json

+            "MIMode": "lldb"
+        },
+        {
+            "name": "Python C++ Debugger: Current File",


Awesome for debugging a Python C++ extension, can set breakpoints in C++ code and they work perfectly when this debug config is ran in e.g. test_parser_cpp.py

mariusandra

Got about half way...

mariusandra · 2023-10-12T14:01:00Z

.github/actions/run-backend-tests/action.yml

+          # changed (requirements.txt has the already-published version)
+          run: |
+              sudo apt-get install libboost-all-dev unzip cmake curl uuid pkg-config
+              curl https://www.antlr.org/download/antlr4-cpp-runtime-4.13.0-source.zip --output antlr4-source.zip


Might be worth chekcing the md5sum here?

Good idea, added that!

hogql_parser/CONTRIBUTING.md

mariusandra · 2023-10-12T14:13:03Z

posthog/hogql/parser.py

@@ -301,7 +326,7 @@ def visitJoinOpLeftRight(self, ctx: HogQLParser.JoinOpLeftRightContext):

    def visitJoinOpFull(self, ctx: HogQLParser.JoinOpFullContext):
        tokens = []
-        if ctx.LEFT():
+        if ctx.FULL():


hogql_parser/string.cpp

mariusandra · 2023-10-12T14:29:39Z

hogql_parser/parser.cpp

+    if (node.has_value() && node.type() == typeid(PyObject*)) {
+      PyObject* py_node = any_cast<PyObject*>(node);
+      if (py_node && is_ast_node_instance(py_node)) {
+        // FIXME: This is leak, because the value argument is not decref'd. Fix for all PyObject_SetAttrString calls.


Is there something to do here? 🧐

Yes, need to do a bunch of DECREFs, caught this late in development. But this is not urgent at all, so I was thinking I'll fix this as a follow-up PR

"It's a memory leak, no big deal" make me question things... 🤣

Definitely a big deal in production – just not urgent in terms of functionality. Simply put, the primary goal of this already-huge PR is to get all the tests to pass. To not make it even huger, the production-readiness memory tightness is next 😅

mariusandra · 2023-10-12T14:36:44Z

hogql_parser/parser.cpp

+        Py_DECREF(flattened_queries);
+        throw HogQLParsingException("Unexpected query node type: " + string(Py_TYPE(query)->tp_name));


Should anything be DECREF-d on the select_queries when this throws?

Good catch, yeah, they all should be decremented

I'll do the same thing as with the other missing DECREFs noted above – follow-up PR tightening that up. For now just added a comment here

mariusandra · 2023-10-12T14:42:53Z

hogql_parser/parser.cpp

+      for (size_t i = 0; i < window_expr_ctxs.size(); i++) {
+        PyDict_SetItemString(
+            window_exprs, visitAsString(identifier_ctxs[i]).c_str(), visitAsPyObject(window_expr_ctxs[i])
+        );
+      }


The grammar dictates that we have the same count of window exprs and identifiers, but without double checking that it's the case, I'm a bit nervous about reading out of bounds memory with identifier_ctxs[i]. Probably fine though 😅

Currently a parse tree where this doesn't hold is impossible. But good to be defensive, I added a check to ensure this also holds in the future

Also added a similar check in RatioExpr

Co-authored-by: Marius Andra <[email protected]>

Twixes · 2023-10-12T21:11:54Z

BTW I went into Codespaces to debug the 187 issue, which ended up being pretty helpful (well, step-through debugging just refused to work, but it's anyway infinitely easier to iterate in an actual dev environment as opposed to CI).

Turns out the issue was that uninitialized integer struct members (so error.start and error.end) had the handy value 0 on macOS… but this was just a happy coincidence, and in fact the uninitialized value is whatever garbage happens to be at that point in memory. The garbage was zero on macOS, and non-zero on Linux.

Now we initialize these members with zero, which solves the problem.

I think I've been thinking too much about Go recently, where every primitive type has a "zero" default value. :D

mariusandra · 2023-10-13T07:26:13Z

BTW I went into Codespaces to debug the 187 issue, which ended up being pretty helpful (well, step-through debugging just refused to work, but it's anyway infinitely easier to iterate in an actual dev environment as opposed to CI).

Turns out the issue was that uninitialized integer struct members (so error.start and error.end) had the handy value 0 on macOS… but this was just a happy coincidence, and in fact the uninitialized value is whatever garbage happens to be at that point in memory. The garbage was zero on macOS, and non-zero on Linux.

Now we initialize these members with zero, which solves the problem.

I think I've been thinking too much about Go recently, where every primitive type has a "zero" default value. :D

Hahaha... if there's a "classic C" moment, this is probably it. I've fallen victim of such problems in the past. Mostly when forgetting to memset a new "array" (or pointer of pointers) or such... Great that you found this!

mariusandra · 2023-10-13T07:41:58Z

hogql_parser/parser.cpp

+    if (node.has_value() && node.type() == typeid(PyObject*)) {
+      PyObject* py_node = any_cast<PyObject*>(node);
+      if (py_node && is_ast_node_instance(py_node)) {
+        // FIXME: This is leak, because the value argument is not decref'd. Fix for all PyObject_SetAttrString calls.


"It's a memory leak, no big deal" make me question things... 🤣

mariusandra · 2023-10-13T07:42:47Z

hogql_parser/parser.cpp

+    }
+    PyObject* expr = visitAsPyObject(ctx->columnExpr());
+
+    if (find(RESERVED_KEYWORDS.begin(), RESERVED_KEYWORDS.end(), boost::algorithm::to_lower_copy(alias)) !=


Are those boost::algorithm::to_lower_copy(alias) strings also getting cleared up, or should we explicitly free them?

We don't use new anywhere, so no need to delete anything – this is on the stack

mariusandra

Overall I guess you deserve a medal for the work done here 💪! Very well done!

There are still some loose ends, but since this isn't directly used anywhere, I'm happy to get it in, and iterate.

I didn't spot anything leaky further down the code, but I'm sure I could have easily missed something. However there are the memory leaks you already know of.

Just to verify, I ran the simplest memory leak test in the world in shell_plus:

sql="SELECT groupArray(start_of_period) AS date, groupArray(counts) AS total, status FROM (SELECT if(equals(status, 'dormant'), negate(sum(counts)), negate(negate(sum(counts)))) AS counts, start_of_period, status FROM (SELECT periods.start_of_period AS start_of_period, 0 AS counts, status FROM (SELECT minus(dateTrunc('day', assumeNotNull(toDateTime('2023-09-20 23:59:59'))), toIntervalDay(number)) AS start_of_period FROM numbers(dateDiff('day', dateTrunc('day', assumeNotNull(toDateTime('2023-09-13 00:00:00'))), dateTrunc('day', plus(assumeNotNull(toDateTime('2023-09-20 23:59:59')), toIntervalDay(1))))) AS numbers) AS periods CROSS JOIN (SELECT status FROM (SELECT 1) ARRAY JOIN ['new', 'returning', 'resurrecting', 'dormant'] AS status) AS sec ORDER BY status ASC, start_of_period ASC UNION ALL SELECT start_of_period, count(DISTINCT person_id) AS counts, status FROM (SELECT events.person.id AS person_id, min(events.person.created_at) AS created_at, arraySort(groupUniqArray(dateTrunc('day', events.timestamp))) AS all_activity, arrayPopBack(arrayPushFront(all_activity, dateTrunc('day', created_at))) AS previous_activity, arrayPopFront(arrayPushBack(all_activity, dateTrunc('day', toDateTime('1970-01-01 00:00:00')))) AS following_activity, arrayMap((previous, current, index) -> if(equals(previous, current), 'new', if(and(equals(minus(current, toIntervalDay(1)), previous), notEquals(index, 1)), 'returning', 'resurrecting')), previous_activity, all_activity, arrayEnumerate(all_activity)) AS initial_status, arrayMap((current, next) -> if(equals(plus(current, toIntervalDay(1)), next), '', 'dormant'), all_activity, following_activity) AS dormant_status, arrayMap(x -> plus(x, toIntervalDay(1)), arrayFilter((current, is_dormant) -> equals(is_dormant, 'dormant'), all_activity, dormant_status)) AS dormant_periods, arrayMap(x -> 'dormant', dormant_periods) AS dormant_label, arrayConcat(arrayZip(all_activity, initial_status), arrayZip(dormant_periods, dormant_label)) AS temp_concat, arrayJoin(temp_concat) AS period_status_pairs, period_status_pairs.1 AS start_of_period, period_status_pairs.2 AS status FROM events WHERE and(greaterOrEquals(timestamp, minus(dateTrunc('day', assumeNotNull(toDateTime('2023-09-13 00:00:00'))), toIntervalDay(1))), less(timestamp, plus(dateTrunc('day', assumeNotNull(toDateTime('2023-09-20 23:59:59'))), toIntervalDay(1))), equals(event, '$pageview')) GROUP BY person_id) GROUP BY start_of_period, status) WHERE and(lessOrEquals(start_of_period, dateTrunc('day', assumeNotNull(toDateTime('2023-09-20 23:59:59')))), greaterOrEquals(start_of_period, dateTrunc('day', assumeNotNull(toDateTime('2023-09-13 00:00:00'))))) GROUP BY start_of_period, status ORDER BY start_of_period ASC) GROUP BY status LIMIT 10000 "
from posthog.hogql.parser import parse_select
import gc
i = 0
while True:
    i += 1
    parse_select(sql, backend='python') and gc.collect() and i

With the python parser, things are slow, but memory stays more or less steady:

On C++ the gc.collect() function returns 0 items collected, so i is not even printed... and memory goes vroom 📈

Since memory leaks can always reappear after any changes, we should really invest in some tooling that just runs a bunch of queries in a loop (both good and bad queries), maybe even just the parser tests, and checks that nothing grows unexpectedly... 🤔

mariusandra · 2023-10-13T09:25:11Z

hogql_parser/parser.cpp

+static PyObject* method_parse_expr(PyObject* self, PyObject* args, PyObject* kwargs) {
+  parser_state* state = get_module_state(self);
+  const char* str;
+  int start;  // TODO: Determine if this `start` kwarg of `parse_expr` is needed for anything


I think the only use is this.

Basically in case the subquery here has a parser error:

.. then we won't show the error at a random location in the query.

Since we're anyway replacing the subquery with a join, this doesn't really matter... and it can be solved differently regardless. So feel free to remove and make life easier here.

mariusandra · 2023-10-13T10:03:24Z

Aside: we should/could enable the C++ parser for testing if a HogQL Query Modifier is set.

neilkakkar · 2023-10-13T10:08:16Z

Very cool test! Out of pure curiosity, how did you find a SQL for which the leak would happen 😮

mariusandra · 2023-10-13T10:12:13Z

Very cool test! Out of pure curiosity, how did you find a SQL for which the leak would happen 😮

There are comments in the codebase that say // TODO: this leaks 😅

neilkakkar · 2023-10-13T10:25:46Z

hogql_parser/CONTRIBUTING.md

@@ -0,0 +1,50 @@
+# Developing `hogql-parser`
+
+## Mandatory reading


this is great I applaud you for writing this 👏

* Add partial C++ parser of HogQL * Support all the rules and add proper error handling * Use `AlignAfterOpenBracket: BlockIndent` * Reuse all the parser tests for the C++ backend * Fix relationship between headers and implementations * Add more error handling and fix minor issues * Build both Python and C++ parsers in the package.json script * Update ARRAY JOIN error assertion * Improve timeit * Move the C extension to the top level * Refactor `vector_to_list_string` * Build the parser on Linux * Build wheels for the parser * Simplify Linux build and fix macOS * Fix Homebrew paths on x86 and don't fail fast in CI * Set MACOSX_DEPLOYMENT_TARGET for C++20 * Set up QEMU for Linux ARM builds * Publish the wheels on PyPI * Avoiding Linux ARM emulation in CI for now * Build sdist too * Revert Dockerfile changes * Fix PyPI publish * Add README and optimize sdist build * Use setup.py directly instead of build * Use PyPI hogql-parser instead of local * Also revert production-unit.Dockerfile * Fix sdist upload and add Linux ARM back * No Linux ARM build in the end * Fix artifact uploading * Do try building Linux ARM We need this for prod. * Use `npm` in `grammar:build` `pnpm` is not available in that job. * Fix formatting of hogql_parser * Build everything on macOS * Revert "Build everything on macOS" Not so fast actually. * Use hogql-parser=0.1.1 * Fix dylib in macOS wheel * Bump hogql-parser version * Fix missing module error * Delete timeit.py * Make error handling robust * Format the C++ * Use `hogql-parser==0.1.1` * Fix reserved keyword error assertions * Use HEAD hogql_paresr in CI * Fix `apt` usage * Add some sudo in CI * Ensure package will be releasable before build * Bump version to 0.1.3 * Cover C++ `unquote_string` with tests * Use BuildJet ARM runners for ARM builds * Add some instructions * Add HogQL version check to backend CI * Update requirements.txt * Use `setuptools` instead of the deprecated `distutils` * Fix working dir in backend CI * Align ANTLR versions * Add test for "mismatched input" This is thrown differently than other HogQLSyntaxExceptions in C++, so might help reveal what's going on with tests failing only on Linux CI and not macOS dev * Add types and bump version * Comment instead of failing version check * Automate hogql-release version bump * Fix checkout token * Don't build hogql-parser if there were no changes * Update query snapshots * Update query snapshots * Update query snapshots * Update query snapshots * Improve documentation * Use new hogql-parser version * Fix error start and end initialization * Note `antlr4-cpp-runtime` Co-authored-by: Marius Andra <[email protected]> * Also remove NUL chars in C++ * Check ANTLR4 runtime archive checksum for security * Note more decrefs to add * Add vector size checks * Use new hogql-parser version * Don't support the `start` arg in C++ `parse_expr` * Use new hogql-parser version --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Marius Andra <[email protected]>

Twixes force-pushed the turbo-parser branch 2 times, most recently from ade2a44 to afc4224 Compare October 4, 2023 08:06

Twixes had a problem deploying to pypi-hogql-parser October 5, 2023 20:36 — with GitHub Actions Failure

Twixes force-pushed the turbo-parser branch from 6ecd517 to cdda7d4 Compare October 5, 2023 20:42

Twixes had a problem deploying to pypi-hogql-parser October 5, 2023 20:53 — with GitHub Actions Failure

Twixes had a problem deploying to pypi-hogql-parser October 5, 2023 21:25 — with GitHub Actions Failure

Twixes force-pushed the turbo-parser branch from 36fc8ed to 885e233 Compare October 5, 2023 22:14

Twixes had a problem deploying to pypi-hogql-parser October 5, 2023 22:22 — with GitHub Actions Failure

Twixes temporarily deployed to pypi-hogql-parser October 6, 2023 07:44 — with GitHub Actions Inactive

Twixes force-pushed the turbo-parser branch from 4bb2b84 to 265b47e Compare October 6, 2023 07:48

Twixes had a problem deploying to pypi-hogql-parser October 6, 2023 07:57 — with GitHub Actions Failure

Twixes force-pushed the turbo-parser branch from 76a1d48 to 658d2a2 Compare October 6, 2023 09:38

Twixes had a problem deploying to pypi-hogql-parser October 6, 2023 09:46 — with GitHub Actions Failure

Twixes had a problem deploying to pypi-hogql-parser October 6, 2023 10:46 — with GitHub Actions Failure

Twixes temporarily deployed to pypi-hogql-parser October 6, 2023 12:21 — with GitHub Actions Inactive

Twixes force-pushed the turbo-parser branch 3 times, most recently from ac4e44c to 32e0901 Compare October 6, 2023 15:11

Twixes temporarily deployed to pypi-hogql-parser October 6, 2023 15:32 — with GitHub Actions Inactive

Twixes had a problem deploying to pypi-hogql-parser October 6, 2023 18:07 — with GitHub Actions Failure

Twixes marked this pull request as ready for review October 9, 2023 08:02

Twixes requested a review from mariusandra October 9, 2023 08:02

Twixes force-pushed the turbo-parser branch 3 times, most recently from 83b0e7c to 4149e36 Compare October 9, 2023 10:29

Twixes had a problem deploying to pypi-hogql-parser October 9, 2023 12:12 — with GitHub Actions Failure

Twixes force-pushed the turbo-parser branch 4 times, most recently from 9e06dc1 to a3396ed Compare October 9, 2023 14:50

Improve documentation

19617f5

Twixes temporarily deployed to pypi-hogql-parser October 11, 2023 08:10 — with GitHub Actions Inactive

Use new hogql-parser version

7a89bf2

Twixes commented Oct 11, 2023

View reviewed changes

mariusandra reviewed Oct 12, 2023

View reviewed changes

Twixes and others added 6 commits October 12, 2023 16:51

Fix error start and end initialization

4b5e8a4

Note antlr4-cpp-runtime

4096807

Co-authored-by: Marius Andra <[email protected]>

Also remove NUL chars in C++

5e5ba40

Check ANTLR4 runtime archive checksum for security

debb5ed

Note more decrefs to add

640c159

Add vector size checks

eca0d64

Twixes temporarily deployed to pypi-hogql-parser October 12, 2023 18:13 — with GitHub Actions Inactive

Use new hogql-parser version

6015bc4

mariusandra reviewed Oct 13, 2023

View reviewed changes

mariusandra mentioned this pull request Oct 13, 2023

feat(hogql): query modifiers #17946

Merged

mariusandra approved these changes Oct 13, 2023

View reviewed changes

neilkakkar reviewed Oct 13, 2023

View reviewed changes

Don't support the start arg in C++ parse_expr

4d4d6c0

Twixes temporarily deployed to pypi-hogql-parser October 13, 2023 11:17 — with GitHub Actions Inactive

Use new hogql-parser version

29980df

Twixes enabled auto-merge (squash) October 13, 2023 11:22

Twixes merged commit 16a71f6 into master Oct 13, 2023
76 checks passed

Twixes deleted the turbo-parser branch October 13, 2023 13:58

Twixes mentioned this pull request Oct 16, 2023

perf(hogql): Fix C++ parser leaks #18022

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Parse HogQL with C++ for a huge speedup #17659

perf: Parse HogQL with C++ for a huge speedup #17659

Twixes commented Sep 28, 2023 •

edited

Loading

Twixes Oct 10, 2023

mariusandra Oct 13, 2023

Twixes Oct 13, 2023

Twixes Oct 11, 2023

Twixes Oct 11, 2023

Twixes Oct 11, 2023

Twixes Oct 11, 2023

Twixes Oct 11, 2023

Twixes Oct 11, 2023

mariusandra left a comment

mariusandra Oct 12, 2023

Twixes Oct 12, 2023

mariusandra Oct 12, 2023

mariusandra Oct 12, 2023

Twixes Oct 12, 2023

mariusandra Oct 13, 2023

Twixes Oct 13, 2023

mariusandra Oct 12, 2023

Twixes Oct 12, 2023

Twixes Oct 12, 2023 •

edited

Loading

mariusandra Oct 12, 2023

Twixes Oct 12, 2023

Twixes Oct 12, 2023

Twixes commented Oct 12, 2023 •

edited

Loading

mariusandra commented Oct 13, 2023

mariusandra Oct 13, 2023

mariusandra Oct 13, 2023

Twixes Oct 13, 2023

mariusandra left a comment

mariusandra Oct 13, 2023

mariusandra commented Oct 13, 2023

neilkakkar commented Oct 13, 2023

mariusandra commented Oct 13, 2023

neilkakkar Oct 13, 2023

		Py_DECREF(flattened_queries);
		throw HogQLParsingException("Unexpected query node type: " + string(Py_TYPE(query)->tp_name));

		@@ -0,0 +1,50 @@
		# Developing `hogql-parser`

		## Mandatory reading

perf: Parse HogQL with C++ for a huge speedup #17659

perf: Parse HogQL with C++ for a huge speedup #17659

Conversation

Twixes commented Sep 28, 2023 • edited Loading

Changes

How did you test this code?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariusandra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Twixes Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Twixes commented Oct 12, 2023 • edited Loading

mariusandra commented Oct 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariusandra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariusandra commented Oct 13, 2023

neilkakkar commented Oct 13, 2023

mariusandra commented Oct 13, 2023

Choose a reason for hiding this comment

Twixes commented Sep 28, 2023 •

edited

Loading

Twixes Oct 12, 2023 •

edited

Loading

Twixes commented Oct 12, 2023 •

edited

Loading