-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Parse HogQL with C++ for a huge speedup #17659
Conversation
ade2a44
to
afc4224
Compare
ac4e44c
to
32e0901
Compare
83b0e7c
to
4149e36
Compare
9e06dc1
to
a3396ed
Compare
hogql_parser/parser.cpp
Outdated
static PyObject* method_parse_expr(PyObject* self, PyObject* args, PyObject* kwargs) { | ||
parser_state* state = get_module_state(self); | ||
const char* str; | ||
int start; // TODO: Determine if this `start` kwarg of `parse_expr` is needed for anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not quite understood how the start
arg of parse_expr
is for in the Python original. Do we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only use is this.
Basically in case the subquery here has a parser error:
.. then we won't show the error at a random location in the query.
Since we're anyway replacing the subquery with a join, this doesn't really matter... and it can be solved differently regardless. So feel free to remove and make life easier here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Removed this bit in C++ then
posthog/hogql/test/test_parser.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was renamed to _test_parser.py
, but Git doesn't see it that way because of the factory pattern's added indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was renamed to _test_parse_string.py
, but Git doesn't see it that way because of the factory pattern's added indentation
self.assertEqual(parse_string("`a\\asd`"), "a\asd") | ||
self.assertEqual(parse_string("`a\\vsd`"), "a\vsd") | ||
self.assertEqual(parse_string("`a\\\\sd`"), "a\\sd") | ||
self.assertEqual(parse_string("`a\\0sd`"), "asd") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only change in this whole test suite: NUL is no longer supported, because Python ignores it in PyUnicode_FromStringAndSize
. This is surprising to me, but I really couldn't figure out a way to keep the NUL in the str constructed from C++
self.assertEqual(e.exception.start, 7) | ||
self.assertEqual(e.exception.end, 16) | ||
|
||
def test_malformed_sql(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only addition to this test suite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason with hogql-parser
installed, pytest no longer wants to collect these manage.py command tests unless the posthog.management.commands
is a module, which requires __init__.py
"MIMode": "lldb" | ||
}, | ||
{ | ||
"name": "Python C++ Debugger: Current File", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome for debugging a Python C++ extension, can set breakpoints in C++ code and they work perfectly when this debug config is ran in e.g. test_parser_cpp.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got about half way...
# changed (requirements.txt has the already-published version) | ||
run: | | ||
sudo apt-get install libboost-all-dev unzip cmake curl uuid pkg-config | ||
curl https://www.antlr.org/download/antlr4-cpp-runtime-4.13.0-source.zip --output antlr4-source.zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth chekcing the md5sum
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, added that!
@@ -301,7 +326,7 @@ def visitJoinOpLeftRight(self, ctx: HogQLParser.JoinOpLeftRightContext): | |||
|
|||
def visitJoinOpFull(self, ctx: HogQLParser.JoinOpFullContext): | |||
tokens = [] | |||
if ctx.LEFT(): | |||
if ctx.FULL(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳
if (node.has_value() && node.type() == typeid(PyObject*)) { | ||
PyObject* py_node = any_cast<PyObject*>(node); | ||
if (py_node && is_ast_node_instance(py_node)) { | ||
// FIXME: This is leak, because the value argument is not decref'd. Fix for all PyObject_SetAttrString calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something to do here? 🧐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, need to do a bunch of DECREFs, caught this late in development. But this is not urgent at all, so I was thinking I'll fix this as a follow-up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"It's a memory leak, no big deal" make me question things... 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a big deal in production – just not urgent in terms of functionality. Simply put, the primary goal of this already-huge PR is to get all the tests to pass. To not make it even huger, the production-readiness memory tightness is next 😅
hogql_parser/parser.cpp
Outdated
Py_DECREF(flattened_queries); | ||
throw HogQLParsingException("Unexpected query node type: " + string(Py_TYPE(query)->tp_name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should anything be DECREF
-d on the select_queries
when this throws?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, yeah, they all should be decremented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do the same thing as with the other missing DECREFs noted above – follow-up PR tightening that up. For now just added a comment here
for (size_t i = 0; i < window_expr_ctxs.size(); i++) { | ||
PyDict_SetItemString( | ||
window_exprs, visitAsString(identifier_ctxs[i]).c_str(), visitAsPyObject(window_expr_ctxs[i]) | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The grammar dictates that we have the same count of window exprs and identifiers, but without double checking that it's the case, I'm a bit nervous about reading out of bounds memory with identifier_ctxs[i]
. Probably fine though 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently a parse tree where this doesn't hold is impossible. But good to be defensive, I added a check to ensure this also holds in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added a similar check in RatioExpr
BTW I went into Codespaces to debug the 187 issue, which ended up being pretty helpful (well, step-through debugging just refused to work, but it's anyway infinitely easier to iterate in an actual dev environment as opposed to CI). Turns out the issue was that uninitialized integer struct members (so Now we initialize these members with zero, which solves the problem. I think I've been thinking too much about Go recently, where every primitive type has a "zero" default value. :D |
Hahaha... if there's a "classic C" moment, this is probably it. I've fallen victim of such problems in the past. Mostly when forgetting to |
if (node.has_value() && node.type() == typeid(PyObject*)) { | ||
PyObject* py_node = any_cast<PyObject*>(node); | ||
if (py_node && is_ast_node_instance(py_node)) { | ||
// FIXME: This is leak, because the value argument is not decref'd. Fix for all PyObject_SetAttrString calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"It's a memory leak, no big deal" make me question things... 🤣
} | ||
PyObject* expr = visitAsPyObject(ctx->columnExpr()); | ||
|
||
if (find(RESERVED_KEYWORDS.begin(), RESERVED_KEYWORDS.end(), boost::algorithm::to_lower_copy(alias)) != |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those boost::algorithm::to_lower_copy(alias)
strings also getting cleared up, or should we explicitly free
them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use new
anywhere, so no need to delete
anything – this is on the stack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I guess you deserve a medal for the work done here 💪! Very well done!
There are still some loose ends, but since this isn't directly used anywhere, I'm happy to get it in, and iterate.
I didn't spot anything leaky further down the code, but I'm sure I could have easily missed something. However there are the memory leaks you already know of.
Just to verify, I ran the simplest memory leak test in the world in shell_plus
:
sql="SELECT groupArray(start_of_period) AS date, groupArray(counts) AS total, status FROM (SELECT if(equals(status, 'dormant'), negate(sum(counts)), negate(negate(sum(counts)))) AS counts, start_of_period, status FROM (SELECT periods.start_of_period AS start_of_period, 0 AS counts, status FROM (SELECT minus(dateTrunc('day', assumeNotNull(toDateTime('2023-09-20 23:59:59'))), toIntervalDay(number)) AS start_of_period FROM numbers(dateDiff('day', dateTrunc('day', assumeNotNull(toDateTime('2023-09-13 00:00:00'))), dateTrunc('day', plus(assumeNotNull(toDateTime('2023-09-20 23:59:59')), toIntervalDay(1))))) AS numbers) AS periods CROSS JOIN (SELECT status FROM (SELECT 1) ARRAY JOIN ['new', 'returning', 'resurrecting', 'dormant'] AS status) AS sec ORDER BY status ASC, start_of_period ASC UNION ALL SELECT start_of_period, count(DISTINCT person_id) AS counts, status FROM (SELECT events.person.id AS person_id, min(events.person.created_at) AS created_at, arraySort(groupUniqArray(dateTrunc('day', events.timestamp))) AS all_activity, arrayPopBack(arrayPushFront(all_activity, dateTrunc('day', created_at))) AS previous_activity, arrayPopFront(arrayPushBack(all_activity, dateTrunc('day', toDateTime('1970-01-01 00:00:00')))) AS following_activity, arrayMap((previous, current, index) -> if(equals(previous, current), 'new', if(and(equals(minus(current, toIntervalDay(1)), previous), notEquals(index, 1)), 'returning', 'resurrecting')), previous_activity, all_activity, arrayEnumerate(all_activity)) AS initial_status, arrayMap((current, next) -> if(equals(plus(current, toIntervalDay(1)), next), '', 'dormant'), all_activity, following_activity) AS dormant_status, arrayMap(x -> plus(x, toIntervalDay(1)), arrayFilter((current, is_dormant) -> equals(is_dormant, 'dormant'), all_activity, dormant_status)) AS dormant_periods, arrayMap(x -> 'dormant', dormant_periods) AS dormant_label, arrayConcat(arrayZip(all_activity, initial_status), arrayZip(dormant_periods, dormant_label)) AS temp_concat, arrayJoin(temp_concat) AS period_status_pairs, period_status_pairs.1 AS start_of_period, period_status_pairs.2 AS status FROM events WHERE and(greaterOrEquals(timestamp, minus(dateTrunc('day', assumeNotNull(toDateTime('2023-09-13 00:00:00'))), toIntervalDay(1))), less(timestamp, plus(dateTrunc('day', assumeNotNull(toDateTime('2023-09-20 23:59:59'))), toIntervalDay(1))), equals(event, '$pageview')) GROUP BY person_id) GROUP BY start_of_period, status) WHERE and(lessOrEquals(start_of_period, dateTrunc('day', assumeNotNull(toDateTime('2023-09-20 23:59:59')))), greaterOrEquals(start_of_period, dateTrunc('day', assumeNotNull(toDateTime('2023-09-13 00:00:00'))))) GROUP BY start_of_period, status ORDER BY start_of_period ASC) GROUP BY status LIMIT 10000 "
from posthog.hogql.parser import parse_select
import gc
i = 0
while True:
i += 1
parse_select(sql, backend='python') and gc.collect() and i
With the python parser, things are slow, but memory stays more or less steady:
On C++ the gc.collect()
function returns 0
items collected, so i
is not even printed... and memory goes vroom 📈
Since memory leaks can always reappear after any changes, we should really invest in some tooling that just runs a bunch of queries in a loop (both good and bad queries), maybe even just the parser tests, and checks that nothing grows unexpectedly... 🤔
hogql_parser/parser.cpp
Outdated
static PyObject* method_parse_expr(PyObject* self, PyObject* args, PyObject* kwargs) { | ||
parser_state* state = get_module_state(self); | ||
const char* str; | ||
int start; // TODO: Determine if this `start` kwarg of `parse_expr` is needed for anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only use is this.
Basically in case the subquery here has a parser error:
.. then we won't show the error at a random location in the query.
Since we're anyway replacing the subquery with a join, this doesn't really matter... and it can be solved differently regardless. So feel free to remove and make life easier here.
Aside: we should/could enable the C++ parser for testing if a HogQL Query Modifier is set. |
Very cool test! Out of pure curiosity, how did you find a SQL for which the leak would happen 😮 |
There are comments in the codebase that say |
@@ -0,0 +1,50 @@ | |||
# Developing `hogql-parser` | |||
|
|||
## Mandatory reading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is great I applaud you for writing this 👏
* Add partial C++ parser of HogQL * Support all the rules and add proper error handling * Use `AlignAfterOpenBracket: BlockIndent` * Reuse all the parser tests for the C++ backend * Fix relationship between headers and implementations * Add more error handling and fix minor issues * Build both Python and C++ parsers in the package.json script * Update ARRAY JOIN error assertion * Improve timeit * Move the C extension to the top level * Refactor `vector_to_list_string` * Build the parser on Linux * Build wheels for the parser * Simplify Linux build and fix macOS * Fix Homebrew paths on x86 and don't fail fast in CI * Set MACOSX_DEPLOYMENT_TARGET for C++20 * Set up QEMU for Linux ARM builds * Publish the wheels on PyPI * Avoiding Linux ARM emulation in CI for now * Build sdist too * Revert Dockerfile changes * Fix PyPI publish * Add README and optimize sdist build * Use setup.py directly instead of build * Use PyPI hogql-parser instead of local * Also revert production-unit.Dockerfile * Fix sdist upload and add Linux ARM back * No Linux ARM build in the end * Fix artifact uploading * Do try building Linux ARM We need this for prod. * Use `npm` in `grammar:build` `pnpm` is not available in that job. * Fix formatting of hogql_parser * Build everything on macOS * Revert "Build everything on macOS" Not so fast actually. * Use hogql-parser=0.1.1 * Fix dylib in macOS wheel * Bump hogql-parser version * Fix missing module error * Delete timeit.py * Make error handling robust * Format the C++ * Use `hogql-parser==0.1.1` * Fix reserved keyword error assertions * Use HEAD hogql_paresr in CI * Fix `apt` usage * Add some sudo in CI * Ensure package will be releasable before build * Bump version to 0.1.3 * Cover C++ `unquote_string` with tests * Use BuildJet ARM runners for ARM builds * Add some instructions * Add HogQL version check to backend CI * Update requirements.txt * Use `setuptools` instead of the deprecated `distutils` * Fix working dir in backend CI * Align ANTLR versions * Add test for "mismatched input" This is thrown differently than other HogQLSyntaxExceptions in C++, so might help reveal what's going on with tests failing only on Linux CI and not macOS dev * Add types and bump version * Comment instead of failing version check * Automate hogql-release version bump * Fix checkout token * Don't build hogql-parser if there were no changes * Update query snapshots * Update query snapshots * Update query snapshots * Update query snapshots * Improve documentation * Use new hogql-parser version * Fix error start and end initialization * Note `antlr4-cpp-runtime` Co-authored-by: Marius Andra <[email protected]> * Also remove NUL chars in C++ * Check ANTLR4 runtime archive checksum for security * Note more decrefs to add * Add vector size checks * Use new hogql-parser version * Don't support the `start` arg in C++ `parse_expr` * Use new hogql-parser version --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Marius Andra <[email protected]>
Changes
This adds a new C++ parser of HogQL, which is a 10-100x speedup over the Python parser.
TODOs remaining:
How did you test this code?
The new parser must pass the same test suite as the Python parser.