Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(hog): C++ parser #23100

Merged
merged 39 commits into from
Jun 21, 2024
Merged

feat(hog): C++ parser #23100

merged 39 commits into from
Jun 21, 2024

Conversation

mariusandra
Copy link
Collaborator

@mariusandra mariusandra commented Jun 20, 2024

Problem

The Python parser is slow

Changes

  • Ports the Hog node parsing over to C++
  • Also makes parsing {return} work

How did you test this code?

Moved the python-only parser tests into the universar parser tests file.

TODO:

  • Release a new hogql_parser and run CI
  • Release a new @posthog/hogvm

@posthog-bot
Copy link
Contributor

It looks like the code of hogql-parser has changed since last push, but its version stayed the same at 1.0.15. 👀
Make sure to resolve this in hogql_parser/setup.py before merging!

@posthog-bot
Copy link
Contributor

It looks like the code of hogql-parser has changed since last push, but its version stayed the same at 1.0.15. 👀
Make sure to resolve this in hogql_parser/setup.py before merging!

@mariusandra mariusandra changed the base branch from master to trailing-commas June 20, 2024 12:58
@mariusandra mariusandra marked this pull request as ready for review June 20, 2024 13:30
Base automatically changed from trailing-commas to master June 20, 2024 14:05
@mariusandra
Copy link
Collaborator Author

The build is still in progress (couldn't release a new hogql_parser because launchpad.net is down and the ARM deadsnakes python repo with it), but everything else could use some 👀

Copy link
Member

@Twixes Twixes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Only some suggestions around code clarity (and a couple missing DECREFs)

.github/workflows/build-hogql-parser.yml Outdated Show resolved Hide resolved
hogvm/typescript/src/execute.ts Show resolved Hide resolved
hogql_parser/parser.cpp Outdated Show resolved Hide resolved
Comment on lines +703 to +724
PyObject* declarations = PyList_New(0);
if (!declarations) {
throw PyInternalError();
}
auto declaration_ctxs = ctx->declaration();
for (auto declaration_ctx : declaration_ctxs) {
if (!declaration_ctx->statement() || !declaration_ctx->statement()->emptyStmt()) {
PyObject* statement;
try {
statement = visitAsPyObject(declaration_ctx);
} catch (...) {
Py_DECREF(declarations);
throw;
}
int append_code = PyList_Append(declarations, statement);
Py_DECREF(statement);
if (append_code == -1) {
Py_DECREF(declarations);
throw PyInternalError();
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

visitPyListOfObjects() would be simpler, though wouldn't skip the empty statements – not sure how important this is

Suggested change
PyObject* declarations = PyList_New(0);
if (!declarations) {
throw PyInternalError();
}
auto declaration_ctxs = ctx->declaration();
for (auto declaration_ctx : declaration_ctxs) {
if (!declaration_ctx->statement() || !declaration_ctx->statement()->emptyStmt()) {
PyObject* statement;
try {
statement = visitAsPyObject(declaration_ctx);
} catch (...) {
Py_DECREF(declarations);
throw;
}
int append_code = PyList_Append(declarations, statement);
Py_DECREF(statement);
if (append_code == -1) {
Py_DECREF(declarations);
throw PyInternalError();
}
}
}
PyObject* declarations = visitPyListOfObjects(ctx->declaration());

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely want to skip the empty statements in the parser and return a relatively clean tree... 🤔

Comment on lines 685 to 691
vector<string> identifiers;
auto identifier_ctxs = ctx->identifier();
identifiers.reserve(identifier_ctxs.size());
for (auto identifier_ctx : identifier_ctxs) {
identifiers.push_back(visitAsString(identifier_ctx));
}
return X_PyList_FromStrings(identifiers);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would do the same:

Suggested change
vector<string> identifiers;
auto identifier_ctxs = ctx->identifier();
identifiers.reserve(identifier_ctxs.size());
for (auto identifier_ctx : identifier_ctxs) {
identifiers.push_back(visitAsString(identifier_ctx));
}
return X_PyList_FromStrings(identifiers);
visitPyListOfObjects(ctx->identifier());

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to revert this. Not sure why, but it failed with Parsing failed due to bad type casting (even after adding the return I forgot the first time). The above works, so I'll 🙈

Copy link
Member

@Twixes Twixes Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, of course, because the identifier rule visitor returns a string. So there need to be two steps (and an error guard):

Suggested change
vector<string> identifiers;
auto identifier_ctxs = ctx->identifier();
identifiers.reserve(identifier_ctxs.size());
for (auto identifier_ctx : identifier_ctxs) {
identifiers.push_back(visitAsString(identifier_ctx));
}
return X_PyList_FromStrings(identifiers);
PyObject* ret = X_PyList_FromStrings(visitAsVectorOfStrings(ctx->identifier()));
if (!ret) {
throw PyInternalError();
}
return ret;

Copy link
Member

@Twixes Twixes Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though actually this goes against convention No. 3:

Stay out of Python land as long as possible. E.g. avoid using PyObject*s` for bools or strings.

So in this spirit, this visitor should only return the vector of strings, as is conventional. Then the rule that actually builds an AST node using IdentifierList can use X_PyList_FromStrings() itself – so that this method doesn't have to be needlessly concerned with Python internals

hogql_parser/parser.cpp Outdated Show resolved Hide resolved
hogql_parser/parser.cpp Outdated Show resolved Hide resolved
hogql_parser/parser.cpp Outdated Show resolved Hide resolved
hogql_parser/parser.cpp Outdated Show resolved Hide resolved
hogql_parser/parser.cpp Outdated Show resolved Hide resolved
@posthog-bot
Copy link
Contributor

It looks like the code of hogql-parser has changed since last push, but its version stayed the same at 1.0.18. 👀
Make sure to resolve this in hogql_parser/setup.py before merging!

@mariusandra mariusandra requested a review from Twixes June 21, 2024 13:09
@posthog-bot
Copy link
Contributor

It looks like the code of hogql-parser has changed since last push, but its version stayed the same at 1.0.18. 👀
Make sure to resolve this in hogql_parser/setup.py before merging!

Copy link
Member

@Twixes Twixes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
One optional note on a point where my initial suggestion was wrong

Comment on lines 685 to 691
vector<string> identifiers;
auto identifier_ctxs = ctx->identifier();
identifiers.reserve(identifier_ctxs.size());
for (auto identifier_ctx : identifier_ctxs) {
identifiers.push_back(visitAsString(identifier_ctx));
}
return X_PyList_FromStrings(identifiers);
Copy link
Member

@Twixes Twixes Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, of course, because the identifier rule visitor returns a string. So there need to be two steps (and an error guard):

Suggested change
vector<string> identifiers;
auto identifier_ctxs = ctx->identifier();
identifiers.reserve(identifier_ctxs.size());
for (auto identifier_ctx : identifier_ctxs) {
identifiers.push_back(visitAsString(identifier_ctx));
}
return X_PyList_FromStrings(identifiers);
PyObject* ret = X_PyList_FromStrings(visitAsVectorOfStrings(ctx->identifier()));
if (!ret) {
throw PyInternalError();
}
return ret;

@mariusandra mariusandra temporarily deployed to pypi-hogql-parser June 21, 2024 14:17 — with GitHub Actions Inactive
@posthog-bot
Copy link
Contributor

It looks like the code of hogql-parser has changed since last push, but its version stayed the same at 1.0.19. 👀
Make sure to resolve this in hogql_parser/setup.py before merging!

@mariusandra mariusandra temporarily deployed to pypi-hogql-parser June 21, 2024 14:42 — with GitHub Actions Inactive
@mariusandra mariusandra temporarily deployed to pypi-hogql-parser June 21, 2024 16:01 — with GitHub Actions Inactive
@mariusandra mariusandra enabled auto-merge (squash) June 21, 2024 16:04
@mariusandra mariusandra merged commit 01a9f36 into master Jun 21, 2024
86 of 87 checks passed
@mariusandra mariusandra deleted the hog-plus-plus branch June 21, 2024 16:22
thmsobrmlr pushed a commit that referenced this pull request Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants