Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(hog): Hog bytecode function STL #24653

Merged
merged 19 commits into from
Aug 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/ci-hog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,12 @@ jobs:
# as apt-get is quite out of date. The same version must be set in hogql_parser/pyproject.toml
ANTLR_VERSION: '4.13.2'

- name: Check if STL bytecode is up to date
if: needs.changes.outputs.hog == 'true'
run: |
python -m hogvm.stl.compile
git diff --exit-code

- name: Run HogVM Python tests
if: needs.changes.outputs.hog == 'true'
run: |
Expand Down
1 change: 1 addition & 0 deletions .prettierignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@ dist/
node_modules/
pnpm-lock.yaml
posthog/templates/email/*
hogvm/typescript/src/stl/bytecode.ts
3 changes: 2 additions & 1 deletion bin/hog
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ if [[ "$@" == *".hoge"* ]]; then
fi
exec node $CLI_PATH "$@"
fi

elif [[ "$@" == *"--out"* ]]; then
exec python3 -m posthog.hogql.cli --out "$@"
elif [[ "$@" == *".hog"* ]]; then
exec python3 -m posthog.hogql.cli --run "$@"
else
Expand Down
42 changes: 7 additions & 35 deletions hogvm/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# HogVM

A HogVM is a 🦔 that runs HogQL bytecode. It's purpose is to locally evaluate HogQL expressions against any object.
A HogVM is a 🦔 that runs Hog bytecode. It's purpose is to locally evaluate Hog/QL expressions against any object.

## HogQL bytecode
## Hog bytecode

HogQL Bytecode is a compact representation of a subset of the HogQL AST nodes. It follows a certain structure:
Hog Bytecode is a compact representation of a subset of the Hog AST nodes. It follows a certain structure:

```
1 + 2 # [_H, op.INTEGER, 2, op.INTEGER, 1, op.PLUS]
Expand All @@ -23,11 +23,11 @@ The `python/execute.py` function in this folder acts as the reference implementa

### Operations

To be considered a PostHog HogQL Bytecode Certified Parser, you must implement the following operations:
Here's a sample list of Hog bytecode operations, missing about half of them and likely out of date:

```bash
FIELD = 1 # [arg3, arg2, arg1, FIELD, 3] # arg1.arg2.arg3
CALL = 2 # [arg2, arg1, CALL, 'concat', 2] # concat(arg1, arg2)
CALL_GLOBAL = 2 # [arg2, arg1, CALL, 'concat', 2] # concat(arg1, arg2)
AND = 3 # [val3, val2, val1, AND, 3] # val1 and val2 and val3
OR = 4 # [val3, val2, val1, OR, 3] # val1 or val2 or val3
NOT = 5 # [val, NOT] # not val
Expand Down Expand Up @@ -60,29 +60,9 @@ INTEGER = 33 # [INTEGER, 123] # 123
FLOAT = 34 # [FLOAT, 123.12] # 123.01
```

### Async Operations

Some operations can't be computed directly, and are thus asked back to the caller. These include:

```bash
IN_COHORT = 27 # [val2, val1, IREGEX] # val1 in cohort val2
NOT_IN_COHORT = 28 # [val2, val1, NOT_IREGEX] # val1 not in cohort val2
```

The arguments for these instructions will be passed on to the provided `async_operation(*args)` in reverse:

```python
def async_operation(*args):
if args[0] == op.IN_COHORT:
return db.queryInCohort(args[1], args[2])
return False

execute_bytecode(to_bytecode("'user_id' in cohort 2"), {}, async_operation).result
```

### Functions

A PostHog HogQL Bytecode Certified Parser must also implement the following function calls:
A Hog Certified Parser must also implement the following function calls:

```bash
concat(...) # concat('test: ', 1, null, '!') == 'test: 1!'
Expand All @@ -96,19 +76,11 @@ ifNull(val, alternative) # ifNull('string', false) == 'string'

### Null handling

In HogQL equality comparisons, `null` is treated as any other variable. Its presence will not make functions automatically return `null`, as is the ClickHouse default.
In Hog/QL equality comparisons, `null` is treated as any other variable. Its presence will not make functions automatically return `null`, as is the ClickHouse default.

```sql
1 == null # false
1 != null # true
```

Nulls are just ignored in `concat`


## Known broken features

- **Regular Expression** support is implemented, but NOT GUARANTEED to the same way across platforms. Different implementations (ClickHouse, Python, Node) use different Regexp engines. ClickHouse uses `re2`, the others use `pcre`. Use the case-insensitive regex operators instead of passing in modifier flags through the expression.
- **DateTime** comparisons are not supported.
- **Cohort Matching** operations are not implemented.
- Only a small subset of functions is enabled. This list is bound to expand.
9 changes: 9 additions & 0 deletions hogvm/__tests__/__snapshots__/bytecodeStl.hoge
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
["_H", 1, 32, "--- arrayMap ----", 2, "print", 1, 35, 52, "lambda", 1, 0, 6, 33, 2, 36, 0, 8, 38, 53, 0, 33, 1, 33, 2,
33, 3, 43, 3, 2, "arrayMap", 2, 2, "print", 1, 35, 32, "--- arrayExists ----", 2, "print", 1, 35, 52, "lambda", 1, 0, 6,
32, "%nana%", 36, 0, 17, 38, 53, 0, 32, "apple", 32, "banana", 32, "cherry", 43, 3, 2, "arrayExists", 2, 2, "print", 1,
35, 52, "lambda", 1, 0, 6, 32, "%boom%", 36, 0, 17, 38, 53, 0, 32, "apple", 32, "banana", 32, "cherry", 43, 3, 2,
"arrayExists", 2, 2, "print", 1, 35, 52, "lambda", 1, 0, 6, 32, "%boom%", 36, 0, 17, 38, 53, 0, 43, 0, 2, "arrayExists",
2, 2, "print", 1, 35, 32, "--- arrayFilter ----", 2, "print", 1, 35, 52, "lambda", 1, 0, 6, 32, "%nana%", 36, 0, 17, 38,
53, 0, 32, "apple", 32, "banana", 32, "cherry", 43, 3, 2, "arrayFilter", 2, 2, "print", 1, 35, 52, "lambda", 1, 0, 6,
32, "%e%", 36, 0, 17, 38, 53, 0, 32, "apple", 32, "banana", 32, "cherry", 43, 3, 2, "arrayFilter", 2, 2, "print", 1, 35,
52, "lambda", 1, 0, 6, 32, "%boom%", 36, 0, 17, 38, 53, 0, 43, 0, 2, "arrayFilter", 2, 2, "print", 1, 35]
10 changes: 10 additions & 0 deletions hogvm/__tests__/__snapshots__/bytecodeStl.stdout
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
--- arrayMap ----
[2, 4, 6]
--- arrayExists ----
true
false
false
--- arrayFilter ----
['banana']
['apple', 'cherry']
[]
4 changes: 3 additions & 1 deletion hogvm/__tests__/__snapshots__/lambdas.hoge
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,6 @@
33, 2, 52, "lambda", 1, 0, 6, 33, 2, 36, 0, 8, 38, 53, 0, 54, 1, 2, "print", 1, 35, 32, "--------", 2, "print", 1, 35,
52, "lambda", 1, 0, 20, 36, 0, 2, "print", 1, 35, 32, "moo", 2, "print", 1, 35, 32, "cow", 2, "print", 1, 35, 31, 38,
53, 0, 33, 2, 36, 3, 54, 1, 35, 32, "--------", 2, "print", 1, 35, 52, "lambda", 0, 0, 14, 32, "moo", 2, "print", 1, 35,
32, "cow", 2, "print", 1, 35, 31, 38, 53, 0, 36, 4, 54, 0, 35, 35, 35, 35, 35, 35]
32, "cow", 2, "print", 1, 35, 31, 38, 53, 0, 36, 4, 54, 0, 35, 32, "-------- lambdas do not survive json --------", 2,
"print", 1, 35, 36, 0, 2, "print", 1, 35, 36, 0, 2, "jsonStringify", 1, 2, "print", 1, 35, 36, 0, 2, "jsonStringify", 1,
2, "jsonParse", 1, 36, 5, 2, "print", 1, 35, 35, 35, 35, 35, 35, 35]
4 changes: 4 additions & 0 deletions hogvm/__tests__/__snapshots__/lambdas.stdout
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,7 @@ cow
--------
moo
cow
-------- lambdas do not survive json --------
fn<lambda(1)>
"fn<lambda(1)>"
fn<lambda(1)>
12 changes: 12 additions & 0 deletions hogvm/__tests__/bytecodeStl.hog
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
print('--- arrayMap ----')
print(arrayMap(x -> x * 2, [1,2,3]))

print('--- arrayExists ----')
print(arrayExists(x -> x like '%nana%', ['apple', 'banana', 'cherry']))
print(arrayExists(x -> x like '%boom%', ['apple', 'banana', 'cherry']))
print(arrayExists(x -> x like '%boom%', []))

print('--- arrayFilter ----')
print(arrayFilter(x -> x like '%nana%', ['apple', 'banana', 'cherry']))
print(arrayFilter(x -> x like '%e%', ['apple', 'banana', 'cherry']))
print(arrayFilter(x -> x like '%boom%', []))
7 changes: 7 additions & 0 deletions hogvm/__tests__/lambdas.hog
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,10 @@ let noArg := () -> {
print('cow')
}
noArg()

print('-------- lambdas do not survive json --------')

print(b)
print(jsonStringify(b)) // just a json string "<lambda:0>"
let c := jsonParse(jsonStringify(b))
print(c) // prints a string, can't be called
94 changes: 83 additions & 11 deletions hogvm/python/execute.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from hogvm.python.objects import is_hog_error, new_hog_closure, CallFrame, ThrowFrame, new_hog_callable, is_hog_upvalue
from hogvm.python.operation import Operation, HOGQL_BYTECODE_IDENTIFIER, HOGQL_BYTECODE_IDENTIFIER_V0
from hogvm.python.stl import STL
from hogvm.python.stl.bytecode import BYTECODE_STL
from dataclasses import dataclass

from hogvm.python.utils import (
Expand Down Expand Up @@ -67,6 +68,7 @@ def execute_bytecode(
call_stack.append(
CallFrame(
ip=2 if bytecode[0] == HOGQL_BYTECODE_IDENTIFIER else 1,
chunk="root",
stack_start=0,
arg_len=0,
closure=new_hog_closure(
Expand All @@ -75,32 +77,49 @@ def execute_bytecode(
arg_count=0,
upvalue_count=0,
ip=2 if bytecode[0] == HOGQL_BYTECODE_IDENTIFIER else 1,
chunk="root",
name="",
)
),
)
)
frame = call_stack[-1]
chunk_bytecode: list[Any] = bytecode

def stack_keep_first_elements(count: int):
def set_chunk_bytecode():
nonlocal chunk_bytecode, last_op
if not frame.chunk or frame.chunk == "root":
chunk_bytecode = bytecode
last_op = len(bytecode) - 1
elif frame.chunk.startswith("stl/") and frame.chunk[4:] in BYTECODE_STL:
chunk_bytecode = BYTECODE_STL[frame.chunk[4:]][1]
last_op = len(bytecode) - 1
else:
raise HogVMException(f"Unknown chunk: {frame.chunk}")

def stack_keep_first_elements(count: int) -> list[Any]:
nonlocal stack, mem_stack, mem_used
if count < 0 or len(stack) < count:
raise HogVMException("Stack underflow")
for upvalue in reversed(upvalues):
if upvalue["location"] >= count:
if not upvalue["closed"]:
upvalue["closed"] = True
upvalue["value"] = stack[upvalue["location"]]
else:
break
removed = stack[count:]
stack = stack[0:count]
mem_used -= sum(mem_stack[count:])
mem_stack = mem_stack[0:count]
return removed

def next_token():
nonlocal frame
nonlocal frame, chunk_bytecode
if frame.ip >= last_op:
raise HogVMException("Unexpected end of bytecode")
frame.ip += 1
return bytecode[frame.ip]
return chunk_bytecode[frame.ip]

def pop_stack():
if not stack:
Expand Down Expand Up @@ -145,7 +164,7 @@ def capture_upvalue(index) -> dict:
symbol: Any = None
while frame.ip <= last_op:
ops += 1
symbol = bytecode[frame.ip]
symbol = chunk_bytecode[frame.ip]
if (ops & 127) == 0: # every 128th operation
check_timeout()
elif debug:
Expand Down Expand Up @@ -232,6 +251,7 @@ def capture_upvalue(index) -> dict:
arg_count=0,
upvalue_count=0,
ip=-1,
chunk="stl",
)
)
)
Expand All @@ -244,6 +264,20 @@ def capture_upvalue(index) -> dict:
arg_count=STL[chain[0]].maxArgs or 0,
upvalue_count=0,
ip=-1,
chunk="stl",
)
)
)
elif chain[0] in BYTECODE_STL and len(chain) == 1:
push_stack(
new_hog_closure(
new_hog_callable(
type="stl",
name=chain[0],
arg_count=len(BYTECODE_STL[chain[0]][0]),
upvalue_count=0,
ip=0,
chunk=f"stl/{chain[0]}",
)
)
)
Expand All @@ -262,6 +296,7 @@ def capture_upvalue(index) -> dict:
stack_keep_first_elements(stack_start)
push_stack(response)
frame = call_stack[-1]
set_chunk_bytecode()
continue # resume the loop without incrementing frame.ip

case Operation.GET_LOCAL:
Expand Down Expand Up @@ -343,6 +378,7 @@ def capture_upvalue(index) -> dict:
new_hog_callable(
type="local",
name=name,
chunk=frame.chunk,
arg_count=arg_count,
upvalue_count=upvalue_count,
ip=frame.ip + 1,
Expand Down Expand Up @@ -402,30 +438,59 @@ def capture_upvalue(index) -> dict:
push_stack(None)
frame = CallFrame(
ip=func_ip,
chunk=frame.chunk,
stack_start=len(stack) - arg_len,
arg_len=arg_len,
closure=new_hog_closure(
new_hog_callable(
type="stl",
type="local",
name=name,
arg_count=arg_len,
upvalue_count=0,
ip=-1,
ip=func_ip,
chunk=frame.chunk,
)
),
)
call_stack.append(frame)
continue # resume the loop without incrementing frame.ip
else:
# Shortcut for calling STL functions (can also be done with an STL function closure)
if version == 0:
args = [pop_stack() for _ in range(arg_count)]
else:
args = list(reversed([pop_stack() for _ in range(arg_count)]))
if functions is not None and name in functions:
if version == 0:
args = [pop_stack() for _ in range(arg_count)]
else:
args = stack_keep_first_elements(len(stack) - arg_count)
push_stack(functions[name](*args))
elif name in STL:
if version == 0:
args = [pop_stack() for _ in range(arg_count)]
else:
args = stack_keep_first_elements(len(stack) - arg_count)
push_stack(STL[name].fn(args, team, stdout, timeout.total_seconds()))
elif name in BYTECODE_STL:
arg_names = BYTECODE_STL[name][0]
if len(arg_names) != arg_count:
raise HogVMException(f"Function {name} requires exactly {len(arg_names)} arguments")
frame.ip += 1 # advance for when we return
frame = CallFrame(
ip=0,
chunk=f"stl/{name}",
stack_start=len(stack) - arg_count,
arg_len=arg_count,
closure=new_hog_closure(
new_hog_callable(
type="stl",
name=name,
arg_count=arg_count,
upvalue_count=0,
ip=0,
chunk=f"stl/{name}",
)
),
)
set_chunk_bytecode()
call_stack.append(frame)
continue # resume the loop without incrementing frame.ip
else:
raise HogVMException(f"Unsupported function call: {name}")
case Operation.CALL_LOCAL:
Expand All @@ -452,10 +517,12 @@ def capture_upvalue(index) -> dict:
frame.ip += 1 # advance for when we return
frame = CallFrame(
ip=callable["ip"],
chunk=callable["chunk"],
stack_start=len(stack) - callable["argCount"],
arg_len=callable["argCount"],
closure=closure,
)
set_chunk_bytecode()
call_stack.append(frame)
continue # resume the loop without incrementing frame.ip

Expand Down Expand Up @@ -509,6 +576,7 @@ def capture_upvalue(index) -> dict:
call_stack = call_stack[0:call_stack_len]
push_stack(exception)
frame = call_stack[-1]
set_chunk_bytecode()
frame.ip = catch_ip
continue
else:
Expand All @@ -517,6 +585,10 @@ def capture_upvalue(index) -> dict:
message=exception.get("message"),
payload=exception.get("payload"),
)
case _:
raise HogVMException(
f'Unexpected node while running bytecode in chunk "{frame.chunk}": {chunk_bytecode[frame.ip]}'
)

frame.ip += 1
if debug:
Expand Down
Loading
Loading