Skip to content

Commit

Permalink
feat(hog): Hog bytecode function STL (#24653)
Browse files Browse the repository at this point in the history
  • Loading branch information
mariusandra authored Aug 29, 2024
1 parent f5d92e0 commit 00bab5e
Show file tree
Hide file tree
Showing 34 changed files with 398 additions and 95 deletions.
6 changes: 6 additions & 0 deletions .github/workflows/ci-hog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,12 @@ jobs:
# as apt-get is quite out of date. The same version must be set in hogql_parser/pyproject.toml
ANTLR_VERSION: '4.13.2'

- name: Check if STL bytecode is up to date
if: needs.changes.outputs.hog == 'true'
run: |
python -m hogvm.stl.compile
git diff --exit-code
- name: Run HogVM Python tests
if: needs.changes.outputs.hog == 'true'
run: |
Expand Down
1 change: 1 addition & 0 deletions .prettierignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@ dist/
node_modules/
pnpm-lock.yaml
posthog/templates/email/*
hogvm/typescript/src/stl/bytecode.ts
3 changes: 2 additions & 1 deletion bin/hog
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ if [[ "$@" == *".hoge"* ]]; then
fi
exec node $CLI_PATH "$@"
fi

elif [[ "$@" == *"--out"* ]]; then
exec python3 -m posthog.hogql.cli --out "$@"
elif [[ "$@" == *".hog"* ]]; then
exec python3 -m posthog.hogql.cli --run "$@"
else
Expand Down
42 changes: 7 additions & 35 deletions hogvm/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# HogVM

A HogVM is a 🦔 that runs HogQL bytecode. It's purpose is to locally evaluate HogQL expressions against any object.
A HogVM is a 🦔 that runs Hog bytecode. It's purpose is to locally evaluate Hog/QL expressions against any object.

## HogQL bytecode
## Hog bytecode

HogQL Bytecode is a compact representation of a subset of the HogQL AST nodes. It follows a certain structure:
Hog Bytecode is a compact representation of a subset of the Hog AST nodes. It follows a certain structure:

```
1 + 2 # [_H, op.INTEGER, 2, op.INTEGER, 1, op.PLUS]
Expand All @@ -23,11 +23,11 @@ The `python/execute.py` function in this folder acts as the reference implementa

### Operations

To be considered a PostHog HogQL Bytecode Certified Parser, you must implement the following operations:
Here's a sample list of Hog bytecode operations, missing about half of them and likely out of date:

```bash
FIELD = 1 # [arg3, arg2, arg1, FIELD, 3] # arg1.arg2.arg3
CALL = 2 # [arg2, arg1, CALL, 'concat', 2] # concat(arg1, arg2)
CALL_GLOBAL = 2 # [arg2, arg1, CALL, 'concat', 2] # concat(arg1, arg2)
AND = 3 # [val3, val2, val1, AND, 3] # val1 and val2 and val3
OR = 4 # [val3, val2, val1, OR, 3] # val1 or val2 or val3
NOT = 5 # [val, NOT] # not val
Expand Down Expand Up @@ -60,29 +60,9 @@ INTEGER = 33 # [INTEGER, 123] # 123
FLOAT = 34 # [FLOAT, 123.12] # 123.01
```

### Async Operations

Some operations can't be computed directly, and are thus asked back to the caller. These include:

```bash
IN_COHORT = 27 # [val2, val1, IREGEX] # val1 in cohort val2
NOT_IN_COHORT = 28 # [val2, val1, NOT_IREGEX] # val1 not in cohort val2
```

The arguments for these instructions will be passed on to the provided `async_operation(*args)` in reverse:

```python
def async_operation(*args):
if args[0] == op.IN_COHORT:
return db.queryInCohort(args[1], args[2])
return False

execute_bytecode(to_bytecode("'user_id' in cohort 2"), {}, async_operation).result
```

### Functions

A PostHog HogQL Bytecode Certified Parser must also implement the following function calls:
A Hog Certified Parser must also implement the following function calls:

```bash
concat(...) # concat('test: ', 1, null, '!') == 'test: 1!'
Expand All @@ -96,19 +76,11 @@ ifNull(val, alternative) # ifNull('string', false) == 'string'

### Null handling

In HogQL equality comparisons, `null` is treated as any other variable. Its presence will not make functions automatically return `null`, as is the ClickHouse default.
In Hog/QL equality comparisons, `null` is treated as any other variable. Its presence will not make functions automatically return `null`, as is the ClickHouse default.

```sql
1 == null # false
1 != null # true
```

Nulls are just ignored in `concat`


## Known broken features

- **Regular Expression** support is implemented, but NOT GUARANTEED to the same way across platforms. Different implementations (ClickHouse, Python, Node) use different Regexp engines. ClickHouse uses `re2`, the others use `pcre`. Use the case-insensitive regex operators instead of passing in modifier flags through the expression.
- **DateTime** comparisons are not supported.
- **Cohort Matching** operations are not implemented.
- Only a small subset of functions is enabled. This list is bound to expand.
9 changes: 9 additions & 0 deletions hogvm/__tests__/__snapshots__/bytecodeStl.hoge
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
["_H", 1, 32, "--- arrayMap ----", 2, "print", 1, 35, 52, "lambda", 1, 0, 6, 33, 2, 36, 0, 8, 38, 53, 0, 33, 1, 33, 2,
33, 3, 43, 3, 2, "arrayMap", 2, 2, "print", 1, 35, 32, "--- arrayExists ----", 2, "print", 1, 35, 52, "lambda", 1, 0, 6,
32, "%nana%", 36, 0, 17, 38, 53, 0, 32, "apple", 32, "banana", 32, "cherry", 43, 3, 2, "arrayExists", 2, 2, "print", 1,
35, 52, "lambda", 1, 0, 6, 32, "%boom%", 36, 0, 17, 38, 53, 0, 32, "apple", 32, "banana", 32, "cherry", 43, 3, 2,
"arrayExists", 2, 2, "print", 1, 35, 52, "lambda", 1, 0, 6, 32, "%boom%", 36, 0, 17, 38, 53, 0, 43, 0, 2, "arrayExists",
2, 2, "print", 1, 35, 32, "--- arrayFilter ----", 2, "print", 1, 35, 52, "lambda", 1, 0, 6, 32, "%nana%", 36, 0, 17, 38,
53, 0, 32, "apple", 32, "banana", 32, "cherry", 43, 3, 2, "arrayFilter", 2, 2, "print", 1, 35, 52, "lambda", 1, 0, 6,
32, "%e%", 36, 0, 17, 38, 53, 0, 32, "apple", 32, "banana", 32, "cherry", 43, 3, 2, "arrayFilter", 2, 2, "print", 1, 35,
52, "lambda", 1, 0, 6, 32, "%boom%", 36, 0, 17, 38, 53, 0, 43, 0, 2, "arrayFilter", 2, 2, "print", 1, 35]
10 changes: 10 additions & 0 deletions hogvm/__tests__/__snapshots__/bytecodeStl.stdout
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
--- arrayMap ----
[2, 4, 6]
--- arrayExists ----
true
false
false
--- arrayFilter ----
['banana']
['apple', 'cherry']
[]
4 changes: 3 additions & 1 deletion hogvm/__tests__/__snapshots__/lambdas.hoge
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,6 @@
33, 2, 52, "lambda", 1, 0, 6, 33, 2, 36, 0, 8, 38, 53, 0, 54, 1, 2, "print", 1, 35, 32, "--------", 2, "print", 1, 35,
52, "lambda", 1, 0, 20, 36, 0, 2, "print", 1, 35, 32, "moo", 2, "print", 1, 35, 32, "cow", 2, "print", 1, 35, 31, 38,
53, 0, 33, 2, 36, 3, 54, 1, 35, 32, "--------", 2, "print", 1, 35, 52, "lambda", 0, 0, 14, 32, "moo", 2, "print", 1, 35,
32, "cow", 2, "print", 1, 35, 31, 38, 53, 0, 36, 4, 54, 0, 35, 35, 35, 35, 35, 35]
32, "cow", 2, "print", 1, 35, 31, 38, 53, 0, 36, 4, 54, 0, 35, 32, "-------- lambdas do not survive json --------", 2,
"print", 1, 35, 36, 0, 2, "print", 1, 35, 36, 0, 2, "jsonStringify", 1, 2, "print", 1, 35, 36, 0, 2, "jsonStringify", 1,
2, "jsonParse", 1, 36, 5, 2, "print", 1, 35, 35, 35, 35, 35, 35, 35]
4 changes: 4 additions & 0 deletions hogvm/__tests__/__snapshots__/lambdas.stdout
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,7 @@ cow
--------
moo
cow
-------- lambdas do not survive json --------
fn<lambda(1)>
"fn<lambda(1)>"
fn<lambda(1)>
12 changes: 12 additions & 0 deletions hogvm/__tests__/bytecodeStl.hog
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
print('--- arrayMap ----')
print(arrayMap(x -> x * 2, [1,2,3]))

print('--- arrayExists ----')
print(arrayExists(x -> x like '%nana%', ['apple', 'banana', 'cherry']))
print(arrayExists(x -> x like '%boom%', ['apple', 'banana', 'cherry']))
print(arrayExists(x -> x like '%boom%', []))

print('--- arrayFilter ----')
print(arrayFilter(x -> x like '%nana%', ['apple', 'banana', 'cherry']))
print(arrayFilter(x -> x like '%e%', ['apple', 'banana', 'cherry']))
print(arrayFilter(x -> x like '%boom%', []))
7 changes: 7 additions & 0 deletions hogvm/__tests__/lambdas.hog
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,10 @@ let noArg := () -> {
print('cow')
}
noArg()

print('-------- lambdas do not survive json --------')

print(b)
print(jsonStringify(b)) // just a json string "<lambda:0>"
let c := jsonParse(jsonStringify(b))
print(c) // prints a string, can't be called
94 changes: 83 additions & 11 deletions hogvm/python/execute.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from hogvm.python.objects import is_hog_error, new_hog_closure, CallFrame, ThrowFrame, new_hog_callable, is_hog_upvalue
from hogvm.python.operation import Operation, HOGQL_BYTECODE_IDENTIFIER, HOGQL_BYTECODE_IDENTIFIER_V0
from hogvm.python.stl import STL
from hogvm.python.stl.bytecode import BYTECODE_STL
from dataclasses import dataclass

from hogvm.python.utils import (
Expand Down Expand Up @@ -67,6 +68,7 @@ def execute_bytecode(
call_stack.append(
CallFrame(
ip=2 if bytecode[0] == HOGQL_BYTECODE_IDENTIFIER else 1,
chunk="root",
stack_start=0,
arg_len=0,
closure=new_hog_closure(
Expand All @@ -75,32 +77,49 @@ def execute_bytecode(
arg_count=0,
upvalue_count=0,
ip=2 if bytecode[0] == HOGQL_BYTECODE_IDENTIFIER else 1,
chunk="root",
name="",
)
),
)
)
frame = call_stack[-1]
chunk_bytecode: list[Any] = bytecode

def stack_keep_first_elements(count: int):
def set_chunk_bytecode():
nonlocal chunk_bytecode, last_op
if not frame.chunk or frame.chunk == "root":
chunk_bytecode = bytecode
last_op = len(bytecode) - 1
elif frame.chunk.startswith("stl/") and frame.chunk[4:] in BYTECODE_STL:
chunk_bytecode = BYTECODE_STL[frame.chunk[4:]][1]
last_op = len(bytecode) - 1
else:
raise HogVMException(f"Unknown chunk: {frame.chunk}")

def stack_keep_first_elements(count: int) -> list[Any]:
nonlocal stack, mem_stack, mem_used
if count < 0 or len(stack) < count:
raise HogVMException("Stack underflow")
for upvalue in reversed(upvalues):
if upvalue["location"] >= count:
if not upvalue["closed"]:
upvalue["closed"] = True
upvalue["value"] = stack[upvalue["location"]]
else:
break
removed = stack[count:]
stack = stack[0:count]
mem_used -= sum(mem_stack[count:])
mem_stack = mem_stack[0:count]
return removed

def next_token():
nonlocal frame
nonlocal frame, chunk_bytecode
if frame.ip >= last_op:
raise HogVMException("Unexpected end of bytecode")
frame.ip += 1
return bytecode[frame.ip]
return chunk_bytecode[frame.ip]

def pop_stack():
if not stack:
Expand Down Expand Up @@ -145,7 +164,7 @@ def capture_upvalue(index) -> dict:
symbol: Any = None
while frame.ip <= last_op:
ops += 1
symbol = bytecode[frame.ip]
symbol = chunk_bytecode[frame.ip]
if (ops & 127) == 0: # every 128th operation
check_timeout()
elif debug:
Expand Down Expand Up @@ -232,6 +251,7 @@ def capture_upvalue(index) -> dict:
arg_count=0,
upvalue_count=0,
ip=-1,
chunk="stl",
)
)
)
Expand All @@ -244,6 +264,20 @@ def capture_upvalue(index) -> dict:
arg_count=STL[chain[0]].maxArgs or 0,
upvalue_count=0,
ip=-1,
chunk="stl",
)
)
)
elif chain[0] in BYTECODE_STL and len(chain) == 1:
push_stack(
new_hog_closure(
new_hog_callable(
type="stl",
name=chain[0],
arg_count=len(BYTECODE_STL[chain[0]][0]),
upvalue_count=0,
ip=0,
chunk=f"stl/{chain[0]}",
)
)
)
Expand All @@ -262,6 +296,7 @@ def capture_upvalue(index) -> dict:
stack_keep_first_elements(stack_start)
push_stack(response)
frame = call_stack[-1]
set_chunk_bytecode()
continue # resume the loop without incrementing frame.ip

case Operation.GET_LOCAL:
Expand Down Expand Up @@ -343,6 +378,7 @@ def capture_upvalue(index) -> dict:
new_hog_callable(
type="local",
name=name,
chunk=frame.chunk,
arg_count=arg_count,
upvalue_count=upvalue_count,
ip=frame.ip + 1,
Expand Down Expand Up @@ -402,30 +438,59 @@ def capture_upvalue(index) -> dict:
push_stack(None)
frame = CallFrame(
ip=func_ip,
chunk=frame.chunk,
stack_start=len(stack) - arg_len,
arg_len=arg_len,
closure=new_hog_closure(
new_hog_callable(
type="stl",
type="local",
name=name,
arg_count=arg_len,
upvalue_count=0,
ip=-1,
ip=func_ip,
chunk=frame.chunk,
)
),
)
call_stack.append(frame)
continue # resume the loop without incrementing frame.ip
else:
# Shortcut for calling STL functions (can also be done with an STL function closure)
if version == 0:
args = [pop_stack() for _ in range(arg_count)]
else:
args = list(reversed([pop_stack() for _ in range(arg_count)]))
if functions is not None and name in functions:
if version == 0:
args = [pop_stack() for _ in range(arg_count)]
else:
args = stack_keep_first_elements(len(stack) - arg_count)
push_stack(functions[name](*args))
elif name in STL:
if version == 0:
args = [pop_stack() for _ in range(arg_count)]
else:
args = stack_keep_first_elements(len(stack) - arg_count)
push_stack(STL[name].fn(args, team, stdout, timeout.total_seconds()))
elif name in BYTECODE_STL:
arg_names = BYTECODE_STL[name][0]
if len(arg_names) != arg_count:
raise HogVMException(f"Function {name} requires exactly {len(arg_names)} arguments")
frame.ip += 1 # advance for when we return
frame = CallFrame(
ip=0,
chunk=f"stl/{name}",
stack_start=len(stack) - arg_count,
arg_len=arg_count,
closure=new_hog_closure(
new_hog_callable(
type="stl",
name=name,
arg_count=arg_count,
upvalue_count=0,
ip=0,
chunk=f"stl/{name}",
)
),
)
set_chunk_bytecode()
call_stack.append(frame)
continue # resume the loop without incrementing frame.ip
else:
raise HogVMException(f"Unsupported function call: {name}")
case Operation.CALL_LOCAL:
Expand All @@ -452,10 +517,12 @@ def capture_upvalue(index) -> dict:
frame.ip += 1 # advance for when we return
frame = CallFrame(
ip=callable["ip"],
chunk=callable["chunk"],
stack_start=len(stack) - callable["argCount"],
arg_len=callable["argCount"],
closure=closure,
)
set_chunk_bytecode()
call_stack.append(frame)
continue # resume the loop without incrementing frame.ip

Expand Down Expand Up @@ -509,6 +576,7 @@ def capture_upvalue(index) -> dict:
call_stack = call_stack[0:call_stack_len]
push_stack(exception)
frame = call_stack[-1]
set_chunk_bytecode()
frame.ip = catch_ip
continue
else:
Expand All @@ -517,6 +585,10 @@ def capture_upvalue(index) -> dict:
message=exception.get("message"),
payload=exception.get("payload"),
)
case _:
raise HogVMException(
f'Unexpected node while running bytecode in chunk "{frame.chunk}": {chunk_bytecode[frame.ip]}'
)

frame.ip += 1
if debug:
Expand Down
Loading

0 comments on commit 00bab5e

Please sign in to comment.