-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Snowflake batch export #18427
refactor: Snowflake batch export #18427
Conversation
1e80500
to
8d9f615
Compare
Co-authored-by: Brett Hoerner <[email protected]>
Co-authored-by: Brett Hoerner <[email protected]>
Co-authored-by: Brett Hoerner <[email protected]>
We should hold off on merging this one until #18467 is merged due to potential conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a conflict from my merge, but LGTM.
I like the plans to extract batch exports to their own directory/app/service/image, no real comments there at this point.
posthog/temporal/utils.py
Outdated
# Ideally, any new exceptions should be added to the previous blocks after the first time and we will never land here. | ||
heartbeat_details = None | ||
received = False | ||
logger.warning(f"Did not receive details from previous activity Excecution due to an unexpected error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this include the exception itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, this log should be promoted to ERROR level and use logger.exception
.
|
||
except snowflake.connector.ProgrammingError: | ||
# TODO: logging? Other handling? | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this exception end up in some existing log at least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will bubble up and the activity will fail, so we will see it in the temporal error logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO was just thinking if we should do anything more, maybe log ourselves to add some context. I'll do that at least once we merge the structlog PR and I can rebase this one.
"""Executes a PUT query using the provided cursor to the provided table_name. | ||
|
||
Sadly, Snowflake's execute_async does not work with PUT statements. So, we pass the execute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally aside, it's wild to me that their execute fn does or doesn't work only with certain statements. Weird!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some (very little) explanation of why here: snowflakedb/snowflake-connector-python#1227 (comment).
|
||
We add a file_no to the file_name when executing PUT as Snowflake will reject any files with the same | ||
name. Since batch exports re-use the same file, our name does not change, but we don't want Snowflake | ||
to reject or overwrite our new data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the unique filename thing true for the lifetime of the table? Do we need to worry at all about a temporary file name (which is I think what we're using here) being reused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If COPY
is successful after we are done uploading everything, the files will be purged, so their names become available again.
That being said, the purge can fail (even without COPY
failing), or we could fail somewhere before COPY
. In that case, we are hoping that Python will not generate the same name for our temporary file again. First, Python generates an infinite sequence of random names (with this: https://github.com/python/cpython/blob/3.10/Lib/tempfile.py#L132), and every time it needs a name for a temp file, it calls next
to get the next randomly generated name (here: https://github.com/python/cpython/blob/3.10/Lib/tempfile.py#L252), and uses that as the file name (here: https://github.com/python/cpython/blob/3.10/Lib/tempfile.py#L556).
So, worst possible scenario, our protection from collisions boils down to how likely or under what circumstances will the Python RNG choose the same sequence of 8 characters. We are not manually seeding it for any reason, so I think we should be safe and that if it ever happens it will be a good anecdote.
EDIT: I ran through the name generation process in this comment as I was keeping notes as I looked up how it worked just now. Not my intention to imply I know everything or sound pedantic!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the detail!
…batch-exports-and-real-tests
7f46d0e
to
4dcf261
Compare
Size Change: -2.64 kB (0%) Total Size: 2.01 MB
|
Problem
Snowflake is the first destination we supported for PostHog batch exports, as such the code has been missing some best practices and features we introduced in destinations implemented later, but never ported back to Snowflake.
Changes
This heartbeat API is going to be rolled out to all destinations, which is why I've dropped it in a utilities module. Long term, I would like to shuffle modules around and end up with a file structure like:
The ideas behind a refactoring like this are:
a. If we eventually want batch exports to be separate from PostHog, for starters, it has to be in it in its own directory.
b. This also makes it easy to eventually refactor the models to avoid any dependencies to PostHog.
a. Currently, things being separated between
batch_exports/
andtemporal/
makes no sense.b. But it also doesn't make sense to fully integrate into PostHog if long term we want batch exports to be a separate package
c. The only exception is the
batch_exports/http.py
module which will remain part of PostHog.This PR also has some other non-heartbeat related changes:
BatchExportTemporaryFile
class to manage file writing.👉 Stay up-to-date with PostHog coding conventions for a smoother review.
How did you test this code?
New unit tests ran against real Snowflake: