refactor: Snowflake batch export #18427

tomasfarias · 2023-11-06T17:39:16Z

Problem

Snowflake is the first destination we supported for PostHog batch exports, as such the code has been missing some best practices and features we introduced in destinations implemented later, but never ported back to Snowflake.

Changes

Wrap the Snowflake API to make it async.
- Which allows us to heartbeat...
Implement Heartbeat on progress to allow resuming from partial exports.
- And now that we heartbeat we can...
Implement resume based on last heartbeat progress values.

This heartbeat API is going to be rolled out to all destinations, which is why I've dropped it in a utilities module. Long term, I would like to shuffle modules around and end up with a file structure like:

. batch_exports
├── __main__.py (start_worker() with all workflows and activities in this package)
├── activities (common activities shared by all workflows)
│   ├── create_batch_export.py
│   └── ...
├── workflows
│   ├── base.py (PostHogWorkflow base class)
│   ├── bigquery_batch_export_workflow.py
│   ├── snowflake_batch_export_workflow.py
│   └── ...
├── models (or maybe just a models.py)
│   ├── batch_export_model.py
│   ├── backfill_batch_export_model.py
│   └── ...
├── utilities (if any of these gets too big, we can move them to their own separate module)
│   ├── heartbeat.py
│   ├── temporary_file.py
│   ├── clickhouse.py
│   ├── service.py
│   └── ...
├── temporal
│   ├── client.py
│   ├── codec.py
│   ├── worker.py
│   └── ...
└── tests
    ├── workflows
    │   ├── test_bigquery_batch_export_workflow.py
    │   └── ...
    ├── temporal
    │   ├── test_codec.py
    │   └── ...
    └── utilities
        ├── test_clickhouse.py
        ├── test_heartbeat.py
        └── ...

The ideas behind a refactoring like this are:

Separate batch exports from the rest of the PostHog monolith.
a. If we eventually want batch exports to be separate from PostHog, for starters, it has to be in it in its own directory.
b. This also makes it easy to eventually refactor the models to avoid any dependencies to PostHog.
Temporal is just the tool we use for the job, batch exports is the actual application. So, batch exports should be the root dir.
a. Currently, things being separated between batch_exports/ and temporal/ makes no sense.
b. But it also doesn't make sense to fully integrate into PostHog if long term we want batch exports to be a separate package
c. The only exception is the batch_exports/http.py module which will remain part of PostHog.

This PR also has some other non-heartbeat related changes:

Use the BatchExportTemporaryFile class to manage file writing.
Add optional unit tests like those implemented for BigQuery/Redshift.

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

How did you test this code?

New unit tests ran against real Snowflake:

$  SNOWFLAKE_USERNAME="username" SNOWFLAKE_ACCOUNT="account" SNOWFLAKE_PASSWORD="password" SNOWFLAKE_WAREHOUSE="warehouse" DEBUG=1 pytest posthog/temporal/tests/batch_exports/test_snowflake_batch_export_workflow.py::test_snowflake_export_workflow -vv
===================================================== test session starts =====================================================
platform linux -- Python 3.10.10, pytest-7.4.0, pluggy-0.13.1 -- src/github.com/PostHog/posthog/.direnv/python-3.10.10/bin/python
cachedir: .pytest_cache
django: settings: posthog.settings (from ini)
rootdir: src/github.com/PostHog/posthog
configfile: pytest.ini
plugins: asyncio-0.21.1, icdiff-0.6, flaky-3.7.0, env-0.8.2, Faker-17.5.0, syrupy-1.7.4, mock-3.11.1, split-0.8.1, django-4.5.2, cov-4.1.0
asyncio: mode=strict
collected 4 items                                                                                                             

posthog/temporal/tests/batch_exports/test_snowflake_batch_export_workflow.py::test_snowflake_export_workflow[None-hour] PASSED [ 25%]
posthog/temporal/tests/batch_exports/test_snowflake_batch_export_workflow.py::test_snowflake_export_workflow[None-day] PASSED [ 50%]
posthog/temporal/tests/batch_exports/test_snowflake_batch_export_workflow.py::test_snowflake_export_workflow[exclude_events1-hour] PASSED [ 75%]
posthog/temporal/tests/batch_exports/test_snowflake_batch_export_workflow.py::test_snowflake_export_workflow[exclude_events1-day] PASSED [100%]

--------------------------------------------------- snapshot report summary ---------------------------------------------------

===================================================== 4 passed in 35.77s ======================================================

Co-authored-by: Brett Hoerner <[email protected]>

tomasfarias · 2023-11-09T14:27:59Z

We should hold off on merging this one until #18467 is merged due to potential conflicts.

bretthoerner

There's a conflict from my merge, but LGTM.

I like the plans to extract batch exports to their own directory/app/service/image, no real comments there at this point.

posthog/temporal/utils.py

bretthoerner · 2023-11-09T23:28:13Z

posthog/temporal/utils.py

+        # Ideally, any new exceptions should be added to the previous blocks after the first time and we will never land here.
+        heartbeat_details = None
+        received = False
+        logger.warning(f"Did not receive details from previous activity Excecution due to an unexpected error")


Should this include the exception itself?

Ah, yeah, this log should be promoted to ERROR level and use logger.exception.

bretthoerner · 2023-11-09T23:39:30Z

posthog/temporal/workflows/snowflake_batch_export.py

+
+    except snowflake.connector.ProgrammingError:
+        # TODO: logging? Other handling?
+        raise


Will this exception end up in some existing log at least?

It will bubble up and the activity will fail, so we will see it in the temporal error logs.

The TODO was just thinking if we should do anything more, maybe log ourselves to add some context. I'll do that at least once we merge the structlog PR and I can rebase this one.

bretthoerner · 2023-11-09T23:40:47Z

posthog/temporal/workflows/snowflake_batch_export.py

    """Executes a PUT query using the provided cursor to the provided table_name.

+    Sadly, Snowflake's execute_async does not work with PUT statements. So, we pass the execute


Totally aside, it's wild to me that their execute fn does or doesn't work only with certain statements. Weird!

There is some (very little) explanation of why here: snowflakedb/snowflake-connector-python#1227 (comment).

bretthoerner · 2023-11-09T23:43:11Z

posthog/temporal/workflows/snowflake_batch_export.py

+
+    We add a file_no to the file_name when executing PUT as Snowflake will reject any files with the same
+    name. Since batch exports re-use the same file, our name does not change, but we don't want Snowflake
+    to reject or overwrite our new data.


Is the unique filename thing true for the lifetime of the table? Do we need to worry at all about a temporary file name (which is I think what we're using here) being reused?

If COPY is successful after we are done uploading everything, the files will be purged, so their names become available again.

That being said, the purge can fail (even without COPY failing), or we could fail somewhere before COPY. In that case, we are hoping that Python will not generate the same name for our temporary file again. First, Python generates an infinite sequence of random names (with this: https://github.com/python/cpython/blob/3.10/Lib/tempfile.py#L132), and every time it needs a name for a temp file, it calls next to get the next randomly generated name (here: https://github.com/python/cpython/blob/3.10/Lib/tempfile.py#L252), and uses that as the file name (here: https://github.com/python/cpython/blob/3.10/Lib/tempfile.py#L556).

So, worst possible scenario, our protection from collisions boils down to how likely or under what circumstances will the Python RNG choose the same sequence of 8 characters. We are not manually seeding it for any reason, so I think we should be safe and that if it ever happens it will be a good anecdote.

EDIT: I ran through the name generation process in this comment as I was keeping notes as I looked up how it worked just now. Not my intention to imply I know everything or sound pedantic!

Thanks for all the detail!

…batch-exports-and-real-tests

github-actions · 2023-11-17T14:54:09Z

Size Change: -2.64 kB (0%)

Total Size: 2.01 MB

Filename	Size	Change
`frontend/dist/toolbar.js`	2.01 MB	-2.64 kB (0%)

_{compressed-size-action}

tomasfarias marked this pull request as draft November 6, 2023 17:39

tomasfarias mentioned this pull request Nov 6, 2023

refactor: Snowflake batch export #18411

Closed

tomasfarias changed the title ~~refactor: Snowflake batch export is now async~~ refactor: Snowflake batch export Nov 6, 2023

Base automatically changed from refactor/batch-exports-tests-simplification to master November 7, 2023 10:07

tomasfarias added 19 commits November 7, 2023 11:09

feat: Add test utilities module

bd4e872

feat: Add new fixtures to conftest.py

7cd9283

refactor: Use new fixtures/utilities in S3BatchExport tests

2924a0c

chore: Update tests README

5f21e4c

refactor: Make exclude_events and interval fixtures available for all

c539a61

refactor: Use new fixtures/utils in bigquery tests

34fbd01

chore: Mark all tests as pytest.mark.django_db

96cf08c

refactor: Use new fixtures/utilities in PostgresBatchExport tests

16b4dcb

fix: Test for exclude_events

90bf46e

fix: Apply indirect fixture

704a505

refactor: Use new fixtures/utilities in RedshiftBatchExport tests

8d36aff

refactor: Backfill tests use new fixtures/utilities

b052352

fix: Skip tests that require google credentials

ee253a7

refactor: BatchExport tests use new fixtures/utilities

0c5a22d

chore: Update docstrings

18309a9

refactor: Snowflake batch export is now async

991d4b1

fix: Snowflake's execute_async doesn't support PUT

927f199

test: Add proper Snowflake tests

2718a02

test: Update mock tests

8d9f615

tomasfarias force-pushed the refactor/snowflake-batch-exports-and-real-tests branch from 1e80500 to 8d9f615 Compare November 7, 2023 10:09

tomasfarias and others added 6 commits November 7, 2023 11:11

test: Generate events early and late

af203d1

test: Do not pass events_to_exclude as they are excluded anyways

d488cb2

typo: In tests readme

c6ef21f

Co-authored-by: Brett Hoerner <[email protected]>

typo: In test docstring

54153c4

Co-authored-by: Brett Hoerner <[email protected]>

typo: In test data

140b080

Co-authored-by: Brett Hoerner <[email protected]>

test: Do not pass events_to_exclude as they are excluded anyways

ae6179f

github-actions bot and others added 4 commits November 7, 2023 10:34

Update query snapshots

a1a737b

Update query snapshots

61d17d4

fix: Support multiple file uploads

048c467

feat: Add heartbeat support in Snowflake

79d1549

tomasfarias mentioned this pull request Nov 8, 2023

chore(batch-exports): add some Prometheus metrics for batch exports #18467

Merged

tomasfarias marked this pull request as ready for review November 9, 2023 14:26

tomasfarias requested a review from a team November 9, 2023 14:26

bretthoerner approved these changes Nov 9, 2023

View reviewed changes

Merge remote-tracking branch 'origin/master' into refactor/snowflake-…

4dcf261

…batch-exports-and-real-tests

tomasfarias force-pushed the refactor/snowflake-batch-exports-and-real-tests branch from 7f46d0e to 4dcf261 Compare November 17, 2023 14:46

tomasfarias added 5 commits November 20, 2023 09:52

chore: Update logging levels

86d2e01

fix: More asyncio

675f207

fix: Typing

db22d13

chore: Add heartbeat timeout

bd61e3f

chore: Update log line

da9ab69

tomasfarias merged commit 1c6ec08 into master Nov 20, 2023
67 checks passed

tomasfarias deleted the refactor/snowflake-batch-exports-and-real-tests branch November 20, 2023 10:10

tomasfarias mentioned this pull request Nov 20, 2023

feat: Heartbeating and asyncio for BigQuery batch exports #18741

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Snowflake batch export #18427

refactor: Snowflake batch export #18427

tomasfarias commented Nov 6, 2023 •

edited

Loading

tomasfarias commented Nov 9, 2023

bretthoerner left a comment

bretthoerner Nov 9, 2023

tomasfarias Nov 10, 2023

bretthoerner Nov 9, 2023

tomasfarias Nov 10, 2023

tomasfarias Nov 10, 2023 •

edited

Loading

bretthoerner Nov 9, 2023

tomasfarias Nov 10, 2023 •

edited

Loading

bretthoerner Nov 9, 2023

tomasfarias Nov 10, 2023 •

edited

Loading

bretthoerner Nov 13, 2023

github-actions bot commented Nov 17, 2023 •

edited

Loading

		"""Executes a PUT query using the provided cursor to the provided table_name.

		Sadly, Snowflake's execute_async does not work with PUT statements. So, we pass the execute

refactor: Snowflake batch export #18427

refactor: Snowflake batch export #18427

Conversation

tomasfarias commented Nov 6, 2023 • edited Loading

Problem

Changes

How did you test this code?

tomasfarias commented Nov 9, 2023

bretthoerner left a comment

Choose a reason for hiding this comment

bretthoerner Nov 9, 2023

Choose a reason for hiding this comment

tomasfarias Nov 10, 2023

Choose a reason for hiding this comment

bretthoerner Nov 9, 2023

Choose a reason for hiding this comment

tomasfarias Nov 10, 2023

Choose a reason for hiding this comment

tomasfarias Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

bretthoerner Nov 9, 2023

Choose a reason for hiding this comment

tomasfarias Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

bretthoerner Nov 9, 2023

Choose a reason for hiding this comment

tomasfarias Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

bretthoerner Nov 13, 2023

Choose a reason for hiding this comment

github-actions bot commented Nov 17, 2023 • edited Loading

tomasfarias commented Nov 6, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading

github-actions bot commented Nov 17, 2023 •

edited

Loading