feat: Start tracking records exported #21008

tomasfarias · 2024-03-19T16:10:49Z

Problem

Not sure why we weren't doing this already. ~~Anyways, some test case may fail~~ bug squashed, there should be no failures. Also, BigQuery tests are passing when running manually.

Changes

Pass along BatchExportTemporaryFile.records_total to update_batch_export_run_status and use it to update the Django model's records_completed.

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

Works anywhere that can run batch exports.

How did you test this code?

Added assertions on records_completed to a handful of tests.

tomasfarias · 2024-03-19T16:30:29Z

posthog/temporal/batch_exports/redshift_batch_export.py

            rows_exported.add(len(batch))
+            total_rows_exported += len(batch)


Not sure if we could just read the counter instead of having this duplicated.

tomasfarias · 2024-03-19T16:32:27Z

posthog/batch_exports/service.py

+def update_batch_export_run_status(
+    run_id: UUID, status: str, latest_error: str | None, records_completed: int = 0
+) -> BatchExportRun:


Could just be named update_batch_export_run as it's doing more than setting the status.

On a broader note, I think we should get rid of the ORM (in batch exports) and move to something like aiosql.

bretthoerner

I keep kicking the tests, they are hanging on Temporal starting and other things. May be related to change to use Depot?

Either way, this looks fine. I have some question about style (using vars outside of the context manager that created them). I cut my teeth on Python but I find it to be so, so ugly now. 🙃

bretthoerner · 2024-03-19T18:19:58Z

posthog/temporal/batch_exports/bigquery_batch_export.py

@@ -354,6 +354,8 @@ async def flush_to_bigquery(bigquery_table, table_schema):

                    jsonl_file.reset()

+        return jsonl_file.records_total


This is de-dented outside of the with BatchExportTemporaryFile() as jsonl_file:, while it may work (which surprises me, but Python loves to leak things from scope) I think we should put it inside?

Yeah, with statements do not introduce start a new scope. I am fine with de-denting this though, and I do plan to address it more in future PRs.

posthog/temporal/batch_exports/postgres_batch_export.py

posthog/temporal/batch_exports/bigquery_batch_export.py

bretthoerner · 2024-03-19T18:23:58Z

posthog/temporal/batch_exports/s3_batch_export.py

@@ -503,6 +503,8 @@ async def flush_to_s3(last_uploaded_part_timestamp: str, last=False):

            await s3_upload.complete()

+        return local_results_file.records_total


Also de-dented, but it competes with the s3_upload which happens at an outer scope...

Maybe I'm being un-Pythonic? The use of important information outside of the with gives me the creeps, but maybe this is fine and good. I guess it's fine as long as __exit__ leaves the state we need?

await s3_upload.complete() should be in the __exit__ of S3MultiPartUpload. I took it out as I was going crazy trying to debug a batch export completing with an extra number of parts, so I tried being explicit to throw some clarity at the problem.

In the end, I think I randomly fixed the bug in another PR when I wasn't looking for it. Anyways, this remained outside of the __exit__ and should be re-added in as it wasn't the cause of the bug.

When we move it we do have to account for exceptions: We don't want to complete the upload in case of an exception and we have to be very careful if we are aborting, as a retry could continue the upload. Maybe the solution is to move this outside of the activity and in the workflow, as starting, completing, or aborting an upload should never fail except for wrong credentials, which we can very precisely wrap in a try/except.

I can open a follow-up PR to deal with this.

tomasfarias · 2024-03-19T18:40:23Z

I keep kicking the tests, they are hanging on Temporal starting and other things. May be related to change to use Depot?

Not entirely sure, they've been very flaky in this and other PRs. Eventually they pass, but with the snapshot bot adding commits all the progress is reset 🙃

tomasfarias · 2024-03-19T18:44:46Z

Either way, this looks fine. I have some question about style (using vars outside of the context manager that created them). I cut my teeth on Python but I find it to be so, so ugly now. 🙃

Fair point. I think this has to do with our temporary file doing two things: Being a file and a writer to a file. The writer itself outlives the context of the file, and should be able to report how many records it wrote. I am making this distinction explicit to support file formats (parquet for S3, wip in #20979), as in that case I need different writers to deal with the different formats. That PR should already make things better as the writer is defined outside the context manager, I think.

Co-authored-by: Brett Hoerner <[email protected]>

sentry-io · 2024-03-20T10:39:31Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ ActivityError: Activity task failed temporalio.worker._workflow_instance in run_act... View Issue
‼️ ActivityError: Activity task failed temporalio.worker._workflow_instance in run_act... View Issue
‼️ ActivityError: Activity task failed temporalio.worker._workflow_instance in run_act... View Issue
‼️ ActivityError: Activity task failed temporalio.worker._workflow_instance in run_act... View Issue
‼️ InterfaceError: connection already closed django.db.backends.postgresql.base in create_cu... View Issue

_{Did you find this useful? React with a 👍 or 👎}

feat: Start tracking records exported

f0a4c12

tiina303 requested a review from bretthoerner March 19, 2024 16:24

tomasfarias commented Mar 19, 2024

View reviewed changes

tomasfarias and others added 5 commits March 19, 2024 17:46

fix: Move records_completed inside try

04bbd27

test: Add more destination tests

c5fcb78

fix: Redshift returns int

063dc37

fix: Also return records_total from HTTP

855d9f0

Update query snapshots

6bcccad

bretthoerner approved these changes Mar 19, 2024

View reviewed changes

Update query snapshots

6fc0168

github-actions bot and others added 4 commits March 19, 2024 18:54

Update query snapshots

3f7ae99

Update query snapshots

d31c20b

fix: Return inside context manager

e27439a

Co-authored-by: Brett Hoerner <[email protected]>

fix: Return inside context manager

852648d

Co-authored-by: Brett Hoerner <[email protected]>

tomasfarias merged commit 53355af into master Mar 20, 2024
100 of 101 checks passed

tomasfarias deleted the feat/start-tracking-records-exported branch March 20, 2024 10:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Start tracking records exported #21008

feat: Start tracking records exported #21008

tomasfarias commented Mar 19, 2024 •

edited

Loading

tomasfarias Mar 19, 2024

tomasfarias Mar 19, 2024

tomasfarias Mar 19, 2024 •

edited

Loading

bretthoerner left a comment

bretthoerner Mar 19, 2024

tomasfarias Mar 19, 2024 •

edited

Loading

bretthoerner Mar 19, 2024

tomasfarias Mar 19, 2024

tomasfarias Mar 19, 2024 •

edited

Loading

tomasfarias commented Mar 19, 2024

tomasfarias commented Mar 19, 2024

sentry-io bot commented Mar 20, 2024 •

edited

Loading

		rows_exported.add(len(batch))
		total_rows_exported += len(batch)

		@@ -354,6 +354,8 @@ async def flush_to_bigquery(bigquery_table, table_schema):

		jsonl_file.reset()

		return jsonl_file.records_total

		@@ -503,6 +503,8 @@ async def flush_to_s3(last_uploaded_part_timestamp: str, last=False):

		await s3_upload.complete()

		return local_results_file.records_total

feat: Start tracking records exported #21008

feat: Start tracking records exported #21008

Conversation

tomasfarias commented Mar 19, 2024 • edited Loading

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

tomasfarias Mar 19, 2024

Choose a reason for hiding this comment

tomasfarias Mar 19, 2024

Choose a reason for hiding this comment

tomasfarias Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

bretthoerner left a comment

Choose a reason for hiding this comment

bretthoerner Mar 19, 2024

Choose a reason for hiding this comment

tomasfarias Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

bretthoerner Mar 19, 2024

Choose a reason for hiding this comment

tomasfarias Mar 19, 2024

Choose a reason for hiding this comment

tomasfarias Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

tomasfarias commented Mar 19, 2024

tomasfarias commented Mar 19, 2024

sentry-io bot commented Mar 20, 2024 • edited Loading

Suspect Issues

tomasfarias commented Mar 19, 2024 •

edited

Loading

tomasfarias Mar 19, 2024 •

edited

Loading

tomasfarias Mar 19, 2024 •

edited

Loading

tomasfarias Mar 19, 2024 •

edited

Loading

sentry-io bot commented Mar 20, 2024 •

edited

Loading