refactor: Structlog batch exports logging #18458

tomasfarias · 2023-11-07T17:14:01Z

Problem

The existing logging implementation for batch exports is a bit clunky:

We hardcode the log source to "batch_exports", but it would be nice to distinguish from backfills.
Logs are not rendered as JSON, so they would not be parse-able by log aggregators (Loki).
Logs repeat themselves: Every log line says the export type.
All the code is in the generic batch_exports.py module.
Kafka logging handler is threaded instead of async. Having an async handler would allow us to not block other tasks (mostly heartbeats).

Changes

Move over to structlog to handle logging (structlog is already a dependency from the django-structlog requirement, so nothing new here, but I did bump the version, which should be supported). This allows us to set context variables (like team_id and destination) once and for all.
Moreover, structlog allows us to render logs as JSON, so log aggregators will be happy.
Move logger to its own module.
Use aiokafka to make kafka producer async.

Overall, the logging system now looks easier to understand and, eventually, extend. The line diff may be deceptive as a lot of it is in docstrings to aid with understanding and more than half is in unit testing (which we didn't properly have much of).

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

How did you test this code?

Added unit tests.

tomasfarias · 2023-11-09T14:20:52Z

.github/actions/run-backend-tests/action.yml

@@ -83,7 +83,7 @@ runs:
        - uses: syphar/restore-virtualenv@v1
          id: cache-backend-tests
          with:
-              custom_cache_key_element: v1
+              custom_cache_key_element: v2


Not sure if this is necessary, but CI was insisting on re-using a virtual environment with an old version of structlog and hogql_parser.

tomasfarias · 2023-11-09T14:21:59Z

posthog/temporal/tests/batch_exports/test_bigquery_batch_export_workflow.py

+
+    yield dataset
+
+    bigquery_client.delete_dataset(dataset_id, delete_contents=True, not_found_ok=True)


Instead of hardcoding the test dataset, I rather create a new one and then clean it up.

tomasfarias · 2023-11-09T14:22:35Z

posthog/temporal/tests/batch_exports/test_logger.py

+    def producer(self) -> aiokafka.AIOKafkaProducer:
+        if self._producer is None:
+            self._producer = aiokafka.AIOKafkaProducer(
+                bootstrap_servers=settings.KAFKA_HOSTS + ["localhost:9092"],


Not including extra hosts here was making CI tests fail due to a connection error with kafka 🤷.

Huh, weird, it should be ["kafka:9092"] which works in CI for other Kafka clients.

tomasfarias · 2023-11-09T18:41:44Z

Rebased on master to resolve conflicts, will keep an eye out for tests failing.

bretthoerner

Looks great, I mostly have some async related questions and one worry about the Kafka producer dying.

bretthoerner · 2023-11-09T22:25:42Z

posthog/temporal/workflows/bigquery_batch_export.py

-                def flush_to_bigquery():
-                    logger.info(
-                        "Copying %s records of size %s bytes to BigQuery",
+                async def flush_to_bigquery():


It's interesting to me that this became async but nothing it awaited? Is the new logger async, and if so is it OK that we don't await? (Sorry, I'm still adjusting to Python async.) I see a similar thing happened elsewhere in other destination implementations.

Probably an artifact of rebasing, this doesn't need to be async.

Once we migrate bigquery to async probably, for now, I'll remove it. ~~Changed my mind, kept it to do async logging in case any handlers happen to block.~~

The logger is not async (structlog has async methods but they just wrap the log call in a thread, and you pay the overhead of doing that, which doesn't seem worth it especially since the queue's put_nowait method returns immediately anyways).

That being said, the queue listener that produces logs to kafka, is async, and will release the event loop while it's waiting for work (logs) to arrive on the queue.

I mean, we might as well use the async log methods. There is some overhead from spawning the thread like I said, but it means we can release the event loop while printing logs, and we never know when a handler might block. This just makes it safer and overall we are IO bound anyways to care much about the thread spawn overhead.

Actually, I tried this again and remembered why we cannot use the async logging methods (e.g. logger.ainfo, logger.adebug): Temporal runs on its own custom event loop that disallows a lot of thread related methods (like run_in_executor, to_thread), which the logging library uses.

So, for now, going back to sync logging, and I've removed this await. This shouldn't block more than the current sync handlers anyways...

Ah, okay: I did some more digging. The Workflow is running in a custom event loop. However, activities run in the normal asyncio loop. So, we can do async logging in activities, I was just seeing failures from using async logging everywhere.

This PR is getting quite long, so if it's okay with you I'll revisit async logging as we move each destination to async. We have to work on unblocking everywhere for BigQuery, might as well add logging to the list.

Sounds good to me.

bretthoerner · 2023-11-09T22:34:34Z

posthog/temporal/workflows/logger.py

+async def bind_batch_exports_logger(team_id: int, destination: str | None = None) -> FilteringBoundLogger:
+    """Return a bound logger for BatchExports."""
+    if not structlog.is_configured():
+        await configure_logger()


Couldn't this be called concurrently while another task is waiting on configure_logger to complete? I wouldn't worry about the race so much normally, but doing things like creating multiple listen_tasks makes me a bit nervous. In JS I'd stash the configure promise so any future caller would await the same instance.

Maybe I'm missing something about Python async though?

Also, I guess leaving these promises dangling is cool in Python? (Related to the other async question I asked about loggers, I guess.)

return (listen_task, worker_shutdown_handler_task)

Aah, good catch, yeah, we should keep a reference to these tasks as they could be gc'd otherwise. The crab compiler would have probably warned us about this one!

I've added a BACKGROUND_LOGGER_TASKS set to hold references to these tasks. This way, they won't be garbage collected under us.

Couldn't this be called concurrently while another task is waiting on configure_logger to complete?

Yeah, but configure_logger is actually all sync. I probably got confused because it's spawning an async task, but not async itself. So, we can just make this function sync and get rid of any potential race conditions.

posthog/temporal/workflows/logger.py

bretthoerner · 2023-11-09T23:12:03Z

posthog/temporal/workflows/logger.py

+        try:
+            while True:
+                msg = await self.queue.get()
+                await self.produce(msg)


So if the Brokers are unhealthy (or something) and this throws, it seems like our Kafka listen task is forever dead and there is recovery or shutdown?

And if so, I guess we'd lose logs and slowly OOM via the asyncio.Queue(maxsize=-1)?

I don't think it needs to be super robust or perfect but it does make me a little nervous that there's no real recovery (or I'm missing it!).

Yeah, fair point, we can do the same as the current approach and catch exceptions so, worst case, at least we will continue to get from the queue and we won't OOM. It means we will miss some logs, but I think we can live with that (at least for now).

Perhaps we should also catch any errors when initializing the producer, just in case kafka is unhealthy right as we are coming up.

EDIT: I did just that. Won't put any logs to the queue if the producer fails to start for whatever reason.

Thank you for making our code more reliable!

bretthoerner · 2023-11-09T23:14:58Z

posthog/temporal/tests/batch_exports/test_logger.py

+    def producer(self) -> aiokafka.AIOKafkaProducer:
+        if self._producer is None:
+            self._producer = aiokafka.AIOKafkaProducer(
+                bootstrap_servers=settings.KAFKA_HOSTS + ["localhost:9092"],


Huh, weird, it should be ["kafka:9092"] which works in CI for other Kafka clients.

tomasfarias changed the title ~~Refactor/batch exports logging new~~ refactor: batch exports logging Nov 7, 2023

tomasfarias changed the title ~~refactor: batch exports logging~~ refactor: Structlog batch exports logging Nov 7, 2023

tomasfarias mentioned this pull request Nov 7, 2023

feat: Implement structlog batch exports logger #18334

Closed

tomasfarias force-pushed the refactor/batch-exports-logging-new branch 21 times, most recently from 1838321 to 288a97c Compare November 8, 2023 18:09

tomasfarias commented Nov 9, 2023

View reviewed changes

tomasfarias marked this pull request as ready for review November 9, 2023 14:23

tomasfarias requested a review from a team November 9, 2023 14:23

fix: Checkout master before checking for hogql changes

209ae8f

tomasfarias added 15 commits November 9, 2023 19:35

fix: Don't use async logging as it's unsupported by temporal runtime

1fc7819

test: Add logger tests

d893732

fix: Mix pytestmark lists

5cdfd78

fix: Remove unused imports

075d50a

fix: Cleanup pytest warnings

a24d5ab

fix: Create and drop dataset for bigquery tests

0b66f61

fix: Typing issue?

199831c

fix: Let's just checkout everything

2d5ba76

fix: Use global event loop in tests

68c9dc2

fix: Blow-up cache

33d049f

fix: Truncate only if table exists

88f9ced

revert-me: Skip all postgres tests

300a15d

fix: Connect to kafka in localhost

3cb54f0

fix: Lazily connect to Kafka

7219708

fix: Resolve conflicts

4189919

tomasfarias force-pushed the refactor/batch-exports-logging-new branch from 288a97c to 4189919 Compare November 9, 2023 18:41

bretthoerner approved these changes Nov 9, 2023

View reviewed changes

tomasfarias added 3 commits November 10, 2023 11:11

fix: Capture temporal context once and bind it to the logger

54c5589

fix: Make configure logger sync

4e4c222

fix: Keep strong reference to background tasks

98cf769

tomasfarias force-pushed the refactor/batch-exports-logging-new branch from 909137e to 98cf769 Compare November 10, 2023 10:36

fix: Continue consuming from log queue even if we fail to produce

f7c05af

tomasfarias force-pushed the refactor/batch-exports-logging-new branch from 7c2d28c to 28ca083 Compare November 10, 2023 11:01

tomasfarias added 2 commits November 10, 2023 15:50

fix: Also catch the producer not starting

0e4228b

fix: Remove unused await

403c87d

tomasfarias force-pushed the refactor/batch-exports-logging-new branch from 28ca083 to 403c87d Compare November 10, 2023 14:53

fix: Log kafka producer error after logger is configured

8fb89e6

tomasfarias merged commit a8f6d92 into master Nov 13, 2023
67 checks passed

tomasfarias deleted the refactor/batch-exports-logging-new branch November 13, 2023 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Structlog batch exports logging #18458

refactor: Structlog batch exports logging #18458

tomasfarias commented Nov 7, 2023

tomasfarias Nov 9, 2023 •

edited

Loading

tomasfarias Nov 9, 2023

tomasfarias Nov 9, 2023 •

edited

Loading

bretthoerner Nov 9, 2023

tomasfarias commented Nov 9, 2023

bretthoerner left a comment

bretthoerner Nov 9, 2023

tomasfarias Nov 10, 2023

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading

bretthoerner Nov 13, 2023

bretthoerner Nov 9, 2023

bretthoerner Nov 9, 2023

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023

bretthoerner Nov 9, 2023

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023

bretthoerner Nov 9, 2023


		yield dataset

		bigquery_client.delete_dataset(dataset_id, delete_contents=True, not_found_ok=True)

refactor: Structlog batch exports logging #18458

refactor: Structlog batch exports logging #18458

Conversation

tomasfarias commented Nov 7, 2023

Problem

Changes

How did you test this code?

tomasfarias Nov 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomasfarias Nov 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomasfarias commented Nov 9, 2023

bretthoerner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomasfarias Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

tomasfarias Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

tomasfarias Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

tomasfarias Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

tomasfarias Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomasfarias Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomasfarias Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

tomasfarias Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomasfarias Nov 9, 2023 •

edited

Loading

tomasfarias Nov 9, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading

tomasfarias Nov 10, 2023 •

edited

Loading