feat: add KafkaBatchPublisher #19

wtfzambo · 2024-09-24T13:26:25Z

In this PR

KafkaBatchPublisher class as requested
Related unit tests
Some refactoring of adjacent unit tests
Dev package chispa for easy handling of pyspark DataFrames equality assertions (see screenshot for example)
Clean up of Pytest logs so now they're readable

chispa output example

shahinism · 2024-09-27T08:06:35Z

pyproject.toml

@@ -70,7 +71,15 @@ source = [
 skip_empty = true

 [tool.pytest.ini_options]
-addopts = "-vv --tb=auto --disable-warnings"
+addopts = "-v --tb=short --disable-warnings -ra --no-header --show-capture=log"


add a description to this configuration and why it's added. Also drop --disable-warnings.

Sure. I'm doing all this logging-related stuff because I use the terminal to run pytest. With default settings, the output is too cluttered with unnecessary information, making it painful to read.

vv was producing output too verbose with lots of unneeded information for testing purposes only

tb=short makes tracebacks for failed tests more concise: prints the failing line + related context instead of the whole damn function

ra prints a Recap at the end of the output. a stands for all tests except passed ones.

no-header reduces verbosity at the beginning of the output with little helpful info

show-capture=log This is because pytest, by default, captures EVERYTHING. Usually, this is not an issue until we work with libraries (spark) that produce A LOT of output which often we don't need for unit testing. This is captured and printed. On top of that, logging written by us gets printed twice, once in the captured stderr/out, once in the captured logs. The end result is that pytest output is very cluttered. By capturing log only, it's much cleaner without missing out useful info (especially logging of our own).

shahinism · 2024-09-27T08:09:40Z

src/sparkle/writer/kafka_writer.py

+        """Initialize the KafkaBatchPublisher object.
+
+        Args:
+            kafka_options (dict[str, Any]): Kafka options for the connection.


With regards to your communication in Slack @wtfzambo I completely agree with you, this kind of configuration, is contradicting the whole purpose of with_config. So initially when we introduced it in the previous iteration, was we were taking in the kafka_options, as an argument list in the construct. If that's not necessary, why not just take in the KafkaConfig type here as the input to initialize the construct and drop with_config altogether?

Admittedly, I still have limited vision, but I'd also drop the with_config method.

Especially if KafkaReader and KafkaWriter are supposed to be made public to the user.

The only counter-aspect that I can think of is that passing the whole Config object is a lot more verbose than the current constructor method, which takes just the bare minimum of values to work with Kafka.

src/sparkle/writer/kafka_writer.py

shahinism · 2024-09-27T08:16:06Z

src/sparkle/writer/kafka_writer.py

+        (
+            kafka_df.write.format("kafka")
+            .options(**self.kafka_options)
+            .option("topic", self.kafka_topic)


Does this mean we need to initialize one instance of this writer everytime we want to write to a topic? I think it's more flexible to get the topic name when we want to invoke write, that way, we can also get multiple topics, and just iterate if needed.

This is the same in the other writer I copied from, and the other observation I had. I think I spoke to you about this on slack.

I said something like "the topic is set during the istantiation, so for each different topic we gotta create a different instance of the writer class?"

I kept it like this in the batch writer because it's not my design and IDK where we're going yet. I don't like to change other's designs until I know exactly what the route is.

tests/conftest.py

shahinism · 2024-09-27T08:20:39Z

tests/conftest.py

+def cleanup_logging_handlers():
+    """Fixture to cleanup logging handlers after tests.
+
+    Prevents logging errors at the end of the report.


Is it causing any harm? If not I'd suggest to not implement it like this as it's just silencing any possible error in handlers, and with future updates might introduce unwanted behavior.

Is it causing any harm?
To my eyes, yes. It's also related to pytest logs being cluttered.

This is a long standing known issue of pytest and a sufficiently deep rabbit hole: pytest-dev/pytest#5502 (comment).

There are some situations/tests (pardon the imprecision; I delved into this too many years ago) where pytest attempts to write to closed streams AFTER the session is over.

This results in error messages being printed at the very end of the pytest recap, which provide no useful information and clutter the output.

This fixture prevents this from happening, since it's scope is session it shouldn't tamper with anything test related.

shahinism · 2024-09-27T08:22:13Z

tests/conftest.py

+def log_spark_dataframe(df: DataFrame, *, truncate: bool = False, name: str = "") -> None:
+    """Logs the contents of a Spark DataFrame in tabular format.
+
+    Useful when Pytest is configured to capture only logs, so `df.show()` won't work.


Isn't it better to configure Pytest to not only capture logs?

Usually yes, but see explanation above

Also: - reshaped some data so that type checker stops complaining

See: https://github.com/MrPowers/chispa Also: - cleanup pytest log output

Chispa doesn't have stubs or py.typed marker. Mypy throws error for that. This seems unnecessarily strict for a package that is just used for comparing dataframes in unit testing. This commit disables the behavior for this specific package.

Co-authored-by: Reza Khanipour <[email protected]>

wtfzambo linked an issue Sep 24, 2024 that may be closed by this pull request

Add Kafka batch publisher #14

Closed

wtfzambo changed the title ~~feat: add kafka-ui service so I can see what goes on~~ feat: add KafkaBatchPublisher Sep 24, 2024

wtfzambo force-pushed the feature/14-add-kafka-batch-publisher branch from 9c9d7f0 to c87f136 Compare September 25, 2024 15:44

wtfzambo marked this pull request as ready for review September 25, 2024 15:50

wtfzambo requested a review from shahinism September 25, 2024 15:50

shahinism reviewed Sep 27, 2024

View reviewed changes

Federico Zambelli and others added 12 commits October 3, 2024 10:22

feat: add kafka-ui service so I can see what goes on

2da9db5

Also: - reshaped some data so that type checker stops complaining

test: separate batch from stream writer

aeaed3d

feat: add chispa pkg for easy spark DF comparisons

c558638

See: https://github.com/MrPowers/chispa Also: - cleanup pytest log output

feat(kafkaWriter): add batch publisher

03457fe

test(kafkaWriter): add tests for batch publisher

bf8c9e8

fix: check if Kafka columns in writer are *exactly* {key, value}

33ce26e

Co-authored-by: Reza Khanipour <[email protected]>

Update tests/conftest.py

d148db7

Co-authored-by: Reza Khanipour <[email protected]>

chore: add explanation to pytest flags

63a2851

chore: add logging to some tests

c108e70

fix: add missing colon

7791ae0

chore: add script to remove unused containers

f992d87

wtfzambo force-pushed the feature/14-add-kafka-batch-publisher branch from 750c607 to f992d87 Compare October 3, 2024 08:49

farbodahm merged commit 91901f7 into main Oct 3, 2024
2 checks passed

farbodahm deleted the feature/14-add-kafka-batch-publisher branch October 3, 2024 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add KafkaBatchPublisher #19

feat: add KafkaBatchPublisher #19

wtfzambo commented Sep 24, 2024 •

edited

Loading

shahinism Sep 27, 2024

wtfzambo Sep 27, 2024

shahinism Sep 27, 2024

wtfzambo Sep 27, 2024

shahinism Sep 27, 2024

wtfzambo Sep 27, 2024

shahinism Sep 27, 2024

wtfzambo Sep 27, 2024

shahinism Sep 27, 2024

wtfzambo Sep 27, 2024

feat: add KafkaBatchPublisher #19

feat: add KafkaBatchPublisher #19

Conversation

wtfzambo commented Sep 24, 2024 • edited Loading

In this PR

chispa output example

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wtfzambo commented Sep 24, 2024 •

edited

Loading