-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add KafkaBatchPublisher #19
Conversation
9c9d7f0
to
c87f136
Compare
pyproject.toml
Outdated
@@ -70,7 +71,15 @@ source = [ | |||
skip_empty = true | |||
|
|||
[tool.pytest.ini_options] | |||
addopts = "-vv --tb=auto --disable-warnings" | |||
addopts = "-v --tb=short --disable-warnings -ra --no-header --show-capture=log" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a description to this configuration and why it's added. Also drop --disable-warnings
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I'm doing all this logging-related stuff because I use the terminal to run pytest. With default settings, the output is too cluttered with unnecessary information, making it painful to read.
- vv was producing output too verbose with lots of unneeded information for testing purposes only
- tb=short makes tracebacks for failed tests more concise: prints the failing line + related context instead of the whole damn function
- ra prints a Recap at the end of the output.
a
stands for all tests except passed ones. - no-header reduces verbosity at the beginning of the output with little helpful info
- show-capture=log This is because pytest, by default, captures EVERYTHING. Usually, this is not an issue until we work with libraries (spark) that produce A LOT of output which often we don't need for unit testing. This is captured and printed. On top of that, logging written by us gets printed twice, once in the captured
stderr/out
, once in thecaptured logs
. The end result is that pytest output is very cluttered. By capturinglog
only, it's much cleaner without missing out useful info (especially logging of our own).
"""Initialize the KafkaBatchPublisher object. | ||
|
||
Args: | ||
kafka_options (dict[str, Any]): Kafka options for the connection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With regards to your communication in Slack @wtfzambo I completely agree with you, this kind of configuration, is contradicting the whole purpose of with_config
. So initially when we introduced it in the previous iteration, was we were taking in the kafka_options, as an argument list in the construct. If that's not necessary, why not just take in the KafkaConfig type here as the input to initialize the construct and drop with_config
altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admittedly, I still have limited vision, but I'd also drop the with_config
method.
Especially if KafkaReader
and KafkaWriter
are supposed to be made public to the user.
The only counter-aspect that I can think of is that passing the whole Config
object is a lot more verbose than the current constructor method, which takes just the bare minimum of values to work with Kafka.
( | ||
kafka_df.write.format("kafka") | ||
.options(**self.kafka_options) | ||
.option("topic", self.kafka_topic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we need to initialize one instance of this writer everytime we want to write to a topic? I think it's more flexible to get the topic name when we want to invoke write
, that way, we can also get multiple topics, and just iterate if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same in the other writer I copied from, and the other observation I had. I think I spoke to you about this on slack.
I said something like "the topic is set during the istantiation, so for each different topic we gotta create a different instance of the writer class?"
I kept it like this in the batch writer because it's not my design and IDK where we're going yet. I don't like to change other's designs until I know exactly what the route is.
def cleanup_logging_handlers(): | ||
"""Fixture to cleanup logging handlers after tests. | ||
|
||
Prevents logging errors at the end of the report. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it causing any harm? If not I'd suggest to not implement it like this as it's just silencing any possible error in handlers, and with future updates might introduce unwanted behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it causing any harm?
To my eyes, yes. It's also related to pytest logs being cluttered.
This is a long standing known issue of pytest and a sufficiently deep rabbit hole: pytest-dev/pytest#5502 (comment).
There are some situations/tests (pardon the imprecision; I delved into this too many years ago) where pytest attempts to write to closed streams AFTER the session is over.
This results in error messages being printed at the very end of the pytest recap, which provide no useful information and clutter the output.
This fixture prevents this from happening, since it's scope is session
it shouldn't tamper with anything test related.
def log_spark_dataframe(df: DataFrame, *, truncate: bool = False, name: str = "") -> None: | ||
"""Logs the contents of a Spark DataFrame in tabular format. | ||
|
||
Useful when Pytest is configured to capture only logs, so `df.show()` won't work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it better to configure Pytest
to not only capture logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually yes, but see explanation above
Also: - reshaped some data so that type checker stops complaining
See: https://github.com/MrPowers/chispa Also: - cleanup pytest log output
Chispa doesn't have stubs or py.typed marker. Mypy throws error for that. This seems unnecessarily strict for a package that is just used for comparing dataframes in unit testing. This commit disables the behavior for this specific package.
Co-authored-by: Reza Khanipour <[email protected]>
Co-authored-by: Reza Khanipour <[email protected]>
750c607
to
f992d87
Compare
In this PR
KafkaBatchPublisher
class as requestedchispa output example