Automated data diffs as part of CI validation #2537

rousik · 2023-04-19T21:12:38Z

rousik
Apr 19, 2023
Collaborator

Making any change to pudl codebase may affect the outputs that are being generated by the pipeline. This change can be either intentional (e.g. modifying transform process, fixing bugs, adding new datasets), or it could be erroneous (refactoring may introduce data change that is not wanted or intended).

Many changes (esp. refactorings) should be no-op w.r.t. to the resulting data. To ensure that this is so, it would be useful to automate the output data diffing capabilities. The proposal would be this:

whenever PR is opened against one of the recognized target branches (e.g. dev, main), ETL is run with some configuration.
if the ETL succeeds, outputs will be stored into public data buckets (perhaps with limited ttl)
data diff process compares these outputs with the outputs obtained by running the code against the target branch

If there are no data diffs, there's no problem. If there are data diffs, then either the PR is explicitly marked as "expecting data changes" (e.g. this could be achieved by setting specific label on the PR) in which case, the test can still pass. If the PR introduces data changes and those are unexpected, results of the data diff can be added to the PR comments automatically so that the author can analyze and fix the problems.

rousik · 2023-04-19T21:12:57Z

rousik
Apr 19, 2023
Collaborator Author

See https://catalystcooperative.slack.com/archives/C02BSMJJVR8/p1681746492529679 for the slack conversation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

Automated data diffs as part of CI validation #2537

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Catalyst Cooperative

Automated data diffs as part of CI validation #2537

rousik Apr 19, 2023 Collaborator

Replies: 1 comment

rousik Apr 19, 2023 Collaborator Author

rousik
Apr 19, 2023
Collaborator

rousik
Apr 19, 2023
Collaborator Author