Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate avro files from bq tables #507

Merged
merged 13 commits into from
Oct 8, 2024
Merged

Generate avro files from bq tables #507

merged 13 commits into from
Oct 8, 2024

Conversation

chowbao
Copy link
Contributor

@chowbao chowbao commented Oct 7, 2024

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with the jira ticket associated with the PR.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated the README with the added features, breaking changes, new instructions on how to use the repository.

What

Adding DAG to generate hourly AVRO files from BQ tables

Why

Needed a way to generate frontfill AVRO files for external analytics partnerships

Known limitations

  • Opted for individual table.sql files instead of programmatically generated through airflow/python even though this duplicates SQL. Main reason being it is unknown what kind of query logic that will need to be propagated to the queries to generate the AVRO files. Example is history_effects needing to 1. exclude columns 2. flatten a column 3. exclude a column from the flattened columns. It seems easier to manage at the individual table SQL level than adding function parameters in airflow
  • Did not include a monthly time window backfill DAG
  • history_effects and history_operations currently not run until confirmation that the tables works for external partners

@chowbao chowbao requested a review from a team as a code owner October 7, 2024 16:16
dag = DAG(
"generate_avro_files",
default_args=get_default_dag_args(),
start_date=datetime(2024, 10, 1, 0, 0),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AVRO files for < 10/1/24 are already generated

Copy link
Contributor Author

@chowbao chowbao Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm we can. I wonder if that would be confusing for Dune though. Like 10/1 wouldn't be the full month but instead there would be 10/1 and then another folder for 10/8, 10/9, 10/10, etc...

Any concerns about catching the last 7 days up?

Nope

Copy link
Contributor

@sydneynotthecity sydneynotthecity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I noticed the SQL files are not filtering based on partition key. These tables aren't partitioned by closed_at yet. Can you check the difference in processing bytes/execution time by excluding partition key batch_run_date? Might be negligible, but I have concerns on larger tables, like history_transactions and trust_lines.

Looks good overall, just want to confirm the queries are performant before we run them hourly :sweat-smile:

except(details, details_json, batch_id, batch_insert_ts, batch_run_date),
details.*
except(claimants, type),
details.type as details_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this soroban_operation_type? Do you think we need to rename this column to be clearer? Might be worth discussing with Andre

Copy link
Contributor Author

@chowbao chowbao Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this soroban_operation_type?

Yes it is

Do you think we need to rename this column to be clearer?

Yeah we can. I'll regenerate the files and mention it to Andre

Comment on lines 14 to 15
and closed_at >= '{batch_run_date}'
and closed_at < '{next_batch_run_date}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

closed_at is not on the table, i don't think this will run

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to ledger_closed_at

"project_id": project,
"dataset_id": dataset,
"batch_run_date": batch_run_date,
"prev_batch_run_date": prev_batch_run_date,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used in the SQL query? I didn't see the parameter anywhere in the SQLs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prev_batch_run_date is not. I'll remove it

@chowbao
Copy link
Contributor Author

chowbao commented Oct 7, 2024

Nit: I noticed the SQL files are not filtering based on partition key. These tables aren't partitioned by closed_at yet. Can you check the difference in processing bytes/execution time by excluding partition key batch_run_date? Might be negligible, but I have concerns on larger tables, like history_transactions and trust_lines.

Looks good overall, just want to confirm the queries are performant before we run them hourly :sweat-smile:

Ahh I can add batch_run_date filter as well

Can you check the difference in processing bytes/execution time by excluding partition key batch_run_date?

opting to not check this because it makes sense to use the partition column

@chowbao chowbao merged commit 16be143 into master Oct 8, 2024
2 checks passed
@sydneynotthecity sydneynotthecity deleted the avro-frontfill branch November 21, 2024 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants