-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate avro files from bq tables #507
Conversation
dags/generate_avro_files_dag.py
Outdated
dag = DAG( | ||
"generate_avro_files", | ||
default_args=get_default_dag_args(), | ||
start_date=datetime(2024, 10, 1, 0, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AVRO files for < 10/1/24 are already generated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm we can. I wonder if that would be confusing for Dune though. Like 10/1 wouldn't be the full month but instead there would be 10/1 and then another folder for 10/8, 10/9, 10/10, etc...
Any concerns about catching the last 7 days up?
Nope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I noticed the SQL files are not filtering based on partition key. These tables aren't partitioned by closed_at
yet. Can you check the difference in processing bytes/execution time by excluding partition key batch_run_date
? Might be negligible, but I have concerns on larger tables, like history_transactions
and trust_lines
.
Looks good overall, just want to confirm the queries are performant before we run them hourly :sweat-smile:
except(details, details_json, batch_id, batch_insert_ts, batch_run_date), | ||
details.* | ||
except(claimants, type), | ||
details.type as details_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this soroban_operation_type? Do you think we need to rename this column to be clearer? Might be worth discussing with Andre
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this soroban_operation_type?
Yes it is
Do you think we need to rename this column to be clearer?
Yeah we can. I'll regenerate the files and mention it to Andre
and closed_at >= '{batch_run_date}' | ||
and closed_at < '{next_batch_run_date}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
closed_at
is not on the table, i don't think this will run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to ledger_closed_at
"project_id": project, | ||
"dataset_id": dataset, | ||
"batch_run_date": batch_run_date, | ||
"prev_batch_run_date": prev_batch_run_date, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used in the SQL query? I didn't see the parameter anywhere in the SQLs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prev_batch_run_date
is not. I'll remove it
Ahh I can add
opting to not check this because it makes sense to use the partition column |
PR Checklist
PR Structure
otherwise).
Thoroughness
What
Adding DAG to generate hourly AVRO files from BQ tables
Why
Needed a way to generate frontfill AVRO files for external analytics partnerships
Known limitations
history_effects
andhistory_operations
currently not run until confirmation that the tables works for external partners