-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AWS Athena plugin #133
Add AWS Athena plugin #133
Conversation
Assumes use of the https://github.com/Tomme/dbt-athena adapter Considerations: - `query-comment:` in `dbt_project.yml` must be set to nothing - Could be set to some other custom value - Standard block comment in dbt `/*...*/` is not supported in Athena - See getredash/redash#2399 - JSON in `./integration_tests/public_data/json/**/*.json` modified - Otherwise, fails integration test check for equal values - See https://github.com/rcongiu/Hive-JSON-Serde#json-data-files - Piggybacks on the redshift macros for non-hive-compatible partitions - uses redshift helper macros because the query syntax is the same - we use a similar spec setup as Redshift for this reason New env variables need to be set for CI: - ATHENA_TEST_DBNAME - this is an AWS Glue "Data Source" - the schema is set to dbt_external_tables_integration_tests_athena - hardcoded like the other schemas in sample.profiles.yml - this is an AWS Glue "Database" - AWS_REGION - ATHENA_TEST_BUCKET - ATHENA_TEST_WORKGROUP
ci/circleci: integration-athena will fail as long as the environment variables are not set and the infra is not set up in the AWS account that circleci is running in. |
I hope this is a wanted contribution to the code! TBH I did this for myself and then got inspired to contribute. I'm 💯 OK with running my own modified local dbt package, so no hard feelings if this is out of scope and closed. Cheers 🍾 |
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daniel-cortez-stevenson This is really cool! I want to spend some time actually playing around with this (first step, spin up Athena in this sandbox) — but just to say, I really appreciate the work you've put in, and your desire to contribute it back for the benefit of other community members.
# Is there a better way around this? | ||
query-comment: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is something the adapter can reimplement! That's where the logic lives, with just this sort of reason in mind. The base implementation is defined on the core adapter interface: _QueryComment.add()
in core.dbt.adapters.base.query_headers
The right place for this is probably a new issue in the dbt-athena
adapter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, thanks for the pointer! I've opened up a PR in the dbt-athena adapter repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daniel-cortez-stevenson Could you please add a link to your PR? I cannot find it and need inspiration for how to solve a similar issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm. found it: Tomme/dbt-athena#64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of it?
@@ -0,0 +1,61 @@ | |||
{% macro athena__refresh_external_table(source_node) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(After glancing quickly over many lines of code that look vaguely familiar, like an old dream): How much overlap is there between the logic here and the Redshift macros? Would it make sense to abstract the shared logic into a set of common macros, called by both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A ton of overlap! I was torn between being DRY and avoiding creating a dependency between the Athena and Redshift plugins, which may not have such dependencies on the AWS side (ie. Athena and Redshift syntax could diverge). I ended up not following either option 100%.
I think either option is fine. If you would decide which option is preferred, I'd make the changes in the code to follow!
note: the "avoiding shared dependencies" idea came from common guidance to avoid shared dependencies in microservices, and my own experience writing ETLs (I re-wrote a lot ETLs to not have shared code, because data can change and transformations can change between ETLs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am willing to update the athena adapter an use this package, but I am getting this error:
An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: 1 validation error detected at 'queryString' failed to satisfy constraint: Member must have length less than or equal to 262144
The maximum query string length in Athena (262,144 bytes) is not an adjustable quota.
It needs a to split the queries before reach this limit.
Athena DDL does not always respect /* ... */ block quotations. This is the case with (at least) `create external table ...` statements. Currently, if you try to run `create external table ...` statements, the following error is returned: `pyathena.error.OperationalError: FAILED: ParseException line 1:0 cannot recognize input near '/' '*' '{'` Here I monkey-patched _QueryComment.add to prepend "-- " to the `query_comment` and replace any newlines with " ". This allows the default `query_comment` to be added to `create external table` statements. We might want to do this so that dbt-labs/dbt-external-tables can add Athena support as is proposed in dbt-labs/dbt-external-tables/pull/133 Without this modification, the only way to run `dbt run-operation stage_external_sources --vars "ext_full_refresh: true"` without an error is to set `query_comment:` to nothing, a specific value, or a macro.
@jtcohen6 now that we've resolved the issue with the query comment in the DBT Athena adapter, do we want to move forward with this PR? The steps might be:
|
I have copied the macro's in this PR as overrides for the dbt-external-tables plugin to be able to read cloudtrail data, and the macro's worked great with Athena. Any idea how this PR should proceed to get merged in? |
@daniel-cortez-stevenson @jtcohen6 is there any plan to move this PR forward? it's a pretty interesting feature that will be important to have. |
I've been having a play with this PR content (against the dbt-athena-community [email protected], which is the recommended one afaik now) - I'm having difficulties with later versions of dbt-athena-community. How are the dependencies on the adapters handled? Seems like a change could be unwittingly introduced in an adapter that could break this build and consumers of this package. (Aside, if I get it working against latest to my satisfaction I'll raise a new PR, assuming this one doesn't get sorted in the meantime) |
would love to see this merge - right now it seems there is no good solution how to manage |
what is left to do with this pr? can i help complete it? |
It seems that this won't be merged anytime soon. Does anyone have a working solution with the latest dbt-athena version? I'm looking for a solution to define source Athena tables and lftag permissions. |
Yeah too bad that this PR is stuck. I am using external tables in my day to day, I've installed this package: - package: dbt-labs/dbt_external_tables
version: 0.8.2 Then added these lines to my dispatch:
- macro_namespace: dbt_external_tables
search_order: [my_custom_dbt_macros, dbt_external_tables]
# Stage the external tables when dbt starts running, using https://github.com/dbt-labs/dbt-external-tables
on-run-start:
- "{{ dbt_external_tables.stage_external_sources() }}" And created a private git repo with the macro from this PR, which I installed in
Which has a name: 'my_custom_dbt_macros'
version: '1.0.0'
config-version: 2
require-dbt-version: ">=1.0.0" And the files from this PR in a |
Thanks @jessedobbelaere! I'd rather had this merged, but in the mean time, if anyone needs this, I packaged the solution above at https://github.com/rstml/dbt-external-tables-athena |
@rstml nice one, feel free to post this also in #db-athena in dbt slack workspace, I believe that some folks will be interested. |
I imagine adding Athena support could be out of scope for this repository. @jtcohen6 can you confirm whether or not this feature is desirable? If yes, I'd get it merge-ready. |
There is actually even another PR #203 not sure what it's the difference. Also looking at the supported engines I see:
that are all verified adapters. dbt-athena-community is now trusted not sure if this is a criteria used to merge such integration. |
#203 looks very similar, but it also changes redshift helper for some reason https://github.com/dbt-labs/dbt-external-tables/pull/203/files#diff-462105eddcb198df9f7384b4bef97c092c55e93924896866db36c41373a811dc
Frankly, this is how I initially felt reading the docs. However, there's great demand for something like Maybe initially this wasn't intended use for this extension. But why not extend it to encompass all those databases/use cases? I noticed there's a PR to add Trino support too: #153 |
@nicor88 the original PR was based on the original athena adapter, not the community one. When I wanted to use Athena with external tables last year, I saw the original was already stalled for a year so used it as a basis and made minimal changes to create #203 PR based on dbt-athena-community - which I asked for advice and merge with no response and we've been using at my client off my GitHub PR branch since last summer. @rstml I can't remember why I had to change the redshift helper now, but I suspect it was because there was shared code between the two (in the original PR that I started from) and I had to make it deal with the serious funky quoting behaviour of Athena. I think that now dbt-athena-community probably supports most of what's needed internally and patching into dbt-external-tables could be neater than either current PR - but I gave up putting any thought into it after being ignored for so long 🤷 It would be great, one way or another, to properly support external tables in Athena! |
@nicor88 FYI, bit of a warning on an issue I've had with my PR #203, dbt-athena-community will go and try to delete the data in S3 backing a table, with no way I can see to disable this behaviour from dbt-external-tables. If you have laid an external table over some data in S3, then change from an external table to a normal model, it will try to delete the data underneath as part of the table drop. This is inappropriate and either results in breaking dbt run with permissions errors or potentially uncoverable loss of data. |
@brabster f those external tables are treated as sources in dbt-athena-community nothing should happen, because they are managed outside dbt-athena, therefore there isn't any delete happening. |
Yeah that's it. If you have a source external table called "transactions" over a bunch of files landing from somewhere, and you decide you want to put a model there instead, you change the name of the source to "transactions-external" and create a model "transactions". The new external table "transactions-external" gets created fine, and the model "transactions" sees the existing table and dbt-athena-community starts deleting all the data underneath, which could be bad. I think that's the behaviour I've seen, anyway, just giving you a heads up about that potential issue to avoid 👍 |
closing in favor of #203 |
Description & motivation
Added AWS Athena so we can use AWS Athena external sources in DBT. This is nice for using public data sources like Open Street Maps and Facebook Data for Good Population Density.
Assumes use of the dbt-athena adapter
Considerations:
query-comment:
indbt_project.yml
must be set to nothing/*...*/
is not supported in Athena./integration_tests/public_data/json/**/*.json
modifiedNew env variables need to be set for CI:
Checklist