Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AWS Athena plugin #133

Closed

Conversation

daniel-cortez-stevenson

Description & motivation

Added AWS Athena so we can use AWS Athena external sources in DBT. This is nice for using public data sources like Open Street Maps and Facebook Data for Good Population Density.

Assumes use of the dbt-athena adapter

Considerations:

New env variables need to be set for CI:

  • ATHENA_TEST_DBNAME
    • this is an AWS Glue "Data Source"
    • the schema is set to dbt_external_tables_integration_tests_athena
      • hardcoded like the other schemas in sample.profiles.yml
      • this is an AWS Glue "Database"
  • AWS_REGION
  • ATHENA_TEST_BUCKET
  • ATHENA_TEST_WORKGROUP

Checklist

  • I have verified that these changes work locally
  • I have updated the README.md (if applicable)
  • I have added an integration test for my fix/feature (if applicable)

Assumes use of the https://github.com/Tomme/dbt-athena adapter

Considerations:
- `query-comment:` in `dbt_project.yml` must be set to nothing
    - Could be set to some other custom value
    - Standard block comment in dbt `/*...*/` is not supported in Athena
    - See getredash/redash#2399
- JSON in `./integration_tests/public_data/json/**/*.json` modified
    - Otherwise, fails integration test check for equal values
    - See https://github.com/rcongiu/Hive-JSON-Serde#json-data-files
- Piggybacks on the redshift macros for non-hive-compatible partitions
    - uses redshift helper macros because the query syntax is the same
    - we use a similar spec setup as Redshift for this reason

New env variables need to be set for CI:
- ATHENA_TEST_DBNAME
    - this is an AWS Glue "Data Source"
    - the schema is set to dbt_external_tables_integration_tests_athena
        - hardcoded like the other schemas in sample.profiles.yml
        - this is an AWS Glue "Database"
- AWS_REGION
- ATHENA_TEST_BUCKET
- ATHENA_TEST_WORKGROUP
@daniel-cortez-stevenson
Copy link
Author

ci/circleci: integration-athena will fail as long as the environment variables are not set and the infra is not set up in the AWS account that circleci is running in.

@daniel-cortez-stevenson
Copy link
Author

I hope this is a wanted contribution to the code! TBH I did this for myself and then got inspired to contribute. I'm 💯 OK with running my own modified local dbt package, so no hard feelings if this is out of scope and closed. Cheers 🍾

@daniel-cortez-stevenson
Copy link
Author

The query-comment: thing is troublesome ... any ideas how to fix this 'hack' a little more sustainably? Is it the job of this package, dbt-athena, dbt-core, or AWS Athena to resolve this?

Copy link
Collaborator

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daniel-cortez-stevenson This is really cool! I want to spend some time actually playing around with this (first step, spin up Athena in this sandbox) — but just to say, I really appreciate the work you've put in, and your desire to contribute it back for the benefit of other community members.

Comment on lines +31 to +32
# Is there a better way around this?
query-comment:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is something the adapter can reimplement! That's where the logic lives, with just this sort of reason in mind. The base implementation is defined on the core adapter interface: _QueryComment.add() in core.dbt.adapters.base.query_headers

The right place for this is probably a new issue in the dbt-athena adapter

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks for the pointer! I've opened up a PR in the dbt-athena adapter repo.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daniel-cortez-stevenson Could you please add a link to your PR? I cannot find it and need inspiration for how to solve a similar issue

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm. found it: Tomme/dbt-athena#64

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of it?

@@ -0,0 +1,61 @@
{% macro athena__refresh_external_table(source_node) %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(After glancing quickly over many lines of code that look vaguely familiar, like an old dream): How much overlap is there between the logic here and the Redshift macros? Would it make sense to abstract the shared logic into a set of common macros, called by both?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A ton of overlap! I was torn between being DRY and avoiding creating a dependency between the Athena and Redshift plugins, which may not have such dependencies on the AWS side (ie. Athena and Redshift syntax could diverge). I ended up not following either option 100%.

I think either option is fine. If you would decide which option is preferred, I'd make the changes in the code to follow!

note: the "avoiding shared dependencies" idea came from common guidance to avoid shared dependencies in microservices, and my own experience writing ETLs (I re-wrote a lot ETLs to not have shared code, because data can change and transformations can change between ETLs).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am willing to update the athena adapter an use this package, but I am getting this error:

An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: 1 validation error detected at 'queryString' failed to satisfy constraint: Member must have length less than or equal to 262144

The maximum query string length in Athena (262,144 bytes) is not an adjustable quota.
It needs a to split the queries before reach this limit.

Tomme pushed a commit to Tomme/dbt-athena that referenced this pull request Apr 5, 2022
Athena DDL does not always respect /* ... */ block quotations. This is
the case with (at least) `create external table ...` statements.

Currently, if you try to run `create external table ...` statements, the
following error is returned: `pyathena.error.OperationalError: FAILED:
ParseException line 1:0 cannot recognize input near '/' '*' '{'`

Here I monkey-patched _QueryComment.add to prepend "-- " to the
`query_comment` and replace any newlines with " ".

This allows the default `query_comment` to be added to `create external
table` statements.

We might want to do this so that dbt-labs/dbt-external-tables can add
Athena support as is proposed in dbt-labs/dbt-external-tables/pull/133

Without this modification, the only way to run `dbt run-operation
stage_external_sources --vars "ext_full_refresh: true"` without an error
is to set `query_comment:` to nothing, a specific value, or a macro.
@daniel-cortez-stevenson
Copy link
Author

@jtcohen6 now that we've resolved the issue with the query comment in the DBT Athena adapter, do we want to move forward with this PR?

The steps might be:

  1. Resolve the architecture issue for the Redshift and Athena plugins by duplicating some code OR by moving some shared macros to common
  2. Add CI/CD support for the Athena plugin

@cbhagl cbhagl mentioned this pull request Aug 22, 2022
3 tasks
@jessedobbelaere
Copy link

I have copied the macro's in this PR as overrides for the dbt-external-tables plugin to be able to read cloudtrail data, and the macro's worked great with Athena. Any idea how this PR should proceed to get merged in?

@nicor88
Copy link

nicor88 commented Jan 11, 2023

@daniel-cortez-stevenson @jtcohen6 is there any plan to move this PR forward? it's a pretty interesting feature that will be important to have.

@brabster
Copy link

brabster commented May 2, 2023

I've been having a play with this PR content (against the dbt-athena-community [email protected], which is the recommended one afaik now) - I'm having difficulties with later versions of dbt-athena-community. How are the dependencies on the adapters handled? Seems like a change could be unwittingly introduced in an adapter that could break this build and consumers of this package. (Aside, if I get it working against latest to my satisfaction I'll raise a new PR, assuming this one doesn't get sorted in the meantime)

@mathewcohle
Copy link

would love to see this merge - right now it seems there is no good solution how to manage EXTERNAL TABLE resources with dbt and Athena setup

@aidan-o-boyle-kroo
Copy link

what is left to do with this pr? can i help complete it?

atommych added a commit to atommych/tf-athena-dbt-elementary that referenced this pull request Nov 19, 2023
@rstml
Copy link

rstml commented Jan 17, 2024

It seems that this won't be merged anytime soon. Does anyone have a working solution with the latest dbt-athena version?

I'm looking for a solution to define source Athena tables and lftag permissions.

@jessedobbelaere
Copy link

Yeah too bad that this PR is stuck.

I am using external tables in my day to day, I've installed this package:

 - package: dbt-labs/dbt_external_tables
    version: 0.8.2

Then added these lines to my dbt_project.yml:

dispatch:
  - macro_namespace: dbt_external_tables
    search_order: [my_custom_dbt_macros, dbt_external_tables]

# Stage the external tables when dbt starts running, using https://github.com/dbt-labs/dbt-external-tables
on-run-start:
  - "{{ dbt_external_tables.stage_external_sources() }}"

And created a private git repo with the macro from this PR, which I installed in packages.yml:

  - git: [email protected]:XXX/dbt-macros.git
    revision: 0.0.1

Which has a dbt_project.yml:

name: 'my_custom_dbt_macros'
version: '1.0.0'
config-version: 2

require-dbt-version: ">=1.0.0"

And the files from this PR in a dbt_external_tables subfolder.
Seems to work, but I only have a basic use case... (Cloudtrail S3 data)

@rstml
Copy link

rstml commented Jan 18, 2024

Thanks @jessedobbelaere!

I'd rather had this merged, but in the mean time, if anyone needs this, I packaged the solution above at https://github.com/rstml/dbt-external-tables-athena

@nicor88
Copy link

nicor88 commented Jan 18, 2024

@rstml nice one, feel free to post this also in #db-athena in dbt slack workspace, I believe that some folks will be interested.

@daniel-cortez-stevenson
Copy link
Author

I imagine adding Athena support could be out of scope for this repository.

@jtcohen6 can you confirm whether or not this feature is desirable?

If yes, I'd get it merge-ready.

@nicor88
Copy link

nicor88 commented Jan 18, 2024

There is actually even another PR #203 not sure what it's the difference.

Also looking at the supported engines I see:

  • Redshift (Spectrum)
  • Snowflake
  • BigQuery
  • Spark
  • Synapse
  • Azure SQL

that are all verified adapters. dbt-athena-community is now trusted not sure if this is a criteria used to merge such integration.

@rstml
Copy link

rstml commented Jan 18, 2024

#203 looks very similar, but it also changes redshift helper for some reason https://github.com/dbt-labs/dbt-external-tables/pull/203/files#diff-462105eddcb198df9f7384b4bef97c092c55e93924896866db36c41373a811dc

I imagine adding Athena support could be out of scope for this repository.

Frankly, this is how I initially felt reading the docs.

However, there's great demand for something like dbt-external-tables for Athena/Trino/etc since data often loaded as files and there's a need to define metadata somewhere.

Maybe initially this wasn't intended use for this extension. But why not extend it to encompass all those databases/use cases?

I noticed there's a PR to add Trino support too: #153

@brabster
Copy link

brabster commented Jan 18, 2024

@nicor88 the original PR was based on the original athena adapter, not the community one. When I wanted to use Athena with external tables last year, I saw the original was already stalled for a year so used it as a basis and made minimal changes to create #203 PR based on dbt-athena-community - which I asked for advice and merge with no response and we've been using at my client off my GitHub PR branch since last summer.

@rstml I can't remember why I had to change the redshift helper now, but I suspect it was because there was shared code between the two (in the original PR that I started from) and I had to make it deal with the serious funky quoting behaviour of Athena.

I think that now dbt-athena-community probably supports most of what's needed internally and patching into dbt-external-tables could be neater than either current PR - but I gave up putting any thought into it after being ignored for so long 🤷

It would be great, one way or another, to properly support external tables in Athena!

@nicor88
Copy link

nicor88 commented Jan 19, 2024

@brabster thanks for the context, and It would be great, one way or another, to properly support external tables in Athena! I can't agree more.

@jtcohen6 @dataders could you consider to have a look at this?

@brabster
Copy link

@nicor88 FYI, bit of a warning on an issue I've had with my PR #203, dbt-athena-community will go and try to delete the data in S3 backing a table, with no way I can see to disable this behaviour from dbt-external-tables.

If you have laid an external table over some data in S3, then change from an external table to a normal model, it will try to delete the data underneath as part of the table drop. This is inappropriate and either results in breaking dbt run with permissions errors or potentially uncoverable loss of data.

@nicor88
Copy link

nicor88 commented Mar 26, 2024

@brabster f those external tables are treated as sources in dbt-athena-community nothing should happen, because they are managed outside dbt-athena, therefore there isn't any delete happening.
Of course if a some point the user start to treat them inside dbt-athena, then there will be a delete underlying.

@brabster
Copy link

Yeah that's it. If you have a source external table called "transactions" over a bunch of files landing from somewhere, and you decide you want to put a model there instead, you change the name of the source to "transactions-external" and create a model "transactions".

The new external table "transactions-external" gets created fine, and the model "transactions" sees the existing table and dbt-athena-community starts deleting all the data underneath, which could be bad. I think that's the behaviour I've seen, anyway, just giving you a heads up about that potential issue to avoid 👍

@dataders
Copy link
Collaborator

dataders commented Apr 4, 2024

closing in favor of #203

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.