Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Superset deployment #3715

Merged
merged 17 commits into from
Aug 16, 2024
Merged

Superset deployment #3715

merged 17 commits into from
Aug 16, 2024

Conversation

bendnorman
Copy link
Member

@bendnorman bendnorman commented Jul 11, 2024

Overview

This PR contains superset configuration and cloud deployment changes for our data exploration tool.

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

@bendnorman bendnorman self-assigned this Jul 11, 2024
@bendnorman bendnorman added datasette Issues related the accessing PUDL data via Datasette. duckdb Issues referring to duckdb, the embedded OLAP database superset labels Jul 11, 2024
These changes have not been applied! I couldn't
figure out how mount the bucket as a volume in cloud run
using tf so I used the GUI. Now, when I run terraform plan
tf wants to change a bunch of attributes for the cloud run
instance.

This commit also changes the default superset registration user
@bendnorman bendnorman linked an issue Jul 25, 2024 that may be closed by this pull request
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Driveby comments since I was looking at a bunch of tickets. Glad you got that role stuff sorted out - ping me whenever you want a more in-depth review!

superset/superset_config.py Outdated Show resolved Hide resolved
terraform/main.tf Show resolved Hide resolved
@@ -0,0 +1,14 @@
# hadolint ignore=DL3006
FROM apache/superset
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdangerx should we pin the base image version? If so, can we set up dependabot to keep the base image and the requirements.txt file up to date?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should pin the base image version since a superset upgrade might require a DB migration as well: https://superset.apache.org/docs/installation/upgrading-superset/

I think we can set up Dependabot to make PRs to keep the docker image up to date by adding a docker block to the updates in dependabot.yml, and then labeling the Dockerfile source: https://github.com/dependabot-fixtures/docker-with-source/blob/main/Dockerfile

And I think we can add a pip block for requirements.txt.

I think both of those can be in a follow-up PR, though, and we should set some sort of reminder later to check to see if the dependabot config actually worked...

If this is the first time running superset locally or you recently ran `docker compose down` you'll need to run the commands in `setup.sh`.

## Making changes to the production deployment
TODO: instructions on how to connect to Cloud SQL
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to figure out how to use the Cloud Auth Proxy to make changes to the production database. Also, how can we protect against people making changes to the production database when they are just experimenting with the local deployment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you make changes to the Cloud SQL database for the initial deploy?

I think we can use the Docker version of the Cloud SQL Auth Proxy in a mutate-prod docker compose file. That would just be the same as our dev docker-compose, but with the Cloud SQL proxy standing in for the postgres service. Then executing superset db-upgrade or whatever, within the pudl-superset docker-compose service, would point at prod Cloud SQL.

We'd need to make a new service account to give that Cloud SQL Auth Proxy. To stop people from accidentally changing the prod DB, we could restrict the ability to create a key for that SA to only a subset of Catalyst. So it'd require a fair amount of effort/oversight to be able to make a change to prod DB.

## Making changes to the production deployment
TODO: instructions on how to connect to Cloud SQL

## Deploy to Cloud Run
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the deployment flow be for superset? I can think of a few reasons to trigger a new deployment:
When do we want to update superset

  • package in requirements.txt is updated
  • the base image is updated
  • we make a change to superset_config.py
  • I think we’ll probably want to redeploy superset when there is new data because changes to tables might break shared dashboards. Also, should superset point at nightly or stable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think redeploying Superset for nightly builds is a good way to test that our deployment infrastructure still works, and has the benefit of catching all these other changes too - so long as the cloud build doesn't cost a ton I think that's the easiest way forward.

I also think that eventually we probably want both nightly and stable to be on Superset - we can default to using stable, but have the option to connect to nightly in the database selector if people want.

- "8080:8088"
volumes:
# - {path to your pudl.duckdb file}:/app/pudl.duckdb
- ./roles.json:/app/roles.json
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to try to track the role definitions in git so our local deployments can use the roles we're using in production?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes a ton of sense to me :)

Comment on lines +66 to +74
def FLASK_APP_MUTATOR(app: Flask) -> None: # noqa: N802
"""Superset function that allows you to configure the Flask app.

Args:
app: The Flask app instance
"""
app.config.update(
PREFERRED_URL_SCHEME="https",
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this didn't resolve the HTTP redirect issue. I think the issue might be related to the auth0 Oauth development tokens we're using.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, nothing in that list looks suspicious to me, I'd love to see the authentication logs and see if there are specific errors showing up, or what the callback URL is being parsed as, etc.

@bendnorman bendnorman marked this pull request as ready for review August 14, 2024 01:28
@@ -102,3 +102,297 @@ resource "google_storage_bucket_iam_binding" "binding" {
"user:[email protected]",
]
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to learn more about how to structure terraform projects. It'd be helpful to isolate the superset setup in a separate file or something so we can easily reuse it for other projects like the pudl usage metrics repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty straightforward - terraform mashes all the *.tf files into one thing before parsing, so you can split however you want. Check out this article: https://build5nines.com/terraform-split-main-tf-into-seperate-files/

@jdangerx jdangerx self-requested a review August 14, 2024 14:36
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for wrangling all this! There's definitely bits to change but also it seems like it works.

The main blocking thing is just pinning the docker image - once that's done I'm happy to merge to main.

It's probably also a good idea to get people to run their own Auth0 apps instead of using the production one for development.

Some improvements/follow-ups that could be separate PRs:

  • triggering redeploys on nightly build
  • define a new SA specifically for Superset and give that all the accesses it needs - probably refactoring to use for_each along the way.
  • get a docker-compose setup for pointing at production

@@ -0,0 +1,14 @@
# hadolint ignore=DL3006
FROM apache/superset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should pin the base image version since a superset upgrade might require a DB migration as well: https://superset.apache.org/docs/installation/upgrading-superset/

I think we can set up Dependabot to make PRs to keep the docker image up to date by adding a docker block to the updates in dependabot.yml, and then labeling the Dockerfile source: https://github.com/dependabot-fixtures/docker-with-source/blob/main/Dockerfile

And I think we can add a pip block for requirements.txt.

I think both of those can be in a follow-up PR, though, and we should set some sort of reminder later to check to see if the dependabot config actually worked...

superset/README.md Outdated Show resolved Hide resolved
superset/README.md Outdated Show resolved Hide resolved
superset/README.md Show resolved Hide resolved
If this is the first time running superset locally or you recently ran `docker compose down` you'll need to run the commands in `setup.sh`.

## Making changes to the production deployment
TODO: instructions on how to connect to Cloud SQL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you make changes to the Cloud SQL database for the initial deploy?

I think we can use the Docker version of the Cloud SQL Auth Proxy in a mutate-prod docker compose file. That would just be the same as our dev docker-compose, but with the Cloud SQL proxy standing in for the postgres service. Then executing superset db-upgrade or whatever, within the pudl-superset docker-compose service, would point at prod Cloud SQL.

We'd need to make a new service account to give that Cloud SQL Auth Proxy. To stop people from accidentally changing the prod DB, we could restrict the ability to create a key for that SA to only a subset of Catalyst. So it'd require a fair amount of effort/oversight to be able to make a change to prod DB.

Comment on lines +66 to +74
def FLASK_APP_MUTATOR(app: Flask) -> None: # noqa: N802
"""Superset function that allows you to configure the Flask app.

Args:
app: The Flask app instance
"""
app.config.update(
PREFERRED_URL_SCHEME="https",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, nothing in that list looks suspicious to me, I'd love to see the authentication logs and see if there are specific errors showing up, or what the callback URL is being parsed as, etc.

@@ -102,3 +102,297 @@ resource "google_storage_bucket_iam_binding" "binding" {
"user:[email protected]",
]
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty straightforward - terraform mashes all the *.tf files into one thing before parsing, so you can split however you want. Check out this article: https://build5nines.com/terraform-split-main-tf-into-seperate-files/

}
}

resource "google_sql_database_instance" "postgres_pvp_instance_name" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does pvp mean here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I grabbed this terraform code from the google docs and forgot to rename the terraform resource. Is it too late to rename it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you rename it, it will delete the resource and create a new one. So it sort of depends on how hard it is to re-initialize the state.

deletion_protection = true
}

resource "google_secret_manager_secret" "superset_database_username" {
Copy link
Member

@jdangerx jdangerx Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooray! I love that you're using the secret manager to manage secrets, it makes my shriveled little security heart sing.

It's a little heavyweight to define all this boilerplate just to say "there are these secrets, here are their names, and this service account has access to them." We can use the for_each syntax of Terraform to drastically reduce that.

Note that if you use for_each it will change all the resource names in Terraform, so it will want to recreate everything -_- which means you'll have to manually re-populate the secrets.

Copy link
Member Author

@bendnorman bendnorman Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good to know! I tackle it when I create a new service account.

resource "google_secret_manager_secret_iam_member" "superset_database_username_compute_iam" {
secret_id = google_secret_manager_secret.superset_database_username.id
role = "roles/secretmanager.secretAccessor"
member = "serviceAccount:[email protected]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to make a new service account that's only for Superset instead of using the default Compute Engine SA. Then it's easier to restrict access to the prod database, which makes it harder to accidentally blow it up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes it sure does. I'll create an issue for this.

@bendnorman
Copy link
Member Author

Thanks for all the great feedback! I pinned the docker base image version, updated auth0 env var instructions and set docker compose env var defaults. I also created some draft issues that I'll flesh out tomorrow.

@bendnorman bendnorman added this pull request to the merge queue Aug 16, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 16, 2024
@jdangerx
Copy link
Member

You're hitting a bunch of this error in CI:

ERROR test/integration/glue_test.py::test_unmapped_utils_eia - TypeError: ForwardRef._evaluate() missing 1 required keyword-only argument: 'recursive_guard'

Which is related to dagster / python version incompatibilities: dagster-io/dagster#22985

@bendnorman bendnorman added this pull request to the merge queue Aug 16, 2024
Merged via the queue into main with commit f257fc8 Aug 16, 2024
17 checks passed
@bendnorman bendnorman deleted the init-superset branch August 16, 2024 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasette Issues related the accessing PUDL data via Datasette. duckdb Issues referring to duckdb, the embedded OLAP database superset
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Deploy superset
2 participants