From f5a2d207b272f802703e23f25dd84ee06ae06137 Mon Sep 17 00:00:00 2001 From: Jay Carlton <53479492+jaycarlton@users.noreply.github.com> Date: Wed, 11 Nov 2020 19:02:29 -0500 Subject: [PATCH 1/4] initial reporting module --- .gitignore | 4 + AOU_RW_MODULE_WALKTHROUGH.md | 276 ++++++++++++++++++ README.md | 0 TERRAFORM_QUICKSTART.md | 218 ++++++++++++++ modules/.DS_Store | Bin 0 -> 6148 bytes modules/workbench/README.md | 24 ++ modules/workbench/WORKBENCH-MODULE.md | 63 ++++ modules/workbench/main.tf | 11 + modules/workbench/modules/reporting/.DS_Store | Bin 0 -> 6148 bytes .../reporting/assets/schemas/cohort.json | 57 ++++ .../reporting/assets/schemas/institution.json | 37 +++ .../reporting/assets/schemas/user.json | 192 ++++++++++++ .../reporting/assets/schemas/workspace.json | 177 +++++++++++ .../reporting/assets/views/latest_cohorts.sql | 13 + .../assets/views/latest_institutions.sql | 14 + .../reporting/assets/views/latest_users.sql | 12 + .../assets/views/latest_workspaces.sql | 13 + .../assets/views/table_count_vs_time.sql | 39 +++ modules/workbench/modules/reporting/main.tf | 105 +++++++ .../workbench/modules/reporting/variables.tf | 14 + .../reporting/views/latest_cohorts.sql | 13 + .../reporting/views/latest_institutions.sql | 14 + .../modules/reporting/views/latest_users.sql | 12 + .../reporting/views/latest_workspaces.sql | 13 + .../reporting/views/table_count_vs_time.sql | 39 +++ modules/workbench/providers.tf | 19 ++ modules/workbench/variables.tf | 36 +++ 27 files changed, 1415 insertions(+) create mode 100644 .gitignore create mode 100644 AOU_RW_MODULE_WALKTHROUGH.md create mode 100644 README.md create mode 100644 TERRAFORM_QUICKSTART.md create mode 100644 modules/.DS_Store create mode 100644 modules/workbench/README.md create mode 100644 modules/workbench/WORKBENCH-MODULE.md create mode 100644 modules/workbench/main.tf create mode 100644 modules/workbench/modules/reporting/.DS_Store create mode 100644 modules/workbench/modules/reporting/assets/schemas/cohort.json create mode 100644 modules/workbench/modules/reporting/assets/schemas/institution.json create mode 100644 modules/workbench/modules/reporting/assets/schemas/user.json create mode 100644 modules/workbench/modules/reporting/assets/schemas/workspace.json create mode 100644 modules/workbench/modules/reporting/assets/views/latest_cohorts.sql create mode 100644 modules/workbench/modules/reporting/assets/views/latest_institutions.sql create mode 100644 modules/workbench/modules/reporting/assets/views/latest_users.sql create mode 100644 modules/workbench/modules/reporting/assets/views/latest_workspaces.sql create mode 100644 modules/workbench/modules/reporting/assets/views/table_count_vs_time.sql create mode 100644 modules/workbench/modules/reporting/main.tf create mode 100644 modules/workbench/modules/reporting/variables.tf create mode 100644 modules/workbench/modules/reporting/views/latest_cohorts.sql create mode 100644 modules/workbench/modules/reporting/views/latest_institutions.sql create mode 100644 modules/workbench/modules/reporting/views/latest_users.sql create mode 100644 modules/workbench/modules/reporting/views/latest_workspaces.sql create mode 100644 modules/workbench/modules/reporting/views/table_count_vs_time.sql create mode 100644 modules/workbench/providers.tf create mode 100644 modules/workbench/variables.tf diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..39f7fa5 --- /dev/null +++ b/.gitignore @@ -0,0 +1,4 @@ +**/.idea/* +*.tfstate +*.backup +*.iml diff --git a/AOU_RW_MODULE_WALKTHROUGH.md b/AOU_RW_MODULE_WALKTHROUGH.md new file mode 100644 index 0000000..624ed38 --- /dev/null +++ b/AOU_RW_MODULE_WALKTHROUGH.md @@ -0,0 +1,276 @@ +# AoU Researcher Workbench Module Walkthrough +## 0. Module Structure +The state associated with the current deployment consists of +one `root` module for each environment, in separate directories + +In order to deploy a full (or partial) environment we need to declare what modules are used and to supply +values to all unbound declared variables. The environment module is unioned with the modules in the +`source` statement. + +The overall source structure looks like the following. Note that +Terraform will collect all `.tf` files in a referenced directory, +so the calling module will need to specify values for the chilid +modules' `variable` blocks that don't have defaults. + +```text +/repos/workbench/ops/terraform/ +├── AOU_RW_MODULE_WALKTHROUGH.md +├── TERRAFORM-QUICKSTART.md +├── environments +│   ├── local +│   ├── scratch +│   │   ├── SCRATCH-ENVIRONMENT.md +│   │   ├── scratch.tf +│   │   ├── terraform.tfstate +│   │   ├── terraform.tfstate.backup +│   │   └── terraform.tfstate.yet.another.backup +│   └── test +└── modules + └── aou-rw-reporting + ├── providers.tf + ├── reporting.tf + ├── schemas + │   ├── cohort.json + │   ├── institution.json + │   ├── user.json + │   └── workspace.json + ├── variables.tf + └── views + ├── latest_cohorts.sql + ├── latest_institutions.sql + ├── latest_users.sql + ├── latest_workspaces.sql + └── table_count_vs_time.sql +``` +The `modules` directory contains independent, reusable modules foor +subsystems that are +* logical to deploy and configure operationally, +* don't depend on each other (at least for exported modules) and +* can be used by AoU or potentially another organization interested in deploying a copy +of all or part of our system. + +## Prerequisites +### 1. Get Terraform +Install Terraform using the directinos at [TERRAFORM-QUICKSTART.md] +### 2. Change to the `environments/scratch directory` `get` and `init` +The environment for this outline is `scratch`, which exists in a target environment +of your choice. +### 3. Assign Values to Input Variables + +The following public variable declarations are representative of those +specified in `modules/reporting/variables.tf` and elsewhere. The description +string shows when interactively running from the command line without all the +vars cominig in from a `-var-file` argument. +```hcl-terraform +variable credentials_file { + description = "Location of service account credentials JSON file." + type = string +} + +variable aou_env { + description = "Short name (all lowercase) of All of Us Workbench deployed environments, e.g. local, test, staging, prod." + type = string +} + +variable project_id { + description = "GCP Project" + type = string +} +``` +Create a `scratch_tutorrial.tfvars` file outside of this repository. This file should +look contain values for the following [input variables](https://www.terraform.io/docs/configuration/variables.html) that will be different +for different organizations and environments. + +```hcl-terraform +aou_env = "scratch" # Name of environment we're creating or attaching to. Needs to match directory name +project_id = "my-qa-project" # Should not be prod +reporting_dataset_id = "firstname_lastname_scratch_0" # BigQuery dataset id +``` + +The credentials file should point to a JSON key file generated +by Google Cloud IAM (at least on lower environments). The only required +permission is `BigQuery Data Owner` Neither the credentials nor +the `.tfvars` file itself should be checked into public source control. + +It's sometimes helpful to assign the full path to this `.tfvars` to an environment variable, +as it will need to e provided for most commands. There are several other ways to do this, +but the advantage for us is separating the reusable stuff from the AoU-instance-specific +values. +```shell script +$ SCRATCH_TFVARS=/rerpos/workbench-devops/terraform/scratch.tfvars +``` + +### 4. Initialize Terraform +Run [`terraform init`](https://www.terraform.io/docs/commands/init.html) to initialize the current directory (which should be +`api/terraform/environrments/scratch` if working from this repo. It should also be possible +work from a directory completely sepaated from source control. It's just +a bit harder to refer to the module definitions. + +If `init` was successful, the following message should print something like the following +like following: +``` +Initializing modules... + +Initializing the backend... + +Initializing provider plugins... +- Using previously-installed hashicorp/google v3.5.0 + +Terraform has been successfully initialized! + +You may now begin working with Terraform. Try running "terraform plan" to see +any changes that are required for your infrastructure. All Terraform commands +should now work. + +If you ever set or change modules or backend configuration for Terraform, +rerun this command to reinitialize your working directory. If you forget, other +commands will detect it and remind you to do so if necessary. +``` + +After successfully `init`, while the backend, plugins, and modules are now in a reasonably good state, +but certain expensive operations are deferred for performance. Look at the `terraform.tfstate` file +in the run directory to confirming nothing is of intererst there: +```json +{ + "version": 4, + "terraform_version": "0.13.0", + "serial": 24, + "lineage": "d9d8e034-fad0-03ff-df40-86bdd7a43128", + "outputs": {}, + "resources": [] +} +``` + +### 5. Build a Plan +Terraform creates a plan of action based on the difference between its view of the state +of all the resources, and what's stated in the file. + +Run like +``` +terraform plan -var-file=$SCRATCH_TFVARS +``` +The output for me looks like [this](doc/plan_output.txt). You should see a couple of key things: +* A dataset, several tables, and some views will be created. Searching for "will be created" is an easy way to +see this. +* All the variables are expanded in the state file, so treat this file as Eyes Only. +* The summary line should show `Plan: 10 to add, 0 to change, 0 to destroy.` + +The `plan` command doesn't edit actual resources, but is important for understanding Terraform's marching +orders. + +### 6. Apply the Plan +Use the `apply` command to make the chagnes necessary. It will ask you for a `yes` confirmation beofre proceeding. +In the sase of the reporting module, creating the dataset then immediately crerating tabales may mean +that we need to run one more time. Luckily, `apply` is idempotent for this case and there's no harm. + +Once everything is applied, rerunning `tf plan` will show that nothing is left to do: +``` +$ tf plan -lock=false -var-file=$SCRATCH_TFVARS +Refreshing Terraform state in-memory prior to plan... +The refreshed state will be used to calculate this plan, but will not be +persisted to local or remote state storage. + +module.aou_rw_scratch_env.module.reporting.google_bigquery_dataset.main: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2] +module.aou_rw_scratch_env.module.reporting.google_bigquery_table.view["latest_cohorts"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/latest_cohorts] +module.aou_rw_scratch_env.module.reporting.google_bigquery_table.main["institution"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/institution] +module.aou_rw_scratch_env.module.reporting.google_bigquery_table.view["table_count_vs_time"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/table_count_vs_time] +module.aou_rw_scratch_env.module.reporting.google_bigquery_table.view["latest_institutions"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/latest_institutions] +module.aou_rw_scratch_env.module.reporting.google_bigquery_table.view["latest_users"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/latest_users] +module.aou_rw_scratch_env.module.reporting.google_bigquery_table.main["cohort"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/cohort] +module.aou_rw_scratch_env.module.reporting.google_bigquery_table.view["latest_workspaces"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/latest_workspaces] +module.aou_rw_scratch_env.module.reporting.google_bigquery_table.main["user"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/user] +module.aou_rw_scratch_env.module.reporting.google_bigquery_table.main["workspace"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/workspace] + +------------------------------------------------------------------------ + +No changes. Infrastructure is up-to-date. + +This means that Terraform did not detect any differences between your +configuration and real physical resources that exist. As a result, no +actions need to be performed. +``` + +### 7. Selectively removing state +If it's necessary to detach one or more online resources from the local Terraform state (as if it has +never been created or imported), use the `terraform state rm` command. The general pattern is +`terraform remove tfitem_id cloud_id`. For example, let's say I've decided I no longer want the view +named `latest_workspaces` to be included in the state file. + +### 8. Handy State commands +The [state command](https://www.terraform.io/docs/commands/state/index.html) is one of the more powerful ones to use, and lets you avoid interacting directly with `.tfstate` +files. +#### Import +Working with resources tha already exist requires `terraform import` command. This seems unintuitive, +but the sample `tarraform state list` output shows what's expected. Third party modules should show +the expected syntx. For [importing a BigQuery dataset](https://www.terraform.io/docs/providers/google/r/bigquery_dataset.html#import) +from the `scratch` environment to the `local` environment, simply do; + +```shell script +terraform import -var-file=$TFVARS_LOCAL \ + module.local.module.reporting.google_bigquery_dataset.main \ + reporting_local +``` +The output should look like this if successful. There are several failure modes involving directory structure, +module path, and differing asset ID configurations for different providers. + +``` +terraform import -var-file=$TFVARS_LOCAL module.local.module.reporting.google_bigquery_dataset.main reporting_local +module.local.module.reporting.google_bigquery_dataset.main: Importing from ID "reporting_local"... +module.local.module.reporting.google_bigquery_dataset.main: Import prepared! + Prepared google_bigquery_dataset for import +module.local.module.reporting.google_bigquery_dataset.main: Refreshing state... [id=projects/my-project/datasets/reporting_local] + +Import successful! + +The resources that were imported are shown above. These resources are now in +your Terraform state and will henceforth be managed by Terraform. +``` + +`tf state` should now show that we are managing the resource: +```shell script +tf state list +module.local.module.reporting.google_bigquery_dataset.main +``` + +```shell script + +terraform import -var-file=$TFVARS_LOCAL module.local.module.reporting.google_bigquery_table.main[\"cohort\"] \projects/all-of-us-workbench-test/datasets/reporting_local/tables/cohort +terraform import -var-file=$TFVARS_LOCAL module.local.module.reporting.google_bigquery_table.view[\"latest_users\"] projects/all-of-us-workbench-test/datasets/reporting_local/tables/latest_users +``` +is an example of importing a table. remember that equation marks must be escaped. + +```shell script +$ tf state list +module.local.module.reporting.google_bigquery_dataset.main +module.local.module.reporting.google_bigquery_table.main["cohort"] +``` +None of the `teraform state` commands accept variable values, as those have already been interpolated +during a `plan` or `apply` operation. + +**NOTE** While Terraform is managing the dataset, it's not yet managing any data in it directly. +Running `tf plan` at this point will indicate that, while the dataset is controlled, the tables and +views in it are not. It's probalby not a good idea to `terraform destory` imported resources that +contain other resources you care about; always study the `plan` output carefully. +#### `state list` +`terraform state list` lists all modules and resources under management for the current module. It's +especially handy when trying to find the desired module path string for `import` if you're reusing a +oonfiguration for another environment or system. +#### `state show` +terraform show is a more detailed listing for a given item in the state tree. The comm +``` +terraform state show module.local.module.reporting.google_bigquery_dataset.main +``` + +#### `state pull` +To show the active state file (by default named `terraform.tfstate`), simply do + ```terraform state pull | jq```. +The `jq` command makes the JSON colorized, though it already has a nice structure. + +I don't know why you'd use `terraform push`, which applies state that's externalized as JSON somehow. +Likely an advanced feature. + +#### `state rm` +The opposite of `state import`, the `state rm` subcommand removes a tracked resource from the +Terraform state file. Some uses for this are for repairing configurations, spliting them up, +or allowing someone else to experiment with changes on a deployed artifact before bringing it +back under control. Happily, this command does not `destroy` objects when removing them. diff --git a/README.md b/README.md new file mode 100644 index 0000000..e69de29 diff --git a/TERRAFORM_QUICKSTART.md b/TERRAFORM_QUICKSTART.md new file mode 100644 index 0000000..ed83b20 --- /dev/null +++ b/TERRAFORM_QUICKSTART.md @@ -0,0 +1,218 @@ +# Terraform Quickstart +The [official documentation](https://www.terraform.io/) for Terraform +is quite readable and exposes the functionality and assumptions at a good pace. +In particular, I found the [Get Started - Google Cloud](https://learn.hashicorp.com/collections/terraform/gcp-get-started) guide to be very helpful. + +It's worth making an alias for terraform and putting it in your `.bash_profile` or other shell init file, as +it's difficult to spell `terraform` correctly when caffeinated. +```shell script +alias tf='terraform' +``` +The above tip also serves as a warning and non-apology that I'm going to forget to spell out the +command name repeatedly below. + +## Installation +For the work so far, I've used the [Terraform CLI](https://www.terraform.io/docs/cli-index.html), which has the advantage of not costing +money or requiring an email registration. On the mac, `brew inistall terraform` is pretty much all it takes. + +Terraform works by keeping state on the local filesystem for evaluating diffs and staging changes. Primary files for users to author +and check in to source control are: +* main.tf - listing providers and specifying Terraform version and other global options +* .tf - list of resources and their properties and dependencies. This file can reference any other .tf flies in the local directory. +* variables.tf - any string, numeric, or map variables to be provided to the script. +* external text files - useful files with text input, such as BigQuery table schema JSON files + +Output files provided Terraform (and not checked in to source control) include +* tfstate files - a record of the current known state of resources under Terraform's control. + +## Organization +Terraform configuration settings are reusable for all environments (after bvinding environment-specific +variables in `.tfvars` files). The reuse is provided by Terraform +## Running +If you have a small change to make to a resource under Terraform's management, in the simplest case the workflow is +* Run `terraform init` to initialize the providers +* Run `terraform state list` to list all artifacts currently known and managed by Terraform within +the scope of the `.tf` files in the current directory. +* Run `terraform show` to view the current state of the (managed) world, and check any errors. +* change the setting in the tf file (such as reporting.tf). +* Run `terraform plan` to see the execution plan. This can be saved with the `-out` argument in +situations where it's important to apply exactly the planned changes. Otherwise, new changes to the +environment might be picked up in the `apply` step, giving possibly significantly different behaviors +than were expected based on the `plan` output. +* Run `terraform apply` to execute the plan and apply the changes. You'll need to type "yes" to + proceed with the changes (or use `-auto-approve` in a non-interactive workflow.) +* Check in changes to the terraform file. + +## Managing Ownership +### Importing resources +Frequently, resources to be managed already exist. By default, Terraform will try to re-create them +if they're added to a configuration and fail because the name or r other unique identifier is already in use. +Using `terraform import` allows the existing resource to be included +in the `tfstate` file as if Terraform created it from scratch. + +### Removing items from Terraform +Occasionally, it's desirable to remove a resource form Terraform state. This can be helpful when reorganizing +resources or `tf` files. The `terraform state rm` command accomplishes this, and moves those resources +into a state where Terraform doesn't know it either created or owned them. The +[official](https://www.terraform.io/docs/commands/state/rm.html) do are pretty good for this. + +## Good Practices +### Formatting +A builtin linter is available with the `terraform fmt` command. It spaces assignments in clever ways +that would be difficult to maintain by hand, but that are easy to read. It's easy to set up in IntelliJ +by installing the FileWitchers plugin and adding a Terraform Format action. Runs fast,too. + +### Labels +It's handy to have a human-readable label called `managedByTerraform` and set it to `true` for all TF artifacts. +It's possible to set up default labels and things for this. +### Local Variables +Using a `locals` bock allows you to assign values (computed once) to variables to be used elsewhere. This +is especially useful for nested map lookups: +```hcl-terraform +locals { + project = var.aou_env_info[var.aou_env]["project"] + dataset = var.aou_env_info[var.aou_env]["dataset"] +} +``` + +Later, simply reference the value by `dataset_id = local.dataset`. Note that these "local" variables +are available to other `.tf` files, but apparently, since things are all initialized at once and immutable, +it doesn't really matter whether you define them in `chicken.tf` or `egg.tf`. It just works as long +as both files are part of the same logical configuration. + +It's useful in some cases to specify `default` values for the resources in use, but it's advisable to +force the user to specify certain fundamental things (such as the AoU environment) every time in order +to avoid migrating the wrong environment prematurely (such as removing artifacts that code running on +that environment expects to be there). + +### Starting with a scratch state collection +It's much faster to work Terraform-created artifacts, properties, etc, than to attach to existing infrastructure. +For this purpose, it can be handy to add new BigQuery datasets just for the development of the configuration, +capture resource and module identifiers for import, and then tear down the temporary artifacts with `terraform destroy`. + +### Use Modules +[Modules](https://www.terraform.io/docs/configuration/modules.html) are the basis of reuse, +encapsulation, and separation of concerns in Terraform. Frequently, the provider (such as Google +Cloud Platform) has already written handy base modules that provide reasonable +defaults, logical arrangement of resources, and convenient output variable declarations. + +### Separate Private Vars from Community-use Settings +Names of artifacts, deployments (such as test and staging), service accounts, or other pseudo-secrets +should be kept separate from the primary module definitions outlining behavior. For example, looking +at the reporting project, we have: +* public: table schemas, names, and clustering/partitioning settings +* public: view queries (with dataset and project names abstracted out) +* private: names of AoU environments (currently exposed in several places publicly, but of no legitimate +use to the general public) +* private: BigQuery dataset names. We have a simple convention of naming it after the environment, +but this isn't a contract enforced by our application code or the Terraform configurations. + +Why do we include the environment name in the dataset name (as opposed to just calling it `reporting`) in every +environment? Firstly, we have two environments that share a GCP project, so we would have a name clash. +More fundamentally, though, is that it would be too easy to apply a query to a dataset in the wrong environment +if it simply referred to `reporting.workspace` instead of `reporting_prod.workspace`, as the BigQuery +console lets you mix datasets from multiple environments as long as you have the required credentials. In most +cases, I'd argue against such inconsistent resource naming. + +### Don't fear the `tfstate` file +Despite the scary name, the contents of `tfstate` are in JSON, and largely readable. You can operate +on it with utilities such as `jq` + +```shell script +$ jq '.resources[0].instances[0].attributes.friendly_name' terraform.tfstate +"Workbench Scratch Environment Reporting Data" +``` + +I'd keep any operations read-only whenever possible, but I have a feeling one of the keys to mastering +Terraform will be understanding the `tfstate` file. +## Gotchas +## A Terra by any other name +[Terra](https://terra.bio/) and [Terraform](https://www.terraform.io/) are different things, and for +the most part going to one organization for help with the other's platform will result in bemusement +at best. Good luck differentiating them on your resume. + +### Mis-configuring a tfstate file +The file really shouldn't be checked into source contol, because +it's not safe to have multiple developers working with it. It's too easy to getinito an inconsistent view of the world. + +However, that doesn't mean it's safe to lost track of the tfstate JSON file altogether. +When working with multiple people, a shared online backend with locking is really +required. + +### Using two terminals in the same terraform root module working directory. +Frequent error messages about the lock file and how you can use `lock=fale` but should really never +do so. It's basically that two processes think they own something in `.terraform/`. So don't do that. + +### Using `terraform state show` with `for-each` or an array-declared value. +When creating many items of hte same type at the same level/scope, it's useful to use arrays or +`for-each`. However, the syntax for `tf state show` is trickier because you need to pass a double-quoted +string index from the command line. + +Given the following output of `terraform state list`: +``` +$ tf state list +module.bigquery_dataset.google_bigquery_dataset.main +module.bigquery_dataset.google_bigquery_table.main["cohort"] +module.bigquery_dataset.google_bigquery_table.main["user"] +module.bigquery_dataset.google_bigquery_table.main["workspace"] +module.bigquery_dataset.google_bigquery_table.view["latest_users"] +``` +The naive approach gives you this [cryptic error message](https://github.com/hashicorp/terraform/pull/22395). +``` +$ tf state show module.bigquery_dataset.google_bigquery_table.main["cohort"] +Error parsing instance address: module.bigquery_dataset.google_bigquery_table.main[cohort] + +This command requires that the address references one specific instance. +To view the available instances, use "terraform state list". Please modify +the address to reference a specific instance. + +``` +The approach that seems to work in Bash is +``` + terraform state show ¨ +``` + +### Cloud not quite ready to use newly created resource +When creating a new BigQuery dataset with tables and views +all at once, I once run into an issue where the new table +wasn't ready for a view creation yet. The error message was +``` +Error: googleapi: Error 404: Not found: Table my-project:my_dataset.user, notFound + + on .terraform/modules/aou_rw_reporting/main.tf line 76, in resource "google_bigquery_table" "view": + 76: resource "google_bigquery_table" "view" { +``` + +Re-running `terraform apply` fixed this. +### Renaming files and directories +It's really easy to refactor yourself into a corner by renaming modules or directories in their paths. +If you see this error, it probably means you've moved something in the local filesystem that the +cached state was depending on. +``` +Error: Module not found + +The module address +"/repos/workbench/ops/terraform/modules/aou-rw-reporting/" +could not be resolved. + +If you intended this as a path relative to the current module, use +"/repos/workbench/ops/terraform/modules/aou-rw-reporting/" +instead. The "./" prefix indicates that the address is a relative filesystem +path. +``` +So the last chance to rename things relatively is just before you've created them and people are depending on them in prod. +It not really easy to rework your tf files after deployment. (Another good reason for a scratch project). + +### Running in wrong terminal window +If things get created on the wrong cloud, that's not good. I was really confused when I tried running +the AWS tutorial tf file. `tf destroy` is cathartic in such situations. I'm not even sure it's OK to use two +terminals in the same root module at once. + +### Using new BigQuery resources +The BigQuery console UI frequently doesn't list all of the new datasets for several minutes, so using +`bq show` is helpful if you want to see things "with your own eyes after tf operation". + +### Yes Man +If you "yes" out of habit but `terraform apply` or `destroy` bailed out earlier than the prompt, +you see a string of `y`s in your terminia. I nearly filed a bug for this, but then realized the `yes` +command with no argument does that for all time (at least, so far...). diff --git a/modules/.DS_Store b/modules/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..23f162e0e0933a05a85d2f5c71e9c7d76b2dbaa7 GIT binary patch literal 6148 zcmeHKJxc>Y5PhR5A~q>4x3smkSwbwV{R47Q5DXp&!S*VDm;X%Pd?1F?MzE1LF!Oe2 z=iTOBagzaH>;2sYumZ58JL1*D-2B{qW=EAVBAw3|aE~W!@o>GHRezsw?g?Hn;u()$ z`QvujjRQ~pr(M#|-@~|)NdYM!1*Cu!kOIF~z_O z9M%&RrGOMTRNy$bEARhT^dIK`Ly~qn oC=Ah#iP4U^@pgO`MOoK;>M3B?g`OpcD0Hz;%&HfxlMZ12_g7q5uE@ literal 0 HcmV?d00001 diff --git a/modules/workbench/README.md b/modules/workbench/README.md new file mode 100644 index 0000000..d3235e5 --- /dev/null +++ b/modules/workbench/README.md @@ -0,0 +1,24 @@ +# Workbench Child Modules +The module directories here represent individually deployable subsystems, +microservices, or other functional units. It's easy enough to put all buckets, say, +in a `gcs` module, but that wouldn't really let us operate on an individual components's bucket. + +Following is a broad outline fo each child module. If you feel irritated that you can't see, for example, +all dashboards in one place, you can still go to the Console or use `gcloud`. + +## Reporting +The state for reporting is currently the BigQuery dataset and its tables and views. In the future, +it makes sense to add j +* Reporting-specific metrics +* Notifications on the system +* Reporting-specific logs, specific logs +* Data blocks for views (maybe) + +## Backend Database (future) +This resource is inherently cross-functional, so we can just put +* The application DB +* backup settings +This will take advantage of the `google_sql_database_instance` resource. + +Schema migrations work via `Ruby->Gradle->Liquibase->MySql->🚂` +Maybe it needs a `Terraform` caboose. It looks like there's not currently a Liquibase provider. diff --git a/modules/workbench/WORKBENCH-MODULE.md b/modules/workbench/WORKBENCH-MODULE.md new file mode 100644 index 0000000..ef2f3f7 --- /dev/null +++ b/modules/workbench/WORKBENCH-MODULE.md @@ -0,0 +1,63 @@ + +# Workbench Module +The module directories here represent individually deployable subsystems, +microservices, or other functional units. It's easy enough to put all buckets, say, +in a `gcs` module, but that wouldn't really let us operate on an individual components's bucket. + +Following is a broad outline fo each child module. If you feel irritated that you can't see, for example, +all dashboards in one place, you can still go to the Console or use `gcloud`. + +A somewhat forward-looking plan for that would look like + +# Workbench Module Development Plan +The Workbench is the topmost parent module in the AoU Workbench +Application configuration. It depends on several modules for individual +subsystems. + +After creating a valid Terraform configuration we're not finished, +as we need to make sure we don't step on other tools or automatioin. +For example, items that pertain to cloud resources will need to move +out of the workbench JSON config system. + +I have automation already for stackdriver setting that fetches all of theiir configurations +and plan to migrate it to Terraform. + +## Reporting +The state for reporting is currently the BigQuery dataset and its tables and views. +Highlights +* Reporting-specific metrics with the `google_logging_metric` [resource](https://www.terraform.io/docs/providers/google/r/logging_metric.html) +and others +* Notifications on the system +* Reporting-specific logs, specific logs +* Data blocks for views (maybe) + +## Backend Database (future) +This resource is inherently cross-functional, so we can just put +* The application DB +* backup settings +This will take advantage of the `google_sql_database_instance` resource. + +Schema migrations work via `Ruby->Gradle->Liquibase->MySql->�` +Maybe it needs a `Terraform` caboose. It looks like there's not currently a Liquibase provider. + +## Workbench to RDR Pipeline +Instantiate [google_cloud_tasks_queue](https://www.terraform.io/docs/providers/google/r/cloud_tasks_queue.html) resource +resouorces as necessary. + +## API Server +* AppEngine versions, instances, logs, etc. Isn't just named +App Engine, since that's the resource that gets crated. + +## Action Audit +This module maps to +* Stackdriver logs for each environment. (It will nedd to + move from the applicatioin JSON config likely.) + +## Tiers and Egress Detection +There is a [sumo logic provider](https://www.sumologic.com/blog/terraform-provider-hosted/) for Terraform, which is very good +news. It looks really svelte. + +We will also want to control the VPC flow logs, +perimeters, etc, but it won't be in this `workbench` module, +because Terra-not-form owns the organization and needs to do +cration manually for now. diff --git a/modules/workbench/main.tf b/modules/workbench/main.tf new file mode 100644 index 0000000..fcaeed2 --- /dev/null +++ b/modules/workbench/main.tf @@ -0,0 +1,11 @@ +# Module for creating an instance of the scratch AoU RW Environment +module "reporting" { + source = "./modules/reporting" + + # reporting + aou_env = var.aou_env + reporting_dataset_id = var.reporting_dataset_id + + # provider + project_id = var.project_id +} diff --git a/modules/workbench/modules/reporting/.DS_Store b/modules/workbench/modules/reporting/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..f3525be0944f23b4093325db59846b29f5c5c924 GIT binary patch literal 6148 zcmeHKyH3ME5S$H(2qHx~D6gbKqNcH;P*C#&#Cbp@gCi$E>xO^e7x+q;y<0_z4J|^j zEA7qsyyNxFoxDB(GJ2X_07C$Mx?<;u%@?Nk)mzrFkseXvGnSZPjtXm>x1xREH!7fO zH^7Q(R^l1E>$kz;x+tf4QRa-%4f0Mu`-sM#@+z+Oip&ZdavyQWb#L=78@xmAk16iS zeZU0k{Va0Qh#v60Cd{y8o1-90cULQmyf%2fIG;IQni-()^R_b&Ub?vgu7E4>uM}X- zR_h-p^wt${1zdrx0{VRjbj8fDP8dHOY~m4s*yV6E_Vrg$Il(Y9tP}DMO%j!usA5M9 zNpy|}i^~k_gozHZ!-rUAu@j2K?0kNR;gBq$x2}LIu&=;Dx*Tf#zxw|CzfbZjSHKnc zQwm6Td@~+#N>*E2lhaxo=(lt=jjI!GDeS~j%vdeOr}StX585GShIK-Y(ELXr%HWMF I@S_TR0``(oQUCw| literal 0 HcmV?d00001 diff --git a/modules/workbench/modules/reporting/assets/schemas/cohort.json b/modules/workbench/modules/reporting/assets/schemas/cohort.json new file mode 100644 index 0000000..f5e1408 --- /dev/null +++ b/modules/workbench/modules/reporting/assets/schemas/cohort.json @@ -0,0 +1,57 @@ +[ + { + "description": "Time snapshot was taken, in Epoch milliseconds. Same across all rows and all tables in the snapshot, and uniquely defines a particular snapshot.", + "name": "snapshot_timestamp", + "type": "INTEGER" + }, + { + "name": "cohort_id", + "type": "INTEGER", + "description": "Unique ID of this cohort in the application DB. Should be unique within each snapshot." + }, + { + "name": "creation_time", + "type": "TIMESTAMP", + "description": "Timestamp for creation of this cohort." + }, + { + "name": "creator_id", + "type": "INTEGER", + "description": "User ID of cohort creator. Should be a foreign key into the user table." + }, + { + "name": "criteria", + "type": "STRING", + "description": "JSON serialization of the selection criteria for this cohort. Schema is defined at\nhttps://github.com/all-of-us/workbench/pull/4076/files. TODO: update with permanent URL." + }, + { + "name": "description", + "type": "STRING", + "description": "User-provided cohort description from the Save Cohort dialog." + }, + { + "name": "last_modified_time", + "type": "TIMESTAMP", + "description": "Timestamp of last user modification to the cohort definition." + }, + { + "name": "name", + "type": "STRING", + "description": "User-provided human-readable name for this cohort." + }, + { + "name": "type", + "type": "STRING", + "description": "Deprecated. For internal use only." + }, + { + "name": "version", + "type": "INTEGER", + "description": "Deprecated. For internal use only." + }, + { + "name": "workspace_id", + "type": "INTEGER", + "description": "Application workspace ID of the workspace containing this cohort. Should be a foreign\nkey into the workspace table." + } +] diff --git a/modules/workbench/modules/reporting/assets/schemas/institution.json b/modules/workbench/modules/reporting/assets/schemas/institution.json new file mode 100644 index 0000000..034fad1 --- /dev/null +++ b/modules/workbench/modules/reporting/assets/schemas/institution.json @@ -0,0 +1,37 @@ +[ + { + "description": "Time snapshot was taken, in Epoch milliseconds. Same across all rows and all tables in the snapshot, and uniquely defines a particular snapshot.", + "name": "snapshot_timestamp", + "type": "INTEGER" + }, + { + "name": "display_name", + "type": "STRING", + "description": "Human-readable name for the institution, as shown in the UI." + }, + { + "name": "dua_type_enum", + "type": "STRING", + "description": "Data Use Agreement type." + }, + { + "name": "institution_id", + "type": "INTEGER", + "description": "Unique PK for institution table in application DB. Note that in foreign key relationships,\nthe short_name is typically used in place of this identifier." + }, + { + "name": "organization_type_enum", + "type": "STRING", + "description": "Organization type. If Other, then organization_type_other_text describes this organization's type." + }, + { + "name": "organization_type_other_text", + "type": "STRING", + "description": "If organization_type_enum is Other, this field describes organization's type. Value is not valid\notherwise." + }, + { + "name": "short_name", + "type": "STRING", + "description": "A unique string identifier used in the API to map a user affiliation to an institution. (Database\nidentifiers are not typically exposed in the workbench API)." + } +] diff --git a/modules/workbench/modules/reporting/assets/schemas/user.json b/modules/workbench/modules/reporting/assets/schemas/user.json new file mode 100644 index 0000000..75bf678 --- /dev/null +++ b/modules/workbench/modules/reporting/assets/schemas/user.json @@ -0,0 +1,192 @@ + [ + { + "description": "Time snapshot was taken, in Epoch milliseconds. Same across all rows and all tables in the snapshot, and uniquely defines a particular snapshot.", + "name": "snapshot_timestamp", + "type": "INTEGER" + }, + { + "name": "about_you", + "type": "STRING", + "description": "User's description of themselves." + }, + { + "name": "area_of_research", + "type": "STRING", + "description": "User's primary area of research focus." + }, + { + "name": "compliance_training_bypass_time", + "type": "TIMESTAMP", + "description": "Time Compliance Training administratively bypassed, or null if this requirement\nwas never bypassed for this user." + }, + { + "name": "compliance_training_completion_time", + "type": "TIMESTAMP", + "description": "Time Compliance Training completed, or null if never completed." + }, + { + "name": "compliance_training_expiration_time", + "type": "TIMESTAMP", + "description": "Time User's compliance training will expire, or null training not taken or if bypassed" + }, + { + "name": "contact_email", + "type": "STRING", + "description": "User's external email address on file." + }, + { + "name": "creation_time", + "type": "TIMESTAMP", + "description": "Timestamp of account creation." + }, + { + "name": "current_position", + "type": "STRING", + "description": "User's current role/position." + }, + { + "name": "data_access_level", + "type": "STRING", + "description": "An indication of whether the user has completed the requirements for access to the\nregistered tier." + }, + { + "name": "data_use_agreement_bypass_time", + "type": "TIMESTAMP", + "description": "Time an admin bypassed the DUCC rerquirement for this user, or null if that has not happened.\nThis value is reset to null if a user's bypassed status is revoked." + }, + { + "name": "data_use_agreement_completion_time", + "type": "TIMESTAMP", + "description": "Time user last completed DUCC" + }, + { + "name": "data_use_agreement_signed_version", + "type": "INTEGER", + "description": "Version of DUCC last signed by this user." + }, + { + "name": "demographic_survey_completion_time", + "type": "TIMESTAMP", + "description": "Timestamp for completion of Demographic Survey (if completed)." + }, + { + "name": "disabled", + "type": "BOOLEAN", + "description": "If true, this account has been disabled by an administrator (or potentially an automatic process)." + }, + { + "name": "era_commons_bypass_time", + "type": "TIMESTAMP", + "description": "Time an administrator bypassed the ERA Commons requirement for this user." + }, + { + "name": "era_commons_completion_time", + "type": "TIMESTAMP", + "description": "Time user completed ERA Commons account link." + }, + { + "name": "family_name", + "type": "STRING", + "description": "User last name (family name)." + }, + { + "name": "first_registration_completion_time", + "type": "TIMESTAMP", + "description": "Time user first completed registration." + }, + { + "name": "first_sign_in_time", + "type": "TIMESTAMP", + "description": "Time user first signed into the Workbench with a GSuite account." + }, + { + "name": "free_tier_credits_limit_days_override", + "type": "INTEGER", + "description": "Override value for the default free tier time limit (days)." + }, + { + "name": "free_tier_credits_limit_dollars_override", + "type": "FLOAT", + "description": "Override value for the default free tier spending limit (USD)." + }, + { + "name": "given_name", + "type": "STRING", + "description": "User first name (given name)." + }, + { + "name": "last_modified_time", + "type": "TIMESTAMP", + "description": "Time of last modification to this user account." + }, + { + "name": "professional_url", + "type": "STRING", + "description": "User's URL at primary place of work." + }, + { + "name": "two_factor_auth_bypass_time", + "type": "TIMESTAMP", + "description": "Time an administrator bypassed the two-factor authentication requirement for this user,\nor null if it has not been bypassed." + }, + { + "name": "two_factor_auth_completion_time", + "type": "TIMESTAMP", + "description": "Time user registered a two-factor authentication method satisfying the 2FA requirement." + }, + { + "name": "user_id", + "type": "INTEGER", + "description": "Unique integer ID for this user, as assigned in the main Application DB. Serves as a pseudo-\nprimary key in this table when combined with a snapshot_timestamp. BigQuery doesn't enforce\nthis uniqueness, though." + }, + { + "name": "username", + "type": "STRING", + "description": "User's GSuite username, including appropriate domain for this environment. Uniquely describes a\nuser account (but not constrained to be unique by BigQuery)." + }, + { + "name": "city", + "type": "STRING", + "description": "User-reported city of residence." + }, + { + "name": "country", + "type": "STRING", + "description": "User-reported country of residence." + }, + { + "name": "state", + "type": "STRING", + "description": "User-reported state or province. Not guaranteed to match official abbreviations or spellings." + }, + { + "name": "street_address_1", + "type": "STRING", + "description": "First line of user street address." + }, + { + "name": "street_address_2", + "type": "STRING", + "description": "Second line of user street address." + }, + { + "name": "zip_code", + "type": "STRING", + "description": "Up to 10-digit zip code for use residence." + }, + { + "name": "institution_id", + "type": "INTEGER", + "description": "Foreign key into institution table. Each user is only affiliated\nwith a single institution." + }, + { + "name": "institutional_role_enum", + "type": "STRING", + "description": "Description of the user's role at the institution they are\naffiliated with. Selected from a list of predefined values. If \"other\", see institutional_role_other_text\nfor custom description." + }, + { + "name": "institutional_role_other_text", + "type": "STRING", + "description": "If the institutional_role_enum is \"other\", custom description\nof this user's role in the institution." + } +] diff --git a/modules/workbench/modules/reporting/assets/schemas/workspace.json b/modules/workbench/modules/reporting/assets/schemas/workspace.json new file mode 100644 index 0000000..13ec5bd --- /dev/null +++ b/modules/workbench/modules/reporting/assets/schemas/workspace.json @@ -0,0 +1,177 @@ +[ + { + "description": "Time snapshot was taken, in Epoch milliseconds. Same across all rows and all tables in the snapshot, and uniquely defines a particular snapshot.", + "name": "snapshot_timestamp", + "type": "INTEGER" + }, + { + "name": "billing_account_type", + "type": "STRING", + "description": "Whether the workspace's billing account is the free tier account, or a user-provided billing\naccount" + }, + { + "name": "billing_status", + "type": "STRING", + "description": "Is the billing account associated with this workspace available to incur costs? For a free\ntier project, this indicates whether a user has an available balance in their quota. For a\nuser-provided billing account, this corresponds to whether payment is valid and up to date." + }, + { + "name": "cdr_version_id", + "type": "INTEGER", + "description": "Foreign key into CDR table." + }, + { + "name": "creation_time", + "type": "TIMESTAMP", + "description": "Time workspace was initially created." + }, + { + "name": "creator_id", + "type": "INTEGER", + "description": "User ID of user who initially created this workspace." + }, + { + "name": "disseminate_research_other", + "type": "STRING", + "description": "Description of user-defined research dissemination option." + }, + { + "name": "last_accessed_time", + "type": "TIMESTAMP", + "description": "No longer in use. Column should be ignored." + }, + { + "name": "last_modified_time", + "type": "TIMESTAMP", + "description": "Last time a modification was made to this workspace, with certain exceptions. In general,\nonly changes that should alter sort order in the Recent Workspaces UI trigger an update\nto the last modified time." + }, + { + "name": "name", + "type": "STRING", + "description": "User-defined name for this workspace. Human-readable." + }, + { + "name": "needs_rp_review_prompt", + "type": "INTEGER", + "description": "If true, the owner of the workspace will be asked to review the Research purpose." + }, + { + "name": "published", + "type": "BOOLEAN", + "description": "If true, this workspace has been published." + }, + { + "name": "rp_additional_notes", + "type": "STRING", + "description": "Research purpose additional notes input." + }, + { + "name": "rp_ancestry", + "type": "BOOLEAN", + "description": "If true, user has reported this workspace will study ancestry." + }, + { + "name": "rp_anticipated_findings", + "type": "STRING", + "description": "Answer to question: \"What are the anticipated findings from the study?\nHow would your findings contribute to the body of scientific knowledge in\nthe field?\" 1000 character limit (applied client-side)." + }, + { + "name": "rp_approved", + "type": "BOOLEAN", + "description": "Status of the most recent Request for Review of Research Purpose\nDescription for this workspace.\nIf true, this workspace has been approved by the Resource Access\nBoard. If false, it was rejected. If null, the workspace has not\nbeen adjudicated. If rp_review_requested is true and rp_approved is\nnull, the workspace has a review pending." + }, + { + "name": "rp_commercial_purpose", + "type": "BOOLEAN", + "description": "If true, this workspace and research have commercial goals." + }, + { + "name": "rp_control_set", + "type": "BOOLEAN", + "description": "Reserch Control selected. All of Us data will be used as a reference or control\ndataset for comparison with another dataset from a different resource (e.g.\nCase-control studies)." + }, + { + "name": "rp_disease_focused_research", + "type": "BOOLEAN", + "description": "Disease-focused research: The primary purpose of the research is to learn more about\na particular disease or disorder (for example, type 2 diabetes), a trait (for example,\nblood pressure), or a set of related conditions (for example, autoimmune diseases,\npsychiatric disorders)." + }, + { + "name": "rp_disease_of_focus", + "type": "STRING", + "description": "For workspaces that include Disese-focused Research, the user-supplied name of the diseas\nof focus (in the Name of Disease field)." + }, + { + "name": "rp_drug_development", + "type": "BOOLEAN", + "description": "Drug/Therapeutics Development Research selected. Primary focus of the research\nis drug/therapeutics development. The data will be used to understand treatment-gene\ninteractions or treatment outcomes relevant to the therapeutic(s) of interest." + }, + { + "name": "rp_educational", + "type": "BOOLEAN", + "description": "Educational Purpose: The data will be used for education purposes (e.g. for a college\nresearch methods course, to educate students on population-based research approaches)." + }, + { + "name": "rp_ethics", + "type": "BOOLEAN", + "description": "Ethical, Legal, and Social Implications (ELSI) Research: this\nresearch focuses on ethical, legal, and social implications (ELSI) of, or related to design,\nconduct, and translation of research." + }, + { + "name": "rp_intended_study", + "type": "STRING", + "description": "Intended field of study for this research" + }, + { + "name": "rp_methods_development", + "type": "BOOLEAN", + "description": "Methods development/validation study: The primary purpose of the use of All of Us data is to\ndevelop and/or validate specific methods/tools for analyzing or interpreting data (e.g.\nstatistical methods for describing data trends, developing more powerful methods to detect\ngene-environment, or other types of interactions in genome-wide association studies)." + }, + { + "name": "rp_other_population_details", + "type": "STRING", + "description": "If studying a specific population categorized as Other, user's description of that population." + }, + { + "name": "rp_other_purpose", + "type": "BOOLEAN", + "description": "Other Purpose checkbox (requires rp_other_purpose_details)." + }, + { + "name": "rp_other_purpose_details", + "type": "STRING", + "description": "If your purpose of use is different from the options listed above, please\nselect \"Other Purpose\" and provide details regarding your purpose of data use here (500 character limit).\nrp_other_purpose should be true if this field is populated." + }, + { + "name": "rp_population_health", + "type": "BOOLEAN", + "description": "Population Health/Public Health Research: the primary purpose of using All of Us data is to\ninvestigate health behaviors, outcomes, access, and disparities in populations." + }, + { + "name": "rp_reason_for_all_of_us", + "type": "STRING", + "description": "Why All of Us was chosen for the research in this workspace." + }, + { + "name": "rp_review_requested", + "type": "BOOLEAN", + "description": "If true, a reivew has been requested by the Resource Access Board. This\nflag is currently not reset when a review is completed." + }, + { + "name": "rp_scientific_approach", + "type": "STRING", + "description": "Answer to the question \"What are the scientific approaches you plan to use for your\nstudy? Describe the datasets, research methods, and tools you will use to answer\nyour scientific question(s).\"" + }, + { + "name": "rp_social_behavioral", + "type": "BOOLEAN", + "description": "If true, user states the research focuses on the social or behavioral phenomena or determinants of health." + }, + { + "name": "rp_time_requested", + "type": "TIMESTAMP", + "description": "Time user requested a research purpose review, or null if never requested." + }, + { + "name": "workspace_id", + "type": "INTEGER", + "description": "Primary key of the workspace table in Workrbench application database. Along with\nsnapshot_timestamp, serves as a pseudo-primary key for this table." + } +] diff --git a/modules/workbench/modules/reporting/assets/views/latest_cohorts.sql b/modules/workbench/modules/reporting/assets/views/latest_cohorts.sql new file mode 100644 index 0000000..a6c0285 --- /dev/null +++ b/modules/workbench/modules/reporting/assets/views/latest_cohorts.sql @@ -0,0 +1,13 @@ +-- All cohorts from the most recent snapshot +SELECT + c.* +FROM + `${project}`.${dataset}.cohort c +WHERE + c.snapshot_timestamp = ( + SELECT + MAX(u.snapshot_timestamp) + FROM + `${project}`.${dataset}.user u) +ORDER BY + c.cohort_id; diff --git a/modules/workbench/modules/reporting/assets/views/latest_institutions.sql b/modules/workbench/modules/reporting/assets/views/latest_institutions.sql new file mode 100644 index 0000000..4ef7a28 --- /dev/null +++ b/modules/workbench/modules/reporting/assets/views/latest_institutions.sql @@ -0,0 +1,14 @@ +-- All institutions from the most recent snapshot. Some may not have +-- users associated with them. +SELECT + i.* +FROM + `${project}`.${dataset}.institution i +WHERE + i.snapshot_timestamp = ( + SELECT + MAX(u.snapshot_timestamp) + FROM + `${project}`.${dataset}.user u) +ORDER BY + i.institution_id; diff --git a/modules/workbench/modules/reporting/assets/views/latest_users.sql b/modules/workbench/modules/reporting/assets/views/latest_users.sql new file mode 100644 index 0000000..173f0bc --- /dev/null +++ b/modules/workbench/modules/reporting/assets/views/latest_users.sql @@ -0,0 +1,12 @@ +SELECT + u.* +FROM + `${project}`.${dataset}.user u +WHERE + u.snapshot_timestamp = ( + SELECT + MAX(u2.snapshot_timestamp) + FROM + `${project}`.${dataset}.user u2) +ORDER BY + u.username; diff --git a/modules/workbench/modules/reporting/assets/views/latest_workspaces.sql b/modules/workbench/modules/reporting/assets/views/latest_workspaces.sql new file mode 100644 index 0000000..742327f --- /dev/null +++ b/modules/workbench/modules/reporting/assets/views/latest_workspaces.sql @@ -0,0 +1,13 @@ +SELECT + w.* +FROM + `${project}`.${dataset}.workspace w +WHERE + w.snapshot_timestamp = ( + SELECT + MAX(u.snapshot_timestamp) + FROM + `${project}`.${dataset}.user u) + +ORDER BY + w.workspace_id; diff --git a/modules/workbench/modules/reporting/assets/views/table_count_vs_time.sql b/modules/workbench/modules/reporting/assets/views/table_count_vs_time.sql new file mode 100644 index 0000000..c8753f8 --- /dev/null +++ b/modules/workbench/modules/reporting/assets/views/table_count_vs_time.sql @@ -0,0 +1,39 @@ +-- simple count of each table over time. Demonstrates time series +-- aggregation across snapshots. Note that if the user table is missing a timestamp, +-- we consider it a bad snapshot, but any other table will return zero rows. +SELECT + TIMESTAMP_MILLIS(u.snapshot_timestamp) AS snapshot, + ( + SELECT + COUNT(u_inner.user_id) + FROM + `${project}`.${dataset}.user u_inner + WHERE + u_inner.snapshot_timestamp = u.snapshot_timestamp) AS user_count, + ( + SELECT + COUNT(w.workspace_id) + FROM + `${project}`.${dataset}.workspace w + WHERE + w.snapshot_timestamp = u.snapshot_timestamp) AS workspace_count, + ( + SELECT + COUNT(c.cohort_id) + FROM + `${project}`.${dataset}.cohort c + WHERE + c.snapshot_timestamp = u.snapshot_timestamp) AS cohort_count, + ( + SELECT + COUNT(institution_id) + FROM + `${project}`.${dataset}.institution i + WHERE + i.snapshot_timestamp = u.snapshot_timestamp) AS institution_count +FROM + `${project}`.${dataset}.user u +GROUP BY + u.snapshot_timestamp +ORDER BY + u.snapshot_timestamp; diff --git a/modules/workbench/modules/reporting/main.tf b/modules/workbench/modules/reporting/main.tf new file mode 100644 index 0000000..3b140f4 --- /dev/null +++ b/modules/workbench/modules/reporting/main.tf @@ -0,0 +1,105 @@ +locals { + # + # Tables + # + + # Description attribute for each table. If absent, no description is set (null). + table_to_description = { + cohort = "Workbench Cohorts, including Uncompressed JSON Criteria" + user = "All Workbench users at each snapshot, including disabled or incompletely registered accounts." + institution = "Institutions represented by workbench users in an official affiliation" + workspace = "Workbench Workspaces (Active Only). Includes all user-supplied data about the research being conducted in each workspace." + } + + # Values that don't ever change set for this dataset. + TABLE_CONSTANTS = { + time_partitioning = null + expiration_time = null + clustering = [] + labels = { + terraform_managed = "true" + } + } + + TABLE_SCHEMA_SUFFIX = ".json" + # The module path is in a hidden directory under the running directory of the calling module, but + # our view files and schemas for BigQuery are in whatever directory this file is located, and apparently + # don't get loaded into that dir by default. + table_schema_filenames = fileset(pathexpand("${path.module}/assets/schemas"), "*.json") + // fileset() doesn't have an option to output full paths, so we need to re-expand them + table_schema_paths = [for file_name in local.table_schema_filenames : pathexpand("${path.module}/assets/schemas/${file_name}")] + + # Build a vector of objects, one for each table + table_inputs = [for full_path in local.table_schema_paths : { + schema = full_path + # TODO(jaycarlton) I do not yet see a way around doing the replacement twice, as it's not possible + # to refer to other values in the same object when defining it. + table_id = replace(basename(full_path), local.TABLE_SCHEMA_SUFFIX, "") + description = lookup(local.table_to_description, replace(basename(full_path), local.TABLE_SCHEMA_SUFFIX, ""), null) + }] + + # Merge calculated inputs with the ones we use every time. + tables = [for table_input in local.table_inputs : + merge(table_input, local.TABLE_CONSTANTS) + ] + + # + # Views + # + VIEW_CONSTANTS = { + # Reporting Subsystem always uses Standard SQL Syntax + use_legacy_sql = false, + labels = { + terraform_managed = "true" + } + } + QUERY_TEMPLATE_SUFFIX = ".sql" + # Local filenames for view templates. Returns something like ["latest_users.sql", "users_by_id.sql"] + view_query_template_filenames = fileset("${path.module}/assets/views", "*.sql") + # expanded to fully qualified path, e.g. ["/repos/workbench/terraform/modules/reporting/views/latest_users.sql", ...] + view_query_template_paths = [for file_name in local.view_query_template_filenames : pathexpand("${path.module}/assets/views/${file_name}")] + + # Create views for each .sql file in the views directory. There is no Terraform + # dependency from the view to the table(s) it queries, and I don't believe the SQL is even checked + # for accuracy prior to creation on the BQ side. + views = [for view_query_template_path in local.view_query_template_paths : + merge({ + view_id = replace(basename(view_query_template_path), local.QUERY_TEMPLATE_SUFFIX, ""), + query = templatefile(view_query_template_path, { + project = var.project_id + dataset = var.reporting_dataset_id + }) + }, local.VIEW_CONSTANTS)] + +} + +# All BigQuery assets for Reporting subsystem +module "main" { + source = "terraform-google-modules/bigquery/google" + version = "~> 4.3" + dataset_id = var.reporting_dataset_id + project_id = var.project_id + location = "US" + + # Note: friendly_name is discovered in plan and apply steps, but can't be + # entered here. Maybe they're just not exposed by the dataset module but the resources are looking + # for them? + dataset_name = "Workbench ${title(var.aou_env)} Environment Reporting Data" # exposed as friendly_name in plan + description = "Daily output of relational tables and time series views for analysis. Views are provided for general ad-hoc analysis." + + tables = local.tables + + # Note that, when creating this module fom the ground up, it's common to see an error like + # `Error: googleapi: Error 404: Not found: Table my-project:my_dataset.my_table, notFound`. It seems + # to be a momentary issue due to the dataset's existence not yet being observable to the table/view + # create API. So far, it's always worked on a re-run. + # TODO(jaycarlton) see if there's a way to put a retry on this. I'm not convinced that will work + # outside of a resource context (and inside a third-party module). + views = local.views + + dataset_labels = { + subsystem = "reporting" + terraform_managed = "true" + aou_env = var.aou_env + } +} diff --git a/modules/workbench/modules/reporting/variables.tf b/modules/workbench/modules/reporting/variables.tf new file mode 100644 index 0000000..4b9335c --- /dev/null +++ b/modules/workbench/modules/reporting/variables.tf @@ -0,0 +1,14 @@ +variable aou_env { + description = "Short name (all lowercase) of All of Us Workbench deployed environments, e.g. local, test, staging, prod." + type = string +} + +variable project_id { + description = "GCP Project" + type = string +} + +variable reporting_dataset_id { + description = "BigQuery dataset for workbench reporting data." + type = string +} diff --git a/modules/workbench/modules/reporting/views/latest_cohorts.sql b/modules/workbench/modules/reporting/views/latest_cohorts.sql new file mode 100644 index 0000000..a6c0285 --- /dev/null +++ b/modules/workbench/modules/reporting/views/latest_cohorts.sql @@ -0,0 +1,13 @@ +-- All cohorts from the most recent snapshot +SELECT + c.* +FROM + `${project}`.${dataset}.cohort c +WHERE + c.snapshot_timestamp = ( + SELECT + MAX(u.snapshot_timestamp) + FROM + `${project}`.${dataset}.user u) +ORDER BY + c.cohort_id; diff --git a/modules/workbench/modules/reporting/views/latest_institutions.sql b/modules/workbench/modules/reporting/views/latest_institutions.sql new file mode 100644 index 0000000..4ef7a28 --- /dev/null +++ b/modules/workbench/modules/reporting/views/latest_institutions.sql @@ -0,0 +1,14 @@ +-- All institutions from the most recent snapshot. Some may not have +-- users associated with them. +SELECT + i.* +FROM + `${project}`.${dataset}.institution i +WHERE + i.snapshot_timestamp = ( + SELECT + MAX(u.snapshot_timestamp) + FROM + `${project}`.${dataset}.user u) +ORDER BY + i.institution_id; diff --git a/modules/workbench/modules/reporting/views/latest_users.sql b/modules/workbench/modules/reporting/views/latest_users.sql new file mode 100644 index 0000000..173f0bc --- /dev/null +++ b/modules/workbench/modules/reporting/views/latest_users.sql @@ -0,0 +1,12 @@ +SELECT + u.* +FROM + `${project}`.${dataset}.user u +WHERE + u.snapshot_timestamp = ( + SELECT + MAX(u2.snapshot_timestamp) + FROM + `${project}`.${dataset}.user u2) +ORDER BY + u.username; diff --git a/modules/workbench/modules/reporting/views/latest_workspaces.sql b/modules/workbench/modules/reporting/views/latest_workspaces.sql new file mode 100644 index 0000000..742327f --- /dev/null +++ b/modules/workbench/modules/reporting/views/latest_workspaces.sql @@ -0,0 +1,13 @@ +SELECT + w.* +FROM + `${project}`.${dataset}.workspace w +WHERE + w.snapshot_timestamp = ( + SELECT + MAX(u.snapshot_timestamp) + FROM + `${project}`.${dataset}.user u) + +ORDER BY + w.workspace_id; diff --git a/modules/workbench/modules/reporting/views/table_count_vs_time.sql b/modules/workbench/modules/reporting/views/table_count_vs_time.sql new file mode 100644 index 0000000..c8753f8 --- /dev/null +++ b/modules/workbench/modules/reporting/views/table_count_vs_time.sql @@ -0,0 +1,39 @@ +-- simple count of each table over time. Demonstrates time series +-- aggregation across snapshots. Note that if the user table is missing a timestamp, +-- we consider it a bad snapshot, but any other table will return zero rows. +SELECT + TIMESTAMP_MILLIS(u.snapshot_timestamp) AS snapshot, + ( + SELECT + COUNT(u_inner.user_id) + FROM + `${project}`.${dataset}.user u_inner + WHERE + u_inner.snapshot_timestamp = u.snapshot_timestamp) AS user_count, + ( + SELECT + COUNT(w.workspace_id) + FROM + `${project}`.${dataset}.workspace w + WHERE + w.snapshot_timestamp = u.snapshot_timestamp) AS workspace_count, + ( + SELECT + COUNT(c.cohort_id) + FROM + `${project}`.${dataset}.cohort c + WHERE + c.snapshot_timestamp = u.snapshot_timestamp) AS cohort_count, + ( + SELECT + COUNT(institution_id) + FROM + `${project}`.${dataset}.institution i + WHERE + i.snapshot_timestamp = u.snapshot_timestamp) AS institution_count +FROM + `${project}`.${dataset}.user u +GROUP BY + u.snapshot_timestamp +ORDER BY + u.snapshot_timestamp; diff --git a/modules/workbench/providers.tf b/modules/workbench/providers.tf new file mode 100644 index 0000000..58890c3 --- /dev/null +++ b/modules/workbench/providers.tf @@ -0,0 +1,19 @@ +// See https://www.terraform.io/docs/configuration/providers.html +// Child modules receive their provider configurations from the root module. +terraform { + required_providers { + google = { + source = "hashicorp/google" + } + } +} + +provider "google" { + version = "3.5.0" + project = var.project_id + region = var.region + zone = var.zone + # Rather than provide a credentials_file value, we should + # use Application-default credentials. + # https://cloud.google.com/sdk/gcloud/reference/auth/application-default/login +} diff --git a/modules/workbench/variables.tf b/modules/workbench/variables.tf new file mode 100644 index 0000000..557a904 --- /dev/null +++ b/modules/workbench/variables.tf @@ -0,0 +1,36 @@ +# +# Environment Variables +# +variable aou_env { + description = "Short name (all lowercase) of All of Us Workbench deployed environments, e.g. local, test, staging, prod." + type = string +} + +# +# Provider Variables +# + +variable project_id { + description = "GCP Project" + type = string +} + +variable "region" { + description = "GCP region" + type = string + default = "us-central1" +} + +variable "zone" { + description = "GCP zone" + type = string + default = "us-central1-c" +} + +# +# Reporting +# +variable reporting_dataset_id { + description = "BigQuery dataset for workbench reporting data." + type = string +} From 8c194a99e81f2f4dfcc23e75d5af484acc5fef9d Mon Sep 17 00:00:00 2001 From: Jay Carlton <53479492+jaycarlton@users.noreply.github.com> Date: Sat, 14 Nov 2020 19:54:13 -0500 Subject: [PATCH 2/4] trick TF into honoring view->table dependency --- .../reporting/assets/views/latest_cohorts.sql | 13 ------- .../assets/views/latest_institutions.sql | 14 ------- .../reporting/assets/views/latest_users.sql | 12 ------ .../assets/views/latest_workspaces.sql | 13 ------- .../assets/views/live/live_table.sql | 13 +++++++ .../{ => timeseries}/table_count_vs_time.sql | 0 modules/workbench/modules/reporting/main.tf | 32 ++++++++++----- .../reporting/views/latest_cohorts.sql | 13 ------- .../reporting/views/latest_institutions.sql | 14 ------- .../modules/reporting/views/latest_users.sql | 12 ------ .../reporting/views/latest_workspaces.sql | 13 ------- .../reporting/views/table_count_vs_time.sql | 39 ------------------- 12 files changed, 35 insertions(+), 153 deletions(-) delete mode 100644 modules/workbench/modules/reporting/assets/views/latest_cohorts.sql delete mode 100644 modules/workbench/modules/reporting/assets/views/latest_institutions.sql delete mode 100644 modules/workbench/modules/reporting/assets/views/latest_users.sql delete mode 100644 modules/workbench/modules/reporting/assets/views/latest_workspaces.sql create mode 100644 modules/workbench/modules/reporting/assets/views/live/live_table.sql rename modules/workbench/modules/reporting/assets/views/{ => timeseries}/table_count_vs_time.sql (100%) delete mode 100644 modules/workbench/modules/reporting/views/latest_cohorts.sql delete mode 100644 modules/workbench/modules/reporting/views/latest_institutions.sql delete mode 100644 modules/workbench/modules/reporting/views/latest_users.sql delete mode 100644 modules/workbench/modules/reporting/views/latest_workspaces.sql delete mode 100644 modules/workbench/modules/reporting/views/table_count_vs_time.sql diff --git a/modules/workbench/modules/reporting/assets/views/latest_cohorts.sql b/modules/workbench/modules/reporting/assets/views/latest_cohorts.sql deleted file mode 100644 index a6c0285..0000000 --- a/modules/workbench/modules/reporting/assets/views/latest_cohorts.sql +++ /dev/null @@ -1,13 +0,0 @@ --- All cohorts from the most recent snapshot -SELECT - c.* -FROM - `${project}`.${dataset}.cohort c -WHERE - c.snapshot_timestamp = ( - SELECT - MAX(u.snapshot_timestamp) - FROM - `${project}`.${dataset}.user u) -ORDER BY - c.cohort_id; diff --git a/modules/workbench/modules/reporting/assets/views/latest_institutions.sql b/modules/workbench/modules/reporting/assets/views/latest_institutions.sql deleted file mode 100644 index 4ef7a28..0000000 --- a/modules/workbench/modules/reporting/assets/views/latest_institutions.sql +++ /dev/null @@ -1,14 +0,0 @@ --- All institutions from the most recent snapshot. Some may not have --- users associated with them. -SELECT - i.* -FROM - `${project}`.${dataset}.institution i -WHERE - i.snapshot_timestamp = ( - SELECT - MAX(u.snapshot_timestamp) - FROM - `${project}`.${dataset}.user u) -ORDER BY - i.institution_id; diff --git a/modules/workbench/modules/reporting/assets/views/latest_users.sql b/modules/workbench/modules/reporting/assets/views/latest_users.sql deleted file mode 100644 index 173f0bc..0000000 --- a/modules/workbench/modules/reporting/assets/views/latest_users.sql +++ /dev/null @@ -1,12 +0,0 @@ -SELECT - u.* -FROM - `${project}`.${dataset}.user u -WHERE - u.snapshot_timestamp = ( - SELECT - MAX(u2.snapshot_timestamp) - FROM - `${project}`.${dataset}.user u2) -ORDER BY - u.username; diff --git a/modules/workbench/modules/reporting/assets/views/latest_workspaces.sql b/modules/workbench/modules/reporting/assets/views/latest_workspaces.sql deleted file mode 100644 index 742327f..0000000 --- a/modules/workbench/modules/reporting/assets/views/latest_workspaces.sql +++ /dev/null @@ -1,13 +0,0 @@ -SELECT - w.* -FROM - `${project}`.${dataset}.workspace w -WHERE - w.snapshot_timestamp = ( - SELECT - MAX(u.snapshot_timestamp) - FROM - `${project}`.${dataset}.user u) - -ORDER BY - w.workspace_id; diff --git a/modules/workbench/modules/reporting/assets/views/live/live_table.sql b/modules/workbench/modules/reporting/assets/views/live/live_table.sql new file mode 100644 index 0000000..c39e19a --- /dev/null +++ b/modules/workbench/modules/reporting/assets/views/live/live_table.sql @@ -0,0 +1,13 @@ +-- All ${table_name} rows from the most recent snapshot +SELECT + t.* +FROM + `${project}`.${dataset}.${table_name} t +WHERE + t.snapshot_timestamp = ( + SELECT + MAX(u.snapshot_timestamp) + FROM + `${project}`.${dataset}.user u) +ORDER BY + t.${table_name}_id; diff --git a/modules/workbench/modules/reporting/assets/views/table_count_vs_time.sql b/modules/workbench/modules/reporting/assets/views/timeseries/table_count_vs_time.sql similarity index 100% rename from modules/workbench/modules/reporting/assets/views/table_count_vs_time.sql rename to modules/workbench/modules/reporting/assets/views/timeseries/table_count_vs_time.sql diff --git a/modules/workbench/modules/reporting/main.tf b/modules/workbench/modules/reporting/main.tf index 3b140f4..fc00bce 100644 --- a/modules/workbench/modules/reporting/main.tf +++ b/modules/workbench/modules/reporting/main.tf @@ -55,14 +55,32 @@ locals { } QUERY_TEMPLATE_SUFFIX = ".sql" # Local filenames for view templates. Returns something like ["latest_users.sql", "users_by_id.sql"] - view_query_template_filenames = fileset("${path.module}/assets/views", "*.sql") + timeseries_view_template_filenames = fileset("${path.module}/assets/views/timeseries", "*.sql") # expanded to fully qualified path, e.g. ["/repos/workbench/terraform/modules/reporting/views/latest_users.sql", ...] - view_query_template_paths = [for file_name in local.view_query_template_filenames : pathexpand("${path.module}/assets/views/${file_name}")] + timeseries_view_template_paths = [for file_name in local.timeseries_view_template_filenames : + pathexpand("${path.module}/assets/views/timeseries/${file_name}")] + + live_view_tables = [for table_input in local.table_inputs : table_input["table_id"] ] + live_view_template_path = pathexpand("${path.module}/assets/views/live/live_table.sql") + + # All live views (live_user, live_cohort, etc) depend on the tables being created first, so we need to make sure + # Teraform treats each view as depending on all the tables. It's not possible to depend on the exact + # table (I think) but this should solve the dependency problem of trying to create the view before + # its table. https://stackoverflow.com/q/64795896/12345554 + live_views = [for table_name in module.main.table_names : + merge({ + view_id = "live_${table_name}" + query = templatefile(local.live_view_template_path, { + project = var.project_id + dataset = var.reporting_dataset_id + table_name = table_name + }) + }, local.VIEW_CONSTANTS)] # Create views for each .sql file in the views directory. There is no Terraform # dependency from the view to the table(s) it queries, and I don't believe the SQL is even checked # for accuracy prior to creation on the BQ side. - views = [for view_query_template_path in local.view_query_template_paths : + timeseries_views = [for view_query_template_path in local.timeseries_view_template_paths : merge({ view_id = replace(basename(view_query_template_path), local.QUERY_TEMPLATE_SUFFIX, ""), query = templatefile(view_query_template_path, { @@ -71,6 +89,7 @@ locals { }) }, local.VIEW_CONSTANTS)] + views = concat(local.live_views, local.timeseries_views) } # All BigQuery assets for Reporting subsystem @@ -88,13 +107,6 @@ module "main" { description = "Daily output of relational tables and time series views for analysis. Views are provided for general ad-hoc analysis." tables = local.tables - - # Note that, when creating this module fom the ground up, it's common to see an error like - # `Error: googleapi: Error 404: Not found: Table my-project:my_dataset.my_table, notFound`. It seems - # to be a momentary issue due to the dataset's existence not yet being observable to the table/view - # create API. So far, it's always worked on a re-run. - # TODO(jaycarlton) see if there's a way to put a retry on this. I'm not convinced that will work - # outside of a resource context (and inside a third-party module). views = local.views dataset_labels = { diff --git a/modules/workbench/modules/reporting/views/latest_cohorts.sql b/modules/workbench/modules/reporting/views/latest_cohorts.sql deleted file mode 100644 index a6c0285..0000000 --- a/modules/workbench/modules/reporting/views/latest_cohorts.sql +++ /dev/null @@ -1,13 +0,0 @@ --- All cohorts from the most recent snapshot -SELECT - c.* -FROM - `${project}`.${dataset}.cohort c -WHERE - c.snapshot_timestamp = ( - SELECT - MAX(u.snapshot_timestamp) - FROM - `${project}`.${dataset}.user u) -ORDER BY - c.cohort_id; diff --git a/modules/workbench/modules/reporting/views/latest_institutions.sql b/modules/workbench/modules/reporting/views/latest_institutions.sql deleted file mode 100644 index 4ef7a28..0000000 --- a/modules/workbench/modules/reporting/views/latest_institutions.sql +++ /dev/null @@ -1,14 +0,0 @@ --- All institutions from the most recent snapshot. Some may not have --- users associated with them. -SELECT - i.* -FROM - `${project}`.${dataset}.institution i -WHERE - i.snapshot_timestamp = ( - SELECT - MAX(u.snapshot_timestamp) - FROM - `${project}`.${dataset}.user u) -ORDER BY - i.institution_id; diff --git a/modules/workbench/modules/reporting/views/latest_users.sql b/modules/workbench/modules/reporting/views/latest_users.sql deleted file mode 100644 index 173f0bc..0000000 --- a/modules/workbench/modules/reporting/views/latest_users.sql +++ /dev/null @@ -1,12 +0,0 @@ -SELECT - u.* -FROM - `${project}`.${dataset}.user u -WHERE - u.snapshot_timestamp = ( - SELECT - MAX(u2.snapshot_timestamp) - FROM - `${project}`.${dataset}.user u2) -ORDER BY - u.username; diff --git a/modules/workbench/modules/reporting/views/latest_workspaces.sql b/modules/workbench/modules/reporting/views/latest_workspaces.sql deleted file mode 100644 index 742327f..0000000 --- a/modules/workbench/modules/reporting/views/latest_workspaces.sql +++ /dev/null @@ -1,13 +0,0 @@ -SELECT - w.* -FROM - `${project}`.${dataset}.workspace w -WHERE - w.snapshot_timestamp = ( - SELECT - MAX(u.snapshot_timestamp) - FROM - `${project}`.${dataset}.user u) - -ORDER BY - w.workspace_id; diff --git a/modules/workbench/modules/reporting/views/table_count_vs_time.sql b/modules/workbench/modules/reporting/views/table_count_vs_time.sql deleted file mode 100644 index c8753f8..0000000 --- a/modules/workbench/modules/reporting/views/table_count_vs_time.sql +++ /dev/null @@ -1,39 +0,0 @@ --- simple count of each table over time. Demonstrates time series --- aggregation across snapshots. Note that if the user table is missing a timestamp, --- we consider it a bad snapshot, but any other table will return zero rows. -SELECT - TIMESTAMP_MILLIS(u.snapshot_timestamp) AS snapshot, - ( - SELECT - COUNT(u_inner.user_id) - FROM - `${project}`.${dataset}.user u_inner - WHERE - u_inner.snapshot_timestamp = u.snapshot_timestamp) AS user_count, - ( - SELECT - COUNT(w.workspace_id) - FROM - `${project}`.${dataset}.workspace w - WHERE - w.snapshot_timestamp = u.snapshot_timestamp) AS workspace_count, - ( - SELECT - COUNT(c.cohort_id) - FROM - `${project}`.${dataset}.cohort c - WHERE - c.snapshot_timestamp = u.snapshot_timestamp) AS cohort_count, - ( - SELECT - COUNT(institution_id) - FROM - `${project}`.${dataset}.institution i - WHERE - i.snapshot_timestamp = u.snapshot_timestamp) AS institution_count -FROM - `${project}`.${dataset}.user u -GROUP BY - u.snapshot_timestamp -ORDER BY - u.snapshot_timestamp; From 76ae015a3874f6d483e3f4f700b91ba048570930 Mon Sep 17 00:00:00 2001 From: Jay Carlton <53479492+jaycarlton@users.noreply.github.com> Date: Mon, 30 Nov 2020 11:41:27 -0500 Subject: [PATCH 3/4] fix and fmt --- .gitignore | 1 + modules/workbench/README.md | 45 +++++++++---- modules/workbench/WORKBENCH-MODULE-PLAN.md | 72 +++++++++++++++++++++ modules/workbench/main.tf | 2 +- modules/workbench/modules/reporting/main.tf | 20 +++--- 5 files changed, 117 insertions(+), 23 deletions(-) create mode 100644 modules/workbench/WORKBENCH-MODULE-PLAN.md diff --git a/.gitignore b/.gitignore index 39f7fa5..5e69c9b 100644 --- a/.gitignore +++ b/.gitignore @@ -2,3 +2,4 @@ *.tfstate *.backup *.iml +.DS_Store diff --git a/modules/workbench/README.md b/modules/workbench/README.md index d3235e5..a7e776e 100644 --- a/modules/workbench/README.md +++ b/modules/workbench/README.md @@ -1,24 +1,45 @@ -# Workbench Child Modules +# Workbench Terraform Modules The module directories here represent individually deployable subsystems, microservices, or other functional units. It's easy enough to put all buckets, say, -in a `gcs` module, but that wouldn't really let us operate on an individual components's bucket. +in a `gcs` module, but that wouldn't really let us operate on an individual components-owned bucket. Following is a broad outline fo each child module. If you feel irritated that you can't see, for example, all dashboards in one place, you can still go to the Console or use `gcloud`. +## Goals +### Automate ourselves out of a job +All the existing and planned Terraform modules have some level of scripted or otherwise automated +support processes. +## Non-goals +### Become the game in town +We don't want to get into a position where we force anyone to use Terraform if it's not the best +choice for them. Terraform is still pretty new, and changing rapidly. The Google provider is also +under rapid development. +### Wag the Dog +We do not have any aspirations to absorb any of the tasks that external teams are responsible for, +including building the GCP projects for each of our environments or conducting all administrative +tasks in either pmi-ops or terra projects. If Terraform really "takes off". then it may make sense to +share learnings, and at that point, there may be opportunities for our Terraform stack to use theirs, +or vice versa. While these boundaries may be fuzzy today, hopefully the addition of clear module +inputs and documentation will drive clarification of responsibilities and visibility into state, +dependencies, etc. +### Bypass security approvals +In some cases, actions that require security approval can be performed in Terraform, particularly +around IAM bindings, access groups, and roles. We don't want a situation where an audit finds that +individuals or service accounts were added or modified without going through the proper channels. -## Reporting +One potential workaround here is to invite sysadmin or security personnel to the private repository +to approve changes to the Terraform module inputs. + +## Currently Supported Modules + +### Reporting The state for reporting is currently the BigQuery dataset and its tables and views. In the future, -it makes sense to add j +it makes sense to add these sorts of things: * Reporting-specific metrics * Notifications on the system * Reporting-specific logs, specific logs * Data blocks for views (maybe) -## Backend Database (future) -This resource is inherently cross-functional, so we can just put -* The application DB -* backup settings -This will take advantage of the `google_sql_database_instance` resource. - -Schema migrations work via `Ruby->Gradle->Liquibase->MySql->🚂` -Maybe it needs a `Terraform` caboose. It looks like there's not currently a Liquibase provider. +In other words, the primary focus of the module is the Reporting system, but it may be convenient to +add reporting-specific artifacts that might otherwise be concerned with Monitoring or other auxiliary +services. diff --git a/modules/workbench/WORKBENCH-MODULE-PLAN.md b/modules/workbench/WORKBENCH-MODULE-PLAN.md new file mode 100644 index 0000000..55d2c0a --- /dev/null +++ b/modules/workbench/WORKBENCH-MODULE-PLAN.md @@ -0,0 +1,72 @@ +# Workbench Module Plan +The module directories here represent individually deployable subsystems, +microservices, or other functional units. It's easy enough to put all buckets, say, +in a `gcs` module, but that wouldn't really let us operate on an individual components's bucket. + +Following is a broad outline fo each child module. If you feel irritated that you can't see, for example, +all dashboards in one place, you can still go to the Console or use `gcloud`. + +# Workbench Module Development Plan +The Workbench is the topmost parent module in the AoU Workbench +Application configuration. It depends on several modules for individual +subsystems. + +After creating a valid Terraform configuration we're not finished, +as we need to make sure we don't step on other tools or automatioin. +For example, items that pertain to cloud resources will need to move +out of the workbench JSON config system. + +I have automation already for stackdriver setting that fetches all of theiir configurations +and plan to migrate it to Terraform. + +## Reporting +The state for reporting is currently the BigQuery dataset and its tables and views. +Highlights +* Reporting-specific metrics with the `google_logging_metric` [resource](https://www.terraform.io/docs/providers/google/r/logging_metric.html) +and others +* Notifications on the system +* Reporting-specific logs, specific logs +* Data blocks for views (maybe) + +## Backend Database (notional) +This resource is inherently cross-functional, so we can just put +* The application DB +* backup settings +This will take advantage of the `google_sql_database_instance` resource. + +Schema migrations work via `Ruby->Gradle->Liquibase->MySql->�` +Maybe it needs a `Terraform` caboose. It looks like there's not currently a Liquibase provider. + +It may not make sense organizationally to do this in Terraform, as there are dependencies on other +team(s) when instantiating or migrating databases. + +## Workbench to RDR Pipeline +Instantiate [google_cloud_tasks_queue](https://www.terraform.io/docs/providers/google/r/cloud_tasks_queue.html) resource +resouorces as necessary. + +## API Server +* AppEngine versions, instances, logs, etc. Isn't just named +App Engine, since that's the resource that gets crated. + +At the moment, there are no plans to rip and replace our existing deployment process or automation, +but we may find areas that the Terraform approach could be helpful (such as managing dependent +deployment artifacts or steps in a declarative way.) + +## Action Audit +This module maps to +* Stackdriver logs for each environment. (It will need to + move from the application JSON config likely.) +* Logs-based metrics on the initial log stream +* Sink to BigQuery dataset for each environment (Stackdriver may need to create initially, in which +case, we need to do `terraform state import`.) +* Logs-based metrics on the initial log stream +* Reporting datasets in BigQuery + +## Tiers and Egress Detection +There is a [sumo logic provider](https://www.sumologic.com/blog/terraform-provider-hosted/) for Terraform, which is very good +news. It looks really svelte. + +We will also want to control the VPC flow logs, +perimeters, etc, but it won't be in this `workbench` module, +because Terra-not-form owns the organization and needs to do +creation manually for now. diff --git a/modules/workbench/main.tf b/modules/workbench/main.tf index fcaeed2..776c769 100644 --- a/modules/workbench/main.tf +++ b/modules/workbench/main.tf @@ -1,4 +1,4 @@ -# Module for creating an instance of the scratch AoU RW Environment +# Workbench Analytics Reporting Subsystem module "reporting" { source = "./modules/reporting" diff --git a/modules/workbench/modules/reporting/main.tf b/modules/workbench/modules/reporting/main.tf index fc00bce..5119f78 100644 --- a/modules/workbench/modules/reporting/main.tf +++ b/modules/workbench/modules/reporting/main.tf @@ -58,9 +58,9 @@ locals { timeseries_view_template_filenames = fileset("${path.module}/assets/views/timeseries", "*.sql") # expanded to fully qualified path, e.g. ["/repos/workbench/terraform/modules/reporting/views/latest_users.sql", ...] timeseries_view_template_paths = [for file_name in local.timeseries_view_template_filenames : - pathexpand("${path.module}/assets/views/timeseries/${file_name}")] + pathexpand("${path.module}/assets/views/timeseries/${file_name}")] - live_view_tables = [for table_input in local.table_inputs : table_input["table_id"] ] + live_view_tables = [for table_input in local.table_inputs : table_input["table_id"]] live_view_template_path = pathexpand("${path.module}/assets/views/live/live_table.sql") # All live views (live_user, live_cohort, etc) depend on the tables being created first, so we need to make sure @@ -68,13 +68,13 @@ locals { # table (I think) but this should solve the dependency problem of trying to create the view before # its table. https://stackoverflow.com/q/64795896/12345554 live_views = [for table_name in module.main.table_names : - merge({ - view_id = "live_${table_name}" - query = templatefile(local.live_view_template_path, { - project = var.project_id - dataset = var.reporting_dataset_id - table_name = table_name - }) + merge({ + view_id = "live_${table_name}" + query = templatefile(local.live_view_template_path, { + project = var.project_id + dataset = var.reporting_dataset_id + table_name = table_name + }) }, local.VIEW_CONSTANTS)] # Create views for each .sql file in the views directory. There is no Terraform @@ -107,7 +107,7 @@ module "main" { description = "Daily output of relational tables and time series views for analysis. Views are provided for general ad-hoc analysis." tables = local.tables - views = local.views + views = local.views dataset_labels = { subsystem = "reporting" From 3c9cc54c1f217902ac9b80b730a67c499789353e Mon Sep 17 00:00:00 2001 From: Jay Carlton <53479492+jaycarlton@users.noreply.github.com> Date: Mon, 30 Nov 2020 13:57:49 -0500 Subject: [PATCH 4/4] writeup fixes --- AOU_RW_MODULE_WALKTHROUGH.md | 276 -------------------------- README.md | 45 +++++ modules/workbench/README.md | 45 ----- modules/workbench/WORKBENCH-MODULE.md | 63 ------ 4 files changed, 45 insertions(+), 384 deletions(-) delete mode 100644 AOU_RW_MODULE_WALKTHROUGH.md delete mode 100644 modules/workbench/README.md delete mode 100644 modules/workbench/WORKBENCH-MODULE.md diff --git a/AOU_RW_MODULE_WALKTHROUGH.md b/AOU_RW_MODULE_WALKTHROUGH.md deleted file mode 100644 index 624ed38..0000000 --- a/AOU_RW_MODULE_WALKTHROUGH.md +++ /dev/null @@ -1,276 +0,0 @@ -# AoU Researcher Workbench Module Walkthrough -## 0. Module Structure -The state associated with the current deployment consists of -one `root` module for each environment, in separate directories - -In order to deploy a full (or partial) environment we need to declare what modules are used and to supply -values to all unbound declared variables. The environment module is unioned with the modules in the -`source` statement. - -The overall source structure looks like the following. Note that -Terraform will collect all `.tf` files in a referenced directory, -so the calling module will need to specify values for the chilid -modules' `variable` blocks that don't have defaults. - -```text -/repos/workbench/ops/terraform/ -├── AOU_RW_MODULE_WALKTHROUGH.md -├── TERRAFORM-QUICKSTART.md -├── environments -│   ├── local -│   ├── scratch -│   │   ├── SCRATCH-ENVIRONMENT.md -│   │   ├── scratch.tf -│   │   ├── terraform.tfstate -│   │   ├── terraform.tfstate.backup -│   │   └── terraform.tfstate.yet.another.backup -│   └── test -└── modules - └── aou-rw-reporting - ├── providers.tf - ├── reporting.tf - ├── schemas - │   ├── cohort.json - │   ├── institution.json - │   ├── user.json - │   └── workspace.json - ├── variables.tf - └── views - ├── latest_cohorts.sql - ├── latest_institutions.sql - ├── latest_users.sql - ├── latest_workspaces.sql - └── table_count_vs_time.sql -``` -The `modules` directory contains independent, reusable modules foor -subsystems that are -* logical to deploy and configure operationally, -* don't depend on each other (at least for exported modules) and -* can be used by AoU or potentially another organization interested in deploying a copy -of all or part of our system. - -## Prerequisites -### 1. Get Terraform -Install Terraform using the directinos at [TERRAFORM-QUICKSTART.md] -### 2. Change to the `environments/scratch directory` `get` and `init` -The environment for this outline is `scratch`, which exists in a target environment -of your choice. -### 3. Assign Values to Input Variables - -The following public variable declarations are representative of those -specified in `modules/reporting/variables.tf` and elsewhere. The description -string shows when interactively running from the command line without all the -vars cominig in from a `-var-file` argument. -```hcl-terraform -variable credentials_file { - description = "Location of service account credentials JSON file." - type = string -} - -variable aou_env { - description = "Short name (all lowercase) of All of Us Workbench deployed environments, e.g. local, test, staging, prod." - type = string -} - -variable project_id { - description = "GCP Project" - type = string -} -``` -Create a `scratch_tutorrial.tfvars` file outside of this repository. This file should -look contain values for the following [input variables](https://www.terraform.io/docs/configuration/variables.html) that will be different -for different organizations and environments. - -```hcl-terraform -aou_env = "scratch" # Name of environment we're creating or attaching to. Needs to match directory name -project_id = "my-qa-project" # Should not be prod -reporting_dataset_id = "firstname_lastname_scratch_0" # BigQuery dataset id -``` - -The credentials file should point to a JSON key file generated -by Google Cloud IAM (at least on lower environments). The only required -permission is `BigQuery Data Owner` Neither the credentials nor -the `.tfvars` file itself should be checked into public source control. - -It's sometimes helpful to assign the full path to this `.tfvars` to an environment variable, -as it will need to e provided for most commands. There are several other ways to do this, -but the advantage for us is separating the reusable stuff from the AoU-instance-specific -values. -```shell script -$ SCRATCH_TFVARS=/rerpos/workbench-devops/terraform/scratch.tfvars -``` - -### 4. Initialize Terraform -Run [`terraform init`](https://www.terraform.io/docs/commands/init.html) to initialize the current directory (which should be -`api/terraform/environrments/scratch` if working from this repo. It should also be possible -work from a directory completely sepaated from source control. It's just -a bit harder to refer to the module definitions. - -If `init` was successful, the following message should print something like the following -like following: -``` -Initializing modules... - -Initializing the backend... - -Initializing provider plugins... -- Using previously-installed hashicorp/google v3.5.0 - -Terraform has been successfully initialized! - -You may now begin working with Terraform. Try running "terraform plan" to see -any changes that are required for your infrastructure. All Terraform commands -should now work. - -If you ever set or change modules or backend configuration for Terraform, -rerun this command to reinitialize your working directory. If you forget, other -commands will detect it and remind you to do so if necessary. -``` - -After successfully `init`, while the backend, plugins, and modules are now in a reasonably good state, -but certain expensive operations are deferred for performance. Look at the `terraform.tfstate` file -in the run directory to confirming nothing is of intererst there: -```json -{ - "version": 4, - "terraform_version": "0.13.0", - "serial": 24, - "lineage": "d9d8e034-fad0-03ff-df40-86bdd7a43128", - "outputs": {}, - "resources": [] -} -``` - -### 5. Build a Plan -Terraform creates a plan of action based on the difference between its view of the state -of all the resources, and what's stated in the file. - -Run like -``` -terraform plan -var-file=$SCRATCH_TFVARS -``` -The output for me looks like [this](doc/plan_output.txt). You should see a couple of key things: -* A dataset, several tables, and some views will be created. Searching for "will be created" is an easy way to -see this. -* All the variables are expanded in the state file, so treat this file as Eyes Only. -* The summary line should show `Plan: 10 to add, 0 to change, 0 to destroy.` - -The `plan` command doesn't edit actual resources, but is important for understanding Terraform's marching -orders. - -### 6. Apply the Plan -Use the `apply` command to make the chagnes necessary. It will ask you for a `yes` confirmation beofre proceeding. -In the sase of the reporting module, creating the dataset then immediately crerating tabales may mean -that we need to run one more time. Luckily, `apply` is idempotent for this case and there's no harm. - -Once everything is applied, rerunning `tf plan` will show that nothing is left to do: -``` -$ tf plan -lock=false -var-file=$SCRATCH_TFVARS -Refreshing Terraform state in-memory prior to plan... -The refreshed state will be used to calculate this plan, but will not be -persisted to local or remote state storage. - -module.aou_rw_scratch_env.module.reporting.google_bigquery_dataset.main: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2] -module.aou_rw_scratch_env.module.reporting.google_bigquery_table.view["latest_cohorts"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/latest_cohorts] -module.aou_rw_scratch_env.module.reporting.google_bigquery_table.main["institution"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/institution] -module.aou_rw_scratch_env.module.reporting.google_bigquery_table.view["table_count_vs_time"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/table_count_vs_time] -module.aou_rw_scratch_env.module.reporting.google_bigquery_table.view["latest_institutions"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/latest_institutions] -module.aou_rw_scratch_env.module.reporting.google_bigquery_table.view["latest_users"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/latest_users] -module.aou_rw_scratch_env.module.reporting.google_bigquery_table.main["cohort"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/cohort] -module.aou_rw_scratch_env.module.reporting.google_bigquery_table.view["latest_workspaces"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/latest_workspaces] -module.aou_rw_scratch_env.module.reporting.google_bigquery_table.main["user"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/user] -module.aou_rw_scratch_env.module.reporting.google_bigquery_table.main["workspace"]: Refreshing state... [id=projects/all-of-us-workbench-test/datasets/jaycarlton_terraform_tmp_2/tables/workspace] - ------------------------------------------------------------------------- - -No changes. Infrastructure is up-to-date. - -This means that Terraform did not detect any differences between your -configuration and real physical resources that exist. As a result, no -actions need to be performed. -``` - -### 7. Selectively removing state -If it's necessary to detach one or more online resources from the local Terraform state (as if it has -never been created or imported), use the `terraform state rm` command. The general pattern is -`terraform remove tfitem_id cloud_id`. For example, let's say I've decided I no longer want the view -named `latest_workspaces` to be included in the state file. - -### 8. Handy State commands -The [state command](https://www.terraform.io/docs/commands/state/index.html) is one of the more powerful ones to use, and lets you avoid interacting directly with `.tfstate` -files. -#### Import -Working with resources tha already exist requires `terraform import` command. This seems unintuitive, -but the sample `tarraform state list` output shows what's expected. Third party modules should show -the expected syntx. For [importing a BigQuery dataset](https://www.terraform.io/docs/providers/google/r/bigquery_dataset.html#import) -from the `scratch` environment to the `local` environment, simply do; - -```shell script -terraform import -var-file=$TFVARS_LOCAL \ - module.local.module.reporting.google_bigquery_dataset.main \ - reporting_local -``` -The output should look like this if successful. There are several failure modes involving directory structure, -module path, and differing asset ID configurations for different providers. - -``` -terraform import -var-file=$TFVARS_LOCAL module.local.module.reporting.google_bigquery_dataset.main reporting_local -module.local.module.reporting.google_bigquery_dataset.main: Importing from ID "reporting_local"... -module.local.module.reporting.google_bigquery_dataset.main: Import prepared! - Prepared google_bigquery_dataset for import -module.local.module.reporting.google_bigquery_dataset.main: Refreshing state... [id=projects/my-project/datasets/reporting_local] - -Import successful! - -The resources that were imported are shown above. These resources are now in -your Terraform state and will henceforth be managed by Terraform. -``` - -`tf state` should now show that we are managing the resource: -```shell script -tf state list -module.local.module.reporting.google_bigquery_dataset.main -``` - -```shell script - -terraform import -var-file=$TFVARS_LOCAL module.local.module.reporting.google_bigquery_table.main[\"cohort\"] \projects/all-of-us-workbench-test/datasets/reporting_local/tables/cohort -terraform import -var-file=$TFVARS_LOCAL module.local.module.reporting.google_bigquery_table.view[\"latest_users\"] projects/all-of-us-workbench-test/datasets/reporting_local/tables/latest_users -``` -is an example of importing a table. remember that equation marks must be escaped. - -```shell script -$ tf state list -module.local.module.reporting.google_bigquery_dataset.main -module.local.module.reporting.google_bigquery_table.main["cohort"] -``` -None of the `teraform state` commands accept variable values, as those have already been interpolated -during a `plan` or `apply` operation. - -**NOTE** While Terraform is managing the dataset, it's not yet managing any data in it directly. -Running `tf plan` at this point will indicate that, while the dataset is controlled, the tables and -views in it are not. It's probalby not a good idea to `terraform destory` imported resources that -contain other resources you care about; always study the `plan` output carefully. -#### `state list` -`terraform state list` lists all modules and resources under management for the current module. It's -especially handy when trying to find the desired module path string for `import` if you're reusing a -oonfiguration for another environment or system. -#### `state show` -terraform show is a more detailed listing for a given item in the state tree. The comm -``` -terraform state show module.local.module.reporting.google_bigquery_dataset.main -``` - -#### `state pull` -To show the active state file (by default named `terraform.tfstate`), simply do - ```terraform state pull | jq```. -The `jq` command makes the JSON colorized, though it already has a nice structure. - -I don't know why you'd use `terraform push`, which applies state that's externalized as JSON somehow. -Likely an advanced feature. - -#### `state rm` -The opposite of `state import`, the `state rm` subcommand removes a tracked resource from the -Terraform state file. Some uses for this are for repairing configurations, spliting them up, -or allowing someone else to experiment with changes on a deployed artifact before bringing it -back under control. Happily, this command does not `destroy` objects when removing them. diff --git a/README.md b/README.md index e69de29..a7e776e 100644 --- a/README.md +++ b/README.md @@ -0,0 +1,45 @@ +# Workbench Terraform Modules +The module directories here represent individually deployable subsystems, +microservices, or other functional units. It's easy enough to put all buckets, say, +in a `gcs` module, but that wouldn't really let us operate on an individual components-owned bucket. + +Following is a broad outline fo each child module. If you feel irritated that you can't see, for example, +all dashboards in one place, you can still go to the Console or use `gcloud`. +## Goals +### Automate ourselves out of a job +All the existing and planned Terraform modules have some level of scripted or otherwise automated +support processes. +## Non-goals +### Become the game in town +We don't want to get into a position where we force anyone to use Terraform if it's not the best +choice for them. Terraform is still pretty new, and changing rapidly. The Google provider is also +under rapid development. +### Wag the Dog +We do not have any aspirations to absorb any of the tasks that external teams are responsible for, +including building the GCP projects for each of our environments or conducting all administrative +tasks in either pmi-ops or terra projects. If Terraform really "takes off". then it may make sense to +share learnings, and at that point, there may be opportunities for our Terraform stack to use theirs, +or vice versa. While these boundaries may be fuzzy today, hopefully the addition of clear module +inputs and documentation will drive clarification of responsibilities and visibility into state, +dependencies, etc. +### Bypass security approvals +In some cases, actions that require security approval can be performed in Terraform, particularly +around IAM bindings, access groups, and roles. We don't want a situation where an audit finds that +individuals or service accounts were added or modified without going through the proper channels. + +One potential workaround here is to invite sysadmin or security personnel to the private repository +to approve changes to the Terraform module inputs. + +## Currently Supported Modules + +### Reporting +The state for reporting is currently the BigQuery dataset and its tables and views. In the future, +it makes sense to add these sorts of things: +* Reporting-specific metrics +* Notifications on the system +* Reporting-specific logs, specific logs +* Data blocks for views (maybe) + +In other words, the primary focus of the module is the Reporting system, but it may be convenient to +add reporting-specific artifacts that might otherwise be concerned with Monitoring or other auxiliary +services. diff --git a/modules/workbench/README.md b/modules/workbench/README.md deleted file mode 100644 index a7e776e..0000000 --- a/modules/workbench/README.md +++ /dev/null @@ -1,45 +0,0 @@ -# Workbench Terraform Modules -The module directories here represent individually deployable subsystems, -microservices, or other functional units. It's easy enough to put all buckets, say, -in a `gcs` module, but that wouldn't really let us operate on an individual components-owned bucket. - -Following is a broad outline fo each child module. If you feel irritated that you can't see, for example, -all dashboards in one place, you can still go to the Console or use `gcloud`. -## Goals -### Automate ourselves out of a job -All the existing and planned Terraform modules have some level of scripted or otherwise automated -support processes. -## Non-goals -### Become the game in town -We don't want to get into a position where we force anyone to use Terraform if it's not the best -choice for them. Terraform is still pretty new, and changing rapidly. The Google provider is also -under rapid development. -### Wag the Dog -We do not have any aspirations to absorb any of the tasks that external teams are responsible for, -including building the GCP projects for each of our environments or conducting all administrative -tasks in either pmi-ops or terra projects. If Terraform really "takes off". then it may make sense to -share learnings, and at that point, there may be opportunities for our Terraform stack to use theirs, -or vice versa. While these boundaries may be fuzzy today, hopefully the addition of clear module -inputs and documentation will drive clarification of responsibilities and visibility into state, -dependencies, etc. -### Bypass security approvals -In some cases, actions that require security approval can be performed in Terraform, particularly -around IAM bindings, access groups, and roles. We don't want a situation where an audit finds that -individuals or service accounts were added or modified without going through the proper channels. - -One potential workaround here is to invite sysadmin or security personnel to the private repository -to approve changes to the Terraform module inputs. - -## Currently Supported Modules - -### Reporting -The state for reporting is currently the BigQuery dataset and its tables and views. In the future, -it makes sense to add these sorts of things: -* Reporting-specific metrics -* Notifications on the system -* Reporting-specific logs, specific logs -* Data blocks for views (maybe) - -In other words, the primary focus of the module is the Reporting system, but it may be convenient to -add reporting-specific artifacts that might otherwise be concerned with Monitoring or other auxiliary -services. diff --git a/modules/workbench/WORKBENCH-MODULE.md b/modules/workbench/WORKBENCH-MODULE.md deleted file mode 100644 index ef2f3f7..0000000 --- a/modules/workbench/WORKBENCH-MODULE.md +++ /dev/null @@ -1,63 +0,0 @@ - -# Workbench Module -The module directories here represent individually deployable subsystems, -microservices, or other functional units. It's easy enough to put all buckets, say, -in a `gcs` module, but that wouldn't really let us operate on an individual components's bucket. - -Following is a broad outline fo each child module. If you feel irritated that you can't see, for example, -all dashboards in one place, you can still go to the Console or use `gcloud`. - -A somewhat forward-looking plan for that would look like - -# Workbench Module Development Plan -The Workbench is the topmost parent module in the AoU Workbench -Application configuration. It depends on several modules for individual -subsystems. - -After creating a valid Terraform configuration we're not finished, -as we need to make sure we don't step on other tools or automatioin. -For example, items that pertain to cloud resources will need to move -out of the workbench JSON config system. - -I have automation already for stackdriver setting that fetches all of theiir configurations -and plan to migrate it to Terraform. - -## Reporting -The state for reporting is currently the BigQuery dataset and its tables and views. -Highlights -* Reporting-specific metrics with the `google_logging_metric` [resource](https://www.terraform.io/docs/providers/google/r/logging_metric.html) -and others -* Notifications on the system -* Reporting-specific logs, specific logs -* Data blocks for views (maybe) - -## Backend Database (future) -This resource is inherently cross-functional, so we can just put -* The application DB -* backup settings -This will take advantage of the `google_sql_database_instance` resource. - -Schema migrations work via `Ruby->Gradle->Liquibase->MySql->�` -Maybe it needs a `Terraform` caboose. It looks like there's not currently a Liquibase provider. - -## Workbench to RDR Pipeline -Instantiate [google_cloud_tasks_queue](https://www.terraform.io/docs/providers/google/r/cloud_tasks_queue.html) resource -resouorces as necessary. - -## API Server -* AppEngine versions, instances, logs, etc. Isn't just named -App Engine, since that's the resource that gets crated. - -## Action Audit -This module maps to -* Stackdriver logs for each environment. (It will nedd to - move from the applicatioin JSON config likely.) - -## Tiers and Egress Detection -There is a [sumo logic provider](https://www.sumologic.com/blog/terraform-provider-hosted/) for Terraform, which is very good -news. It looks really svelte. - -We will also want to control the VPC flow logs, -perimeters, etc, but it won't be in this `workbench` module, -because Terra-not-form owns the organization and needs to do -cration manually for now.