-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data warehouse for access logs #46
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- `ContentsDistribution` extracts the name of the S3 bucket that store CloudFront access logs, which the CloudFront distribution provisions on behalf of us. It highly depends on the implementation details of CDK. - `CkdStack` outputs the name of the S3 bucket for CloudFront access logs. issue codemonger-io#30
- Bumps the CDK version to 2.42.0.
- The issue that the main stack could not be resolved is fixed. It was caused because the name of the main stack contained an extra dash before the deployment stage. Removes the extra dash from the stack name.
- `resolveCodemongerResourceNames` resolves the name of the S3 bucket for access logs of the development stage. - `CodemongerResources` binds the S3 bucket for access logs of the development stage. issue codemonger-io#30
- Introduces a new Lambda function `lambda/mask-access-logs` that masks IP addresses in CloudFront access logs. It saves masked access logs in another S3 bucket. It will be triggered when a new log file is put into the S3 bucket for access logs in the future but we have to manually invoke it for now. - A new CDK construct `AccessLogsMasking` provisions the Lambda function `lambda/mask-access-logs`. It also provisions an S3 bucket where `lambda/mask-access-logs` saves masked access logs. issue codemonger-io#30
- `CdkOpsStack` provisions `AccessLogsMasking`. issue codemonger-io#30
- `lambda/mask-access-logs` now supposes that the input event is a list of SQS messages containing "ObjectCreated:*" S3 events. - `AccessLogsMasking` provisions an SQS queue to capture "ObjectCreated:*" S3 events from the S3 bucket for access logs. It sets the SQS queue as an event source of `MaskAccessLogsLambda`. issue codemonger-io#30
- `lambda/mask-access-logs` ignores access logs files that no longer exist. issue codemonger-io#30
- Provisions a new SQS queue that notifies when a new object is created in the S3 bucket for masked CloudFront access logs; i.e., masking a new CloudFront access logs file has finished. - Introduces a new Lambda function `lambda/delete-access-logs` that deletes files in the original bucket for CloudFront access logs. The above SQS queue triggers this function. - Reorders definitions in `AccessLogsMasking` because a Lambda function and its trigger SQS queue are strongly tied. issue codemonger-io#30
- `AccessLogsMasking` → `AccessLogsETL`. Because the construct will not just mask access logs but do more transformation. So Extract, Transform, and Load (ETL) should be more suitable. issue codemonger-io#30
- `lambda/mask-access-logs` prepends a prefix to the keys of masked access logs files. The prefix is "masked/". issue codemonger-io#30
- `lambda/mask-access-logs` prefixes the date of access log records to the output S3 object key. `lambda/delete-access-logs` excludes prefixed dates to locate the original access logs file. issue codemonger-io#30
- Configures the `env` property of the CDK stack so that correct availability zones (AZs) in the region can be listed. CDK lists only two AZs without these changes. issue codemonger-io#30
- Introduces a new CDK construct `LatestBoto3Layer` that provisions a Lambda layer containing the latest boto3. `lambda/latest-boto3/requirements.txt` lists the modules and versions to be packaged as a Lambda layer. The latest boto3 is necessary because the default boto3 on the Lambda runtime does not support Redshift Serverless. issue codemonger-io#30
- Introduces a new CDK construct `DataWarehouse` that provisions the Redshift Serverless namespace and workgroup for access logs. It also provisions a VPC where the Redshift Serverless cluster will reside. In addition to the Redshift Serverless cluster, it provisions a Lambda function `lambda/populate-dw-database` that populates the database and tables to store access logs. - `CdkOpsStack` provisions `DataWarehouse`. issue codemonger-io#30
- Installs `cdk2-python-library-layer` to create a Lambda layer from a local Python package. issue codemonger-io#30
- Introduces a new Lambda layer `lambda/libdatawarehouse` that provides utilities to handle the data warehouse for CloudFront access logs. A new CDK construct `LibdatawarehouseLayer` provisions it. issue codemonger-io#30
- `lambda/populate-dw-database` uses `libdatawarehouse` to replace commonly used types and functions. - `CdkOpsStack` provisions `LibdatawarehouseLayer` and passes it to `DataWarehouse`. issue codemonger-io#30
- Introduces a new Lambda function `lambda/load-access-logs` that loads CloudFront access logs from the S3 bucket onto the data warehouse. `AccessLogsETL` provisions this function. - `DataWarehouse` introduces a new method `grantQuery` that allows a given `IGrantable` to call the Redshift Data API. The permission is too permissive because I could not figure out how to obtain the ARN of the Redshift Serverless namespace. `DataWarehouse` exports the Redshift Serverless workgroup so that `AccessLogsETL` can configure `lambda/load-access-logs`. - `CdkOpsStack` passes `DataWarehouse`, `LatestBoto3Layer`, and `LibdatawarehouseLayer` to `AccessLogsETL`. issue codemonger-io#30
- `lambda/populate-dw-database` introduces a new column `seq_num` to the `access_log` table, which helps sorting rows in the same order in the original log file. - `lambda/mask-access-logs` prepends a new column `row_num` that records row numbers in the original access logs file. Row numbers are unique only in a single access logs file. - `lambda/load-access-logs` maps `row_num` to `seq_num` in the `access_log` table. - `lambda/mask-access-logs` also masks the "x-forwarded-for" column as IP addresses. issue codemonger-io#30
- The bug that `lambda/load-access-logs` crashed when there were no access logs on a given date. It now makes sure that there are access logs on a given date before running the script. - `AccessLogsETL` grants `lambda/load-access-logs` read permissions of the S3 bucket of access logs. issue codemonger-io#30
- Adds thoughts about VACUUM as comments to `lambda/load-access-logs`. issue codemonger-io#30
- `AccessLogsETL` schedules to run `lambda/load-access-logs` at 2:00 AM every day for production. It schedules to load access logs every hour for development. The rules are disabled by default. issue codemonger-io#30
- Fixes the bug that `lambda/delete-access-logs` crashed when an empty folder was created in the S3 bucket. It ignores the event if the object key ends with a slash; i.e., the last segment is empty. issue codemonger-io#30
- `DataWarehouse` introduces a Step Functions State Machine (workflow) to run VACUUM over the tables. The workflow picks the table one by one and applies a new Lambda function `lambda/vacuum-table` to it. `lambda/vacuum-table` runs `VACUUM` over a given table. issue codemonger-io#30
- `lambda/load-access-logs` starts the VACUUM workflow over the updated tables. "SORT ONLY" is sufficient because `lambda/load-access-logs` never deletes rows. issue codemonger-io#30
- `CdkOpsStack` outputs the ARN of `lambda/load-access-logs`. `DataWarehouse` exposes `lambda/load-access-logs`. issue codemonger-io#30
- Installs `@aws-sdk/client-lmabda` to run a Lambda function in a build script. - Bumps the CDK version to 2.45.0. - Bumps the `constructs` version to 10.1.128. - Bumps the `@aws-sdk/client-cloudformation` version to 3.186.0. - Bumps the `ts-node` version to 10.9.1. issue codemonger-io#30
- Introduces a new script `bin/populate-data-warehouse.js` that populates the database and tables of the data warehouse on a given deployment stage. A new npm script `populate-dw` runs this script. - Tells `git` not to ignore `bin/populate-data-warehouse.js`. issue codemonger-io/codemonger#0
- `CdkOpsStack` provisions `DataWarehouse` and `AccessLogsETL` for production. - `CodemongerResources` resolves the S3 bucket containing CloudFront access logs for production. issue codemonger-io#30
- Fixes the bug that `libdatawarehouse.data_api.wait_for_results` did not time out. It now monitors the elapsed time rather than the loop count. It also takes a new parameter `cancel_at_timeout` that makes the function cancel the statement when it times out. It is `True` by default. issue codemonger-io#30
- Turns on `enhancedVpcRouting` of the Redshift Serverless workgroup for development. To avoid an accidental update of the production database, creates a branch of `DataWarehouse` as `DataWarehouseV2` for development. issue codemonger-io#30
- Fixes the bug that `lambda/load-access-logs` crashed when `user_agent` was `NULL`. Replaces `NULL` with a dash (`-`). issue codemonger-io#30
- Enables `enhancedVpcRouting` of the Redshift Serverless workgroup for production. Merges the changes on `DataWarehouseV2` back to `DataWarehouse`. `DataWarehouseV2` is no longer used. issue codemonger-io#30
- Adds the explanation about the AWS architecture of the data warehouse. `docs/data-warehouse-aws-architecture.drawio` is the original data file of `draw.io`: https://github.com/jgraph/drawio issue codemonger-io#30
- Adds the following information to `README.md`, - link to the documentation about the AWS architecture of the data warehouse for access logs - trouble shooting of the admin password of the data warehouse - how to populate the database and tables on the data warehouse - enabling the EventBridge rule for access log loading issue codemonger-io#30
- Translates `docs/data-warehouse.md` into Japanese. Also refines the English version. issue codemonger-io#30
- Translates the updated part of `README` into Japanese. Also refines the English version. issue codemonger-io#30
- `docs/data-warehouse` includes a link to the section that explains how to deal with the situation where the AWS Secrets Manager regenarates the admin secret. issue codemonger-io#30
- Replaces the section "Continuous delivery" with "DevOps" because the `cdk-ops` no longer contains only the continuous delivery but also the data warehouse. issue codemonger-io#30
kikuomax
commented
Oct 19, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have self-reviewed the changes.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduces the data warehouse for access logs. The CDK stack in the
cdk-ops
folder provisions the data warehouse backed by Amazon Redshift Serverless. I cannot describe all the details here since the changes are enormous. So please refer tocdk-ops/docs/data-warehouse.md
to grasp the entire architecture of the data warehouse.