Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data warehouse for access logs #46

Merged
merged 41 commits into from
Oct 19, 2022
Merged

Conversation

kikuomax
Copy link
Member

@kikuomax kikuomax commented Oct 19, 2022

Introduces the data warehouse for access logs. The CDK stack in the cdk-ops folder provisions the data warehouse backed by Amazon Redshift Serverless. I cannot describe all the details here since the changes are enormous. So please refer to cdk-ops/docs/data-warehouse.md to grasp the entire architecture of the data warehouse.

- `ContentsDistribution` extracts the name of the S3 bucket that store
  CloudFront access logs, which the CloudFront distribution provisions
  on behalf of us. It highly depends on the implementation details of
  CDK.
- `CkdStack` outputs the name of the S3 bucket for CloudFront access
  logs.

issue codemonger-io#30
- Bumps the CDK version to 2.42.0.
- The issue that the main stack could not be resolved is fixed. It was
  caused because the name of the main stack contained an extra dash
  before the deployment stage. Removes the extra dash from the stack
  name.
- `resolveCodemongerResourceNames` resolves the name of the S3 bucket
  for access logs of the development stage.
- `CodemongerResources` binds the S3 bucket for access logs of the
  development stage.

issue codemonger-io#30
- Introduces a new Lambda function `lambda/mask-access-logs` that masks
  IP addresses in CloudFront access logs. It saves masked access logs in
  another S3 bucket. It will be triggered when a new log file is put
  into the S3 bucket for access logs in the future but we have to
  manually invoke it for now.
- A new CDK construct `AccessLogsMasking` provisions the Lambda function
  `lambda/mask-access-logs`. It also provisions an S3 bucket where
  `lambda/mask-access-logs` saves masked access logs.

issue codemonger-io#30
- `CdkOpsStack` provisions `AccessLogsMasking`.

issue codemonger-io#30
- `lambda/mask-access-logs` now supposes that the input event is a list
  of SQS messages containing "ObjectCreated:*" S3 events.
- `AccessLogsMasking` provisions an SQS queue to capture
  "ObjectCreated:*" S3 events from the S3 bucket for access logs. It
  sets the SQS queue as an event source of `MaskAccessLogsLambda`.

issue codemonger-io#30
- `lambda/mask-access-logs` ignores access logs files that no longer
  exist.

issue codemonger-io#30
- Provisions a new SQS queue that notifies when a new object is created
  in the S3 bucket for masked CloudFront access logs; i.e., masking a
  new CloudFront access logs file has finished.
- Introduces a new Lambda function `lambda/delete-access-logs` that
  deletes files in the original bucket for CloudFront access logs. The
  above SQS queue triggers this function.
- Reorders definitions in `AccessLogsMasking` because a Lambda function
  and its trigger SQS queue are strongly tied.

issue codemonger-io#30
- `AccessLogsMasking` → `AccessLogsETL`. Because the construct will not
  just mask access logs but do more transformation. So Extract,
  Transform, and Load (ETL) should be more suitable.

issue codemonger-io#30
- `lambda/mask-access-logs` prepends a prefix to the keys of masked
  access logs files. The prefix is "masked/".

issue codemonger-io#30
- `lambda/mask-access-logs` prefixes the date of access log records to
  the output S3 object key. `lambda/delete-access-logs` excludes
  prefixed dates to locate the original access logs file.

issue codemonger-io#30
- Configures the `env` property of the CDK stack so that correct
  availability zones (AZs) in the region can be listed. CDK lists only
  two AZs without these changes.

issue codemonger-io#30
- Introduces a new CDK construct `LatestBoto3Layer` that provisions a
  Lambda layer containing the latest boto3.
  `lambda/latest-boto3/requirements.txt` lists the modules and versions
  to be packaged as a Lambda layer. The latest boto3 is necessary
  because the default boto3 on the Lambda runtime does not support
  Redshift Serverless.

issue codemonger-io#30
- Introduces a new CDK construct `DataWarehouse` that provisions the
  Redshift Serverless namespace and workgroup for access logs. It also
  provisions a VPC where the Redshift Serverless cluster will reside. In
  addition to the Redshift Serverless cluster, it provisions a Lambda
  function `lambda/populate-dw-database` that populates the database and
  tables to store access logs.
- `CdkOpsStack` provisions `DataWarehouse`.

issue codemonger-io#30
- Installs `cdk2-python-library-layer` to create a Lambda layer from a
  local Python package.

issue codemonger-io#30
- Introduces a new Lambda layer `lambda/libdatawarehouse` that provides
  utilities to handle the data warehouse for CloudFront access logs. A
  new CDK construct `LibdatawarehouseLayer` provisions it.

issue codemonger-io#30
- `lambda/populate-dw-database` uses `libdatawarehouse` to replace
  commonly used types and functions.
- `CdkOpsStack` provisions `LibdatawarehouseLayer` and passes it to
  `DataWarehouse`.

issue codemonger-io#30
- Introduces a new Lambda function `lambda/load-access-logs` that loads
  CloudFront access logs from the S3 bucket onto the data warehouse.
  `AccessLogsETL` provisions this function.
- `DataWarehouse` introduces a new method `grantQuery` that allows a
  given `IGrantable` to call the Redshift Data API. The permission is
  too permissive because I could not figure out how to obtain the ARN of
  the Redshift Serverless namespace. `DataWarehouse` exports the
  Redshift Serverless workgroup so that `AccessLogsETL` can configure
  `lambda/load-access-logs`.
- `CdkOpsStack` passes `DataWarehouse`, `LatestBoto3Layer`, and
  `LibdatawarehouseLayer` to `AccessLogsETL`.

issue codemonger-io#30
- `lambda/populate-dw-database` introduces a new column `seq_num` to
  the `access_log` table, which helps sorting rows in the same order in
  the original log file.
- `lambda/mask-access-logs` prepends a new column `row_num` that records
  row numbers in the original access logs file. Row numbers are unique
  only in a single access logs file.
- `lambda/load-access-logs` maps `row_num` to `seq_num` in the
  `access_log` table.
- `lambda/mask-access-logs` also masks the "x-forwarded-for" column as
  IP addresses.

issue codemonger-io#30
- The bug that `lambda/load-access-logs` crashed when there were no
  access logs on a given date. It now makes sure that there are access
  logs on a given date before running the script.
- `AccessLogsETL` grants `lambda/load-access-logs` read permissions of
  the S3 bucket of access logs.

issue codemonger-io#30
- Adds thoughts about VACUUM as comments to `lambda/load-access-logs`.

issue codemonger-io#30
- `AccessLogsETL` schedules to run `lambda/load-access-logs` at 2:00 AM
  every day for production. It schedules to load access logs every hour
  for development. The rules are disabled by default.

issue codemonger-io#30
- Fixes the bug that `lambda/delete-access-logs` crashed when an empty
  folder was created in the S3 bucket. It ignores the event if the
  object key ends with a slash; i.e., the last segment is empty.

issue codemonger-io#30
- `DataWarehouse` introduces a Step Functions State Machine (workflow)
  to run VACUUM over the tables. The workflow picks the table one by
  one and applies a new Lambda function `lambda/vacuum-table` to it.
  `lambda/vacuum-table` runs `VACUUM` over a given table.

issue codemonger-io#30
- `lambda/load-access-logs` starts the VACUUM workflow over the updated
  tables. "SORT ONLY" is sufficient because `lambda/load-access-logs`
  never deletes rows.

issue codemonger-io#30
- `CdkOpsStack` outputs the ARN of `lambda/load-access-logs`.
  `DataWarehouse` exposes `lambda/load-access-logs`.

issue codemonger-io#30
- Installs `@aws-sdk/client-lmabda` to run a Lambda function in a build
  script.
- Bumps the CDK version to 2.45.0.
- Bumps the `constructs` version to 10.1.128.
- Bumps the `@aws-sdk/client-cloudformation` version to 3.186.0.
- Bumps the `ts-node` version to 10.9.1.

issue codemonger-io#30
- Introduces a new script `bin/populate-data-warehouse.js` that
  populates the database and tables of the data warehouse on a given
  deployment stage. A new npm script `populate-dw` runs this script.
- Tells `git` not to ignore `bin/populate-data-warehouse.js`.

issue codemonger-io/codemonger#0
- `CdkOpsStack` provisions `DataWarehouse` and `AccessLogsETL` for
  production.
- `CodemongerResources` resolves the S3 bucket containing CloudFront
  access logs for production.

issue codemonger-io#30
- Fixes the bug that `libdatawarehouse.data_api.wait_for_results` did
  not time out. It now monitors the elapsed time rather than the loop
  count. It also takes a new parameter `cancel_at_timeout` that makes
  the function cancel the statement when it times out. It is `True` by
  default.

issue codemonger-io#30
- Turns on `enhancedVpcRouting` of the Redshift Serverless workgroup for
  development. To avoid an accidental update of the production database,
  creates a branch of `DataWarehouse` as `DataWarehouseV2` for
  development.

issue codemonger-io#30
- Fixes the bug that `lambda/load-access-logs` crashed when `user_agent`
  was `NULL`. Replaces `NULL` with a dash (`-`).

issue codemonger-io#30
- Enables `enhancedVpcRouting` of the Redshift Serverless workgroup for
  production. Merges the changes on `DataWarehouseV2` back to
  `DataWarehouse`. `DataWarehouseV2` is no longer used.

issue codemonger-io#30
- Adds the explanation about the AWS architecture of the data warehouse.
  `docs/data-warehouse-aws-architecture.drawio` is the original data
  file of `draw.io`: https://github.com/jgraph/drawio

issue codemonger-io#30
- Adds the following information to `README.md`,
    - link to the documentation about the AWS architecture of the data
      warehouse for access logs
    - trouble shooting of the admin password of the data warehouse
    - how to populate the database and tables on the data warehouse
    - enabling the EventBridge rule for access log loading

issue codemonger-io#30
- Translates `docs/data-warehouse.md` into Japanese. Also refines the
  English version.

issue codemonger-io#30
- Translates the updated part of `README` into Japanese. Also refines
  the English version.

issue codemonger-io#30
- `docs/data-warehouse` includes a link to the section that explains how
  to deal with the situation where the AWS Secrets Manager regenarates
  the admin secret.

issue codemonger-io#30
- Replaces the section "Continuous delivery" with "DevOps" because the
  `cdk-ops` no longer contains only the continuous delivery but also the
  data warehouse.

issue codemonger-io#30
@kikuomax kikuomax self-assigned this Oct 19, 2022
Copy link
Member Author

@kikuomax kikuomax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have self-reviewed the changes.

@kikuomax kikuomax merged commit e7562e9 into codemonger-io:main Oct 19, 2022
@kikuomax kikuomax deleted the access-logs branch October 19, 2022 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Access analysis (infrastructure)
1 participant