-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access analysis (infrastructure) #30
Comments
I think we need specific goals of access analysis. |
|
The columns of the fact table for access logs.
|
The columns of the
|
The columns of the
|
The columns of the
|
The columns of the
|
The columns of the
|
I have decided to designate this issue for development of the basic infrastructure. I will develop tools for analysis on top of the developed infrastructure in another issue. |
- `ContentsDistribution` extracts the name of the S3 bucket that store CloudFront access logs, which the CloudFront distribution provisions on behalf of us. It highly depends on the implementation details of CDK. - `CkdStack` outputs the name of the S3 bucket for CloudFront access logs. issue codemonger-io#30
- `resolveCodemongerResourceNames` resolves the name of the S3 bucket for access logs of the development stage. - `CodemongerResources` binds the S3 bucket for access logs of the development stage. issue codemonger-io#30
- Introduces a new Lambda function `lambda/mask-access-logs` that masks IP addresses in CloudFront access logs. It saves masked access logs in another S3 bucket. It will be triggered when a new log file is put into the S3 bucket for access logs in the future but we have to manually invoke it for now. - A new CDK construct `AccessLogsMasking` provisions the Lambda function `lambda/mask-access-logs`. It also provisions an S3 bucket where `lambda/mask-access-logs` saves masked access logs. issue codemonger-io#30
- `CdkOpsStack` provisions `AccessLogsMasking`. issue codemonger-io#30
- `lambda/mask-access-logs` now supposes that the input event is a list of SQS messages containing "ObjectCreated:*" S3 events. - `AccessLogsMasking` provisions an SQS queue to capture "ObjectCreated:*" S3 events from the S3 bucket for access logs. It sets the SQS queue as an event source of `MaskAccessLogsLambda`. issue codemonger-io#30
- `lambda/mask-access-logs` ignores access logs files that no longer exist. issue codemonger-io#30
- Provisions a new SQS queue that notifies when a new object is created in the S3 bucket for masked CloudFront access logs; i.e., masking a new CloudFront access logs file has finished. - Introduces a new Lambda function `lambda/delete-access-logs` that deletes files in the original bucket for CloudFront access logs. The above SQS queue triggers this function. - Reorders definitions in `AccessLogsMasking` because a Lambda function and its trigger SQS queue are strongly tied. issue codemonger-io#30
- `AccessLogsMasking` → `AccessLogsETL`. Because the construct will not just mask access logs but do more transformation. So Extract, Transform, and Load (ETL) should be more suitable. issue codemonger-io#30
- `lambda/mask-access-logs` prepends a prefix to the keys of masked access logs files. The prefix is "masked/". issue codemonger-io#30
- `lambda/mask-access-logs` prefixes the date of access log records to the output S3 object key. `lambda/delete-access-logs` excludes prefixed dates to locate the original access logs file. issue codemonger-io#30
- Configures the `env` property of the CDK stack so that correct availability zones (AZs) in the region can be listed. CDK lists only two AZs without these changes. issue codemonger-io#30
- Introduces a new CDK construct `LatestBoto3Layer` that provisions a Lambda layer containing the latest boto3. `lambda/latest-boto3/requirements.txt` lists the modules and versions to be packaged as a Lambda layer. The latest boto3 is necessary because the default boto3 on the Lambda runtime does not support Redshift Serverless. issue codemonger-io#30
- Introduces a new CDK construct `DataWarehouse` that provisions the Redshift Serverless namespace and workgroup for access logs. It also provisions a VPC where the Redshift Serverless cluster will reside. In addition to the Redshift Serverless cluster, it provisions a Lambda function `lambda/populate-dw-database` that populates the database and tables to store access logs. - `CdkOpsStack` provisions `DataWarehouse`. issue codemonger-io#30
- Installs `cdk2-python-library-layer` to create a Lambda layer from a local Python package. issue codemonger-io#30
- Introduces a new Lambda layer `lambda/libdatawarehouse` that provides utilities to handle the data warehouse for CloudFront access logs. A new CDK construct `LibdatawarehouseLayer` provisions it. issue codemonger-io#30
- `lambda/populate-dw-database` uses `libdatawarehouse` to replace commonly used types and functions. - `CdkOpsStack` provisions `LibdatawarehouseLayer` and passes it to `DataWarehouse`. issue codemonger-io#30
- Introduces a new Lambda function `lambda/load-access-logs` that loads CloudFront access logs from the S3 bucket onto the data warehouse. `AccessLogsETL` provisions this function. - `DataWarehouse` introduces a new method `grantQuery` that allows a given `IGrantable` to call the Redshift Data API. The permission is too permissive because I could not figure out how to obtain the ARN of the Redshift Serverless namespace. `DataWarehouse` exports the Redshift Serverless workgroup so that `AccessLogsETL` can configure `lambda/load-access-logs`. - `CdkOpsStack` passes `DataWarehouse`, `LatestBoto3Layer`, and `LibdatawarehouseLayer` to `AccessLogsETL`. issue codemonger-io#30
- `lambda/populate-dw-database` introduces a new column `seq_num` to the `access_log` table, which helps sorting rows in the same order in the original log file. - `lambda/mask-access-logs` prepends a new column `row_num` that records row numbers in the original access logs file. Row numbers are unique only in a single access logs file. - `lambda/load-access-logs` maps `row_num` to `seq_num` in the `access_log` table. - `lambda/mask-access-logs` also masks the "x-forwarded-for" column as IP addresses. issue codemonger-io#30
- The bug that `lambda/load-access-logs` crashed when there were no access logs on a given date. It now makes sure that there are access logs on a given date before running the script. - `AccessLogsETL` grants `lambda/load-access-logs` read permissions of the S3 bucket of access logs. issue codemonger-io#30
- Adds thoughts about VACUUM as comments to `lambda/load-access-logs`. issue codemonger-io#30
- `AccessLogsETL` schedules to run `lambda/load-access-logs` at 2:00 AM every day for production. It schedules to load access logs every hour for development. The rules are disabled by default. issue codemonger-io#30
- Fixes the bug that `lambda/delete-access-logs` crashed when an empty folder was created in the S3 bucket. It ignores the event if the object key ends with a slash; i.e., the last segment is empty. issue codemonger-io#30
- `DataWarehouse` introduces a Step Functions State Machine (workflow) to run VACUUM over the tables. The workflow picks the table one by one and applies a new Lambda function `lambda/vacuum-table` to it. `lambda/vacuum-table` runs `VACUUM` over a given table. issue codemonger-io#30
- `lambda/load-access-logs` starts the VACUUM workflow over the updated tables. "SORT ONLY" is sufficient because `lambda/load-access-logs` never deletes rows. issue codemonger-io#30
- `CdkOpsStack` outputs the ARN of `lambda/load-access-logs`. `DataWarehouse` exposes `lambda/load-access-logs`. issue codemonger-io#30
- Installs `@aws-sdk/client-lmabda` to run a Lambda function in a build script. - Bumps the CDK version to 2.45.0. - Bumps the `constructs` version to 10.1.128. - Bumps the `@aws-sdk/client-cloudformation` version to 3.186.0. - Bumps the `ts-node` version to 10.9.1. issue codemonger-io#30
- `CdkOpsStack` provisions `DataWarehouse` and `AccessLogsETL` for production. - `CodemongerResources` resolves the S3 bucket containing CloudFront access logs for production. issue codemonger-io#30
- Fixes the bug that `libdatawarehouse.data_api.wait_for_results` did not time out. It now monitors the elapsed time rather than the loop count. It also takes a new parameter `cancel_at_timeout` that makes the function cancel the statement when it times out. It is `True` by default. issue codemonger-io#30
- Turns on `enhancedVpcRouting` of the Redshift Serverless workgroup for development. To avoid an accidental update of the production database, creates a branch of `DataWarehouse` as `DataWarehouseV2` for development. issue codemonger-io#30
- Fixes the bug that `lambda/load-access-logs` crashed when `user_agent` was `NULL`. Replaces `NULL` with a dash (`-`). issue codemonger-io#30
- Enables `enhancedVpcRouting` of the Redshift Serverless workgroup for production. Merges the changes on `DataWarehouseV2` back to `DataWarehouse`. `DataWarehouseV2` is no longer used. issue codemonger-io#30
- Adds the explanation about the AWS architecture of the data warehouse. `docs/data-warehouse-aws-architecture.drawio` is the original data file of `draw.io`: https://github.com/jgraph/drawio issue codemonger-io#30
- Adds the following information to `README.md`, - link to the documentation about the AWS architecture of the data warehouse for access logs - trouble shooting of the admin password of the data warehouse - how to populate the database and tables on the data warehouse - enabling the EventBridge rule for access log loading issue codemonger-io#30
- Translates `docs/data-warehouse.md` into Japanese. Also refines the English version. issue codemonger-io#30
- Translates the updated part of `README` into Japanese. Also refines the English version. issue codemonger-io#30
- `docs/data-warehouse` includes a link to the section that explains how to deal with the situation where the AWS Secrets Manager regenarates the admin secret. issue codemonger-io#30
- Replaces the section "Continuous delivery" with "DevOps" because the `cdk-ops` no longer contains only the continuous delivery but also the data warehouse. issue codemonger-io#30
I would like to analyze CloudFront logs to know traffic to my site.
How about to try Amazon Redshift serverless?
The text was updated successfully, but these errors were encountered: