Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access analysis (infrastructure) #30

Closed
kikuomax opened this issue Jul 22, 2022 · 9 comments · Fixed by #46
Closed

Access analysis (infrastructure) #30

kikuomax opened this issue Jul 22, 2022 · 9 comments · Fixed by #46
Assignees
Labels
enhancement New feature or request

Comments

@kikuomax
Copy link
Member

I would like to analyze CloudFront logs to know traffic to my site.
How about to try Amazon Redshift serverless?

@kikuomax kikuomax added the enhancement New feature or request label Jul 22, 2022
@kikuomax kikuomax self-assigned this Jul 22, 2022
@kikuomax
Copy link
Member Author

I think we need specific goals of access analysis.

@kikuomax
Copy link
Member Author

kikuomax commented Sep 27, 2022

I think we need specific goals of access analysis.

  • Within a specific time period,
    • Where did the traffic to my site come from?
    • How often did the traffic from a specific origin (referer) happen?
    • How many attentions did a specific page on my site get?

@kikuomax
Copy link
Member Author

kikuomax commented Oct 1, 2022

The columns of the fact table for access logs.

  1. datetime: TIMESTAMP
    • date + time
  2. seq_num: INT
    • To retain the order of entries at the same timestamp in the original log file.
  3. edge_location: INTedge_location dimension table
    • x-edge-location
  4. sc_bytes: BIGINT
    • sc-bytes
  5. cs_method: VARCHAR
    • cs-method
  6. page: INTpage dimension table
    • cs-uri-stem
  7. status: SMALLINT
    • sc-status
  8. referer: BIGINT DISTKEYreferer dimension table
    • cs(Referer)
  9. user_agent: BIGINTuser_agent dimension table
    • cs(User-Agent)
  10. cs_protocol: VARCHAR
    • cs-protocol
  11. cs_bytes: BIGINT
    • cs-bytes
  12. time_taken: FLOAT4
    • time-taken
  13. edge_response_result_type: INTresult_type dimension table
    • x-edge-response-result-type
  14. time_to_first_byte: FLOAT4
    • time-to-first-byte

SORTKEY: datetime, seq_num

@kikuomax
Copy link
Member Author

kikuomax commented Oct 1, 2022

The columns of the edge_location dimension table.

  1. id: INT
  2. code: VARCHAR SORTKEY UNIQUE
    • x-edge-location

@kikuomax
Copy link
Member Author

kikuomax commented Oct 1, 2022

The columns of the page dimension table.

  1. id: INT
  2. path: VARCHAR(2048) SORTKEY UNIQUE
    • cs-uri-stem

@kikuomax
Copy link
Member Author

kikuomax commented Oct 1, 2022

The columns of the referer dimension table.

  1. id: BIGINT
  2. url: VARCHAR(2048) SORTKEY UNIQUE
    • cs(Referer)

@kikuomax
Copy link
Member Author

kikuomax commented Oct 1, 2022

The columns of the user_agent dimension table.

  1. id: BIGINT
  2. user_agent: VARCHAR(2048) SORTKEY UNIQUE
    • cs(User-Agent)

@kikuomax
Copy link
Member Author

kikuomax commented Oct 1, 2022

The columns of the result_type dimension table.

  1. id: INT
  2. result_type: VARCHAR SORTKEY UNIQUE
    • x-edge-response-result-type

@kikuomax kikuomax changed the title Access analysis Access analysis (infrastructure) Oct 11, 2022
@kikuomax kikuomax mentioned this issue Oct 11, 2022
3 tasks
@kikuomax
Copy link
Member Author

kikuomax commented Oct 11, 2022

I have decided to designate this issue for development of the basic infrastructure. I will develop tools for analysis on top of the developed infrastructure in another issue.

kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- `ContentsDistribution` extracts the name of the S3 bucket that store
  CloudFront access logs, which the CloudFront distribution provisions
  on behalf of us. It highly depends on the implementation details of
  CDK.
- `CkdStack` outputs the name of the S3 bucket for CloudFront access
  logs.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- `resolveCodemongerResourceNames` resolves the name of the S3 bucket
  for access logs of the development stage.
- `CodemongerResources` binds the S3 bucket for access logs of the
  development stage.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- Introduces a new Lambda function `lambda/mask-access-logs` that masks
  IP addresses in CloudFront access logs. It saves masked access logs in
  another S3 bucket. It will be triggered when a new log file is put
  into the S3 bucket for access logs in the future but we have to
  manually invoke it for now.
- A new CDK construct `AccessLogsMasking` provisions the Lambda function
  `lambda/mask-access-logs`. It also provisions an S3 bucket where
  `lambda/mask-access-logs` saves masked access logs.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- `CdkOpsStack` provisions `AccessLogsMasking`.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- `lambda/mask-access-logs` now supposes that the input event is a list
  of SQS messages containing "ObjectCreated:*" S3 events.
- `AccessLogsMasking` provisions an SQS queue to capture
  "ObjectCreated:*" S3 events from the S3 bucket for access logs. It
  sets the SQS queue as an event source of `MaskAccessLogsLambda`.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- `lambda/mask-access-logs` ignores access logs files that no longer
  exist.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- Provisions a new SQS queue that notifies when a new object is created
  in the S3 bucket for masked CloudFront access logs; i.e., masking a
  new CloudFront access logs file has finished.
- Introduces a new Lambda function `lambda/delete-access-logs` that
  deletes files in the original bucket for CloudFront access logs. The
  above SQS queue triggers this function.
- Reorders definitions in `AccessLogsMasking` because a Lambda function
  and its trigger SQS queue are strongly tied.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- `AccessLogsMasking` → `AccessLogsETL`. Because the construct will not
  just mask access logs but do more transformation. So Extract,
  Transform, and Load (ETL) should be more suitable.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- `lambda/mask-access-logs` prepends a prefix to the keys of masked
  access logs files. The prefix is "masked/".

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- `lambda/mask-access-logs` prefixes the date of access log records to
  the output S3 object key. `lambda/delete-access-logs` excludes
  prefixed dates to locate the original access logs file.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- Configures the `env` property of the CDK stack so that correct
  availability zones (AZs) in the region can be listed. CDK lists only
  two AZs without these changes.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- Introduces a new CDK construct `LatestBoto3Layer` that provisions a
  Lambda layer containing the latest boto3.
  `lambda/latest-boto3/requirements.txt` lists the modules and versions
  to be packaged as a Lambda layer. The latest boto3 is necessary
  because the default boto3 on the Lambda runtime does not support
  Redshift Serverless.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- Introduces a new CDK construct `DataWarehouse` that provisions the
  Redshift Serverless namespace and workgroup for access logs. It also
  provisions a VPC where the Redshift Serverless cluster will reside. In
  addition to the Redshift Serverless cluster, it provisions a Lambda
  function `lambda/populate-dw-database` that populates the database and
  tables to store access logs.
- `CdkOpsStack` provisions `DataWarehouse`.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- Installs `cdk2-python-library-layer` to create a Lambda layer from a
  local Python package.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- Introduces a new Lambda layer `lambda/libdatawarehouse` that provides
  utilities to handle the data warehouse for CloudFront access logs. A
  new CDK construct `LibdatawarehouseLayer` provisions it.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- `lambda/populate-dw-database` uses `libdatawarehouse` to replace
  commonly used types and functions.
- `CdkOpsStack` provisions `LibdatawarehouseLayer` and passes it to
  `DataWarehouse`.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- Introduces a new Lambda function `lambda/load-access-logs` that loads
  CloudFront access logs from the S3 bucket onto the data warehouse.
  `AccessLogsETL` provisions this function.
- `DataWarehouse` introduces a new method `grantQuery` that allows a
  given `IGrantable` to call the Redshift Data API. The permission is
  too permissive because I could not figure out how to obtain the ARN of
  the Redshift Serverless namespace. `DataWarehouse` exports the
  Redshift Serverless workgroup so that `AccessLogsETL` can configure
  `lambda/load-access-logs`.
- `CdkOpsStack` passes `DataWarehouse`, `LatestBoto3Layer`, and
  `LibdatawarehouseLayer` to `AccessLogsETL`.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- `lambda/populate-dw-database` introduces a new column `seq_num` to
  the `access_log` table, which helps sorting rows in the same order in
  the original log file.
- `lambda/mask-access-logs` prepends a new column `row_num` that records
  row numbers in the original access logs file. Row numbers are unique
  only in a single access logs file.
- `lambda/load-access-logs` maps `row_num` to `seq_num` in the
  `access_log` table.
- `lambda/mask-access-logs` also masks the "x-forwarded-for" column as
  IP addresses.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- The bug that `lambda/load-access-logs` crashed when there were no
  access logs on a given date. It now makes sure that there are access
  logs on a given date before running the script.
- `AccessLogsETL` grants `lambda/load-access-logs` read permissions of
  the S3 bucket of access logs.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- Adds thoughts about VACUUM as comments to `lambda/load-access-logs`.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- `AccessLogsETL` schedules to run `lambda/load-access-logs` at 2:00 AM
  every day for production. It schedules to load access logs every hour
  for development. The rules are disabled by default.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- Fixes the bug that `lambda/delete-access-logs` crashed when an empty
  folder was created in the S3 bucket. It ignores the event if the
  object key ends with a slash; i.e., the last segment is empty.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- `DataWarehouse` introduces a Step Functions State Machine (workflow)
  to run VACUUM over the tables. The workflow picks the table one by
  one and applies a new Lambda function `lambda/vacuum-table` to it.
  `lambda/vacuum-table` runs `VACUUM` over a given table.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- `lambda/load-access-logs` starts the VACUUM workflow over the updated
  tables. "SORT ONLY" is sufficient because `lambda/load-access-logs`
  never deletes rows.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- `CdkOpsStack` outputs the ARN of `lambda/load-access-logs`.
  `DataWarehouse` exposes `lambda/load-access-logs`.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- Installs `@aws-sdk/client-lmabda` to run a Lambda function in a build
  script.
- Bumps the CDK version to 2.45.0.
- Bumps the `constructs` version to 10.1.128.
- Bumps the `@aws-sdk/client-cloudformation` version to 3.186.0.
- Bumps the `ts-node` version to 10.9.1.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 11, 2022
- `CdkOpsStack` provisions `DataWarehouse` and `AccessLogsETL` for
  production.
- `CodemongerResources` resolves the S3 bucket containing CloudFront
  access logs for production.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 19, 2022
- Fixes the bug that `libdatawarehouse.data_api.wait_for_results` did
  not time out. It now monitors the elapsed time rather than the loop
  count. It also takes a new parameter `cancel_at_timeout` that makes
  the function cancel the statement when it times out. It is `True` by
  default.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 19, 2022
- Turns on `enhancedVpcRouting` of the Redshift Serverless workgroup for
  development. To avoid an accidental update of the production database,
  creates a branch of `DataWarehouse` as `DataWarehouseV2` for
  development.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 19, 2022
- Fixes the bug that `lambda/load-access-logs` crashed when `user_agent`
  was `NULL`. Replaces `NULL` with a dash (`-`).

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 19, 2022
- Enables `enhancedVpcRouting` of the Redshift Serverless workgroup for
  production. Merges the changes on `DataWarehouseV2` back to
  `DataWarehouse`. `DataWarehouseV2` is no longer used.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 19, 2022
- Adds the explanation about the AWS architecture of the data warehouse.
  `docs/data-warehouse-aws-architecture.drawio` is the original data
  file of `draw.io`: https://github.com/jgraph/drawio

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 19, 2022
- Adds the following information to `README.md`,
    - link to the documentation about the AWS architecture of the data
      warehouse for access logs
    - trouble shooting of the admin password of the data warehouse
    - how to populate the database and tables on the data warehouse
    - enabling the EventBridge rule for access log loading

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 19, 2022
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 19, 2022
- Translates `docs/data-warehouse.md` into Japanese. Also refines the
  English version.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 19, 2022
- Translates the updated part of `README` into Japanese. Also refines
  the English version.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 19, 2022
- `docs/data-warehouse` includes a link to the section that explains how
  to deal with the situation where the AWS Secrets Manager regenarates
  the admin secret.

issue codemonger-io#30
kikuomax added a commit to kikuomax/codemonger that referenced this issue Oct 19, 2022
- Replaces the section "Continuous delivery" with "DevOps" because the
  `cdk-ops` no longer contains only the continuous delivery but also the
  data warehouse.

issue codemonger-io#30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant