Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add datalake network to clickhouse #339

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

drizzentic
Copy link
Collaborator

@drizzentic drizzentic commented Dec 10, 2024

This allows clickhouse to connect and fetch assets from datalake using the s3 function

Summary by CodeRabbit

  • New Features

    • Introduced a new network configuration for enhanced connectivity with the addition of the datalake network.
    • Updated service initialization scenarios to include the minio_public network for the Analytics Datastore Clickhouse package.
    • Updated the CDR-DW recipe scenario to include the minio-01 service in the initialization process.
  • Bug Fixes

    • Streamlined Monitoring package scenarios by removing references to the minio-01 service.
  • Chores

    • Improved formatting consistency in health check commands for the OpenFn service.
    • Updated S3 endpoint configuration in the Loki settings to reflect a change in the port used for accessing the service.

@drizzentic drizzentic force-pushed the fix-minio-clickhouse-network branch from 4040c4f to 1d65da2 Compare December 10, 2024 09:45
@drizzentic drizzentic marked this pull request as ready for review December 10, 2024 09:46
Copy link
Contributor

coderabbitai bot commented Dec 10, 2024

Walkthrough

The changes made in the pull request involve the addition of a new network configuration named datalake for the analytics-datastore-clickhouse service in the docker-compose.yml file. This enhancement allows the service to communicate over an additional network while preserving existing configurations for volumes and logs. The overall structure of the docker-compose.yml file remains unchanged, focusing solely on improving network connectivity. Additionally, modifications were made to the MinIO services in the datalake/docker-compose.cluster.yml file, and various feature files were updated to reflect changes in service initialization and destruction processes.

Changes

File Path Change Summary
analytics-datastore-clickhouse/docker-compose.yml Added a new network named datalake to the service's network definitions.
datalake/docker-compose.cluster.yml Updated command and healthcheck for MinIO services to use --address ":9090" and adjusted healthcheck URL.
openfn/docker-compose.yml Removed space from variable reference in health check command for consistency.
test/cucumber/features/cluster-mode/clickhouse.cluster.feature Updated expected networks to include minio_public for initialization and destruction scenarios.
test/cucumber/features/cluster-mode/monitoring.feature Removed references to "minio" services in initialization and destruction scenarios.
test/cucumber/features/single-mode/clickhouse.feature Updated expected networks to include minio_public for initialization and destruction scenarios.
test/cucumber/features/single-mode/monitoring.feature Removed references to "minio-01" service in initialization and destruction scenarios.
test/cucumber/features/single-mode/recipe.feature Updated to include "minio-01" in the initialization scenario and removed "openfn".
monitoring/loki/loki-config.yml Updated S3 endpoint URL from port 9000 to 9090 in the configuration file.

Sequence Diagram(s)

sequenceDiagram
    participant Service as Analytics Datastore Clickhouse
    participant Network as Datalake Network

    Service->>Network: Connect to Datalake
    Network-->>Service: Establish connection
Loading

🐰 "In the land of code where networks grow,
A new path was forged, as we all know.
With datalake added, connections expand,
Hopping through data, so swift and so grand!
Cheers to the changes, let’s celebrate,
In the world of tech, we innovate!" 🐇

Possibly related PRs

Suggested reviewers

  • ItsMurumba
  • brett-onions
  • sushantpatil1214

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
analytics-datastore-clickhouse/docker-compose.yml (1)

Line range hint 62-64: LGTM! Consider documenting the network integration.

The network configuration correctly references the external MinIO network. This enables ClickHouse to access S3-compatible storage provided by MinIO.

Consider adding a comment in the docker-compose file to document:

  1. The purpose of this network integration (S3 connectivity)
  2. Any required MinIO bucket configurations or permissions
  3. The dependency on MinIO service being up before ClickHouse

Example comment:

  datalake:
+   # Enables ClickHouse to access S3-compatible storage via MinIO
+   # Requires: MinIO service to be running with appropriate bucket permissions
    name: minio_public
    external: true
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between bb22490 and 1d65da2.

📒 Files selected for processing (1)
  • analytics-datastore-clickhouse/docker-compose.yml (1 hunks)
🔇 Additional comments (1)
analytics-datastore-clickhouse/docker-compose.yml (1)

25-25: LGTM! Verify network dependency order.

The addition of the datalake network to the ClickHouse service looks good and aligns with the PR objective of enabling S3 connectivity.

Let's verify the network dependency exists:

✅ Verification successful

Network dependency verified and properly configured

The network configuration is correct and well-structured:

  • MinIO service is properly defined in datalake/docker-compose.yml and creates the network
  • The minio_public network is correctly marked as external in the ClickHouse compose file
  • The network is consistently configured across all services that need MinIO access
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the minio_public network exists and is used by MinIO service
# Expected: Find docker-compose files that define or use the minio_public network

# Search for minio_public network definition or usage
rg -l "minio_public"

# Search for MinIO service definition
rg -l "image:.*(minio)"

Length of output: 198


Script:

#!/bin/bash
# Let's examine the network configurations in the found files
echo "=== analytics-datastore-clickhouse/docker-compose.yml ==="
rg -A 3 "minio_public" analytics-datastore-clickhouse/docker-compose.yml

echo -e "\n=== datalake/docker-compose.yml ==="
rg -A 3 "minio_public" datalake/docker-compose.yml

echo -e "\n=== monitoring/docker-compose.yml ==="
rg -A 3 "minio_public" monitoring/docker-compose.yml

# Let's also check if MinIO service is defined in datalake compose
echo -e "\n=== MinIO service in datalake/docker-compose.yml ==="
rg "services:" -A 20 datalake/docker-compose.yml

Length of output: 1608

@drizzentic drizzentic marked this pull request as draft December 11, 2024 05:47
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)
test/cucumber/features/cluster-mode/clickhouse.cluster.feature (1)

Line range hint 1-33: Consider documenting the network architecture and data flow.

The implementation properly enables ClickHouse-datalake connectivity while maintaining network isolation. Consider:

  1. Documenting the network topology and data flow patterns
  2. Adding monitoring for S3 operations and network connectivity
  3. Implementing retry mechanisms for S3 operations in case of network issues
datalake/docker-compose.cluster.yml (1)

57-57: Document port changes in architecture documentation

The port changes are consistently applied across all MinIO nodes. Consider:

  1. Updating architecture documentation to reflect the new port configuration
  2. Adding a comment in the docker-compose file explaining the reason for the port change
  3. Creating a migration guide for existing deployments

Also applies to: 62-62

test/cucumber/features/single-mode/recipe.feature (2)

33-33: Enhance test coverage for datalake integration

While adding MinIO service check is good, consider enhancing the test coverage:

  1. Add specific tests for ClickHouse-MinIO connectivity
  2. Verify data retrieval from datalake via ClickHouse's S3 function
  3. Include error scenarios (e.g., MinIO unavailability)

Would you like me to help create additional test scenarios for the datalake integration?


Line range hint 1-85: Add dedicated scenarios for datalake functionality

Consider adding the following scenarios to thoroughly test the datalake integration:

  Scenario: Verify ClickHouse can access datalake
    When I store test data in MinIO
    Then ClickHouse should be able to query it via S3 function

  Scenario: Handle datalake connectivity issues
    Given MinIO service is stopped
    When I attempt to query datalake from ClickHouse
    Then appropriate error handling should occur
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1d65da2 and cfdab5c.

📒 Files selected for processing (7)
  • datalake/docker-compose.cluster.yml (3 hunks)
  • openfn/docker-compose.yml (1 hunks)
  • test/cucumber/features/cluster-mode/clickhouse.cluster.feature (2 hunks)
  • test/cucumber/features/cluster-mode/monitoring.feature (0 hunks)
  • test/cucumber/features/single-mode/clickhouse.feature (2 hunks)
  • test/cucumber/features/single-mode/monitoring.feature (1 hunks)
  • test/cucumber/features/single-mode/recipe.feature (1 hunks)
💤 Files with no reviewable changes (1)
  • test/cucumber/features/cluster-mode/monitoring.feature
✅ Files skipped from review due to trivial changes (1)
  • openfn/docker-compose.yml
🔇 Additional comments (4)
test/cucumber/features/single-mode/clickhouse.feature (1)

9-9: LGTM! Addition of minio_public network enables S3 connectivity.

The addition of the minio_public network to ClickHouse's network configuration aligns with the PR objective of enabling datalake access via S3.

Let's verify the corresponding MinIO configuration:

Also applies to: 22-22

✅ Verification successful

Let me check the docker-compose.yml file to verify the MinIO configuration.


Let me check the configuration files for any S3-related settings since they might contain the actual MinIO connection configuration.


Network configuration for MinIO connectivity is properly set up

The docker-compose.yml correctly defines the datalake network that maps to the external minio_public network, which aligns with the network configuration in the Cucumber feature file. While we don't see explicit S3/MinIO connection settings in the configuration files, this is expected as such credentials are typically injected at runtime through environment variables or separate configuration files managed by the deployment system.

  • ✓ Network configuration in docker-compose.yml matches the feature file
  • ✓ External network minio_public is properly declared
  • ✓ Network is added to the ClickHouse service configuration
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if ClickHouse has the necessary S3 configurations to connect to MinIO
rg -l 's3|minio' analytics-datastore-clickhouse/

Length of output: 100


Script:

#!/bin/bash
# Check the docker-compose.yml for MinIO/S3 configuration
cat analytics-datastore-clickhouse/docker-compose.yml

Length of output: 1699


Script:

#!/bin/bash
# Check configuration files for S3/MinIO settings
rg -l "s3|minio" analytics-datastore-clickhouse/general_configs/

Length of output: 66

test/cucumber/features/single-mode/monitoring.feature (1)

18-18: Verify the intentional reduction in volume count.

The reduction from 6 to 3 volumes appears to be related to the decoupling of MinIO from the monitoring package. This aligns with the broader changes of establishing a dedicated datalake configuration.

Let's confirm this is intentional:

✅ Verification successful

Volume count reduction is correct and intentional

The docker-compose.yml shows exactly 3 volumes defined for the monitoring package:

  • prometheus-data
  • grafana-data
  • loki-data

This matches the updated test assertion "There should be 3 volumes" in the monitoring.feature file. The volume count is accurate and reflects the current architecture where MinIO storage has been moved to a dedicated datalake configuration, as evidenced by the presence of MinIO-related files in the separate datalake directory shown in the earlier search results.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any remaining MinIO volume definitions in monitoring
rg -l 'minio.*volume' monitoring/
# Check for any MinIO references in monitoring configs
rg -l 'minio' monitoring/config/

Length of output: 127


Script:

#!/bin/bash
# Let's search for MinIO and volume-related configurations in the entire codebase
rg -l "minio" .
# Search for volume-related test scenarios in cucumber features
fd "monitoring.feature" . --exec grep -l "volume" {}
# Get the content of the monitoring feature file to understand the context
fd "monitoring.feature" . --exec cat {}

Length of output: 5029


Script:

#!/bin/bash
# Let's check the docker-compose files to understand the volume configuration
cat monitoring/docker-compose.yml
# Check if there are any previous versions of the file in git history
git log -p --all -- monitoring/docker-compose.yml | head -n 100

Length of output: 10109

test/cucumber/features/cluster-mode/clickhouse.cluster.feature (1)

9-9: LGTM! Consistent network configuration across cluster nodes.

The addition of minio_public network is properly implemented across all ClickHouse cluster nodes, ensuring uniform S3 access capabilities throughout the cluster.

Let's verify the cluster-wide S3 configuration:

Also applies to: 12-12, 15-15, 18-18, 33-33

✅ Verification successful

Based on the review of the docker-compose files and network configurations, I can now provide a final response:

Network configuration properly implemented for S3/MinIO access

The minio_public network has been correctly added to all ClickHouse nodes in the cluster setup, with consistent configuration across the test scenarios. The network is properly defined as external in the docker-compose files, ensuring proper connectivity between ClickHouse and MinIO services.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for consistent S3 configurations across cluster nodes
rg -l 'cluster.*s3|cluster.*minio' analytics-datastore-clickhouse/
# Check for any cluster-specific S3 settings
rg -l 'clickhouse-\d+.*s3|clickhouse-\d+.*minio' analytics-datastore-clickhouse/

Length of output: 151


Script:

#!/bin/bash
# Let's check the feature file content to understand the cluster setup and network configuration
cat test/cucumber/features/cluster-mode/clickhouse.cluster.feature

# Also check for any docker-compose files that might contain cluster configuration
fd docker-compose --exec cat {}

Length of output: 67600


Script:

#!/bin/bash
# Let's check for any docker-compose files in the analytics-datastore-clickhouse directory
fd docker-compose analytics-datastore-clickhouse/ --exec cat {}

# Also check for any network configurations in the package
rg -l "minio_public.*network" analytics-datastore-clickhouse/

Length of output: 12126


Script:

#!/bin/bash
# Let's check the cluster configuration files for any S3/MinIO related settings
fd -e xml cluster_configs/ analytics-datastore-clickhouse/ --exec cat {}

# Also check the config importer files for any S3/MinIO configurations
fd -e js config/ analytics-datastore-clickhouse/ --exec cat {}

Length of output: 143

datalake/docker-compose.cluster.yml (1)

35-35: Configuration consistent with MinIO-02

The changes mirror those in MinIO-02, maintaining consistency across the cluster.

Also applies to: 40-40

datalake/docker-compose.cluster.yml Show resolved Hide resolved
@drizzentic drizzentic marked this pull request as ready for review December 11, 2024 05:52
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
monitoring/loki/loki-config.yml (1)

42-43: Consider adding retry mechanism for S3 operations

Given that this configuration is part of the datalake network integration, consider adding S3 retry configurations to handle temporary network issues or MinIO failovers. This is especially important for the monitoring stack to maintain data consistency.

Example addition:

  aws:
    s3: http://${MO_SECURITY_ADMIN_USER}:${MO_SECURITY_ADMIN_PASSWORD}@minio-01.:9090/loki
    s3forcepathstyle: true
+   max_retries: 5
+   retry_wait: 30s
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between cfdab5c and 6655f1c.

📒 Files selected for processing (1)
  • monitoring/loki/loki-config.yml (1 hunks)
🔇 Additional comments (1)
monitoring/loki/loki-config.yml (1)

42-42: Verify MinIO port configuration across the platform

The MinIO endpoint port has been changed from 9000 to 9090. While this aligns with the changes in datalake/docker-compose.cluster.yml, we should ensure this port change is consistently applied across all services that interact with MinIO.

✅ Verification successful

Let me check the actual port configurations in these files to ensure consistency.


Port configuration is consistent across MinIO services

The port change from 9000 to 9090 is consistently implemented across all relevant configurations:

  • MinIO server is configured to listen on port 9090 in both docker-compose files
  • Loki's S3 endpoint correctly uses port 9090 to connect to MinIO
  • The port is properly exposed through traefik load balancer
  • No port conflicts were found as 9090 is exclusively used for MinIO API
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining references to port 9000 for MinIO
# and verify the new port 9090 configuration is consistent

echo "Checking for MinIO port configurations..."
rg -l "minio.*:9000" --type yaml

echo "Verifying MinIO service definitions..."
rg -l "minio.*:9090" --type yaml

echo "Checking for potential port conflicts..."
rg -l "port.*9090" --type yaml

Length of output: 497


Script:

#!/bin/bash
# Check the actual MinIO port configurations in the identified files
echo "Content of monitoring/loki/loki-config.yml:"
rg "minio.*:90" monitoring/loki/loki-config.yml -A 2 -B 2

echo -e "\nContent of datalake/docker-compose.cluster.yml:"
rg "minio.*:90" datalake/docker-compose.cluster.yml -A 2 -B 2

echo -e "\nContent of datalake/docker-compose.yml:"
rg "port.*9090|minio.*:90" datalake/docker-compose.yml -A 2 -B 2

Length of output: 2056

monitoring/loki/loki-config.yml Show resolved Hide resolved
Copy link
Contributor

@brett-onions brett-onions left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants