Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(logs): add alloy + loki #451

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

0xNineteen
Copy link
Contributor

@0xNineteen 0xNineteen commented Dec 18, 2024

  • mac support
  • linux support

note: alerting requires proper parsing of logs which requires some changes to our logger which ill put into a separate PR

  • logs are stored in logs/sig.log
  • alloy scrapes the logs and pushes to loki
  • loki is a log database which we query from grafana on the explore page (dashboards + alerting coming in future PR -- see note)

TL;DR

Added log aggregation capabilities to the metrics stack using Loki and Alloy.

What changed?

  • Added Loki for log storage and querying
  • Added Alloy for log collection and forwarding
  • Updated Grafana configuration to include Loki as a data source
  • Added /logs to gitignore
  • Consolidated message handle time metrics into the main Gossip dashboard
  • Updated documentation for running with logs

How to test?

  1. Run sig with log output: ./zig-out/bin/sig gossip -n testnet 2>&1 | tee -a logs/sig.log
  2. Start the metrics stack:
    • Mac: docker compose -f compose_mac.yaml up -d
    • Linux: docker compose -f compose_linux.yaml up -d
  3. Access Grafana at localhost:3000 and verify logs are visible in Explore view
  4. Verify metrics dashboards show handle time data correctly

Copy link
Contributor Author

0xNineteen commented Dec 18, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

@0xNineteen 0xNineteen marked this pull request as ready for review December 18, 2024 12:16
@0xNineteen 0xNineteen marked this pull request as draft December 18, 2024 12:16
@0xNineteen 0xNineteen force-pushed the 12-18-feat_logs_add_alloy_loki branch 4 times, most recently from 8eb43d7 to 55efe88 Compare December 18, 2024 22:21
@0xNineteen 0xNineteen force-pushed the 12-18-feat_logs_add_alloy_loki branch from 6ec7912 to f2a4f09 Compare January 2, 2025 19:25
@0xNineteen 0xNineteen mentioned this pull request Jan 3, 2025
1 task
@0xNineteen 0xNineteen marked this pull request as ready for review January 3, 2025 13:17
@0xNineteen 0xNineteen requested review from dnut and Sobeston January 3, 2025 13:17
@0xNineteen 0xNineteen self-assigned this Jan 3, 2025
Copy link
Contributor

@Sobeston Sobeston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a few hitches getting things up, but seems generally good.

metrics/compose_linux.yaml Outdated Show resolved Hide resolved
metrics/loki/loki.yml Show resolved Hide resolved
@Sobeston

This comment was marked as resolved.

@Sobeston

This comment was marked as resolved.

Sobeston and others added 2 commits January 6, 2025 06:40
* fix and clean up metrics

* allow prom to be accessed by external services

---------

Co-authored-by: x19 <[email protected]>
@0xNineteen
Copy link
Contributor Author

Copy link
Contributor

@dnut dnut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I just had a few minor nitpicks. I'm not very opinionated about any of this, so feel free to turn them down.

need to also modify the prometheus `target` to point to the different port).
- sig metrics will be published to localhost:12345 (if you change this on the sig cli, you will
need to also modify the prometheus `target` to point to the different port).
- sig gossip metrics are published to localhost:12355
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant with prior bullet

image: grafana/loki:3.0.0
container_name: loki
ports:
- "3100:3100"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to expose these ports to the host system? This shouldn't be necessary just to connect from other containers in the same docker network.

Comment on lines +36 to +37
- "./alloy/config.alloy:/etc/alloy/config.alloy"
- "../logs:/var/log/alloy"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very important at this point, but it's a good practice to mount as read-only if you know the files don't need to be modified by the container. Does config.alloy need to be modified by the application?

Suggested change
- "./alloy/config.alloy:/etc/alloy/config.alloy"
- "../logs:/var/log/alloy"
- "./alloy/config.alloy:/etc/alloy/config.alloy:ro"
- "../logs:/var/log/alloy:ro"

ports:
- "3100:3100"
volumes:
- ./loki:/etc/loki
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ./loki:/etc/loki
- ./loki:/etc/loki:ro

@@ -41,6 +41,7 @@ pub fn getMetrics(
_: *httpz.Request,
response: *httpz.Response,
) !void {
response.content_type = .TEXT; // expected by prometheus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens when you don't include this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus doesn't record it at all.

prometheus     | time=2025-01-06T15:04:39.954Z level=ERROR source=scrape.go:1590 msg="Failed to determine correct type of scrape target." component="scrape manager" scrape_pool=prometheus target=http://host.docker.internal:12345/metrics content_type="" fallback_media_type="" err="non-compliant scrape target sending blank Content-Type and no fallback_scrape_protocol specified for target"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this due to a recent change in prometheus?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea - worth noting that our prometheus image doesn't have a version on it, pretty sure that means we're tracking the latest image, which is probably undesirable.

@@ -10,6 +10,7 @@
/kcov-output
/results
/validator
/logs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a logs folder? I feel like it's slightly easier to work with a log file when it's placed directly in the current working directory. Doesn't make much difference but just a small convenience.

Also I don't see where a log file is being produced by the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

3 participants