Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedupe events - high memory usage #20883

Open
luk-ada opened this issue Jul 18, 2024 · 5 comments
Open

Dedupe events - high memory usage #20883

luk-ada opened this issue Jul 18, 2024 · 5 comments
Labels
domain: performance Anything related to Vector's performance transform: dedupe Anything `dedupe` transform related type: bug A code related bug.

Comments

@luk-ada
Copy link

luk-ada commented Jul 18, 2024

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hello

I'm streaming the GCP logs from the PubSub and ingest them to Logscale + Vector for log to metrics transformation.

It seams that Vector is not handling PubSub source very well and I have quite a lot of duplicates which are not acceptable by the customer. PubSub load is from 5k/s to 200k/s. I have two Vectors:

  1. Source: PubSub > Transformation: Dedupe (cache 5 000 000 messages) > Sink: vector
  2. Source: Vector > Transformation: Dedupe (cache 15 000 000 messages) > Sinks: humio_logs, vector

First Vector is using ~2,5-3 GB of memory. Second i using 15 GB of RAM and slowly growing all the time.

Is de-duplication working fine? I'm using fields.match by message_id which is string with 17 digits (bytes) so in regards to the Memory Utilization Estimation it should use 0,255 Gigabyte.

length("11497906994447355") 17

17 bytes * 15000000 * 1e-9 = 0,255 Gigabye

I'm sharing configuration for the second Vector.

Configuration

api:
  address: 0.0.0.0:8686
  enabled: true
  playground: false
data_dir: /vector-data-dir
acknowledgements:
  enabled: true
sources:
  vector:
    type: vector
    address: 0.0.0.0:1500
  internal_streams_metrics_source:
    type: internal_metrics
transforms:
  dedupe:
    type: dedupe
    inputs:
      - vector
    cache:
      num_events: 15000000
    fields:
      match: ["message_id"]
sinks:
  internal_streams_metrics_sink:
    address: 0.0.0.0:9000
    default_namespace: service
    inputs:
      - internal_streams_metrics_source
    type: prometheus_exporter
    acknowledgements:
      enabled: false
  logscale_logs:
    type: humio_logs
    inputs:
      - dedupe
    endpoint: "http://***:8080"
    token: ${***}
    index: gcp
    event_type: gcp-parser
    encoding:
      codec: json
    acknowledgements:
      enabled: true
    batch:
      max_bytes: 1500000
      max_events: 1500
      timeout_secs: 1
    buffer:
      max_events: 10000
      type: memory
      when_full: block
    compression: none
  vector_metrics:
    type: vector
    inputs:
      - dedupe
    address: http://***:1500
    acknowledgements:
      enabled: true

Version

vector 0.39.0 (x86_64-unknown-linux-musl 73da9bb 2024-06-17 16:00:23.791735272)

Debug Output

No response

Example Data

example, message_id:

11497906994447360
11497906994447359
11497906994447358
11497906994447357
11497906994447356
11497906994447355
11497906994447354
11497906994447353
11497906994447352
11497906994447351

Additional Context

No response

References

No response

@luk-ada luk-ada added the type: bug A code related bug. label Jul 18, 2024
@jszwedko
Copy link
Member

@luk-ada thanks for this report.

Do have a baseline to compare with, could you try running both Vectors without the dedupe transform and observe the memory use? I'd like to understand how much the transform may be adding vs the baseline.

As you noted, the overhead for the keys should be relatively small relatively small for your key field: message_id. For 17 byte keys, for 5 million messages it should be about 85 MB and for 15 million messages should be about 255 MB for just the internal state store of the dedupe transform. There is some static overhead per key but it should be on the order of a few bytes. I'm suspicious that the additional memory use is in the dedupe transform itself and not somewhere else in the pipeline, but it is possible that there is a bug.

@luk-ada
Copy link
Author

luk-ada commented Jul 23, 2024

@jszwedko thank you for the response. Below you can see RAM usage before dedupe enabled. In this case we are talking about first Vector which was configured with PubSub source and Logscale sink.

image

I've restarted second Vector on Friday, currently usign ~10GB of memory and slowly constantly growing.

image

PS. please ignore usage above limit - pod restarts.

@jszwedko
Copy link
Member

Thanks @luk-ada , that is interesting. It does make it seem like the dedupe transform is causing a large increase in memory usage. Nothing jumped out when quickly reviewing the code.

I think one (or both) of two things could be helpful:

  • Creating a minimal reproducible example that I could run that manifests the behavior. This would include trimming the config to the minimal necessary and providing some mechanism to generate the input and feed it to a running Vector. Then I could more easily profile locally.
  • Collect a memory profile yourself by running Vector under valgrind and providing that here. That profile might make it easy to spot where the issue lies.

@jszwedko jszwedko added transform: dedupe Anything `dedupe` transform related domain: performance Anything related to Vector's performance labels Jul 26, 2024
@luk-ada
Copy link
Author

luk-ada commented Jul 30, 2024

Hi @jszwedko

I've prepared simple setup to reproduce, please check attached zip. It looks like that 15M of messages is using around 5GB of RAM. I used WSL 1 on Windows 10, below my results.

2.6M
2 6m

5.4M
5 4m

11.5M
11 46m

16M
16m

20M
20m

25M
25m

No deduplication
no_dedupe_4 6m

config + generator + vector.sh to run setup to reproduce:

.
├── data
│ └── file
├── generate.py # simple generator written by Copilot
├── log
├── vector.sh
└── vector.yaml

vector-dedupe.zip

@jszwedko
Copy link
Member

jszwedko commented Aug 2, 2024

Thanks for putting this reproduction together! I haven't had a chance to look at it yet, but it should help with reproducing and identifying the reason for the increased memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: performance Anything related to Vector's performance transform: dedupe Anything `dedupe` transform related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

2 participants