Dedupe events - high memory usage #20883

luk-ada · 2024-07-18T12:19:36Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hello

I'm streaming the GCP logs from the PubSub and ingest them to Logscale + Vector for log to metrics transformation.

It seams that Vector is not handling PubSub source very well and I have quite a lot of duplicates which are not acceptable by the customer. PubSub load is from 5k/s to 200k/s. I have two Vectors:

Source: PubSub > Transformation: Dedupe (cache 5 000 000 messages) > Sink: vector
Source: Vector > Transformation: Dedupe (cache 15 000 000 messages) > Sinks: humio_logs, vector

First Vector is using ~2,5-3 GB of memory. Second i using 15 GB of RAM and slowly growing all the time.

Is de-duplication working fine? I'm using fields.match by message_id which is string with 17 digits (bytes) so in regards to the Memory Utilization Estimation it should use 0,255 Gigabyte.

length("11497906994447355") 17

17 bytes * 15000000 * 1e-9 = 0,255 Gigabye

I'm sharing configuration for the second Vector.

Configuration

api:
  address: 0.0.0.0:8686
  enabled: true
  playground: false
data_dir: /vector-data-dir
acknowledgements:
  enabled: true
sources:
  vector:
    type: vector
    address: 0.0.0.0:1500
  internal_streams_metrics_source:
    type: internal_metrics
transforms:
  dedupe:
    type: dedupe
    inputs:
      - vector
    cache:
      num_events: 15000000
    fields:
      match: ["message_id"]
sinks:
  internal_streams_metrics_sink:
    address: 0.0.0.0:9000
    default_namespace: service
    inputs:
      - internal_streams_metrics_source
    type: prometheus_exporter
    acknowledgements:
      enabled: false
  logscale_logs:
    type: humio_logs
    inputs:
      - dedupe
    endpoint: "http://***:8080"
    token: ${***}
    index: gcp
    event_type: gcp-parser
    encoding:
      codec: json
    acknowledgements:
      enabled: true
    batch:
      max_bytes: 1500000
      max_events: 1500
      timeout_secs: 1
    buffer:
      max_events: 10000
      type: memory
      when_full: block
    compression: none
  vector_metrics:
    type: vector
    inputs:
      - dedupe
    address: http://***:1500
    acknowledgements:
      enabled: true

Version

vector 0.39.0 (x86_64-unknown-linux-musl 73da9bb 2024-06-17 16:00:23.791735272)

Debug Output

No response

Example Data

example, message_id:

11497906994447360
11497906994447359
11497906994447358
11497906994447357
11497906994447356
11497906994447355
11497906994447354
11497906994447353
11497906994447352
11497906994447351

Additional Context

No response

References

No response

The text was updated successfully, but these errors were encountered:

jszwedko · 2024-07-19T19:00:18Z

@luk-ada thanks for this report.

Do have a baseline to compare with, could you try running both Vectors without the dedupe transform and observe the memory use? I'd like to understand how much the transform may be adding vs the baseline.

As you noted, the overhead for the keys should be relatively small relatively small for your key field: message_id. For 17 byte keys, for 5 million messages it should be about 85 MB and for 15 million messages should be about 255 MB for just the internal state store of the dedupe transform. There is some static overhead per key but it should be on the order of a few bytes. I'm suspicious that the additional memory use is in the dedupe transform itself and not somewhere else in the pipeline, but it is possible that there is a bug.

luk-ada · 2024-07-23T13:41:27Z

@jszwedko thank you for the response. Below you can see RAM usage before dedupe enabled. In this case we are talking about first Vector which was configured with PubSub source and Logscale sink.

I've restarted second Vector on Friday, currently usign ~10GB of memory and slowly constantly growing.

PS. please ignore usage above limit - pod restarts.

jszwedko · 2024-07-26T21:03:36Z

Thanks @luk-ada , that is interesting. It does make it seem like the dedupe transform is causing a large increase in memory usage. Nothing jumped out when quickly reviewing the code.

I think one (or both) of two things could be helpful:

Creating a minimal reproducible example that I could run that manifests the behavior. This would include trimming the config to the minimal necessary and providing some mechanism to generate the input and feed it to a running Vector. Then I could more easily profile locally.
Collect a memory profile yourself by running Vector under valgrind and providing that here. That profile might make it easy to spot where the issue lies.

luk-ada · 2024-07-30T10:23:58Z

Hi @jszwedko

I've prepared simple setup to reproduce, please check attached zip. It looks like that 15M of messages is using around 5GB of RAM. I used WSL 1 on Windows 10, below my results.

2.6M

5.4M

11.5M

16M

20M

25M

No deduplication

config + generator + vector.sh to run setup to reproduce:

.
├── data
│ └── file
├── generate.py # simple generator written by Copilot
├── log
├── vector.sh
└── vector.yaml

vector-dedupe.zip

jszwedko · 2024-08-02T20:42:32Z

Thanks for putting this reproduction together! I haven't had a chance to look at it yet, but it should help with reproducing and identifying the reason for the increased memory usage.

luk-ada added the type: bug A code related bug. label Jul 18, 2024

jszwedko added transform: dedupe Anything `dedupe` transform related domain: performance Anything related to Vector's performance labels Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dedupe events - high memory usage #20883

Dedupe events - high memory usage #20883

luk-ada commented Jul 18, 2024 •

edited

Loading

jszwedko commented Jul 19, 2024

luk-ada commented Jul 23, 2024 •

edited

Loading

jszwedko commented Jul 26, 2024

luk-ada commented Jul 30, 2024 •

edited

Loading

jszwedko commented Aug 2, 2024

Dedupe events - high memory usage #20883

Dedupe events - high memory usage #20883

Comments

luk-ada commented Jul 18, 2024 • edited Loading

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

jszwedko commented Jul 19, 2024

luk-ada commented Jul 23, 2024 • edited Loading

jszwedko commented Jul 26, 2024

luk-ada commented Jul 30, 2024 • edited Loading

jszwedko commented Aug 2, 2024

luk-ada commented Jul 18, 2024 •

edited

Loading

luk-ada commented Jul 23, 2024 •

edited

Loading

luk-ada commented Jul 30, 2024 •

edited

Loading