-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dedupe events - high memory usage #20883
Comments
@luk-ada thanks for this report. Do have a baseline to compare with, could you try running both Vectors without the As you noted, the overhead for the keys should be relatively small relatively small for your key field: |
@jszwedko thank you for the response. Below you can see RAM usage before dedupe enabled. In this case we are talking about first Vector which was configured with PubSub source and Logscale sink. I've restarted second Vector on Friday, currently usign ~10GB of memory and slowly constantly growing. PS. please ignore usage above limit - pod restarts. |
Thanks @luk-ada , that is interesting. It does make it seem like the I think one (or both) of two things could be helpful:
|
Hi @jszwedko I've prepared simple setup to reproduce, please check attached zip. It looks like that 15M of messages is using around 5GB of RAM. I used WSL 1 on Windows 10, below my results. config + generator + vector.sh to run setup to reproduce:
|
Thanks for putting this reproduction together! I haven't had a chance to look at it yet, but it should help with reproducing and identifying the reason for the increased memory usage. |
A note for the community
Problem
Hello
I'm streaming the GCP logs from the PubSub and ingest them to Logscale + Vector for log to metrics transformation.
It seams that Vector is not handling PubSub source very well and I have quite a lot of duplicates which are not acceptable by the customer. PubSub load is from 5k/s to 200k/s. I have two Vectors:
First Vector is using ~2,5-3 GB of memory. Second i using 15 GB of RAM and slowly growing all the time.
Is de-duplication working fine? I'm using fields.match by message_id which is string with 17 digits (bytes) so in regards to the Memory Utilization Estimation it should use 0,255 Gigabyte.
length("11497906994447355") 17
17 bytes * 15000000 * 1e-9 = 0,255 Gigabye
I'm sharing configuration for the second Vector.
Configuration
Version
vector 0.39.0 (x86_64-unknown-linux-musl 73da9bb 2024-06-17 16:00:23.791735272)
Debug Output
No response
Example Data
example, message_id:
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: