Vector Disk Buffer Error: Last written record was unable to be deserialized. Corruption likely #20651

ShahroZafar · 2024-06-12T08:12:43Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Often we see that one or two of vector pods running as agents shoots in CPU and memory. Looking at the events, it shows that the pod is OOMKilled. (Vector is basically not able to handle even a small spike of load)

When the pod come backs up again, I see that the its CPU usage is very high even though its not pushing logs at all via kafka sink. I see that the disk buffer keeps on increasing. The buffer size is still way under whats configured to be the max size and there is still space available on the underlying AWS EC2 instance

In the logs of vector I can see this error

2024-06-12T00:41:37.459225Z ERROR sink{component_kind="sink" component_id=kafka component_type=kafka}: vector_buffers::variants::disk_v2::writer: Last written record was unable to be deserialized. Corruption likely. reason="invalid data: check failed for struct member payload: pointer out of bounds: base 0x7fb675fffff4 offset 1868849526 not in range 0x7fb673e1f000..0x7fb676000000"

To sum it up

There was a slight increase in load which causes vector to get OOMKilled for some reason
When vector comes back up, it generates the deserialized error as indicated above
The buffer keeps on increasing when the vector comes online
The CPU of vector its way too high even though its not processing anything

Configuration

acknowledgements:
  enabled: true
api:
  address: 0.0.0.0:8686
  enabled: true
  playground: false
data_dir: /vector-data-dir
expire_metrics_secs: 900
sinks:
  kafka:
    batch:
      max_bytes: 1000000
      max_events: 1500
      timeout_secs: 0.5
    bootstrap_servers: kafka:9092
    buffer:
      max_size: 5000000000
      type: disk
      when_full: block
    compression: zstd
    encoding:
      codec: json
    inputs:
    - dedot_keys
    librdkafka_options:
      client.id: vector
      request.required.acks: "1"
    message_timeout_ms: 0
    topic: vector
    type: kafka
  prometheus_exporter:
    address: 0.0.0.0:9090
    buffer:
      max_size: 5000000000
      type: disk
      when_full: block
    flush_period_secs: 60
    inputs:
    - internal_metrics
    type: prometheus_exporter
sources:
  internal_metrics:
    type: internal_metrics
  kubernetes_logs:
    glob_minimum_cooldown_ms: 3000
    ingestion_timestamp_field: ingest_timestamp
    type: kubernetes_logs
transforms:
  dedot_keys:
    inputs:
    - kubernetes_logs
    source: |
      . = map_keys(., recursive: true) -> |key| { replace(key, ".", "_") }
    type: remap

Version

0.38.0

Debug Output

No response

Example Data

No response

Additional Context

Vector is running as an agent on EKS with kubernetes version 1.29.

Attaching Screenshots

References

No response

The text was updated successfully, but these errors were encountered:

jszwedko · 2024-06-12T14:12:11Z

Hi @ShahroZafar !

Thanks for opening this. I think it is the same issue as #18336 so I'll close this one as a duplicate, but I appreciate all of the additional details you provided in your report!

ShahroZafar added the type: bug A code related bug. label Jun 12, 2024

jszwedko closed this as not planned Won't fix, can't repro, duplicate, stale Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector Disk Buffer Error: Last written record was unable to be deserialized. Corruption likely #20651

Vector Disk Buffer Error: Last written record was unable to be deserialized. Corruption likely #20651

ShahroZafar commented Jun 12, 2024 •

edited

Loading

jszwedko commented Jun 12, 2024

Vector Disk Buffer Error: Last written record was unable to be deserialized. Corruption likely #20651

Vector Disk Buffer Error: Last written record was unable to be deserialized. Corruption likely #20651

Comments

ShahroZafar commented Jun 12, 2024 • edited Loading

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

jszwedko commented Jun 12, 2024

ShahroZafar commented Jun 12, 2024 •

edited

Loading