Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector Disk Buffer Error: Last written record was unable to be deserialized. Corruption likely #20651

Closed
ShahroZafar opened this issue Jun 12, 2024 · 1 comment
Labels
type: bug A code related bug.

Comments

@ShahroZafar
Copy link

ShahroZafar commented Jun 12, 2024

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Often we see that one or two of vector pods running as agents shoots in CPU and memory. Looking at the events, it shows that the pod is OOMKilled. (Vector is basically not able to handle even a small spike of load)

When the pod come backs up again, I see that the its CPU usage is very high even though its not pushing logs at all via kafka sink. I see that the disk buffer keeps on increasing. The buffer size is still way under whats configured to be the max size and there is still space available on the underlying AWS EC2 instance

In the logs of vector I can see this error

2024-06-12T00:41:37.459225Z ERROR sink{component_kind="sink" component_id=kafka component_type=kafka}: vector_buffers::variants::disk_v2::writer: Last written record was unable to be deserialized. Corruption likely. reason="invalid data: check failed for struct member payload: pointer out of bounds: base 0x7fb675fffff4 offset 1868849526 not in range 0x7fb673e1f000..0x7fb676000000"

To sum it up

  • There was a slight increase in load which causes vector to get OOMKilled for some reason
  • When vector comes back up, it generates the deserialized error as indicated above
  • The buffer keeps on increasing when the vector comes online
  • The CPU of vector its way too high even though its not processing anything

Configuration

acknowledgements:
  enabled: true
api:
  address: 0.0.0.0:8686
  enabled: true
  playground: false
data_dir: /vector-data-dir
expire_metrics_secs: 900
sinks:
  kafka:
    batch:
      max_bytes: 1000000
      max_events: 1500
      timeout_secs: 0.5
    bootstrap_servers: kafka:9092
    buffer:
      max_size: 5000000000
      type: disk
      when_full: block
    compression: zstd
    encoding:
      codec: json
    inputs:
    - dedot_keys
    librdkafka_options:
      client.id: vector
      request.required.acks: "1"
    message_timeout_ms: 0
    topic: vector
    type: kafka
  prometheus_exporter:
    address: 0.0.0.0:9090
    buffer:
      max_size: 5000000000
      type: disk
      when_full: block
    flush_period_secs: 60
    inputs:
    - internal_metrics
    type: prometheus_exporter
sources:
  internal_metrics:
    type: internal_metrics
  kubernetes_logs:
    glob_minimum_cooldown_ms: 3000
    ingestion_timestamp_field: ingest_timestamp
    type: kubernetes_logs
transforms:
  dedot_keys:
    inputs:
    - kubernetes_logs
    source: |
      . = map_keys(., recursive: true) -> |key| { replace(key, ".", "_") }
    type: remap

Version

0.38.0

Debug Output

No response

Example Data

No response

Additional Context

Vector is running as an agent on EKS with kubernetes version 1.29.

Attaching Screenshots

image image image image image

References

No response

@ShahroZafar ShahroZafar added the type: bug A code related bug. label Jun 12, 2024
@jszwedko
Copy link
Member

Hi @ShahroZafar !

Thanks for opening this. I think it is the same issue as #18336 so I'll close this one as a duplicate, but I appreciate all of the additional details you provided in your report!

@jszwedko jszwedko closed this as not planned Won't fix, can't repro, duplicate, stale Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

2 participants