Sampler Configuration Setting to Enable Stratified Sampling #20921

hillmandj · 2024-07-24T19:29:56Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

In many cases, I'd like to sample logs from different services at the same rate, however not every service generates the same volume of logs. As a result, there is no way to guarantee a uniform distribution of logs into the sampler so the rate cannot be applied consistently across the services. The current sampler implementation is relatively straightforward, essentially it maintains a count of events, and using the modulus chooses to emit a log when the count has reached the specified rate and the formula evaluates to zero.

The solution to this is to create a separate sampler for each service (or input stream), but that results in adding many different files that all have essentially the same configuration.

What would be ideal is if I could setup a configuration with an optional segment_by key (or some other name), such that a given log with a unique value for the field referenced will maintain its own count, and be sampled independently of logs with different values:

type: sample
inputs:
  - log_data_ingress
segment_by: {{ .service_name }} # or some other field for the given event
rate: 100

Attempted Solutions

No response

Proposal

I am not a Rust expert, but in my view this could essentially be implemented with a hashmap, where each unique value for segment_by is a key, with its count as a value. When incrementing the count, similar logic to what exists could be applied by checking if the value for the given key surpasses the rate (i.e. modulo math described earlier here)

References

Use of sample with key_field appears to be causing events to be dropped silently. #19680

Version

0.39.0

The text was updated successfully, but these errors were encountered:

jszwedko · 2024-07-24T20:35:48Z

Thanks for this request @hillmandj . It has come up before in discussions, but I don't think we had a dedicated issue for it. I think calling the option group_by would be consistent with the naming of similar options in other transforms (like reduce).

hillmandj added the type: feature A value-adding code addition that introduce new functionality. label Jul 24, 2024

jszwedko added the transform: sample Anything `sample` transform related label Jul 24, 2024

hillmandj linked a pull request Sep 11, 2024 that will close this issue

enhancement(sample transform): add stratified sampling capability #21274

Open

jszwedko mentioned this issue Oct 1, 2024

sample to optionally sample logs randomly #21393

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampler Configuration Setting to Enable Stratified Sampling #20921

Sampler Configuration Setting to Enable Stratified Sampling #20921

hillmandj commented Jul 24, 2024 •

edited

Loading

jszwedko commented Jul 24, 2024

Sampler Configuration Setting to Enable Stratified Sampling #20921

Sampler Configuration Setting to Enable Stratified Sampling #20921

Comments

hillmandj commented Jul 24, 2024 • edited Loading

A note for the community

Use Cases

Attempted Solutions

Proposal

References

Version

jszwedko commented Jul 24, 2024

hillmandj commented Jul 24, 2024 •

edited

Loading