Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2] Implement sanitizers to operate on OTLP data #5551

Closed
wants to merge 22 commits into from

Conversation

varshith257
Copy link
Contributor

@varshith257 varshith257 commented Jun 10, 2024

Which problem is this PR solving?

Description of the changes

  • Rewritten sanitizers for v2 with the existing sanitizers to achieve parity with v1

How was this change tested?

  • Not Yet

Checklist

Copy link

codecov bot commented Jun 10, 2024

Codecov Report

Attention: Patch coverage is 0% with 64 lines in your changes missing coverage. Please review.

Project coverage is 96.27%. Comparing base (65f6cb9) to head (1062ba0).
Report is 180 commits behind head on main.

Files with missing lines Patch % Lines
...orters/storageexporter/sanitizer/utf8_sanitizer.go 0.00% 40 Missing ⚠️
...exporter/sanitizer/empty_service_name_sanitizer.go 0.00% 13 Missing ⚠️
...l/exporters/storageexporter/sanitizer/sanitizer.go 0.00% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5551      +/-   ##
==========================================
- Coverage   96.65%   96.27%   -0.38%     
==========================================
  Files         342      345       +3     
  Lines       16519    16583      +64     
==========================================
  Hits        15966    15966              
- Misses        363      427      +64     
  Partials      190      190              
Flag Coverage Δ
badger_v1 8.05% <ø> (ø)
badger_v2 1.81% <ø> (ø)
cassandra-3.x-v1 16.61% <ø> (ø)
cassandra-3.x-v2 1.74% <ø> (ø)
cassandra-4.x-v1 16.61% <ø> (ø)
cassandra-4.x-v2 1.74% <ø> (ø)
elasticsearch-6.x-v1 18.78% <ø> (ø)
elasticsearch-7.x-v1 18.84% <ø> (ø)
elasticsearch-8.x-v1 19.02% <ø> (ø)
elasticsearch-8.x-v2 1.81% <ø> (+0.01%) ⬆️
grpc_v1 9.52% <ø> (ø)
grpc_v2 7.14% <ø> (+0.01%) ⬆️
kafka 9.74% <ø> (ø)
memory_v2 1.81% <ø> (ø)
opensearch-1.x-v1 18.89% <ø> (+0.01%) ⬆️
opensearch-2.x-v1 18.89% <ø> (ø)
opensearch-2.x-v2 1.81% <ø> (ø)
unittests 94.71% <0.00%> (-0.37%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@varshith257
Copy link
Contributor Author

varshith257 commented Jun 10, 2024

@yurishkuro @albertteoh PTAL. I have rewritten the initial configuration of existing sanitizers to operate on OTLP data format. I have implemented them based on my understanding of OTLP data. Correct me, if I am wrong with understanding of OTLP data

Signed-off-by: Vamshi Maskuri <[email protected]>
@varshith257 varshith257 changed the title [WIP] Implement v2 sanitizers [v2] Implement sanitizers to operate on OTLP data Jun 10, 2024
Signed-off-by: Vamshi Maskuri <[email protected]>
cmd/collector/app/sanitizer_v2/service_name_sanitizer.go Outdated Show resolved Hide resolved
cmd/collector/app/sanitizer_v2/sanitizer.go Outdated Show resolved Hide resolved
// sanitizeEmptyServiceName sanitizes the service names in the span attributes.
func sanitizeEmptyServiceName(span ptrace.Span) ptrace.Span {
attributes := span.Attributes()
serviceNameAttr, ok := attributes.Get("service.name")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

service name is stored in the Resource attributes, not in the span.

Copy link
Contributor Author

@varshith257 varshith257 Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro span.Resource undefined (type ptrace.Span has no field or method Resource)compilerMissingFieldOrMethod

This means we need to access the resource from the ptrace.ResourceSpans. I think we need to work with ptrace.Traces instead of individual ptrace.Span and access the ResourceSpans which contains the Resource and associated ScopeSpans. Then we need to iterate through each ScopeSpans and their Spans. Am I in the right direction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to work with ptrace.Traces

@james-ryans this is an interesting situation. In v1 pipeline all non-span data (like batch.Process) was first denormalized into each span so that the unit of work in the pipeline was a single span. In v2 storage API we want to support batching for better efficiency, but it would also mean that unless storage supports storing normalized data (like storing resource / scope records separately, which is unlikely in k/v stores world), each storage would need to do its own denormalization. It also creates challenges to shared capabilities like sanitizers here, since they need to operate on data uniformly before the storage is called.

We can provide a shared lib to do the flattening / denormalization of resources[]/scopes[]/spans[] into just spans[] (essentially otlp2jaeger does that today already), but I don't think it's a good solution considering that OTLP spans cannot distinguish between resource/scope/span attributes without some magic prefixes / naming. In v1 model we had a clear slot Span.Process to maintain that distinction. I am thinking that in v2 storage implementations their internal data model should also be Span{ Resource=..., Scope=..., rest of span }, to avoid unnecessary flattening.

Coming back to sanitizers, I think Sanitize(pdata.Traces) is a good starting point signature. Some sanitizers would only look at Resources in that input, while others might look at everything. The alternative would be to have distinct sanitizers for Resource, Scope, and Span, but I don't think we need this complexity at this point.

@varshith257 does this answer your question?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this answer your question?

Yes. Thanks for it

Signed-off-by: Vamshi Maskuri <[email protected]>
@varshith257 varshith257 marked this pull request as ready for review June 11, 2024 03:03
@varshith257 varshith257 requested a review from a team as a code owner June 11, 2024 03:03
@varshith257
Copy link
Contributor Author

@yurishkuro PTAL and if it's ok, I will move ahead to add tests for them

varshith257 and others added 7 commits June 12, 2024 07:26
Signed-off-by: Vamshi Maskuri <[email protected]>
Signed-off-by: Vamshi Maskuri <[email protected]>
Signed-off-by: Vamshi Maskuri <[email protected]>
Signed-off-by: Vamshi Maskuri <[email protected]>
Signed-off-by: Vamshi Maskuri <[email protected]>
@varshith257
Copy link
Contributor Author

@yurishkuro PTAL

@yurishkuro yurishkuro added v2 changelog:exprimental Change to an experimental part of the code labels Jun 12, 2024
…zer.go

Co-authored-by: Yuri Shkuro <[email protected]>
Signed-off-by: Vamshi Maskuri <[email protected]>
)

// NewUTF8Sanitizer creates a UTF8 sanitizer.
func NewUTF8Sanitizer() SanitizeTraces {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test

type SanitizeTraces func(traces ptrace.Traces) ptrace.Traces

// NewStandardSanitizers are automatically applied by SpanProcessor.
func NewStandardSanitizers() []SanitizeTraces {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to call this from the exporter

for i := 0; i < resourceSpans.Len(); i++ {
resourceSpan := resourceSpans.At(i)
attributes := resourceSpan.Resource().Attributes()
serviceNameAttr, ok := attributes.Get("service.name")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"service.name"

better to import semantic conventions and use a constant from there

if !utf8.ValidString(k) {
originalKey := k
attributes.PutStr(invalidTagKey, k)
byteSlice := attributes.PutEmptyBytes(badUTF8Prefix + originalKey)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens to attributes.Range if you add attributes in the lambda function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying the map while iterating over it can lead to undefined behavior. Better, we should first collect the keys and values that need to be sanitized and then apply these changes after the iteration is complete.

varshith257 and others added 5 commits June 12, 2024 21:50
Signed-off-by: Vamshi Maskuri <[email protected]>
Co-authored-by: Yuri Shkuro <[email protected]>
Signed-off-by: Vamshi Maskuri <[email protected]>
Signed-off-by: Vamshi Maskuri <[email protected]>
Signed-off-by: Vamshi Maskuri <[email protected]>
for i := 0; i < resourceSpans.Len(); i++ {
resourceSpan := resourceSpans.At(i)

// Sanitize resource attributes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments like this provide no value

for _, k := range invalidKeys {
originalKey := k
attributes.PutStr(invalidTagKey, k)
binaryAttr := attributes.PutEmptyBytes(badUTF8Prefix + originalKey)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the originalKey is a bad UTF8 string, you are still creating a new key with the same bad string


for k, v := range invalidValues {
originalValue := v
attributes.PutStr(k, invalidTagKey)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this achieve? I suggest you write a specification in English in the comments to the sanitizer what exactly it's trying to do with attributes. And I think starting with the unit tests in this case would be much more useful, because you can see exactly how input and output data are related. For example, I expect the following to be true:

traces1 := inputData // assume it has bad UTF8 strings
traces2 := sanitizeUTF8(traces1) 
assert traces1 != traces2 // definitely changed something
traces3 := sanitizeUTF8(traces2)
assert traces2 == traces3 // the 2nd pass is no-op, since no more invalid data is present

@varshith257 varshith257 reopened this Jun 29, 2024
@varshith257
Copy link
Contributor Author

varshith257 commented Jun 29, 2024

Will complete this PR very soon( I had exams, so i hadn't done progress on this)

@Manoramsharma
Copy link
Contributor

Hi @varshith257 If you are not working on this can I take this from here?

@yurishkuro
Copy link
Member

fixed in #6078

@yurishkuro yurishkuro closed this Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog:exprimental Change to an experimental part of the code v2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants