Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receiver: cache matchers for series calls #7353

Merged
merged 14 commits into from
Jan 3, 2025

Conversation

pedro-stanaka
Copy link
Contributor

@pedro-stanaka pedro-stanaka commented May 13, 2024

Summary

We have tried caching matchers before with a time-based expiration cache, this time we are trying with LRU cache.

We saw some of our receivers busy with compiling regexes and with high CPU usage, similar to the profile of the benchmark I added here:

image

Benchmark results

Expand!
Result on store-proxy-cache-matchers
BenchmarkProxySeriesRegex-11    	 1545795	       768.7 ns/op	    1144 B/op	      19 allocs/op
BenchmarkProxySeriesRegex-11    	 1548040	       769.4 ns/op	    1144 B/op	      19 allocs/op
BenchmarkProxySeriesRegex-11    	 1545019	       778.3 ns/op	    1144 B/op	      19 allocs/op
BenchmarkProxySeriesRegex-11    	 1539387	       771.1 ns/op	    1144 B/op	      19 allocs/op

Result on main
BenchmarkProxySeriesRegex-11    	  130292	      8803 ns/op	   10288 B/op	      78 allocs/op
BenchmarkProxySeriesRegex-11    	  124045	      8533 ns/op	   10288 B/op	      78 allocs/op
BenchmarkProxySeriesRegex-11    	  125092	      8712 ns/op	   10288 B/op	      78 allocs/op
BenchmarkProxySeriesRegex-11    	  120110	      8676 ns/op	   10288 B/op	      78 allocs/op

The results indicate that the "store-proxy-cache-matchers" branch considerably outperforms the "main" branch in all observed aspects of the BenchmarkProxySeriesRegex function. It is roughly 10 times faster regarding execution time while using about 9 times less memory and making about 4 times fewer allocations per operation. These improvements suggest significant optimizations in the regex handling or related data processing in the "store-proxy-cache-matchers" branch compared to the "main" branch

Changes

  • Adding matcher cache for method MatchersToPromMatchers and a new version which uses the cache.
  • The main change is in matchesExternalLabels function which now receives a cache instance.

Verification

  • I have added tests to the change and new benchmarks.

@pedro-stanaka pedro-stanaka force-pushed the store-proxy-cache-matchers branch from 598b480 to d56e024 Compare May 13, 2024 06:41
@pedro-stanaka pedro-stanaka marked this pull request as ready for review May 13, 2024 08:13
@pedro-stanaka pedro-stanaka marked this pull request as draft May 13, 2024 08:14
@pedro-stanaka pedro-stanaka force-pushed the store-proxy-cache-matchers branch 2 times, most recently from 0528b9c to a58508d Compare May 13, 2024 09:29
@pedro-stanaka pedro-stanaka changed the title Receivers|Store: cache matchers for series calls Receiver: cache matchers for series calls May 13, 2024
@pedro-stanaka pedro-stanaka force-pushed the store-proxy-cache-matchers branch 3 times, most recently from 34e4852 to 3f852a5 Compare May 13, 2024 14:48
@pedro-stanaka pedro-stanaka marked this pull request as ready for review May 14, 2024 13:05
Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results indicate that the "store-proxy-cache-matchers" branch considerably outperforms the "main" branch in all observed aspects of the BenchmarkProxySeriesRegex function. It is roughly 10 times faster regarding execution time while using about 9 times less memory and making about 4 times fewer allocations per operation. These improvements suggest significant optimizations in the regex handling or related data processing in the "store-proxy-cache-matchers" branch compared to the "main" branch

Was this AI generated? 😄


func (c *MatchersCache) GetOrSet(key LabelMatcher, newItem NewItemFunc) (*labels.Matcher, error) {
c.metrics.requestsTotal.Inc()
if item, ok := c.cache.Get(key); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using singleflight here to reduce allocations even more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Did that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly as from Cortex: :D

Is it possible to make this interface receive the Prometheus types, instead thanos ones, so we can reuse the same implementation on cortex?

Ex:

GetOrSet(t labels.MatchType, n, v string, newItem NewItemFunc) (*labels.Matcher, error)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanprot could you link where in Cortex you would use this? I introduced an interface now, which prompb.LabelMatcher implements and made the storepb.LabelMatcher implement it as well. Let me know if that is enough for Cortex to reuse the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok..
I think that works!

But i think the interface definition should be in the storecache package? Other than that i think it would work just fine for us!

@@ -973,6 +986,8 @@ func (rc *receiveConfig) registerFlag(cmd extkingpin.FlagClause) {
"about order.").
Default("false").Hidden().BoolVar(&rc.allowOutOfOrderUpload)

cmd.Flag("matcher-cache-size", "The size of the cache used for matching against external labels. Using 0 disables caching.").Default("0").IntVar(&rc.matcherCacheSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this to other components as well like Thanos Store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do it, I fear making the PR hard to review though.

pkg/store/prometheus.go Outdated Show resolved Hide resolved
}
}

func NewMatchersCache(opts ...MatcherCacheOption) (*MatchersCache, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just use pkg/cache/inmemory.go? It's another LRU implementation that already exists in the tree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need to make it generic first, no? Or you mean adding this LRU to also be stored there? Also, I feel like this would introduce the need for the user to configure the cache via YAML configuration in the receiver for example, which would get quite complex.

@yeya24
Copy link
Contributor

yeya24 commented Dec 3, 2024

Hi @pedro-stanaka, do you plan to continue this PR?
We want to add the same cache in Cortex cortexproject/cortex#6382 and think it would be nice to contribute the cache to Thanos directly so that both projects can share the same code.

If you are busy with something else and not planning to come back to this PR, can we create a new PR to add this cache?

@pedro-stanaka
Copy link
Contributor Author

Hi @pedro-stanaka, do you plan to continue this PR? We want to add the same cache in Cortex cortexproject/cortex#6382 and think it would be nice to contribute the cache to Thanos directly so that both projects can share the same code.

If you are busy with something else and not planning to come back to this PR, can we create a new PR to add this cache?

I can take a stab at finishing this up this week probably. Will put on my TODO.

@pedro-stanaka pedro-stanaka force-pushed the store-proxy-cache-matchers branch from f7b7697 to 813b5fe Compare December 5, 2024 09:48
@pull-request-size pull-request-size bot added size/XL and removed size/L labels Dec 5, 2024
@pedro-stanaka
Copy link
Contributor Author

@yeya24 @GiedriusS please take a look at the current version.

Some stuff I think might be good doing, but not sure:

  • We now have feature flags for receivers, should we put this behind a feature flag with a sensible default for the LRU size (whilst still allow users to override)?
  • As Giedrius said, should I already change the Store gateway to use this cache as well?

@pedro-stanaka pedro-stanaka force-pushed the store-proxy-cache-matchers branch 9 times, most recently from f493eb7 to 3a10cf1 Compare December 6, 2024 15:23
@@ -0,0 +1,150 @@
// Copyright (c) The Thanos Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this code out of storepb package? storepb sounds more related to the proto itself but this matcher cache can be more generic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to the storecache package, which seems a generic cache package.

type MatchersCache interface {
// GetOrSet retrieves a matcher from cache or creates and stores it if not present.
// If the matcher is not in cache, it uses the provided newItem function to create it.
GetOrSet(key LabelMatcher, newItem NewItemFunc) (*labels.Matcher, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Can we take prometheus matcher as input key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we want to convert to a prometheus matcher, I dont see the reason to use it as key. I will create a middle-term matcher that can be represented in a more neutral way. Let's see if with that we can use the cache in Cortex as well.

@pedro-stanaka pedro-stanaka force-pushed the store-proxy-cache-matchers branch 2 times, most recently from 417104d to b2b65a2 Compare December 9, 2024 16:58
@pedro-stanaka pedro-stanaka force-pushed the store-proxy-cache-matchers branch from 0845559 to ae8a59b Compare December 18, 2024 13:14
@pedro-stanaka
Copy link
Contributor Author

@yeya24 @GiedriusS any extra feedback here?

pkg/store/cache/matchers_cache.go Outdated Show resolved Hide resolved
pkg/store/cache/matcher_cache_test.go Outdated Show resolved Hide resolved
pkg/store/cache/matchers_cache.go Outdated Show resolved Hide resolved
yeya24
yeya24 previously approved these changes Dec 23, 2024
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all feedbacks!

func MatcherToPromMatcher(m ConversionLabelMatcher) (*labels.Matcher, error) {
mi, ok := m.(*storepb.LabelMatcher)
if !ok {
return nil, errors.New(fmt.Sprintf("invalid matcher type. Got: %T", m))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe we can just use fmt.Errf()

@yeya24
Copy link
Contributor

yeya24 commented Dec 23, 2024

Unit test discovered a data race, which doesn't seem to be related.

==================
WARNING: DATA RACE
Write at 0x000002eddc60 by goroutine 33:
  github.com/thanos-io/thanos/pkg/query.TestEndpointSetUpdate_StrictEndpointMetadata()
      /home/runner/work/thanos/thanos/pkg/query/endpointset_test.go:470 +0x8a
  testing.tRunner()
      /opt/hostedtoolcache/go/1.23.4/x64/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /opt/hostedtoolcache/go/1.23.4/x64/src/testing/testing.go:1743 +0x44

Previous read at 0x000002eddc60 by goroutine 5016:
  github.com/thanos-io/thanos/pkg/info/infopb.(*StoreInfo).Size()
      /home/runner/work/thanos/thanos/pkg/info/infopb/rpc.pb.go:968 +0x4d
  github.com/thanos-io/thanos/pkg/info/infopb.(*InfoResponse).Size()
      /home/runner/work/thanos/thanos/pkg/info/infopb/rpc.pb.go:936 +0x138
  github.com/thanos-io/thanos/pkg/info/infopb.(*InfoResponse).Marshal()
      /home/runner/work/thanos/thanos/pkg/info/infopb/rpc.pb.go:549 +0x2e
  google.golang.org/protobuf/internal/impl.legacyMarshal()
      /home/runner/go/pkg/mod/google.golang.org/[email protected]/internal/impl/legacy_message.go:411 +0xf9
  google.golang.org/protobuf/proto.MarshalOptions.marshal()
      /home/runner/go/pkg/mod/google.golang.org/[email protected]/proto/encode.go:194 +0x504
  google.golang.org/protobuf/proto.Marshal()
      /home/runner/go/pkg/mod/google.golang.org/[email protected]/proto/encode.go:110 +0x52
  google.golang.org/grpc/encoding/proto.codec.Marshal()
      /home/runner/go/pkg/mod/google.golang.org/[email protected]/encoding/proto/proto.go:47 +0x55
  google.golang.org/grpc/encoding/proto.(*codec).Marshal()
      <autogenerated>:1 +0x53
  google.golang.org/grpc.encode()
      /home/runner/go/pkg/mod/google.golang.org/[email protected]/rpc_util.go:647 +0x6a
  google.golang.org/grpc.(*Server).sendResponse()
      /home/runner/go/pkg/mod/google.golang.org/[email protected]/server.go:1130 +0xc5
  google.golang.org/grpc.(*Server).processUnaryRPC()
      /home/runner/go/pkg/mod/google.golang.org/[email protected]/server.go:1416 +0x2092
  google.golang.org/grpc.(*Server).handleStream()
      /home/runner/go/pkg/mod/google.golang.org/[email protected]/server.go:1780 +0x1824
  google.golang.org/grpc.(*Server).serveStreams.func2.1()
      /home/runner/go/pkg/mod/google.golang.org/[email protected]/server.go:1019 +0x158

@pedro-stanaka
Copy link
Contributor Author

pedro-stanaka commented Dec 24, 2024

Thanks for addressing all feedbacks!

I think I am done now. More flaky tests on CI though

pkg/store/storepb/custom.go Outdated Show resolved Hide resolved
pkg/store/cache/matchers_cache.go Outdated Show resolved Hide resolved
We have tried caching matchers before with a time-based expiration cache, this time we are trying with LRU cache.

We saw some of our receivers busy with compiling regexes and with high CPU usage, similar to the profile of the benchmark I added here:

* Adding matcher cache for method `MatchersToPromMatchers` and a new version which uses the cache.
* The main change is in `matchesExternalLabels` function which now receives a cache instance.

adding matcher cache and refactor matchers

Co-authored-by: Andre Branchizio <[email protected]>

Signed-off-by: Pedro Tanaka <[email protected]>

Using the cache in proxy and tsdb stores (only receiver)

Signed-off-by: Pedro Tanaka <[email protected]>

fixing problem with deep equality

Signed-off-by: Pedro Tanaka <[email protected]>

adding some docs

Signed-off-by: Pedro Tanaka <[email protected]>

Adding benchmark

Signed-off-by: Pedro Tanaka <[email protected]>

undo unecessary changes

Signed-off-by: Pedro Tanaka <[email protected]>

Adjusting metric names

Signed-off-by: Pedro Tanaka <[email protected]>

adding changelog

Signed-off-by: Pedro Tanaka <[email protected]>

wiring changes to the receiver

Signed-off-by: Pedro Tanaka <[email protected]>

Fixing linting

Signed-off-by: Pedro Tanaka <[email protected]>

docs

Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: Pedro Tanaka <[email protected]>

Fixing problem with wrong initialization

Signed-off-by: Pedro Tanaka <[email protected]>

Moving interface to storecache package

Signed-off-by: Pedro Tanaka <[email protected]>

remove empty file and fix calls to constructor passing nil;

Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: Pedro Tanaka <[email protected]>
@pedro-stanaka pedro-stanaka force-pushed the store-proxy-cache-matchers branch from 792c9e1 to 7f2c601 Compare January 3, 2025 13:29
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@yeya24 yeya24 merged commit 626d0e5 into thanos-io:main Jan 3, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants