-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prometheus.relabel external cache proposal #1600
Open
wilfriedroset
wants to merge
4
commits into
grafana:main
Choose a base branch
from
wilfriedroset:prometheus-relabel-external-cache
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
74887d9
Add prometheus.relabel external cache proposal
wilfriedroset 4660ad7
PR review follow-up updates
wilfriedroset 8d68d6b
Add GetMultiple and SetMultiple to Cache interface
wilfriedroset 8684ddd
Clarify in-memory cache depreciation
wilfriedroset File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,172 @@ | ||
# Proposal: Prometheus relabel external cache | ||
|
||
- Author(s): Wilfried ROSET (@wilfriedroset), Pierre BAILHACHE (@pbailhache) | ||
- Last updated: 2024-09-02 | ||
- Original issue: <https://github.com/grafana/alloy/issues/1600> | ||
|
||
## Abstract | ||
|
||
This proposal introduces a way to configure the component `prometheus.relabel` so that it could use an external cache such as `Redis` or `Memcached` instead of `in-memory`. | ||
|
||
## Problem | ||
|
||
The `prometheus.relabel` component rewrites the label set of each metric passed along to the exported receiver by applying one or more relabeling rules. To do so it uses a relabeling cache stored in memory. For each `prometheus.relabel` component Alloy will create a dedicated relabel cache. This is not a huge issue per se due to the possibility of registering multiple rules into on component to mutualize the cache. | ||
|
||
However if you're horizontally scaling Alloy deployment with load-balancing, you will end up with one cache per Alloy instance and those local caches will have overlaps, increasing the footprint of each instance. | ||
Moreover, the cache is tied to the pod, which means that a new Alloy process starts with an empty cache. | ||
|
||
For a couple of horizontally scaled Alloy pods it's acceptable but if you plan to have lots of instances processing data horizontally it's not sustainable. | ||
|
||
## Proposal | ||
|
||
Allow to use `Redis` or `Memcached` instead of the `in-memory` one as cache. | ||
|
||
Using [dskit](https://github.com/grafana/dskit/blob/main/cache/cache.go) code to manage connection and client configuration. | ||
|
||
We could create an interface to avoid changing the code logic and to abstract the kind of cache used to the component. | ||
|
||
## Pros and cons | ||
|
||
**Pros:** | ||
|
||
- No logic change so impact is expected to be null for users | ||
- Possibility to use an external cache if needed, even having multiple caches for different relabeling components. | ||
- Will be easy to use in other parts of the codebase | ||
|
||
**Cons:** | ||
|
||
- Config is a bit more complex compared to previous one | ||
|
||
## Alternative solutions | ||
|
||
The alternative is to do nothing as we deem this improvement unnecessary. | ||
|
||
## Compatibility | ||
|
||
This proposal offer to deprecate the old way to configure the in-memory cache and drop it in the next major release (e.g: 2.0). Doing so allow to migrate the settings to the correct block. | ||
|
||
## Implementation | ||
|
||
We should add a re-usable cache interface compatible with multiple backend, it should be usable with different value types | ||
|
||
```golang | ||
type Cache[valueType any] interface { | ||
Get(key string) (*valueType, error) | ||
Set(key string, value *valueType, ttl time.Duration) error | ||
Remove(key string) | ||
GetMultiple(keys []string) ([]*valueType, error) | ||
SetMultiple(values map[string]*valueType, ttl time.Duration) error | ||
Clear(newSize int) error | ||
GetCacheSize() int | ||
} | ||
``` | ||
|
||
Each backend cache should be implemented in a separate file. For now we should support in_memory, redis and memcached. | ||
|
||
We should add a way to select the cache and its connections options through the components arguments | ||
|
||
For example, based on what's done in [Mimir index cache](https://github.com/grafana/mimir/blob/main/pkg/storage/tsdb/index_cache.go#L47): | ||
|
||
```golang | ||
type Arguments struct { | ||
// Where the relabelled metrics should be forwarded to. | ||
ForwardTo []storage.Appendable `alloy:"forward_to,attr"` | ||
|
||
// The relabelling rules to apply to each metric before it's forwarded. | ||
MetricRelabelConfigs []*alloy_relabel.Config `alloy:"rule,block,optional"` | ||
|
||
// DEPRECATED Use type = inmemory and cache_size field. | ||
InMemoryCacheSizeDeprecated int `alloy:"max_cache_size,attr,optional"` | ||
|
||
// Cache backend configuration. | ||
CacheConfig cache.CacheConfig `alloy:"cache,block,optional"` | ||
} | ||
|
||
type CacheConfig struct { | ||
cache.BackendConfig `yaml:",inline"` | ||
InMemory InMemoryCacheConfig `yaml:"inmemory"` | ||
} | ||
|
||
type InMemoryCacheConfig struct { | ||
CacheSize int `yaml:"cache_size"` | ||
} | ||
|
||
------ | ||
type BackendConfig struct { | ||
Backend string `yaml:"backend"` | ||
Memcached MemcachedClientConfig `yaml:"memcached"` | ||
Redis RedisClientConfig `yaml:"redis"` | ||
} | ||
``` | ||
|
||
Configuration should be `in_memory` by default. | ||
`max_cache_size` should still be taken into account but only if `backend = in_memory`. It also should be deprecated and we should redirect to the `InMemoryRelabelCacheConfig.CacheSize` field. | ||
|
||
Here is some examples: | ||
|
||
- legacy config unchanged | ||
|
||
```river | ||
prometheus.relabel "legacy_config" { | ||
forward_to = [...] | ||
max_cache_size = 10000000 | ||
rule { | ||
... | ||
} | ||
|
||
} | ||
``` | ||
|
||
- redis config | ||
|
||
```river | ||
prometheus.relabel "redis_config" { | ||
forward_to = [...] | ||
cache { | ||
backend = "redis" | ||
redis { | ||
endpoint = "redis.url" | ||
username = "user" | ||
password = "password" | ||
... | ||
} | ||
} | ||
... | ||
} | ||
``` | ||
|
||
- new in memory config | ||
|
||
```river | ||
prometheus.relabel "inmemory_config" { | ||
forward_to = [...] | ||
cache { | ||
backend = "inmemory" | ||
inmemory { | ||
cache_size = 10000000 | ||
} | ||
} | ||
... | ||
} | ||
``` | ||
|
||
- memcached config | ||
|
||
```river | ||
prometheus.relabel "memcached_config" { | ||
forward_to = [...] | ||
cache { | ||
backend = "memcached" | ||
memcached { | ||
addresses = "address1, address2" | ||
timeout = 10 | ||
... | ||
} | ||
} | ||
... | ||
} | ||
``` | ||
|
||
## Related open issues | ||
|
||
N/A |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a bespoke redis/memcached action, how about a redis and memcached component that exports a key value interface? Along with generic accessors. I could see this being used for more than just relabel. Much in the same way we allow components to consume strings but those strings can come from vault, file, http etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could even get clever and allow prometheus.relabel to have a list of caches, if there are problems with redis then fallback to in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitively like the idea of having a unified way to define the caching mecanism.
You are correct but this add an extra complexity to take into account. A highly available external cache is indeed recommended but having a fallback is definitively a good idea.
The extra complexity comes from when the external caches becomes unavailable Alloy needs to use the in-memory one. This could lead to OOM, e.g: you have plan to use an external cache then you fallback on in-memory but the limits of your pod is not high enough.
Then Alloy needs to check if the external storage is available again. Once it is, Alloy needs to move its in-memory k/v to the external cache. But we are exposed to a race when we have two Alloy instances using the same cache.
That being said, as I'm not that familiar with the code base yet and your proposal seems harder to achieve.
What do you think about extending our proposal to
loki.relabel
and standardize the whole thing as prometheus and loki are the main component using cache?This would serve as a first step toward your proposal. Something like so
Once done, the associated work to create the
storage
components such asstorage.redis
could be picked up right after the initial proposal.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fallback caches are likely a stretch though I do feel the kv store should be a generic component. As far as the interface I would go with something like the below. Disclaimer this is getting into the deeper technical ideas. The reason keys might want to be strings is to mirror how alloy uses maps. Where the key is always a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't have to specifically use that for the first round. Capsules are mainly if you want to bridge into it being accessible within the alloy config. In this case your likely passing an interface/pointer from go code to go code, which should work without problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the proposal to take reflect what we have been discussing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think we will likely need
GetMultiple(keys []string) ([]Results)
andSetMultiple
, imagine a scenario where you are scraping a million records, hitting redis each time is going to be pretty expensive at one request for each.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattdurham I've tried to take your feedbacks into account, the PR should be fine what do you think about it?
We have started to work on the code with @pbailhache, what would be the next step regarding this proposal? should we merge it if it is to your liking and iterate over the code?
p.s: I don't know what to do about the build 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just opened a draft pull request for the feature :
#1692
This is still a work in progress and I'm ready to iterate over it.