Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prometheus.relabel external cache proposal #1600

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 172 additions & 0 deletions docs/design/1600-prometheus-relabel-external-cache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# Proposal: Prometheus relabel external cache

- Author(s): Wilfried ROSET (@wilfriedroset), Pierre BAILHACHE (@pbailhache)
- Last updated: 2024-09-02
- Original issue: <https://github.com/grafana/alloy/issues/1600>

## Abstract

This proposal introduces a way to configure the component `prometheus.relabel` so that it could use an external cache such as `Redis` or `Memcached` instead of `in-memory`.

## Problem

The `prometheus.relabel` component rewrites the label set of each metric passed along to the exported receiver by applying one or more relabeling rules. To do so it uses a relabeling cache stored in memory. For each `prometheus.relabel` component Alloy will create a dedicated relabel cache. This is not a huge issue per se due to the possibility of registering multiple rules into on component to mutualize the cache.

However if you're horizontally scaling Alloy deployment with load-balancing, you will end up with one cache per Alloy instance and those local caches will have overlaps, increasing the footprint of each instance.
Moreover, the cache is tied to the pod, which means that a new Alloy process starts with an empty cache.

For a couple of horizontally scaled Alloy pods it's acceptable but if you plan to have lots of instances processing data horizontally it's not sustainable.

## Proposal

Allow to use `Redis` or `Memcached` instead of the `in-memory` one as cache.

Using [dskit](https://github.com/grafana/dskit/blob/main/cache/cache.go) code to manage connection and client configuration.

We could create an interface to avoid changing the code logic and to abstract the kind of cache used to the component.

## Pros and cons

**Pros:**

- No logic change so impact is expected to be null for users
- Possibility to use an external cache if needed, even having multiple caches for different relabeling components.
- Will be easy to use in other parts of the codebase

**Cons:**

- Config is a bit more complex compared to previous one

## Alternative solutions

The alternative is to do nothing as we deem this improvement unnecessary.

## Compatibility

This proposal offer to deprecate for a couple of release the old way to configure the in-memory cache, then release+2 drop the old way. Doing so allow to migrate the settings to the correct block.
Copy link
Member

@tpaschalis tpaschalis Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our accepted backwards compatibility guarantees go against this. Any breaking changes for stable functionality would have to wait until v2.0.

Is there any reason for wanting to deprecate it? We can't control the pace of users adopting new versions, so IMO doing our best to avoid breaking their configurations and use cases would be the way to go.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to deprecate anything sooner than what is guaranteed.
I will rephrase to say that we keep both compatibility until the next major release


## Implementation

We should add a re-usable cache interface compatible with multiple backend, it should be usable with different value types

```golang
type Cache[valueType any] interface {
Get(key string) (*valueType, error)
Set(key string, value *valueType, ttl time.Duration) error
Remove(key string)
GetMultiple(keys []string) ([]*valueType, error)
SetMultiple(values map[string]*valueType, ttl time.Duration) error
Clear(newSize int) error
GetCacheSize() int
}
```

Each backend cache should be implemented in a separate file. For now we should support in_memory, redis and memcached.

We should add a way to select the cache and its connections options through the components arguments

For example, based on what's done in [Mimir index cache](https://github.com/grafana/mimir/blob/main/pkg/storage/tsdb/index_cache.go#L47):

```golang
type Arguments struct {
// Where the relabelled metrics should be forwarded to.
ForwardTo []storage.Appendable `alloy:"forward_to,attr"`

// The relabelling rules to apply to each metric before it's forwarded.
MetricRelabelConfigs []*alloy_relabel.Config `alloy:"rule,block,optional"`

// DEPRECATED Use type = inmemory and cache_size field.
InMemoryCacheSizeDeprecated int `alloy:"max_cache_size,attr,optional"`

// Cache backend configuration.
CacheConfig cache.CacheConfig `alloy:"cache,block,optional"`
}

type CacheConfig struct {
cache.BackendConfig `yaml:",inline"`
InMemory InMemoryCacheConfig `yaml:"inmemory"`
}

type InMemoryCacheConfig struct {
CacheSize int `yaml:"cache_size"`
}

------
type BackendConfig struct {
Backend string `yaml:"backend"`
Memcached MemcachedClientConfig `yaml:"memcached"`
Redis RedisClientConfig `yaml:"redis"`
}
```

Configuration should be `in_memory` by default.
`max_cache_size` should still be taken into account but only if `backend = in_memory`. It also should be deprecated and we should redirect to the `InMemoryRelabelCacheConfig.CacheSize` field.

Here is some examples:

- legacy config unchanged

```river
prometheus.relabel "legacy_config" {
forward_to = [...]
max_cache_size = 10000000
rule {
...
}

}
```

- redis config

```river
prometheus.relabel "redis_config" {
forward_to = [...]
cache {
backend = "redis"
redis {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a bespoke redis/memcached action, how about a redis and memcached component that exports a key value interface? Along with generic accessors. I could see this being used for more than just relabel. Much in the same way we allow components to consume strings but those strings can come from vault, file, http etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage.redis "relabel" {
  url = "example.com"
}

prometheus.relabel "relabel" {
  cache = storage.redis.relabel.cache
}

// Just for fun
prometheus.remote_write "mimir" {
  endpoint {
    external_labels = {
      cluster = coalesce(storage.redis.relabel["cluster"],"unknown")
    }
  }
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could even get clever and allow prometheus.relabel to have a list of caches, if there are problems with redis then fallback to in memory.

prometheus.relabel "relabel" {
  caches = [storage.redis.relabel.cache,storage.memory.relabel]
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitively like the idea of having a unified way to define the caching mecanism.

if there are problems with redis then fallback to in memory.

You are correct but this add an extra complexity to take into account. A highly available external cache is indeed recommended but having a fallback is definitively a good idea.

The extra complexity comes from when the external caches becomes unavailable Alloy needs to use the in-memory one. This could lead to OOM, e.g: you have plan to use an external cache then you fallback on in-memory but the limits of your pod is not high enough.
Then Alloy needs to check if the external storage is available again. Once it is, Alloy needs to move its in-memory k/v to the external cache. But we are exposed to a race when we have two Alloy instances using the same cache.

That being said, as I'm not that familiar with the code base yet and your proposal seems harder to achieve.

What do you think about extending our proposal to loki.relabel and standardize the whole thing as prometheus and loki are the main component using cache?
This would serve as a first step toward your proposal. Something like so

type RelabelCache interface {
    Get(key uint64) (*labelAndID, error)
    Set(key uint64, value *labelAndID) error
    Remove(key uint64)
    Clear(newSize int) error
    GetCacheSize() int
} 

Once done, the associated work to create the storage components such as storage.redis could be picked up right after the initial proposal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback caches are likely a stretch though I do feel the kv store should be a generic component. As far as the interface I would go with something like the below. Disclaimer this is getting into the deeper technical ideas. The reason keys might want to be strings is to mirror how alloy uses maps. Where the key is always a string.

type Cache interface {
    Get(key string) (any, bool) 
    Set(key string, value any, ttl int) 
    Remove(key uint64)
    Clear(newSize int) error
    GetCacheSize() int
} 

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't have to specifically use that for the first round. Capsules are mainly if you want to bridge into it being accessible within the alloy config. In this case your likely passing an interface/pointer from go code to go code, which should work without problem.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the proposal to take reflect what we have been discussing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we will likely need GetMultiple(keys []string) ([]Results) and SetMultiple, imagine a scenario where you are scraping a million records, hitting redis each time is going to be pretty expensive at one request for each.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdurham I've tried to take your feedbacks into account, the PR should be fine what do you think about it?
We have started to work on the code with @pbailhache, what would be the next step regarding this proposal? should we merge it if it is to your liking and iterate over the code?

p.s: I don't know what to do about the build 😅

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just opened a draft pull request for the feature :
#1692

This is still a work in progress and I'm ready to iterate over it.

endpoint = "redis.url"
username = "user"
password = "password"
...
}
}
...
}
```

- new in memory config

```river
prometheus.relabel "inmemory_config" {
forward_to = [...]
cache {
backend = "inmemory"
inmemory {
cache_size = 10000000
}
}
...
}
```

- memcached config

```river
prometheus.relabel "memcached_config" {
forward_to = [...]
cache {
backend = "memcached"
memcached {
addresses = "address1, address2"
timeout = 10
...
}
}
...
}
```

## Related open issues

N/A