[NEW] Introduce slot level metrics to Valkey cluster #20

PingXie · 2024-03-25T03:18:52Z

I’m revisiting the feature proposal we discussed in redis/redis#10472, which aims at providing metrics at the slot level. Despite the substantial effort and detailed discussions back then, we didn’t land this feature. I believe it’s worth reconsidering, given the potential benefits and previous interest.

@kyle-yh-kim @zuiderkwast @madolson

madolson · 2024-03-25T03:23:18Z

I fully agree!

madolson · 2024-03-25T03:25:47Z

I've also re-added my favorite creature comfort. @placeholderkv/core-team thoughts?

zuiderkwast · 2024-03-25T10:30:17Z

Sure, metrics seem fine. I don't have strong opinions about it, only that I think fixing the cluster consistency problems is more important than metrics.

zuiderkwast · 2024-03-25T14:47:16Z

redis/redis#11432

madolson · 2024-03-25T18:40:01Z

I think our big initial play should be cluster overhaul. I think a lot of us want it, and it makes the most compelling sense as the big "next major features".

kyle-yh-kim · 2024-03-26T15:04:21Z

Good to hear back on this thread, hope you all have been doing well.

Where we left-off

In total, there were 3 proposed metrics under CLUSTER SLOT-STATS command group;

key_count
cpu_usec
memory_bytes

Next steps

memory_bytes is the most complex of all, but this shouldn't stop us from implementing the first two metrics to gain some momentum.

I will open two PRs for key_count and cpu_usec in the coming days. These PRs will be based off of the already existing PRs for key_count and cpu_usec under Redis repository.

As for CLUSTER SLOT-STATS command format, below was the latest development we agreed upon. Lengthy discussion and rationale can be found here and here.

CLUSTER SLOT-STATS
[SLOTSRANGE start-slot end-slot [start-slot end-slot ...]]|
[ORDERBY column [LIMIT limit] [ASC|DESC]]

hwware · 2024-03-27T21:12:57Z

It is great, and I prefer to add this feature in CLUSTER INFO.

madolson · 2024-03-27T22:18:24Z

It is great, and I prefer to add this feature in CLUSTER INFO.

Why cluster info? It's a free form field I guess, it could be a new sub info field I suppose

kyle-yh-kim · 2024-03-28T00:05:35Z

Thanks for chiming in. Personally, I'm opposed to CLUSTER INFO. We could perhaps add aggregated information under CLUSTER INFO, but not for the slot level metrics themselves.

Imagine dumping ~16384 slot level metrics under CLUSTER INFO. This would unnecessarily bloat the info string, when the user may have only wanted to check cluster_state:ok.

A new command dedicated for slot level metrics querying, in this case, CLUSTER SLOT-STATS, is more suitable. For reference, below was the latest command format we agreed on.

CLUSTER SLOT-STATS
[SLOTSRANGE start-slot end-slot [start-slot end-slot ...]]|
[ORDERBY column [LIMIT limit] [ASC|DESC]]

I'll wait for the core team to finalize this decision, before opening the PRs.

zuiderkwast · 2024-04-15T20:19:08Z

@kyle-yh-kim Yeah CLUSTER SLOT-STATS. We're a bit overloaded with the forking stuff, new core team, new project, etc. but I think we want this for our next release. There was already a lot of review done and I think it was almost ready to merge. Do you want to bring over your PR?

The command provides detailed slot usage statistics upon invocation, with initial support for key_count metric. cpu_usec (approved) and memory_bytes (pending-approval) metrics will soon follow after the merger of this PR.

The command provides detailed slot usage statistics upon invocation, with initial support for key-count metric. cpu-usec (approved) and memory-bytes (pending-approval) metrics will soon follow after the merger of this PR.

kyle-yh-kim · 2024-04-23T02:34:10Z

Ignore my spam references above, I was reviewing the diff manually over Github UI.

PR has now been opened; #351

This PR is part one of the three upcoming PRs;

CLUSTER SLOT-STATS command introduction, with key-count support --> This PR.
cpu-usec support
memory-bytes support

The command provides detailed slot usage statistics upon invocation, with initial support for key-count metric. cpu-usec (approved) and memory-bytes (pending-approval) metrics will soon follow after the merger of this PR. Signed-off-by: Kyle Kim <[email protected]>

kyle-yh-kim · 2024-05-07T17:21:59Z

Moving ahead, I would like to resume our conversation on per-slot memory metrics. I'd argue this is the most important per-slot metric of all, as it enables for smoother horizontal scaling given the accurate memory tracking at per-slot granularity.

Last time, we converged on its high-level strategy in "online analysis" (amortizing memory tracking cost per mutative-command, over offline RDB snapshot analysis / forking a process), as well as its performance and memory impact. The following conclusion was drawn, before the issue was then put on halt by the previously open-sourced Redis-core team.

Overall this data seems really good to me. There is the separate project for improving main memory efficiency of the dictionary, so if these two features are released together it might not be noticeable.

Source: redis/redis#10472 (comment)

As for module consideration, I mention in details here to keep this feature as an opt-in service to maintain backwards compatibility. For opt-in modules, they will be required to accurately track its value size, and call a newly introduced hook RM_ModuleUpdateMemorySlotStats() upon its mutation, to signal valkey-server to register the memory size gain / loss from the module’s registered write commands.

If we are still aligned to this strategy, I will start on its implementation, and open incremental PRs following the merger of the above CLUSTER SLOT-STATS command PR #351.

kyle-yh-kim · 2024-06-11T21:26:16Z

Based on Madelyn's latest comment;

Defer the decision about memory usage since it was contentious.

Memory metric is our greatest interest, since it would enable smoother horizontal scaling given accurate information of each slot's memory consumption.

Whenever possible, I'd like to understand more on the proposed design's concerns from the core team. Once the concerns are shared, I will evaluate alternative options.

One thing I can state for certainty is that - we've put a lot of time and effort into this technical design. Ultimately, there is no solution that comes free of charge - it all boils down to tradeoff decisions (performance, memory, and maintainability).

zuiderkwast · 2024-06-12T00:31:10Z

Hi Kyle! I have two concerns:

Tracking memory for each data structure seems to add considerable complexity. For dict, we'd need keySize and valueSize callbacks in dictType. For quicklist, it's just a size per quicklist I suppose, since the nodes already have a size, but what about compressed nodes? For rax and skiplist, I'd like to see a simple description for how to handle reach of these to understand the complexity.
Any performance degradation, or did you say it's a config? When off, there's no performance degradation?

Memory usage is not a concern to me, since we don't need any new structures to track the memory for the single-allocation datastructures (string, listpack, etc.) Modules is no great concern either since I'd imagine it's no disaster if this metric isn't 100% accurate.

What about alternative approaches? Can we check the total memory usage before and after each command? We know which slot each command operated on.

kyle-yh-kim · 2024-06-13T15:03:14Z

Thanks for your prompt response. My response is attached below.

Complexity in memory tracking

quicklist compression
zmalloc_size(node->entry) is called before and after compression to assess its difference. The two hook points are; 1) __quicklistCompressNode(), and 2) __quicklistDecompressNode(). This difference is then accumulated to quicklist->alloc_bytes, where alloc_bytes is a newly introduced size_t field that tracks its allocation bytes.
zskiplist
zskiplistNode holds two major memory allocations; 1) node->ele, and 2) node->level[]. Both of which can be easily introspected through zmalloc_size(). Similar to dict and quicklist, lowest common hook points are chosen, such that the change is minimally invasive. This change can be accomplished by 2 line changes in 1) zslInsert(), and 2) zslDeleteNode().
rax
There exist an open OSS PR which tracks rax allocation size in its header. The change isn't complex, as we simply add or subtract zmalloc_size(raxNode) per mutation, for which there're about 20 touch points. This effort can be resumed in Valkey project.

Performance degradation

The configuration is based on server.cluster_enabled. If enabled, per-slot memory will be aggregated. Else, the code will be bypassed.

The aggregation comes in two layers;

Track accurate memory usage of Redis key-value entry.
Aggregate memory usage at per-slot level, given that we can track each Redis key-value entry’s memory usage.

Right now, the proposal for CMD is to bypass only the second aggregation and retain the first one. This way, both CMD and CME will have O(1) accurate MEMORY USAGE.

More on performance benchmarking can be found here. In the worst case scenario for CMD, the performance degradation may reach ~1% TPS. For an average workload of 8:2 R/W, the degradation is negligible.

Alternative approaches

Can we check the total memory usage before and after each command? We know which slot each command operated on.

Yes, this was the very first design candidate we ideated. Initially, we expected this to be as simple as subtracting zmalloc_used_memory()_after_cmd - zmalloc_used_memory()_before_cmd. However, it carried far greater complexity due to the following reasons;

Maintenance and hard-to-follow logic. At first glance, this approach seems simplistic to implement. However, zmalloc context switching from customer key-space to others intents (including but are not limited to; 1. Transient / temporary 2. Redis administration 3. Client input / output buffer) can occur throughout all depths of Redis mutative command call-stack. Out of all zmalloc operations, we must isolate those relevant to customer key-space. Thus, for every mutative Redis command, we must first completely map-out these context switching windows, followed by its maintenance upon any new zmalloc introduction within these windows.

The 2nd candidate solves this maintenance problem by logically separating all size tracking within the memory sparse internal data-structure files, such as rax.c, dict.c, quicklist.c and so on. The size tracking will not creep into other depths of call-stacks.

Down the road, if any bug is introduced, 1st candidate will require sweeping across all zmalloc operations within all depths of call-stacks. For the 2nd candidate, we may simply refer to the specific internal data-structure file.

Complex and invasive, as zmalloc can not be relied under all cases.

For example, in order to get the relevant slot number, the input must first be parsed. However, parsing of this input requires zmalloc. We now run into a cyclic dependency, where zmalloc needs slot number to increment, but the slot number can only be obtained once key is parsed through zmalloc. To mitigate, we may temporarily save the size of these variables, then increment them once the slot number is parsed and request is successful. But now, we need a way to carry this additional temporary variable, either through another global variable, or additional argument across all call-stacks.

Another example would be, robj value are conditionally re-created following the initial parsing (createStringObjectFromLongLongWithOptions()). So then, the size of the initially parsed value may or may not be disregarded from the slot metrics array. This requires another layer of consideration.

After a few edge case considerations, the implementation touches multiple signatures and growing number of global variables.

We’ve also investigated various “offline” approaches, such as 1) background thread, 2) cron-job, and 3) forking, all of which were not preferred due to unbounded upper scanning limit, as well as recency lag.

This was greatly discussed over in the other threads, here and here.

zuiderkwast · 2024-06-13T23:32:21Z

Thanks! Yes, I have seen those threads before but I didn't follow this carefully back then. :)

OK, so memory is tracked even for standalone mode and it has almost 1% throughput impact for standalone and nearly 2% in cluster mode. This makes me think that we should add a config for it and wrap all of these in if, like if (server.memory_tracking) { d->size += zmalloc_size(p); }. If the config is off, CPU branch prediction will make sure it doesn't cost anything to execute this kind of if statements.

Why? I think speed is more important than metrics for some users. 1% is not that much but it adds up.

…alkey-io#20). The metric tracks network ingress bytes under per-slot context, by reverse calculation of c->argv_len_sum and c->argc, stored under a newly introduced field c->net_input_bytes_curr_cmd. Signed-off-by: Kyle Kim <[email protected]>

kyle-yh-kim · 2024-07-01T04:59:47Z

PR for per-slot Network bytes-in metric has been opened; #720

The metric tracks network ingress bytes under per-slot context, by reverse calculation of c->argv_len_sum and c->argc, stored under a newly introduced field c->net_input_bytes_curr_cmd.

Similar to CPU metric PR, the first revision only holds implementation changes for initial feedback purposes, with pending perf testing. Integration tests are not up-to-date, and thus failing. This will soon be followed-up.

kyle-yh-kim · 2024-07-08T02:11:32Z

Performance benchmarking result has been attached below. This will help us to decide whether to enable or disable the per-slot metrics by default, for all instances with CME (cluster-mode-enabled). For CMD (cluster-mode-disabled) instances, below performance penalty will not apply.

Performance benchmarking summary

With both cpu-usec and network-bytes-in metrics enabled, we can note a reduction of 0.70% in TPS.

	Naive	With cpu-usec	Percentage diff
p50 (ms)	2.183	2.206	1.05%
p90 (ms)	3.357	3.369	0.36%
p99 (ms)	3.966	4.006	1.02%
TPS	158280	157179	-0.70%

Appendix: Test setup

Server setup

1 server (r6g.xlarge), pre-filled with 3 million keys, 512 bytes each.

Traffic generator setup

8 traffic generators (m6g.large) running on separate ARM instances.
Each traffic generator running the following command (50 clients, SET command, 512 bytes data size), yielding server CPU to pin at 100%.

./valkey-benchmark -h ${TARGET_IP} -c 50 -r 3000000 -n 100000000 -t set -d 514

…alkey-io#20). The metric tracks network egress bytes under per-slot context, by hooking onto COB buffer mutations. The metric can be viewed by calling the CLUSTER SLOT-STATS command, with sample response attached below; ``` 127.0.0.1:6379> cluster slot-stats slotsrange 0 0 1) 1) (integer) 0 2) 1) "key-count" 2) (integer) 1 3) "network-bytes-out" 4) (integer) 175 ``` Signed-off-by: Kyle Kim <[email protected]>

kyle-yh-kim · 2024-07-11T21:01:24Z

PR for per-slot Network bytes-out metric has been opened; #771

This concludes opening of all three per-slot metrics PRs targeted for Valkey 8.0 rc1, which are now pending review / approval from the core team;

cpu-usec: Add cpu-usec metric support under CLUSTER SLOT-STATS command (#20). #712
network-bytes-in: Add network-bytes-in and network-bytes-out metric support under CLUSTER SLOT-STATS command (#20) #720
network-bytes-out: Add network-bytes-out metric support for CLUSTER SLOT-STATS command (#20) #771

madolson · 2024-07-15T15:31:15Z

@valkey-io/core-team We think there should be a config since there is a small performance impact. Here are the options for naming:

cluster-slots-command-metrics for (cpu, network-in, network-out) and cluster-slot-data-metrics (for memory).
cluster-slots-operation-metrics for (cpu, network-in, network-out) and cluster-slot-data-metrics (for memory).
cluster-slot-stats-network-enabled, cluster-slot-stats-cpu-enabled, cluster-slot-stats-memory-enabled. (This is not in valkey 8, we can finalize the name later)
cluster-slot-stats-enabled with a separate future config name for memory.

Please also give input if the config should be mutable or immutable.

zuiderkwast · 2024-07-15T23:06:52Z

I vote 4. cluster-slot-stats-enabled, mutable.

The future config for memory should not be cluster-specific. Name idea: memory-tracking-enabled. Apart from cluster slot-stats, it would make MEMORY USAGE, MEMORY STATS and other info exact (avoid sampling). It should be immutable (since it's non-trivial to make it mutable).

PingXie · 2024-07-16T00:05:47Z

The future config for memory should not be cluster-specific. Name idea: memory-tracking-enabled. Apart from cluster slot-stats, it would make MEMORY USAGE, MEMORY STATS and other info exact (avoid sampling).

@zuiderkwast, my understanding of the cluster use case is find out the "big" slot(s) with the large memory footprint. Is the non cluster use case here about finding "big keys" eventually?

zuiderkwast · 2024-07-16T00:41:06Z

@zuiderkwast, my understanding of the cluster use case is find out the "big" slot(s) with the large memory footprint. Is the non cluster use case here about finding "big keys" eventually?

Yes, it can be used for that too; valkey-cli --memkeys can definitely benefit. (It's using MEMORY USAGE.)

It's more useful to enable it in a cluster than in standalone mode, but it's not useless in standalone mode. If we keep track of memory per key, then aggregating it per slot is very cheap (presumably), so I don't think we need yet another config for memory per slot.

PingXie · 2024-07-16T01:10:10Z

Got it. Option 4 sounds good to me and the user needs to enable both memory-tracking (future) and cluster-slot-stats-enabled (8.0) to get the memory stats.

madolson · 2024-07-16T03:24:39Z

My preference is 3 -> 4, so I'm OK with 4.

madolson · 2024-07-16T03:25:24Z

@kyle-yh-kim Can you update this PR to use a config with the name cluster-slot-stats-enabled, we can sort out an updated name later, but it would be good to get all of the naming out of the way. For now make the config mutable.

kyle-yh-kim · 2024-07-16T23:46:11Z

Sure. I believe our latest decision was to disable the config by default. The following line will do the trick.

// config.c
createBoolConfig("cluster-slot-stats-enabled", NULL, MODIFIABLE_CONFIG, server.cluster_slot_stats_enabled, 0, NULL, NULL),

The three per-slot metrics PRs have now been updated to include the above config, alongside the previously missing TCL integration tests.

This concludes all planned changes for the three PRs targeted for Valkey 8.0 rc1, now pending review / approval from the core team;

cpu-usec: Add cpu-usec metric support under CLUSTER SLOT-STATS command (#20). #712
network-bytes-in: Add network-bytes-in and network-bytes-out metric support under CLUSTER SLOT-STATS command (#20) #720
network-bytes-out: Add network-bytes-out metric support for CLUSTER SLOT-STATS command (#20) #771

…712) The metric tracks cpu time in micro-seconds, sharing the same value as `INFO COMMANDSTATS`, aggregated under per-slot context. --------- Signed-off-by: Kyle Kim <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>

…alkey-io#20). The metric tracks network egress bytes under per-slot context, by hooking onto COB buffer mutations. The metric can be viewed by calling the CLUSTER SLOT-STATS command, with sample response attached below; ``` 127.0.0.1:6379> cluster slot-stats slotsrange 0 0 1) 1) (integer) 0 2) 1) "key-count" 2) (integer) 0 3) "cpu-usec" 4) (integer) 0 5) "network-bytes-in" 6) (integer) 0 7) "network-bytes-out" 8) (integer) 0 ``` Signed-off-by: Kyle Kim <[email protected]>

…io#20). (valkey-io#712) The metric tracks cpu time in micro-seconds, sharing the same value as `INFO COMMANDSTATS`, aggregated under per-slot context. --------- Signed-off-by: Kyle Kim <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>

…ER SLOT-STATS command (#20) (#720) Adds two new metrics for per-slot statistics, network-bytes-in and network-bytes-out. The network bytes are inclusive of replication bytes but exclude other types of network traffic such as clusterbus traffic. #### network-bytes-in The metric tracks network ingress bytes under per-slot context, by reverse calculation of `c->argv_len_sum` and `c->argc`, stored under a newly introduced field `c->net_input_bytes_curr_cmd`. #### network-bytes-out The metric tracks network egress bytes under per-slot context, by hooking onto COB buffer mutations. #### sample response Both metrics are reported under the `CLUSTER SLOT-STATS` command. ``` 127.0.0.1:6379> cluster slot-stats slotsrange 0 0 1) 1) (integer) 0 2) 1) "key-count" 2) (integer) 0 3) "cpu-usec" 4) (integer) 0 5) "network-bytes-in" 6) (integer) 0 7) "network-bytes-out" 8) (integer) 0 ``` --------- Signed-off-by: Kyle Kim <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>

madolson · 2024-07-26T23:10:13Z

The four components for Valkey 8.0 are now merged. We will follow up with memory in Valkey 8.2.

zuiderkwast added the cluster label Mar 25, 2024

madolson added the major-decision-pending Major decision pending by TSC team label Mar 27, 2024

PingXie added this to Valkey 8.0 Apr 15, 2024

PingXie moved this to Todo in Valkey 8.0 Apr 15, 2024

PingXie changed the title ~~Introduce slot level metrics to Redis cluster~~ [NEW] Introduce slot level metrics to Redis cluster Apr 20, 2024

kyle-yh-kim mentioned this issue Jul 15, 2024

Add network-bytes-in and network-bytes-out metric support under CLUSTER SLOT-STATS command (#20) #720

Merged

zuiderkwast mentioned this issue Jul 17, 2024

Rax size tracking #688

Merged

madolson moved this from In Progress to Done in Valkey 8.0 Jul 26, 2024

madolson closed this as completed Jul 26, 2024

kyle-yh-kim mentioned this issue Jul 31, 2024

[NEW] Introduce slot-level memory metrics #852

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] Introduce slot level metrics to Valkey cluster #20

[NEW] Introduce slot level metrics to Valkey cluster #20

PingXie commented Mar 25, 2024

madolson commented Mar 25, 2024

madolson commented Mar 25, 2024

zuiderkwast commented Mar 25, 2024

zuiderkwast commented Mar 25, 2024

madolson commented Mar 25, 2024

kyle-yh-kim commented Mar 26, 2024

hwware commented Mar 27, 2024

madolson commented Mar 27, 2024

kyle-yh-kim commented Mar 28, 2024

zuiderkwast commented Apr 15, 2024

kyle-yh-kim commented Apr 23, 2024

kyle-yh-kim commented May 7, 2024

kyle-yh-kim commented Jun 11, 2024

zuiderkwast commented Jun 12, 2024

kyle-yh-kim commented Jun 13, 2024

zuiderkwast commented Jun 13, 2024

kyle-yh-kim commented Jul 1, 2024 •

edited

Loading

kyle-yh-kim commented Jul 8, 2024

kyle-yh-kim commented Jul 11, 2024

madolson commented Jul 15, 2024

zuiderkwast commented Jul 15, 2024

PingXie commented Jul 16, 2024

zuiderkwast commented Jul 16, 2024

PingXie commented Jul 16, 2024

madolson commented Jul 16, 2024

madolson commented Jul 16, 2024

kyle-yh-kim commented Jul 16, 2024

madolson commented Jul 26, 2024

[NEW] Introduce slot level metrics to Valkey cluster #20

[NEW] Introduce slot level metrics to Valkey cluster #20

Comments

PingXie commented Mar 25, 2024

madolson commented Mar 25, 2024

madolson commented Mar 25, 2024

zuiderkwast commented Mar 25, 2024

zuiderkwast commented Mar 25, 2024

madolson commented Mar 25, 2024

kyle-yh-kim commented Mar 26, 2024

Where we left-off

Next steps

hwware commented Mar 27, 2024

madolson commented Mar 27, 2024

kyle-yh-kim commented Mar 28, 2024

zuiderkwast commented Apr 15, 2024

kyle-yh-kim commented Apr 23, 2024

kyle-yh-kim commented May 7, 2024

kyle-yh-kim commented Jun 11, 2024

zuiderkwast commented Jun 12, 2024

kyle-yh-kim commented Jun 13, 2024

Complexity in memory tracking

Performance degradation

Alternative approaches

zuiderkwast commented Jun 13, 2024

kyle-yh-kim commented Jul 1, 2024 • edited Loading

kyle-yh-kim commented Jul 8, 2024

Performance benchmarking summary

Appendix: Test setup

kyle-yh-kim commented Jul 11, 2024

madolson commented Jul 15, 2024

zuiderkwast commented Jul 15, 2024

PingXie commented Jul 16, 2024

zuiderkwast commented Jul 16, 2024

PingXie commented Jul 16, 2024

madolson commented Jul 16, 2024

madolson commented Jul 16, 2024

kyle-yh-kim commented Jul 16, 2024

madolson commented Jul 26, 2024

kyle-yh-kim commented Jul 1, 2024 •

edited

Loading