-
Notifications
You must be signed in to change notification settings - Fork 656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEW] Introduce slot level metrics to Valkey cluster #20
Comments
I fully agree! |
I've also re-added my favorite creature comfort. @placeholderkv/core-team thoughts? |
Sure, metrics seem fine. I don't have strong opinions about it, only that I think fixing the cluster consistency problems is more important than metrics. |
I think our big initial play should be cluster overhaul. I think a lot of us want it, and it makes the most compelling sense as the big "next major features". |
Good to hear back on this thread, hope you all have been doing well. Where we left-offIn total, there were 3 proposed metrics under
Next steps
I will open two PRs for As for
|
It is great, and I prefer to add this feature in CLUSTER INFO. |
Why cluster info? It's a free form field I guess, it could be a new sub info field I suppose |
Thanks for chiming in. Personally, I'm opposed to Imagine dumping ~16384 slot level metrics under A new command dedicated for slot level metrics querying, in this case,
I'll wait for the core team to finalize this decision, before opening the PRs. |
@kyle-yh-kim Yeah CLUSTER SLOT-STATS. We're a bit overloaded with the forking stuff, new core team, new project, etc. but I think we want this for our next release. There was already a lot of review done and I think it was almost ready to merge. Do you want to bring over your PR? |
The command provides detailed slot usage statistics upon invocation, with initial support for key_count metric. cpu_usec (approved) and memory_bytes (pending-approval) metrics will soon follow after the merger of this PR.
The command provides detailed slot usage statistics upon invocation, with initial support for key_count metric. cpu_usec (approved) and memory_bytes (pending-approval) metrics will soon follow after the merger of this PR.
The command provides detailed slot usage statistics upon invocation, with initial support for key_count metric. cpu_usec (approved) and memory_bytes (pending-approval) metrics will soon follow after the merger of this PR.
The command provides detailed slot usage statistics upon invocation, with initial support for key_count metric. cpu_usec (approved) and memory_bytes (pending-approval) metrics will soon follow after the merger of this PR.
The command provides detailed slot usage statistics upon invocation, with initial support for key_count metric. cpu_usec (approved) and memory_bytes (pending-approval) metrics will soon follow after the merger of this PR.
The command provides detailed slot usage statistics upon invocation, with initial support for key-count metric. cpu-usec (approved) and memory-bytes (pending-approval) metrics will soon follow after the merger of this PR.
The command provides detailed slot usage statistics upon invocation, with initial support for key-count metric. cpu-usec (approved) and memory-bytes (pending-approval) metrics will soon follow after the merger of this PR.
The command provides detailed slot usage statistics upon invocation, with initial support for key-count metric. cpu-usec (approved) and memory-bytes (pending-approval) metrics will soon follow after the merger of this PR.
Ignore my spam references above, I was reviewing the diff manually over Github UI. PR has now been opened; #351 This PR is part one of the three upcoming PRs;
|
The command provides detailed slot usage statistics upon invocation, with initial support for key-count metric. cpu-usec (approved) and memory-bytes (pending-approval) metrics will soon follow after the merger of this PR. Signed-off-by: Kyle Kim <[email protected]>
Moving ahead, I would like to resume our conversation on per-slot memory metrics. I'd argue this is the most important per-slot metric of all, as it enables for smoother horizontal scaling given the accurate memory tracking at per-slot granularity. Last time, we converged on its high-level strategy in "online analysis" (amortizing memory tracking cost per mutative-command, over offline RDB snapshot analysis / forking a process), as well as its performance and memory impact. The following conclusion was drawn, before the issue was then put on halt by the previously open-sourced Redis-core team.
As for module consideration, I mention in details here to keep this feature as an opt-in service to maintain backwards compatibility. For opt-in modules, they will be required to accurately track its value size, and call a newly introduced hook If we are still aligned to this strategy, I will start on its implementation, and open incremental PRs following the merger of the above |
Based on Madelyn's latest comment;
Memory metric is our greatest interest, since it would enable smoother horizontal scaling given accurate information of each slot's memory consumption. Whenever possible, I'd like to understand more on the proposed design's concerns from the core team. Once the concerns are shared, I will evaluate alternative options. One thing I can state for certainty is that - we've put a lot of time and effort into this technical design. Ultimately, there is no solution that comes free of charge - it all boils down to tradeoff decisions (performance, memory, and maintainability). |
Hi Kyle! I have two concerns:
Memory usage is not a concern to me, since we don't need any new structures to track the memory for the single-allocation datastructures (string, listpack, etc.) Modules is no great concern either since I'd imagine it's no disaster if this metric isn't 100% accurate. What about alternative approaches? Can we check the total memory usage before and after each command? We know which slot each command operated on. |
Thanks for your prompt response. My response is attached below. Complexity in memory tracking
Performance degradationThe configuration is based on The aggregation comes in two layers;
Right now, the proposal for CMD is to bypass only the second aggregation and retain the first one. This way, both CMD and CME will have O(1) accurate More on performance benchmarking can be found here. In the worst case scenario for CMD, the performance degradation may reach ~1% TPS. For an average workload of 8:2 R/W, the degradation is negligible. Alternative approaches
Yes, this was the very first design candidate we ideated. Initially, we expected this to be as simple as subtracting
We’ve also investigated various “offline” approaches, such as 1) background thread, 2) cron-job, and 3) forking, all of which were not preferred due to unbounded upper scanning limit, as well as recency lag. This was greatly discussed over in the other threads, here and here. |
Thanks! Yes, I have seen those threads before but I didn't follow this carefully back then. :) OK, so memory is tracked even for standalone mode and it has almost 1% throughput impact for standalone and nearly 2% in cluster mode. This makes me think that we should add a config for it and wrap all of these in Why? I think speed is more important than metrics for some users. 1% is not that much but it adds up. |
…alkey-io#20). The metric tracks network ingress bytes under per-slot context, by reverse calculation of c->argv_len_sum and c->argc, stored under a newly introduced field c->net_input_bytes_curr_cmd. Signed-off-by: Kyle Kim <[email protected]>
…alkey-io#20). The metric tracks network ingress bytes under per-slot context, by reverse calculation of c->argv_len_sum and c->argc, stored under a newly introduced field c->net_input_bytes_curr_cmd. Signed-off-by: Kyle Kim <[email protected]>
PR for per-slot Network bytes-in metric has been opened; #720 The metric tracks network ingress bytes under per-slot context, by reverse calculation of Similar to CPU metric PR, the first revision only holds implementation changes for initial feedback purposes, with pending perf testing. Integration tests are not up-to-date, and thus failing. This will soon be followed-up. |
Performance benchmarking result has been attached below. This will help us to decide whether to enable or disable the per-slot metrics by default, for all instances with CME (cluster-mode-enabled). For CMD (cluster-mode-disabled) instances, below performance penalty will not apply. Performance benchmarking summaryWith both
Appendix: Test setupServer setup
Traffic generator setup
|
…alkey-io#20). The metric tracks network egress bytes under per-slot context, by hooking onto COB buffer mutations. The metric can be viewed by calling the CLUSTER SLOT-STATS command, with sample response attached below; ``` 127.0.0.1:6379> cluster slot-stats slotsrange 0 0 1) 1) (integer) 0 2) 1) "key-count" 2) (integer) 1 3) "network-bytes-out" 4) (integer) 175 ``` Signed-off-by: Kyle Kim <[email protected]>
PR for per-slot Network bytes-out metric has been opened; #771 This concludes opening of all three per-slot metrics PRs targeted for Valkey 8.0 rc1, which are now pending review / approval from the core team;
|
@valkey-io/core-team We think there should be a config since there is a small performance impact. Here are the options for naming:
Please also give input if the config should be mutable or immutable. |
I vote 4. The future config for memory should not be cluster-specific. Name idea: |
@zuiderkwast, my understanding of the cluster use case is find out the "big" slot(s) with the large memory footprint. Is the non cluster use case here about finding "big keys" eventually? |
Yes, it can be used for that too; It's more useful to enable it in a cluster than in standalone mode, but it's not useless in standalone mode. If we keep track of memory per key, then aggregating it per slot is very cheap (presumably), so I don't think we need yet another config for memory per slot. |
Got it. Option 4 sounds good to me and the user needs to enable both |
My preference is 3 -> 4, so I'm OK with 4. |
@kyle-yh-kim Can you update this PR to use a config with the name |
Sure. I believe our latest decision was to disable the config by default. The following line will do the trick. // config.c
createBoolConfig("cluster-slot-stats-enabled", NULL, MODIFIABLE_CONFIG, server.cluster_slot_stats_enabled, 0, NULL, NULL), The three per-slot metrics PRs have now been updated to include the above config, alongside the previously missing TCL integration tests. This concludes all planned changes for the three PRs targeted for Valkey 8.0 rc1, now pending review / approval from the core team;
|
…712) The metric tracks cpu time in micro-seconds, sharing the same value as `INFO COMMANDSTATS`, aggregated under per-slot context. --------- Signed-off-by: Kyle Kim <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>
…alkey-io#20). The metric tracks network egress bytes under per-slot context, by hooking onto COB buffer mutations. The metric can be viewed by calling the CLUSTER SLOT-STATS command, with sample response attached below; ``` 127.0.0.1:6379> cluster slot-stats slotsrange 0 0 1) 1) (integer) 0 2) 1) "key-count" 2) (integer) 0 3) "cpu-usec" 4) (integer) 0 5) "network-bytes-in" 6) (integer) 0 7) "network-bytes-out" 8) (integer) 0 ``` Signed-off-by: Kyle Kim <[email protected]>
…alkey-io#20). The metric tracks network egress bytes under per-slot context, by hooking onto COB buffer mutations. The metric can be viewed by calling the CLUSTER SLOT-STATS command, with sample response attached below; ``` 127.0.0.1:6379> cluster slot-stats slotsrange 0 0 1) 1) (integer) 0 2) 1) "key-count" 2) (integer) 0 3) "cpu-usec" 4) (integer) 0 5) "network-bytes-in" 6) (integer) 0 7) "network-bytes-out" 8) (integer) 0 ``` Signed-off-by: Kyle Kim <[email protected]>
…alkey-io#20). The metric tracks network egress bytes under per-slot context, by hooking onto COB buffer mutations. The metric can be viewed by calling the CLUSTER SLOT-STATS command, with sample response attached below; ``` 127.0.0.1:6379> cluster slot-stats slotsrange 0 0 1) 1) (integer) 0 2) 1) "key-count" 2) (integer) 0 3) "cpu-usec" 4) (integer) 0 5) "network-bytes-in" 6) (integer) 0 7) "network-bytes-out" 8) (integer) 0 ``` Signed-off-by: Kyle Kim <[email protected]>
…alkey-io#20). The metric tracks network egress bytes under per-slot context, by hooking onto COB buffer mutations. The metric can be viewed by calling the CLUSTER SLOT-STATS command, with sample response attached below; ``` 127.0.0.1:6379> cluster slot-stats slotsrange 0 0 1) 1) (integer) 0 2) 1) "key-count" 2) (integer) 0 3) "cpu-usec" 4) (integer) 0 5) "network-bytes-in" 6) (integer) 0 7) "network-bytes-out" 8) (integer) 0 ``` Signed-off-by: Kyle Kim <[email protected]>
…alkey-io#20). The metric tracks network egress bytes under per-slot context, by hooking onto COB buffer mutations. The metric can be viewed by calling the CLUSTER SLOT-STATS command, with sample response attached below; ``` 127.0.0.1:6379> cluster slot-stats slotsrange 0 0 1) 1) (integer) 0 2) 1) "key-count" 2) (integer) 0 3) "cpu-usec" 4) (integer) 0 5) "network-bytes-in" 6) (integer) 0 7) "network-bytes-out" 8) (integer) 0 ``` Signed-off-by: Kyle Kim <[email protected]>
…alkey-io#20). The metric tracks network egress bytes under per-slot context, by hooking onto COB buffer mutations. The metric can be viewed by calling the CLUSTER SLOT-STATS command, with sample response attached below; ``` 127.0.0.1:6379> cluster slot-stats slotsrange 0 0 1) 1) (integer) 0 2) 1) "key-count" 2) (integer) 0 3) "cpu-usec" 4) (integer) 0 5) "network-bytes-in" 6) (integer) 0 7) "network-bytes-out" 8) (integer) 0 ``` Signed-off-by: Kyle Kim <[email protected]>
…alkey-io#20). The metric tracks network egress bytes under per-slot context, by hooking onto COB buffer mutations. The metric can be viewed by calling the CLUSTER SLOT-STATS command, with sample response attached below; ``` 127.0.0.1:6379> cluster slot-stats slotsrange 0 0 1) 1) (integer) 0 2) 1) "key-count" 2) (integer) 0 3) "cpu-usec" 4) (integer) 0 5) "network-bytes-in" 6) (integer) 0 7) "network-bytes-out" 8) (integer) 0 ``` Signed-off-by: Kyle Kim <[email protected]>
…io#20). (valkey-io#712) The metric tracks cpu time in micro-seconds, sharing the same value as `INFO COMMANDSTATS`, aggregated under per-slot context. --------- Signed-off-by: Kyle Kim <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>
…ER SLOT-STATS command (#20) (#720) Adds two new metrics for per-slot statistics, network-bytes-in and network-bytes-out. The network bytes are inclusive of replication bytes but exclude other types of network traffic such as clusterbus traffic. #### network-bytes-in The metric tracks network ingress bytes under per-slot context, by reverse calculation of `c->argv_len_sum` and `c->argc`, stored under a newly introduced field `c->net_input_bytes_curr_cmd`. #### network-bytes-out The metric tracks network egress bytes under per-slot context, by hooking onto COB buffer mutations. #### sample response Both metrics are reported under the `CLUSTER SLOT-STATS` command. ``` 127.0.0.1:6379> cluster slot-stats slotsrange 0 0 1) 1) (integer) 0 2) 1) "key-count" 2) (integer) 0 3) "cpu-usec" 4) (integer) 0 5) "network-bytes-in" 6) (integer) 0 7) "network-bytes-out" 8) (integer) 0 ``` --------- Signed-off-by: Kyle Kim <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>
The four components for Valkey 8.0 are now merged. We will follow up with memory in Valkey 8.2. |
I’m revisiting the feature proposal we discussed in redis/redis#10472, which aims at providing metrics at the slot level. Despite the substantial effort and detailed discussions back then, we didn’t land this feature. I believe it’s worth reconsidering, given the potential benefits and previous interest.
@kyle-yh-kim @zuiderkwast @madolson
The text was updated successfully, but these errors were encountered: