The p99 of the hgetall command is ten times or even higher than that of hget #2691

chenbt-hz · 2024-05-31T13:00:42Z

Is this a regression?

Yes

Description

后续补充截图。

测试设备 104核 256G 内存
单机部署6个pika实例，每个配置3db，进程数4，线程池大小8
纯读：hget p99 < 1ms ,hgetall p99 ~ 10ms
读写时，hset + hmset + hget + hgetall p99甚至大于1s

请问下如何优化hgetall的p99？

Please provide a link to a minimal reproduction of the bug

No response

Screenshots or videos

中间部分是纯hget，p99明显降低，左侧是hget+hgetall，右侧是纯hgetall

Please provide the version you discovered this bug in (check about page for version information)

No response

Anything else?

No response

chenbt-hz · 2024-06-03T03:39:42Z

测试设备 104核 256G 内存
单机部署6个pika实例，每个配置:

db-instance-num : 3
thread-num : 4
thread-pool-size : 8
maxclients : 80000
block-cache: 20G
share-block-cache: yes
enable-partitioned-index-filters: yes
cache-index-and-filter-blocks: yes
level-compaction-dynamic-level-bytes: yes
enable-blob-files : yes
min-blob-size : 2K
blob-file-size : 512M
cache-model : 0

共1.04T 数据，仅hgetall qps：40k - 80k 波动，峰值100k
p99 如下

AlexStocks · 2024-06-07T12:31:47Z

cbt：在单机上启动多个 pika 时出现这个问题，但是 CPU MEM DISK 都没有达到极限

Issues-translate-bot · 2024-06-07T12:32:00Z

Bot detected the issue body's language is not English, translate it automatically.

cbt: This problem occurs when starting multiple pikas on a single machine, but the CPU MEM DISK does not reach the limit.

AlexStocks · 2024-06-14T13:17:53Z

cbt：
1 client 直连 pika
2 在一个机器上启动一个 Pika 实例，请求 P99 在 6ms；部署两个实例，则请求 P99 飙升到 60ms

Issues-translate-bot · 2024-06-14T13:18:05Z

Bot detected the issue body's language is not English, translate it automatically.

cbt: When starting a Pika instance on one machine, the request P99 is in 6ms; deploying two instances, the request P99 soars to 60ms

wangshao1 · 2024-06-28T09:23:36Z

我这边测试的场景是：
48核物理机，nvme ssd盘，部署两个独立的pika节点，每个节点db目录下数据为400GB+，压测客户端与pika所在物理机ping延迟0.0x ms级别。部署一个实例或者部署两个实例，TP9999都没有超过10ms。
如果这个稳定在你那边是稳定复现的话，咱们可以再对比一下测试过程中一些系统指标的差异。

Issues-translate-bot · 2024-06-28T09:23:49Z

Bot detected the issue body's language is not English, translate it automatically.

The scenario I am testing here is:
48-core physical machine, nvme ssd disk, deploy two independent pika nodes, the data in the db directory of each node is 400GB+, the ping delay of the stress test client and the physical machine where pika is located is 0.0x ms level. Deploying one instance or two instances, TP9999 does not take more than 10ms.
If this stability is reproducible on your side, we can compare the differences in some system indicators during the test process.

chenbt-hz · 2024-07-01T09:48:06Z

今天下午用最新的版本重新编译测试了下，还是可以明显看到2个pika时的P99增加很多：
单个pika 约126G ，value大小300B
测试设备 104核 256G 内存 SSD加速卡

测试时间	命令	Pika数量	QPS	缓存命中率	P99	Disk IO	Disk read IO	CPU 使用率
15:30 - 15:38	hget	2	20k + 少量60K	13-19%	波动，最高30ms	启动时60%，qps峰值时25%，常规12.5%	启动时485MB，常规时42MB	3%，qps峰值时9%
15:59 - 16:22	hget	1	20k + 少量60K	11-16%	基本稳定3ms，小范围波动7ms	刚开始升到27%，逐步降到7%	随时间变化25-50-10MB	1.75%，qps峰值时5%
17:23- 17:43	hgetall	2	20k + 少量60K	0-22%	波动，4-48ms	稳定10%，qps峰值时32%	稳定60MB，峰值166MB	5%左右，qps峰值时16%
16:32 - 16:54	hgetall	1	20k + 少量60K	5-21.17%	基本稳定3ms	稳定1%	基本4MB	2%

Issues-translate-bot · 2024-07-01T09:48:18Z

Bot detected the issue body's language is not English, translate it automatically.

I recompiled and tested this afternoon with the latest version. It can still be clearly seen that the P99 increased a lot when using 2 pikas:
A single pika is about 126G, and the value size is 300B.
Test equipment 104 core 256G memory SSD accelerator card

Test time	Command	Number of Pika	QPS	Cache hit rate	P99	Disk IO	Disk read IO	CPU usage
15:30 - 15:38	hget	2	20k + a small amount of 60K	13-19%	Fluctuating, up to 30ms	60% at startup, 25% at peak qps, 12.5% at regular	485MB at startup, 42MB at regular	3%, 9% at qps peak
15:59 - 16:22	hget	1	20k + a small amount of 60K	11-16%	Basically stable 3ms, fluctuating in a small range 7ms	Initially rose to 27%, gradually dropped to 7%	Changes over time 25-50 -10MB	1.75%, 5% at peak qps
17:23- 17:43	hgetall	2	20k + a small amount of 60K	0-22%	Fluctuation, 4-48ms	Stable 10%, 32% at qps peak	Stable 60MB, peak 166MB	About 5%, qps peak 16% when
16:32 - 16:54	hgetall	1	20k + a small amount of 60K	5-21.17%	Basically stable 3ms	Stable 1%	Basically 4MB	2%

AlexStocks · 2024-07-05T13:17:47Z

wsy：有待 baitao 使用 pika 官方的 benchmark 工具进行压测

Issues-translate-bot · 2024-07-05T13:17:59Z

Bot detected the issue body's language is not English, translate it automatically.

wsy: Wait for baitao to use pika’s official benchmark tool for stress testing

chenbt-hz · 2024-07-26T03:20:52Z

用4.0.0版本测试单实例200G内，2个benchmark分别读取2个pika没有复现p99暴涨问题。

但是由于测试工具的限制，当前有些场景没办法去复现对比，能否改进下工具？
主要需求如下：

解决内存占用过高的问题，200G内存也不够用（当前逻辑每个进程和count都需要读取全部文件，导致内存重复占用）
支持使用1个工具读取多个pika端口，模拟集群读取
可以配置工具同时使用多个命令（hset + hget + hgetall等），并配置比例，来满足模拟线上场景

Issues-translate-bot · 2024-07-26T03:21:02Z

Bot detected the issue body's language is not English, translate it automatically.

Using version 4.0.0 to test a single instance of 200G, two benchmarks read two pikas respectively and did not reproduce the p99 surge problem.

However, due to the limitations of testing tools, there are currently some scenarios that cannot be reproduced and compared. Can the tools be improved?
The main requirements are as follows:

Solve the problem of excessive memory usage. 200G memory is not enough (currently each process and count need to read all files, resulting in repeated memory usage)
Supports using one tool to read multiple pika ports and simulate cluster reading
You can configure the tool to use multiple commands at the same time (hset + hget + hgetall, etc.) and configure the ratio to meet the simulated online scenario

chenbt-hz added the ☢️ Bug Something isn't working label May 31, 2024

chenbt-hz closed this as completed Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The p99 of the hgetall command is ten times or even higher than that of hget #2691

The p99 of the hgetall command is ten times or even higher than that of hget #2691

chenbt-hz commented May 31, 2024 •

edited

Loading

chenbt-hz commented Jun 3, 2024 •

edited

Loading

AlexStocks commented Jun 7, 2024

Issues-translate-bot commented Jun 7, 2024

AlexStocks commented Jun 14, 2024 •

edited

Loading

Issues-translate-bot commented Jun 14, 2024

wangshao1 commented Jun 28, 2024

Issues-translate-bot commented Jun 28, 2024

chenbt-hz commented Jul 1, 2024

Issues-translate-bot commented Jul 1, 2024

AlexStocks commented Jul 5, 2024

Issues-translate-bot commented Jul 5, 2024

chenbt-hz commented Jul 26, 2024

Issues-translate-bot commented Jul 26, 2024

The p99 of the hgetall command is ten times or even higher than that of hget #2691

The p99 of the hgetall command is ten times or even higher than that of hget #2691

Comments

chenbt-hz commented May 31, 2024 • edited Loading

Is this a regression?

Description

Please provide a link to a minimal reproduction of the bug

Screenshots or videos

Please provide the version you discovered this bug in (check about page for version information)

Anything else?

chenbt-hz commented Jun 3, 2024 • edited Loading

AlexStocks commented Jun 7, 2024

Issues-translate-bot commented Jun 7, 2024

AlexStocks commented Jun 14, 2024 • edited Loading

Issues-translate-bot commented Jun 14, 2024

wangshao1 commented Jun 28, 2024

Issues-translate-bot commented Jun 28, 2024

chenbt-hz commented Jul 1, 2024

Issues-translate-bot commented Jul 1, 2024

AlexStocks commented Jul 5, 2024

Issues-translate-bot commented Jul 5, 2024

chenbt-hz commented Jul 26, 2024

Issues-translate-bot commented Jul 26, 2024

chenbt-hz commented May 31, 2024 •

edited

Loading

chenbt-hz commented Jun 3, 2024 •

edited

Loading

AlexStocks commented Jun 14, 2024 •

edited

Loading