Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The p99 of the hgetall command is ten times or even higher than that of hget #2691

Closed
chenbt-hz opened this issue May 31, 2024 · 13 comments
Closed
Labels
☢️ Bug Something isn't working

Comments

@chenbt-hz
Copy link
Collaborator

chenbt-hz commented May 31, 2024

Is this a regression?

Yes

Description

后续补充截图。

测试设备 104核 256G 内存
单机部署6个pika实例,每个配置3db,进程数4,线程池大小8
纯读:hget p99 < 1ms ,hgetall p99 ~ 10ms
读写时,hset + hmset + hget + hgetall p99甚至大于1s

请问下如何优化hgetall的p99?

Please provide a link to a minimal reproduction of the bug

No response

Screenshots or videos

中间部分是纯hget,p99明显降低,左侧是hget+hgetall,右侧是纯hgetall
image

Please provide the version you discovered this bug in (check about page for version information)

No response

Anything else?

No response

@chenbt-hz chenbt-hz added the ☢️ Bug Something isn't working label May 31, 2024
@chenbt-hz
Copy link
Collaborator Author

chenbt-hz commented Jun 3, 2024

测试设备 104核 256G 内存
单机部署6个pika实例,每个配置:

db-instance-num : 3
thread-num : 4
thread-pool-size : 8
maxclients : 80000
block-cache: 20G
share-block-cache: yes
enable-partitioned-index-filters: yes
cache-index-and-filter-blocks: yes
level-compaction-dynamic-level-bytes: yes
enable-blob-files : yes
min-blob-size : 2K
blob-file-size : 512M
cache-model : 0

共1.04T 数据,仅hgetall qps:40k - 80k 波动,峰值100k
p99 如下
image

@AlexStocks
Copy link
Collaborator

cbt:在单机上启动多个 pika 时出现这个问题,但是 CPU MEM DISK 都没有达到极限

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


cbt: This problem occurs when starting multiple pikas on a single machine, but the CPU MEM DISK does not reach the limit.

@AlexStocks
Copy link
Collaborator

AlexStocks commented Jun 14, 2024

cbt:
1 client 直连 pika
2 在一个机器上启动一个 Pika 实例,请求 P99 在 6ms;部署两个实例,则请求 P99 飙升到 60ms

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


cbt: When starting a Pika instance on one machine, the request P99 is in 6ms; deploying two instances, the request P99 soars to 60ms

@wangshao1
Copy link
Collaborator

我这边测试的场景是:
48核物理机,nvme ssd盘,部署两个独立的pika节点,每个节点db目录下数据为400GB+,压测客户端与pika所在物理机ping延迟0.0x ms级别。部署一个实例或者部署两个实例,TP9999都没有超过10ms。
如果这个稳定在你那边是稳定复现的话,咱们可以再对比一下测试过程中一些系统指标的差异。

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


The scenario I am testing here is:
48-core physical machine, nvme ssd disk, deploy two independent pika nodes, the data in the db directory of each node is 400GB+, the ping delay of the stress test client and the physical machine where pika is located is 0.0x ms level. Deploying one instance or two instances, TP9999 does not take more than 10ms.
If this stability is reproducible on your side, we can compare the differences in some system indicators during the test process.

@chenbt-hz
Copy link
Collaborator Author

今天下午用最新的版本重新编译测试了下,还是可以明显看到2个pika时的P99增加很多:
单个pika 约126G ,value大小300B
测试设备 104核 256G 内存 SSD加速卡

测试时间 命令 Pika数量 QPS 缓存命中率 P99 Disk IO Disk read IO CPU 使用率
15:30 - 15:38 hget 2 20k + 少量60K 13-19% 波动,最高30ms 启动时60%,qps峰值时25%,常规12.5% 启动时485MB,常规时42MB 3%,qps峰值时9%
15:59 - 16:22 hget 1 20k + 少量60K 11-16% 基本稳定3ms,小范围波动7ms 刚开始升到27%,逐步降到7% 随时间变化25-50-10MB 1.75%,qps峰值时5%
17:23- 17:43 hgetall 2 20k + 少量60K 0-22% 波动,4-48ms 稳定10%,qps峰值时32% 稳定60MB,峰值166MB 5%左右,qps峰值时16%
16:32 - 16:54 hgetall 1 20k + 少量60K 5-21.17% 基本稳定3ms 稳定1% 基本4MB 2%

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


I recompiled and tested this afternoon with the latest version. It can still be clearly seen that the P99 increased a lot when using 2 pikas:
A single pika is about 126G, and the value size is 300B.
Test equipment 104 core 256G memory SSD accelerator card

Test time Command Number of Pika QPS Cache hit rate P99 Disk IO Disk read IO CPU usage
15:30 - 15:38 hget 2 20k + a small amount of 60K 13-19% Fluctuating, up to 30ms 60% at startup, 25% at peak qps, 12.5% ​​at regular 485MB at startup, 42MB at regular 3%, 9% at qps peak
15:59 - 16:22 hget 1 20k + a small amount of 60K 11-16% Basically stable 3ms, fluctuating in a small range 7ms Initially rose to 27%, gradually dropped to 7% Changes over time 25-50 -10MB 1.75%, 5% at peak qps
17:23- 17:43 hgetall 2 20k + a small amount of 60K 0-22% Fluctuation, 4-48ms Stable 10%, 32% at qps peak Stable 60MB, peak 166MB About 5%, qps peak 16% when
16:32 - 16:54 hgetall 1 20k + a small amount of 60K 5-21.17% Basically stable 3ms Stable 1% Basically 4MB 2%

@AlexStocks
Copy link
Collaborator

wsy:有待 baitao 使用 pika 官方的 benchmark 工具进行压测

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


wsy: Wait for baitao to use pika’s official benchmark tool for stress testing

@chenbt-hz
Copy link
Collaborator Author

用4.0.0版本测试单实例200G内,2个benchmark分别读取2个pika没有复现p99暴涨问题。

但是由于测试工具的限制,当前有些场景没办法去复现对比,能否改进下工具?
主要需求如下:

  • 解决内存占用过高的问题,200G内存也不够用(当前逻辑每个进程和count都需要读取全部文件,导致内存重复占用)
  • 支持使用1个工具读取多个pika端口,模拟集群读取
  • 可以配置工具同时使用多个命令(hset + hget + hgetall等),并配置比例,来满足模拟线上场景

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


Using version 4.0.0 to test a single instance of 200G, two benchmarks read two pikas respectively and did not reproduce the p99 surge problem.

However, due to the limitations of testing tools, there are currently some scenarios that cannot be reproduced and compared. Can the tools be improved?
The main requirements are as follows:

  • Solve the problem of excessive memory usage. 200G memory is not enough (currently each process and count need to read all files, resulting in repeated memory usage)
  • Supports using one tool to read multiple pika ports and simulate cluster reading
  • You can configure the tool to use multiple commands at the same time (hset + hget + hgetall, etc.) and configure the ratio to meet the simulated online scenario

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
☢️ Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants