Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[performance] - TopicOperator capacity test case #10050

Merged
merged 6 commits into from
May 15, 2024

Conversation

see-quick
Copy link
Member

@see-quick see-quick commented May 1, 2024

Type of change

  • Enhancement / new feature
  • Refactoring

Description

TLDR; This PR adds a capacity test case to check and understand performance under varying conditions of the Topic Operator.

Findings:

Multi-node (ZK-based)

+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+
 | Experiment | IN: MAX QUEUE SIZE | IN: MAX BATCH SIZE (ms) | IN: MAX BATCH LINGER (ms) | OUT: Successful KafkaTopics Created | jvm_memory_used_megabytes_total.txt | strimzi_update_status_duration_seconds_max.txt | strimzi_reconciliations_max_queue_size.txt | system_load_average_per_core.txt | strimzi_reconciliations_max_batch_size.txt | strimzi_reconciliations_duration_seconds_max.txt | strimzi_create_topics_duration_seconds_max.txt | strimzi_describe_configs_duration_seconds_max.txt | strimzi_total_time_spend_on_uto_event_queue_duration_seconds.txt |
| 1          | 2147483647         | 100                     | 100                       | 3200                                | 215.26356                           | 0.219718797                                    | 56415.0                                    | 0.605                            | 100.0                                      | 22.554448171                                     | 8.737679829                                    | 0.201271031                                       | 102240.315236831                                                 |
| 2          | 2147483647         | 10                      | 1                         | 3200                                | 214.598832                          | 0.220416525                                    | 56232.0                                    | 0.38875                          | 10.0                                       | 2.898720235                                      | 1.531996913                                    | 0.061099615                                       | 12172.831177297001                                               |
| 3          | 2147483647         | 50                      | 100                       | 4800                                | 254.322944                          | 0.206390701                                    | 188946.0                                   | 1.34625                          | 50.0                                       | 62.053237627                                     | 4.362440202                                    | 60.049029957                                      | 63501.666334768                                                  |
| 4          | 2147483647         | 100                     | 500                       | 4000                                | 221.194208                          | 0.361124567                                    | 94535.0                                    | 0.565                            | 100.0                                      | 15.49413541                                      | 7.430776523                                    | 0.247871354                                       | 105783.41873366499                                               |
| 5          | 2147483647         | 500                     | 1000                      | 5500                                | 338.318472                          | 0.279250863                                    | 244963.0                                   | 1.22125                          | 500.0                                      | 76.091719096                                     | 60.006696895                                   | 60.049583745                                      | 663336.367527252                                                 |
| 6          | 2147483647         | 1000                    | 2000                      | 4300                                | 273.048416                          | 0.254702686                                    | 153003.0                                   | 1.29                             | 1000.0                                     | 90.530778363                                     | 10.867806866                                   | 60.07707421                                       | 1109098.767682166                                                |
+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+

Multi-node (KRaft-based)

+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+
| Experiment | IN: MAX QUEUE SIZE | IN: MAX BATCH SIZE (ms) | IN: MAX BATCH LINGER (ms) | OUT: Successful KafkaTopics Created | jvm_memory_used_megabytes_total.txt | strimzi_update_status_duration_seconds_max.txt | strimzi_reconciliations_max_queue_size.txt | system_load_average_per_core.txt | strimzi_reconciliations_max_batch_size.txt | strimzi_reconciliations_duration_seconds_max.txt | strimzi_create_topics_duration_seconds_max.txt | strimzi_describe_configs_duration_seconds_max.txt | strimzi_total_time_spend_on_uto_event_queue_duration_seconds.txt |
| 1          | 2147483647         | 100                     | 100                       | 5600                                | 195.439096                          | 0.22489477                                     | 58228.0                                    | 0.79                             | 100.0                                      | 1.784441551                                      | 0.139120398                                    | 0.063826924                                       | 40307.747651726                                                  |
| 2          | 2147483647         | 10                      | 1                         | 4300                                | 187.411216                          | 0.40710244                                     | 33721.0                                    | 0.765                            | 10.0                                       | 0.52696775                                       | 0.093613257                                    | 0.071502803                                       | 4324.554014178                                                   |
| 3          | 2147483647         | 50                      | 100                       | 4800                                | 220.39176                           | 0.275977841                                    | 42992.0                                    | 0.71                             | 50.0                                       | 1.234386444                                      | 0.248006673                                    | 0.067473604                                       | 20203.399965616998                                               |
| 4          | 2147483647         | 100                     | 500                       | 5700                                | 203.301648                          | 0.127244311                                    | 66151.0                                    | 0.78125                          | 100.0                                      | 2.267170077                                      | 0.168008125                                    | 0.078424406                                       | 42221.078131318005                                               |
| 5          | 2147483647         | 500                     | 1000                      | 5800                                | 374.976736                          | 0.239079959                                    | 129444.0                                   | 1.28125                          | 500.0                                      | 75.606669081                                     | 0.151590669                                    | 60.064203694                                      | 476065.79083824                                                  |
| 6          | 2147483647         | 1000                    | 2000                      | 5400                                | 306.997096                          | 0.224739169                                    | 134374.0                                   | 1.48                             | 1000.0                                     | 92.287175352                                     | 0.15948684                                     | 60.063947644                                      | 835873.344649334                                                 |
+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+

Testing farm ARM:

Use Case: capacityUseCase
+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+
| Experiment | IN: MAX QUEUE SIZE | IN: MAX BATCH SIZE (ms) | IN: MAX BATCH LINGER (ms) | OUT: Successful KafkaTopics Created | jvm_memory_used_megabytes_total.txt | strimzi_update_status_duration_seconds_max.txt | strimzi_reconciliations_max_queue_size.txt | system_load_average_per_core.txt | strimzi_reconciliations_max_batch_size.txt | strimzi_reconciliations_duration_seconds_max.txt | strimzi_create_topics_duration_seconds_max.txt | strimzi_describe_configs_duration_seconds_max.txt | strimzi_total_time_spend_on_uto_event_queue_duration_seconds.txt |
| 1          | 2147483647         | 100                     | 100                       | 4500                                | 731.639096                          | 0.098367982                                    | 78873.0                                    | 0.20421875                       | 100.0                                      | 60.754216112                                     | 60.002203204                                   | 60.000849276                                      | 108018.144774452                                                 |
| 2          | 2147483647         | 10                      | 1                         | 2800                                | 209.412064                          | 0.091291686                                    | 46315.0                                    | 0.216875                         | 10.0                                       | 60.092980774                                     | 0.075462826                                    | 60.000769675                                      | 6332.63210499                                                    |
| 3          | 2147483647         | 50                      | 100                       | 4400                                | 209.128448                          | 0.045046907                                    | 79613.0                                    | 0.2634375                        | 50.0                                       | 60.451420408                                     | 0.113813242                                    | 60.001154181                                      | 53516.134807393006                                               |
| 4          | 2147483647         | 100                     | 500                       | 5300                                | 288.959168                          | 0.101506987                                    | 92527.0                                    | 0.26671875                       | 100.0                                      | 60.828857234                                     | 0.195412376                                    | 60.00111564                                       | 118558.314525059                                                 |
| 5          | 2147483647         | 500                     | 1000                      | 4300                                | 272.103896                          | 0.049253628                                    | 73222.0                                    | 0.27296875                       | 500.0                                      | 63.1228414                                       | 0.190617895                                    | 60.001418996                                      | 350705.097103311                                                 |
| 6          | 2147483647         | 1000                    | 2000                      | 3600                                | 832.115216                          | 0.057186219                                    | 55337.0                                    | 0.2925                           | 1000.0                                     | 67.99075671                                      | 0.122869061                                    | 60.002316816                                      | 289868.954987523                                                 |
+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+

Testing farm Intel

Use Case: capacityUseCase
+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+
| Experiment | IN: MAX QUEUE SIZE | IN: MAX BATCH SIZE (ms) | IN: MAX BATCH LINGER (ms) | OUT: Successful KafkaTopics Created | jvm_memory_used_megabytes_total.txt | strimzi_update_status_duration_seconds_max.txt | strimzi_reconciliations_max_queue_size.txt | system_load_average_per_core.txt | strimzi_reconciliations_max_batch_size.txt | strimzi_reconciliations_duration_seconds_max.txt | strimzi_create_topics_duration_seconds_max.txt | strimzi_describe_configs_duration_seconds_max.txt | strimzi_total_time_spend_on_uto_event_queue_duration_seconds.txt |
| 1          | 2147483647         | 100                     | 100                       | 4100                                | 313.79264                           | 0.270303261                                    | 72879.0                                    | 0.120625                         | 100.0                                      | 60.713250828                                     | 60.001026375                                   | 60.001035712                                      | 23303.511383901                                                  |
| 2          | 2147483647         | 10                      | 1                         | 2800                                | 197.653832                          | 0.143603987                                    | 46974.0                                    | 0.15364583333333334              | 10.0                                       | 60.079330599                                     | 1.024587598                                    | 60.00056892                                       | 6277.153443556                                                   |
| 3          | 2147483647         | 50                      | 100                       | 4100                                | 202.440888                          | 0.086903046                                    | 67232.0                                    | 0.12760416666666666              | 50.0                                       | 60.286231975                                     | 60.001271054                                   | 60.000842754                                      | 50124.335254043                                                  |
| 4          | 2147483647         | 100                     | 500                       | 4900                                | 821.440424                          | 0.099006388                                    | 84448.0                                    | 0.16447916666666665              | 100.0                                      | 60.687474644                                     | 0.078397415                                    | 60.001052049                                      | 112642.51520577799                                               |
| 5          | 2147483647         | 500                     | 1000                      | 3400                                | 209.177632                          | 0.119279889                                    | 52186.0                                    | 0.14885416666666665              | 500.0                                      | 63.252552456                                     | 60.00182004                                    | 60.001381924                                      | 290553.26413272496                                               |
| 6          | 2147483647         | 1000                    | 2000                      | 4100                                | 1101.374072                         | 0.128018699                                    | 64970.0                                    | 0.1259375                        | 1000.0                                     | 66.056154521                                     | 0.082030432                                    | 60.002737845                                      | 406052.291981094                                                 |
+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+

Checklist

  • Write tests
  • Make sure all tests pass

@see-quick
Copy link
Member Author

/packit test --labels performance-topic-operator-capacity

@see-quick see-quick self-assigned this May 1, 2024
@see-quick
Copy link
Member Author

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testCapacity

@strimzi-ci
Copy link

▶️ Build started - check Jenkins for more info. ▶️

@see-quick
Copy link
Member Author

/packit test --labels performance-topic-operator-capacity

@see-quick see-quick added this to the 0.41.0 milestone May 2, 2024
@strimzi-ci
Copy link

✔️ Test Summary ✔️

TEST_PROFILE: performance
GROUPS:
TEST_CASE: TopicOperatorPerformance#testCapacity
TOTAL: 6
PASS: 6
FAIL: 0
SKIP: 0
BUILD_NUMBER: 76
OCP_VERSION: 4.15
BUILD_IMAGES: false
FIPS_ENABLED: false
PARALLEL_COUNT: 5
EXCLUDED_GROUPS: loadbalancer,nodeport,olm

@see-quick
Copy link
Member Author

/packit test --labels performance-topic-operator-capacity

@see-quick
Copy link
Member Author

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testCapacity --env=STRIMZI_USE_KRAFT_IN_TESTS=true

@strimzi-ci
Copy link

▶️ Build started - check Jenkins for more info. ▶️

@see-quick
Copy link
Member Author

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testCapacity --env=STRIMZI_USE_KRAFT_IN_TESTS=true

@strimzi-ci
Copy link

▶️ Build started - check Jenkins for more info. ▶️

@strimzi-ci
Copy link

✔️ Test Summary ✔️

TEST_PROFILE: performance
GROUPS:
TEST_CASE: TopicOperatorPerformance#testCapacity
TOTAL: 6
PASS: 6
FAIL: 0
SKIP: 0
BUILD_NUMBER: 78
OCP_VERSION: 4.15
BUILD_IMAGES: false
FIPS_ENABLED: false
PARALLEL_COUNT: 5
EXCLUDED_GROUPS: loadbalancer,nodeport,olm
ENV_VARIABLES: STRIMZI_USE_KRAFT_IN_TESTS=true

@see-quick see-quick marked this pull request as ready for review May 2, 2024 19:50
@see-quick see-quick requested a review from a team May 2, 2024 19:50
Copy link
Member

@Frawless Frawless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one question - do we have some summary readme or something about the perf tests? I think it might be useful not just for us but also for users.

Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these additional performance tests.

A KafkaTopic creation event is much heavier to handle than a configuration change event. I was wondering if we should use a workload with both event types in parallel. With that, I get the best performance with batch.size=100 and linger.ms=10. Wdyt?

@see-quick
Copy link
Member Author

LGTM, just one question - do we have some summary readme or something about the perf tests? I think it might be useful not just for us but also for users.

Not yet. I can write some overview about them in the development-docs/TESTING.md (next PR preferable).

@see-quick
Copy link
Member Author

see-quick commented May 6, 2024

Thanks for adding these additional performance tests.

A KafkaTopic creation event is much heavier to handle than a configuration change event. I was wondering if we should use a workload with both event types in parallel. With that, I get the best performance with batch.size=100 and linger.ms=10. Wdyt?

So as we have in the UserOperator capacity perf test, that wants to answer the question:

  1. How much memory and CPUs are needed to run such load (i.e., OUT: KafkaUsers Created)
  2. How many resources does a user need resources and which configuration (if not default) to handle the X number of KafkaUser CRs.

Such questions are answered in my findings in the description. This could be potentially documented somewhere (same with User Operator findings).

I was wondering if we should use a workload with both event types in parallel. With that, I get the best performance with batch.size=100 and linger.ms=10.

This is an interesting one and I think it's another use case right? Currently, I am just trying to hit the capacity of the Topic Operator similarly as we do for UO. So maybe we could add such a use case as another test case to see how it's performing?

Note 1: Also, I think it's important to see other INPUT metrics (e.g., broker/controller replicas which we currently do not have we use default 3B-3C setup, or maybe add a input parameter to wait x minutes for batch KafkaTopcis to be created...) to see how they affect the OUTPUT metrics (e.g., OUT: Successful KafkaTopics Created).

Note 2: There seems to be some issue when we hit around 4k in KafkaTopic CRs because [1] and maybe this would open to scale even more :)...

[1] - #10054

@fvaleri
Copy link
Contributor

fvaleri commented May 6, 2024

So as we have in the UserOperator capacity perf test, that wants to answer the question:

Ok this makes sense, thanks.

This is an interesting one and I think it's another use case right? Currently, I am just trying to hit the capacity of the Topic Operator similarly as we do for UO. So maybe we could add such a use case as another test case to see how it's performing?

Yes, this is the one I used for my custom tests to simulate the workload of a busy cluster, that should give more general tuning recommendation. If you confirm my finding, then we may even think about changing the default value of linger.ms.

Note 2: There seems to be some issue when we hit around 4k in KafkaTopic CRs because [1] and maybe this would open to scale even more :)...

I'll look into that. Thanks for raising.

@scholzj scholzj modified the milestones: 0.41.0, 0.42.0 May 7, 2024
@see-quick see-quick added ready for merge Label for PRs which are ready for merge and removed needs review labels May 15, 2024
@see-quick see-quick merged commit a900d30 into strimzi:main May 15, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance ready for merge Label for PRs which are ready for merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants