-
Notifications
You must be signed in to change notification settings - Fork 0
Project 4 Custos Testing
To understand the performance of the client application leveraging Custos, we are performing soak test, stress test and fault tolerance test.
The cluster contains 3 nodes:
- Master - m1.medium. The instance has 6 cores and 16GB RAM and 60GB storage
- Workerone- m1.medium. The instance has 6 cores and 16GB RAM and 60GB storage
- Workertwo - m1.medium. The instance has 6 cores and 16GB RAM and 60GB storage
Namespaces used for the project:
- custos
- keycloak
- olm
- operators
- vault
- cattle-system
- cert-manager
- fleet-system
- ingress-nginx
- kube-node-lease
- kube-public
- kube-system
- security-scan
Soak testing is to stress the system over time simulating a typical system which runs for a long time. Many flaws in the design like memory leak, storage running out, network bottleneck. While soak test is running we are recording response status, ie. correctness of the response and the consistency of the response time.
By performing soak test we are hoping to find if the system degrade over time under heavy load. The most 2 common functionality of our application are registering a user and adding user to a group. Hence we performed soak test for ~8 hours to register 10000 users and adding these 10000 users to a group.
We observed that all 20000 API request were successful. However, we observed some inconsistency over time.
Here you can observe that some API request took more than 1500 ms. As the response time of the APIs are distributed we can say that consistent response time is not maintained.
We also noticed that the response time for different API is different. Here we can observe posting the user to a group is taking longer than registering a user.
To mitigate these issue, a possible plan is to increase minimum pods for group management services. This could help in achieving similar response time for both user management service and group management services.
To gauge the limit of the system, we are planning to generate load such that the system breaks or starts to show signs of consistent failures
We are planning to increase the load by the order of magnitude of 10 users. To begin with we'll start with 100 users, then 1000, 10000.. and so forth.
While the system is under generated stress, we are recording the status code and response time. Idealy, even under stress we should see a consistent response time with low failure rate. Once we have established the upper limit of our system, we can allocate resource and plan according to the observed traffic.