Project 4 Custos Testing

Test Plan

To understand the performance of the client application leveraging Custos, we are performing soak test, stress test and fault tolerance test.

Setup Configuration

rancher-dashboard The cluster contains 3 nodes:

Master - m1.medium. The instance has 6 cores and 16GB RAM and 60GB storage
Workerone- m1.medium. The instance has 6 cores and 16GB RAM and 60GB storage
Workertwo - m1.medium. The instance has 6 cores and 16GB RAM and 60GB storage

rancher-nodes

Namespaces used for the project:

custos
keycloak
olm
operators
vault
cattle-system
cert-manager
fleet-system
ingress-nginx
kube-node-lease
kube-public
kube-system
security-scan

rancher-namespaces rancher-namespaces-2

Soak Testing

Soak testing is to stress the system over time simulating a typical system which runs for a long time. Many flaws in the design like memory leak, storage running out, network bottleneck. While soak test is running we are recording response status, ie. correctness of the response and the consistency of the response time.

By performing soak test we are hoping to find if the system degrade over time under heavy load. The most 2 common functionality of our application are registering a user and adding user to a group. Hence we performed soak test for ~8 hours to register 10000 users and adding these 10000 users to a group.

Screenshot from 2022-05-05 16-58-20

Results

We observed that all 20000 API request were successful. However, we observed some inconsistency over time.

Here you can observe that some API request took more than 1500 ms. As the response time of the APIs are distributed we can say that consistent response time is not maintained.

Screenshot from 2022-05-05 16-10-15

We also noticed that the response time for different API is different. Here we can observe posting the user to a group is taking longer than registering a user.

Screenshot from 2022-05-05 16-14-46

Mitigation

To mitigate these issue, a possible plan is to increase minimum pods for group management services. This could help in achieving similar response time for both user management service and group management services.

Stress Testing

To gauge the limit of the system, we are planning to generate load such that the system breaks or starts to show signs of consistent failures

We are planning to increase the load by the order of magnitude of 10 users. To begin with we'll start with 100 users, then 1000, 10000.. and so forth.

While the system is under generated stress, we are recording the status code and response time. Idealy, even under stress we should see a consistent response time with low failure rate. Once we have established the upper limit of our system, we can allocate resource and plan according to the observed traffic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly