Skip to content

Project 4 Custos Testing

Tanmay Sawaji edited this page May 6, 2022 · 20 revisions

Test Plan

To understand the performance of the client application leveraging Custos, we are performing soak test, stress test and fault tolerance test.

Setup Configuration

rancher-dashboard
Fig - Rancher Dashboard

The cluster contains 3 nodes:

  • Master - m1.medium. The instance has 6 cores and 16GB RAM and 60GB storage
  • Workerone- m1.medium. The instance has 6 cores and 16GB RAM and 60GB storage
  • Workertwo - m1.medium. The instance has 6 cores and 16GB RAM and 60GB storage
rancher-nodes
Fig - Cluster Nodes

Memory Allocated for each application

Screenshot from 2022-05-05 21-06-35
Fig - Kubernetes services

Soak Testing

Soak testing is to stress the system over time simulating a typical system which runs for a long time. Many flaws in the design like memory leak, storage running out, network bottleneck. While soak test is running we are recording response status, ie. correctness of the response and the consistency of the response time.

By performing soak test we are hoping to find if the system degrade over time under heavy load. The most 2 common functionality of our application are registering a user and adding user to a group. Hence we performed soak test for ~8 hours to register 10000 users and adding these 10000 users to a group.

Screenshot from 2022-05-05 16-58-20
Fig - Soak testing report

Results

We observed that all 20000 API request were successful. However, we observed some inconsistency over time.

Here you can observe that some API request took more than 1500 ms. As the response time of the APIs are distributed we can say that consistent response time is not maintained.

Screenshot from 2022-05-05 16-10-15
Fig - Response time overview

We also noticed that the response time for different API is different. Here we can observe posting the user to a group is taking longer than registering a user.

Screenshot from 2022-05-05 16-14-46
Fig - Response time distribution

Mitigation

To mitigate these issue, a possible plan is to increase minimum pods for group management services. This could help in achieving similar response time for both user management service and group management services.

Stress Testing

To gauge the limit of the system, we are planning to generate load such that the system breaks or starts to show signs of consistent failures

We are planning to increase the load by the order of magnitude of 10 users. To begin with we'll start with 1000 and then increment it to 10000.. and so forth.

While the system is under generated stress, we are recording the status code and response time. Ideally, even under stress we should see a consistent response time with low failure rate. Once we have established the upper limit of our system, we can allocate resource and plan according to the observed traffic.

Result

We started with 3000 users on the system, upon testing we realised that we have already hit the upper limit as 60% of the request failed.

Report for 3000 concurrent user running the application

Screenshot from 2022-05-05 21-57-46
Fig - Stress Test Report for 3000 users
Screenshot from 2022-05-05 21-58-40
Fig - Stress test statistics
Screenshot from 2022-05-05 21-58-04
Fig - Response time overview

We decremented the user count to 1000 user to find the max users which our application can support. But as you can see below, even with 1000 concurrent user the application couldn't handle the load. The failure rate of the application API was around 54%

Report for 1000 concurrent user running the application

Screenshot from 2022-05-05 21-52-27
Fig - Stress test report for 1000 users
Screenshot from 2022-05-05 21-52-53
Fig - Stress test statistics
Screenshot from 2022-05-05 21-53-32
Fig - Response time overview

Lastly, we tried to reduce the load as low as 50 user and yet the system was reporting 8% of API failure.

Screenshot from 2022-05-05 22-19-34
Fig - Stress testing report for 50 users
Screenshot from 2022-05-05 22-20-37
Fig - Stress test statistics
Screenshot from 2022-05-05 22-20-10
Fig - Response time overview

Mitigation

We observed that there is no horizontal pod scaling configured for the services. Hence all the pods are not scaling according to the load. If there was horizontal pod scaling enabled, kubernetes would have scaled individual pods of the instance which is causing bottleneck.

Here below you can see that user-profile and iam-admin core services are creating the bottleneck yet the number of pods remain the same.

Screenshot from 2022-05-05 21-00-16
Fig - Memory bottleneck

Fault Tolerance Testing

To test if the services are working even if a instance/pod is down, we manually injected failure in the deployment. We noticed that system was able to recover from the failure immediately and the performance were similar to the state before injecting failure.

Here you can see, upon deletion of a pod, deployment is configure to create new instance and add it to the service.

Screenshot from 2022-05-05 22-52-21
Fig - Fault tolerance report