JMeter Testing for Custos Deployment

Jump to bottom

Nirav Raje edited this page May 9, 2022 · 10 revisions

Overview - JMeter Testing for Custos Deployment using Custos Python SDK

Testing Strategy

We performed stress testing and load testing on the Custos microservices using JMeter. The purpose of stress testing is to detect the failure point for the Custos services. We keep increasing the number of concurrent user threads till the service fails. The purpose of load testing is to inspect the performance of the system when it has been subjected to a significant load of requests over a certain amount of time.
In our approach, we have first tried to identify the point of failure for each microservice/management client by increasing the concurrent load on each microservice. We observed that all microservices/endpoints could handle a load of 100 concurrent threads well.
We slowly increased the concurrent thread count in increments of 50 and noted the point where the requests started failing and demonstrated a significant increase in error rate.
After noting down the maximum concurrent thread count the service could handle, we increased the loop count in JMeter for stress testing. We have stress tested the services with requests ranging from 2500 to 5000 (ramp-up period of 1 second) where we had 100-150 concurrent threads running.

Contents

User Management Client - Find Users by Limit

Tested Functionality: find_users() by limit

Failure Point Detection with 300 concurrent threads in JMeter

The find_users() gRPC servicer in the user management client was able to handle a load of up to 150 concurrent threads with 0% error and 200 concurrent threads with a low error rate of 1.16%.
However, when we increased the load to 300 concurrent threads, the error rate went up to 82% as the pods started restarting after processing just 30 requests.
Hence, our failure point for this endpoint is about 300 concurrent requests.

Result Table

Pods started restarting after processing just 30 requests when a load of 300 concurrent requests was enforced

results table 300 requests - failure point

Summary Report

300 concurrent requests, ~82% error rate

300 requests summary report - 82 error

Stress Testing (5000 requests; 100 threads)

Aggregate Graph - Throughput: 6.5/sec, Error %: 0

aggregate graph

Response Time Graph

response time graph

User Management Client - Create User

Tests following functionalities: Register User, Enable User, Add User to Group, Share Entity with Users

Failure Point Detection

The gRPC servicer endpoints corresponding to register_user(), enable_user(), add_user_to_group(), share_entity_with_users() performed well with concurrent request load of 100, 150, and 200, but started failing after 250 requests.
There was a spike in the error rate from 2% at 200 to 78% at 300.

Result Table

create user results table 300 requests

Summary Report

Error rate increased from 2% at 200 requests to 78% at 300.

summary report 300 requests 78 error rate

Aggregate Graph - Throughput: 15.8/min, Error %: 1.20

aggregate graph

Response Time Graph

response time graph

User Management Client - Update User

Tested Functionality: update_user_profile()

Failure Point Detection

With a load of 300 concurrent requests, failures started appearing after processing 100 requests as can be seen below.

Result Table

update user requests failing after 101

Summary Report

userupdate 300 error rate is 70

Stress Testing (500 requests; 100 threads)

Aggregate Graph - Throughput: 28.8/min, Error %: 0

aggregate graph 500 requests

Response Time Graph

response time graph 500 requests

Group Management Client - Create Group

Tested Functionality: create_group()

Failure Point Detection

Failure point was seen at a load of 300 concurrent requests, after processing 62 requests successfully.

Result Table

300 request - results table- 68 error rate

Summary Report

summary report 300 request 68 error rate

Stress Testing (2500 requests; 50 threads)

Aggregate Graph - Throughput: 4.0/sec, Error %: 0

aggregate graph

Response Time Graph

response time graph

Group Management Client - Add User to Group

Tested Functionality: add_user_to_group()

Failure Point Detection

With 150 concurrent requests, the error rate was around 0.67%.
With 300 concurrent requests, the error rate was around 6.67%.
With 1000 concurrent requests, the error rate spiked up to ~70%.

Result Table

results table 300 requests 6 67 error rate

Summary Report

150 requests summary report 150 requests 0 67 error rate

300 requests summary report 300 requests 6 67 error rate

1000 requests summary report 1000 requests 70 error rate

Stress Testing (5000 requests; 250 threads)

Aggregate Graph - Throughput: 8.4/sec, Error %: 0

5000 stress test aggregate graph

Graph Results

5000 graph results

Entity Management Client - Create Entity

Tested Functionality: create_entity()

Failure Point Detection

With 150 concurrent requests, the error rate was around 1.33%.
With 300 concurrent requests, the error rate spiked up to ~44%.

Result Table

resutls table 300 requests 44 error rate

Summary Report

150 requests 150 requests 1 33 error rate

300 requests summary report 44 error rate

Stress Testing (1000 requests; 125 threads)

Aggregate Graph - Throughput: 1.6/sec, Error %: 0

aggregate graph 1000

Response Time Graph

response time graph 1000

Sharing Management Client - Share Entity with Users

Tested Functionality: share_entity_with_users()

Failure Point Detection

With 300 concurrent requests, a breaking point was observed after 72 requests were processed.

Result Table

A large batch of requests failed after 72. results table 300 breakpoint

Summary Report

300 summary report - error rate 61

Stress Testing (1000 requests; 50 threads)

Aggregate Graph - Throughput: 2.6/sec, Error %: 0

aggregate graph

Response Time Graph

response time graph

Sharing Management Client - Share Entity with Groups

Tested Functionality: share_entity_with_groups()

Failure Point Detection

With a load of 300 concurrent threads, requests started failing after processing 172.

Result Table

300 requests breaking point

Summary Report

300 requests 17 error rate

Stress Testing (2000 requests; 150 threads)

Aggregate Graph - Throughput: 6.0/sec, Error %: 0

aggregate graph

Response Time Graph

response time graph