Homework 2: Scalability

Kubernetes deployment

Prerequisites

Minikube or kubenetes cluster - To install minikube check here
kubectl - to install kubectl check here
helm - to install helm check here

Deployment

Once you have the above dependencies installed correctly we can set up Elves Forecast with the following steps

clone CloudElves repository and checkout deployments branch

git clone https://github.com/airavata-courses/CloudElves.git
cd CloudElves
git checkout deployments
cd deployments

run the deployments script (make sure you are running the command from within CloudElves repo directory where all config files exist)

sh deployments.sh

If you want to delete all the deployments, you can do so by running the cleanup script

sh cleanup.sh

It may take few minutes for all the services to come up. We can check the status by using below command:

kubectl get all -n elves

successful deployment

Note: The above deployment will install Postgres database, rabbitmq, and all the microservices part of elves forecast within the Kubernetes cluster. It will also create a namespace called 'elves' in which all the deployments take place. the cleanup script ensures all the created resources are deleted promptly

Benchmarking microservices

If we consider our architecture, there are 4 microservices apart from UI components - gateway, registry, ingestor, and forecast. Of these services registry and ingestor are process-intensive. Whereas registry logs every user activity (write-intensive) and serves application logging for presenting user history (read-intensive), the ingestor is responsible for downloading data from NexRad servers, plotting, and serving the plots to the front end. While the gateway is an important core component, it handles the authentication and only orchestrates the calls between UI and backend services, and the forecast service response is mocked. Hence, the performance of our system mainly depends on how well the registry and ingestor microservices are performing.

In order to benchmark these services, we carry out the load test with the following settings:

Keeping the memory configuration constant and varying the load (i.e number of users and hence transactions per minute)
Keeping the load constant and running it on different memory configurations (1 GB, 2 GB, and 4 GB)

Once we benchmark these services, we can arrive at a good estimate of how many user interactions we can handle concurrently. This can then be used to perform a comprehensive load test on a fully connected system (including all microservices).

Benchmarking results of registry microservice (Load test done by Madhavan Kalkunte Ramachandra)

The registry provides the following APIs:

getAllUsers - returns all the registered users. This is the least frequently called API, majorly from an administrative point.
getAppLog - returns all the logs for a given user. This is the second most used and called whenever the user wants to review his activity in the history tab of the UI.
getUser - returns user details for a given user. Called once after login to get user details.
addAppLog - logs a user activity into the database. This is the most frequently called. Every user request might result in 1 or more activity logging. For example, if a user requests a plot the gateway logs the request first, and then the ingestor logs another time once the request is completed.

Load pattern

The load test is used based on the relative usage of each API. We maintain proportionality between the API calls as addAppLog:getAppLog:getUser:getAllUsers = 10:5:2:1

Test setting:

At the start of each test, the services are restarted and the database is cleared. JVM memory is capped at 1 GB.

case 1:

AppLog: 1000 tpm, getAppLog: 500 tpm, getUser: 200 tpm, getAllUser: 100 tpm

All the requests were processed successfully and constant throughput was observed throughout the entirety of the test. Throughput graph: throughput-case-1

Memory: memory-case-1

case 2:

AppLog: 4000 tpm, getAppLog: 2000 tpm, getUser: 800 tpm, getAllUser: 400 tpm

All the requests were processed successfully and constant throughput was observed throughout the entirety of the test.

Throughput graph: throughput-case-1

Memory: memory-case-1

case 3:

AppLog: 16000 tpm, getAppLog: 8000 tpm, getUser: 3600 tpm, getAllUser: 1600 tpm

Throughput graph: throughput-case-1

Memory: memory-case-1

Observation: We can observe that as the load test progresses, the throughput decreases. This can be attributed to the increased memory usage that we can observe in the memory graph. As the number of logs per user increases, the fetching of logs for the user becomes both computer and memory intensive. Hence we can see increased memory usage as the load test progresses and as the memory usage increase, garbage collection also increases which further adds to the response time.

Note: Increasing the memory allocated to JVM will not address the issue but only delay it. That is, the system which is now breaking at 16000 TPM with 1 GB JVM might break at 20000-24000 TPM since there will be now more space in JVM.

Benchmarking results of ingestor microservice (Load test done by Navkar Shah)

This service provides the following APIs:

/getdata - returns a URL for the image of the plot. This endpoint is called once every time a user requests weather data.
/getimage - returns the location of the filename. This is executed after /getdata returns the URL for the image which in case of failure would respond with a file not found message.

The load testing pattern for ingestor service would be a simple ratio 1:1 between getdata:getimage.
For all the tests performed the services are restarted and memory is cleared for accuracy of results.
Since the ingestor service performs heavy and crucial tasks of downloading data from nexrad and creating plots of the data received the test cases need to be generated for a variety of users requests and memory allocations for benchmarking its performance under different types of loads.
The following test cases are taken into consideration with their results:

Test case 1:

Memory allocation: 2GB
User requests: a) 10 and b) 20 with a ramp-up period of 40 seconds.
Observations:

As it is clearly observed from the Threads vs Response Time charts that for 2GB of memory allocation the response time increases rapidly due to the inability of the application to store and process large data file simultaneously.
The ingestor container application stops responding completely when it runs out of memory and all the following requests start failing.

tvt_10_2
tvt_20_2
Test case 2:

Memory allocation: 4GB
User requests: a) 10, b) 20 and c) 60 with a ramp-up period of 40 seconds.
Observations:

It can be observed that the throughput improves with the increase in memory allocation.
This is since the application can process twice as many requests as before which reduces to a number of time-outs.
Although the application works fine with fewer requests, it still fails to handle a large number of concurrent requests, and container crashes.

Jmeter-Tree-4

The pie charts below show the success and failure percentages which significantly improved from the previous test for 10, 20, and 60 threads respectively.

Pie-Chart-4.1 Pie-Chart-4.2 Pie-Chart-4.6

Test case 3:

Memory allocation: 6GB
User requests: a) 10, b) 20 and c) 60 with a ramp-up period of 40 seconds.
Observations:

The overall throughput improves as more requests are processed simultaneously, but for a large-scale deployment the memory allocations put a hindrance on the performance of the application.
The rate of failure further decreases but is not completely dependent on memory space.

Jmeter-Tree-6

The piecharts below show the success and failure rates for 10, 20 and 60 threads respectively.

Pie-Chart-6.1 Pie-Chart-6.2 Pie-Chart-6.3

Overall comparison of error rates:

compare_table

Conclusion:

From the results observed above one can see that the rate of success of requests completely depends on the memory allocated to the application.
That the memory needs to be scaled up every time the number of requests increases.
It means that there is no mechanism for throttling the requests according to the memory available.
This calls for an introduction of a queueing service that would help limit the number of user requests to the application without them failing on the producer end.
Another approach can be creating replicas of the application and distributing requests among them.

Benchmarking results of forecast Microservice (Load test done by Ayush Sanghavi)

This service provides the following APIs:

/forecast - returns the current weather forecast with all the details such as minimum temperature, maximum temperature, pressure, humidity, weather description.

Test case 1:

Memory allocation : 2 GB
User requests: a) 2000 with a ramp-up period of 10 seconds

Observations :

We ran a load test on the forecast microservice where we are mimicking the weather forecast.
We observed that there was no failure whatsoever in terms of requesting the API for generation of random data.
As a result, all the requests were well handled and no request failed.
As we can see all the requests have been handled correctly and we have also received the expected http response data (weather information)

The response time increases linearly as the system is flooded with around 2000 incoming requests.

The following image shows the response time ranges vs the number of responses.

Running the load test on the entire old system : (Load test done by Ayush Sanghavi)

Test case 1:

Memory allocation: 2 GB
User requests: a) 10 with a ramp-up period of 10 seconds.
The % of the requests passing and failing :

The response time with 10 users requesting data from the S3 bucket.

The response time vs the request time for 10 users:

20 users requesting data.

The % of the requests passing and failing :

The response time vs the request time for 20 users:

Observations:

Requests fail after certain downloads and a couple of plots. There could be a possibility of reasons -

a) When the entire application is running, each container has a restricted memory size. While giving multiple requests, it is possible that when the files are being downloaded, the container memory is filled up quickly. As soon as the container memory is full, the container crashes, since it cannot hold more downloaded data. This causes some requests to fail.

b) At some point in time, sometimes the requests have restrictions itself, the SDK that we are using (NexradAWS), may not allow so many concurrent multiple requests. This could also be a possibility of the downloads failing itself rather than the container memory issue.

Our plan is to solve this problem by using a queueing service, where all the requests are streamlined and processed one after the other. For the container memory, there is a possibility where we can delete the old data that has not been looked up for a while.

Issues identified in the old system: (Collective work of all team members)

Ingestor:

The data from nexrad was downloaded to an in-memory temporary location. When the number of requests increased, the memory usage increased and eventually resulted in container crashes.
HTTP Based interface provided less control over request throttling. Especially with getData API processing and space-intensive resulted in many performance issues such as high latency and hence the system was prone to network issues. The user experience was also hampered as minor network glitches would disrupt the API call and new requests had to be made.
The plots created were also being stored in an in-memory location, hence being volatile and also memory-consuming.

Registry

logging user activity through HTTP calls negatively impacts user experience as they execute in the main functional flow and add to latency and errors in main functionality.
getLogs returned all logs of a given user. This may grow to bigger proportions over time for a user and result in serious performance issues. We observed that after a limit, this causes high memory consumption in JVM and results in frequent GCs that further affect the performance of other functionalities offered by the microservice.

User Experience Impact

The user experience in old system, especially when being parallel accessed by multiple users is bad. This is majorly due to the instability of HTTP calls over long requests. The /data API has high response time due to the nature of the process. And any small network issue during the API might render the user hanging.
Due to frequent container crashes even at low load, the users might frequently face time outs/connection refused errors, which results in a poor user experience.
Having user activity logging (which is not a critical functionality) as part of critical business logic adds more failure points and even in successful completion adds to the response time of all the requests.

Architectural changes

To address the issues found in the load test, we have made the following changes

The logging of user activity was modified from API call to MQ based logging. This makes the process asynchronous and hence takes the logging activity out of main functional flow.
Registry's getLogs API was modified to provide only top 20 logs of user. However, if user wants to see all history, it can be specifically fetched by adding appropriate parameters to getLogs API. This reduces the amount of read being performed on database.
Ingestor now downloads the data onto a secondary volume and cleans up after the request is completed. This reduces the memory consumption and hence also avoids container crashes.
Gateway to ingestor communication was changed from HTTP to MQ. This makes the process asynchronous and can aid in providing better user experience.
Ingestor updates the result (plot) in the database. Thereby persisting results.
Using MQ between Gateway-registry, Gateway-Ingestor makes the system more fault tolerant.

Modified high-level architectural diagram is as follows:

successful deployment

Load Test on modified Architecture (Load test done by Madhavan Kalkunte Ramachandra)

Following test cases were executed:

Single instance of all microservices

a) getData, getStatus: 60 TPM, getHistory: 60 TPM and getAllHistory: 20 TPM

b) getData, getStatus: 120 TPM, getHistory: 120 TPM and getAllHistory: 40 TPM

c) getData, getStatus: 240 TPM, getHistory: 240 TPM and getAllHistory: 80 TPM

Multiple instances of all microservice - 3 and 5 instances

a) getData, getStatus: 240 TPM, getHistory: 240 TPM and getAllHistory: 80 TPM

Fault tolerance with with 3 instances

a) getData, getStatus: 120 TPM, getHistory: 120 TPM and getAllHistory: 40 TPM

In each of the above tests, all the services were freshly deployed including database and rabbitmq. Following metrics were used to analyze the performance of the system:

Container crashes
Piling up of messages on input queues
Input output validation

Case 1a: Single instance of all microservices: getData, getStatus: 60 TPM, getHistory: 60 TPM and getAllHistory: 20 TPM

Container crashes

Observations: No container crash was observed

Queue status

60tpm-ingestor-input 60tpm-registry-applog-input 60tpm-registry-ingestor-input

Observations: No pileup was observed in any queues. All the requests were handled by the system promptly without any issues

Validity

60tpm-db-snapshot 60tpm-jmeter-report

Observations: we can see that jmeter generated 308 requests and all 308 requests were logged in database.

Case 1b: Single instance of all microservices: getData, getStatus: 120 TPM, getHistory: 120 TPM and getAllHistory: 40 TPM

Container crashes

Observations: No crashes were observed

Queue status

120tpm-ingestor-input 120tpm-registry-applog-input 120tpm-registry-ingestor-input

Observations: No pile-up was observed

Validity

Observations: No requests were lost. All were processed

Case 1c: Single instance of all microservices: getData, getStatus: 240 TPM, getHistory: 240 TPM and getAllHistory: 80 TPM

Container crashes

Observations: Ingestor container crashed twice during the load test

Queue status

120tpm-ingestor-input 240tpm-registry-applog-input 240tpm-registry-ingestor-input

Observations:

300-400 message Pileup was observed in ingestor input queue corresponding to the container crash. Once the containers were up, they were quickly processed. ~100 message pile-up was observed throughout the span of the test.

Validity

Observations: No requests were lost. All were processed

Conclusions:

The system was able to handle loads up to 300TPM combined without any issues. When the load was increased beyond that, the ingestor microservice frequently crashed and as a result, we could observe pileup on the ingestor input queue. In the 3rd test, we could even observe pileup in registry.ingestor queue constantly. However, no container crash was observed.

Case 2a: 3 instances of all microservices with getData, getStatus: 240 TPM, getHistory: 240 TPM and getAllHistory: 80 TPM

Container crashes

Observations: No container crashes observed the load test

Queue status

240tpm-3dep-ingestor-input 240tpm-3dep-registry-applog-input

Observations:

pile up of up to 25-30 messages on average was observed on both ingestor and registry.ingestor input queue

Validity

Observations: No requests were lost. All were processed

Case 2b: 5 instances of all microservices with getData, getStatus: 240 TPM, getHistory: 240 TPM and getAllHistory: 80 TPM

Container crashes

Observations: No container crashes observed the load test

Queue status

240tpm-5dep-ingestor-input 240tpm-5dep-registry-applog-input 240tpm-5dep-registry-ingestor-input

Observations:

pile-up of up to 3k messages on average was observed on registry.applog queue
~100-150 messages pile up was observed on registry.ingestor queue.
No pile-up was observed on the ingestor input queue.

Validity

Observations: No requests were lost. All were processed

Conclusions:

With 3 instances, the system fairly handled the load of up to 600 TPM. Whereas, with 5 instances system was completely able to handle the load. However, this load test exposed another issue with respect to the scaling of registry microservice. When ingestor processing speed increases, we observe registry slows down. This is potentially because of persisting of plots in the database, which is a heavier operation compared to normal logging. Hence, we observe a considerable pile up in registry queues in both these tests. Once ingestor is done processing requests, we can see the applog queue being emptied very quickly.

Case 3 - fault tolerance: 3 instances 120 TPM

In the middle of tests, ingestor and registry services are scaled down and scaled up. As a result their processing capabilities are reduced and the intention of the test is to observe how the system handles such drop in processing capabilities.

Queue status:

ingestor-input applog-input registry-ingestor

Jmeter result:

jmeter-result

Observations:

We can see that corresponding to the scale down of services the service's corresponding input queues have pile ups. However, once the services are scaled up back, all the messages are quickly processed and system returns back to its normal processing.
When all the instances are brought down, we can observe in the Jmeter result that the HTTP requests have failed. Specifically, log fetching is implemented as HTTP calls and we can see failures in those requests.

User Experience Improvements

User Experience is greatly improved because the system is more stable and responsive due to the asynchronous nature of communication between gateway and ingestor.
The UX is also improved by reduction in a system crash (specifically of ingestor) and even the fault tolerance introduced by the use of MQ based communication. Since, even if a container crash in middle of processing a request, the request will not be lost (as we observed in the fault tolerance test) hence the User is guaranteed to get a response albit late.
Updating logs API to only retrieve the latest 20 logs impacts the UX positively in two ways - first it makes more sense to see recent activity than everything and secondly by reducing the response time of the API.
Removing logging out of critical functional logic combined with MQ usage improved the response time of all the requests and made the UI more responsive. Since the turnaround time for all the UI requests are under 100 ms as compared against old architecture where /data and /logs APIs would take seconds.

Load test conclusions

Load test on new architecture indicated that many of the issues identified in previous architecture have been addressed. However, certain new issues have been introduced. Following are a few we identified:

Persisting plots in the database proves to be a sub-optimal solution as it greatly impacts the overall performance. It may be worthwhile to explore systems such as S3 to persist results than database
Container crash on ingestor during high load suggests resource leaks such as high I/O or mismanaged file descriptors etc.

Github branches

Modified architecture code implementations:

Gateway: Check here
Registry: Check here
UI: Check here
Ingestor: Check here

Load test scripts and results:

https://github.com/airavata-courses/CloudElves/tree/HW2/loadtest

Deployment:

https://github.com/airavata-courses/CloudElves/tree/deployments

Docker Images:

Gateway: docker.io/madhavankr/elves-gateway:latest
Registry: docker.io/madhavankr/elves-registry:latest
Ingestor: docker.io/madhavankr/elves-ingestor:latest
Forecast: docker.io/madhavankr/elves-forecast:latest
UI: docker.io/navkar14/elves-ui:latest

References

creating reports - https://jmeter.apache.org/usermanual/generating-dashboard.html https://itnext.io/basic-postgres-database-in-kubernetes-23c7834d91ef

Homework 2: Scalability

Kubernetes deployment

Prerequisites

Deployment

Benchmarking microservices

Benchmarking results of registry microservice (Load test done by Madhavan Kalkunte Ramachandra)

Load pattern

Test setting:

Benchmarking results of ingestor microservice (Load test done by Navkar Shah)

Benchmarking results of forecast Microservice (Load test done by Ayush Sanghavi)

Running the load test on the entire old system : (Load test done by Ayush Sanghavi)

Issues identified in the old system: (Collective work of all team members)

Ingestor:

Registry

User Experience Impact

Architectural changes

Load Test on modified Architecture (Load test done by Madhavan Kalkunte Ramachandra)

Case 1a: Single instance of all microservices: getData, getStatus: 60 TPM, getHistory: 60 TPM and getAllHistory: 20 TPM

Container crashes

Queue status

Validity

Case 1b: Single instance of all microservices: getData, getStatus: 120 TPM, getHistory: 120 TPM and getAllHistory: 40 TPM

Container crashes

Queue status

Validity

Case 1c: Single instance of all microservices: getData, getStatus: 240 TPM, getHistory: 240 TPM and getAllHistory: 80 TPM

Container crashes

Queue status

Validity

Conclusions:

Case 2a: 3 instances of all microservices with getData, getStatus: 240 TPM, getHistory: 240 TPM and getAllHistory: 80 TPM

Container crashes

Queue status

Validity

Case 2b: 5 instances of all microservices with getData, getStatus: 240 TPM, getHistory: 240 TPM and getAllHistory: 80 TPM

Container crashes

Queue status

Validity

Conclusions:

Case 3 - fault tolerance: 3 instances 120 TPM

Queue status:

Jmeter result:

User Experience Improvements

Load test conclusions

Github branches

Modified architecture code implementations:

Load test scripts and results:

Deployment:

Docker Images:

References

Clone this wiki locally