Skip to content

Homework 4: Custos Testing

Madhavan K R edited this page May 6, 2022 · 17 revisions

Overview

In this last phase of the course, we plan to test the freshly deployed custos services. The goal of this exercise is two-fold - to test the correctness of the deployment and to test the non-functional aspects such as scalability, reliability, and fault-tolerance of the system.

Functional Testing

Motivation

This is an important phase before we can test the non-functional aspects for the below reasons:

  1. We do not own the source code that has been deployed, hence we are not aware of its functionalities and behavior.
  2. Non-Functional aspects only matter when the functionalities are working as they are expected to.
  3. As end-users, we are only communicated with what the system's capabilities are and are exposed to limited number of endpoints/avenues to verify it.

BlackBox Testing

With the above in mind, Black box testing strategy seems the best testing strategy to adapt to successfully explore the system while also validating its functionalities and correctness.

Following outlines our approach toward achieving this objective:

  1. We explored the API endpoints exposed by the CUSTOS through the API Documentation
  2. We assessed the information available at our disposal - We are provided with a client_id, client_secret and admin_user
  3. With the available information, we examined the maximum number of APIs that we can validate.
  4. For each selected API, we have made note of the required input and expected output.
  5. The validation is successful if the expected output matches the output of the API.

We tested in total 16 APIs that majorly belong to the groups:

  1. Identity Management
  2. User Management
  3. Group Management

We have documented all the API request/response in this postman collection

Conclusion

All the tested APIs worked as expected, suggesting that our deployment was stable and functioning as expected. This now allows us to conduct performance tests such as load tests and stress tests on the system.

Performance Testing

In Performance testing, we plan to test the non functional aspects such as the stability, reliability, fault tolerance, and scalability of the system. To this end we plan to conduct the following tests on the 16 APIs explored in the functional testing phase:

  1. Load Test
  2. Stress Test
  3. Soak Test

Setup

  • Custos is deployed on a 3 Node Kubernetes cluster where all the 3 nodes act as worker nodes. Further, each node is a v1.Medium instance on Jetstream1 platform having 16GB Memory and 60 GB Storage.

  • All the services are deployed as a single instance. i.e replicas=1

  • No autoscaler setup.

The Thread Groups created are as shown below:

Screen Shot 2022-05-05 at 6 05 40 PM

Load Test

The objective of this test is to incrementally increase the load of the system over time and analyze system performance. This will help us in understanding the impact of high traffic/load on different APIs and the system's response towards it.

We designed a load test with the following settings:

  1. create 100 users over 10 minutes.
  2. create 100 groups over 10 minutes.
  3. Repeatedly add/remove users from groups.
  4. Continuously fetch access tokens for created users.

The load we've considered is:

  1. 100 user creations in 10 minutes.
  2. 100 group creations in 10 minutes.
  3. 100 User Management actions per minute for 10 minutes.
  4. 100 Group Management actions per minute for 10 minutes.

Results from Load Test:

Screen Shot 2022-05-05 at 7 13 59 PM
Screen Shot 2022-05-05 at 7 14 16 PM
Screen Shot 2022-05-05 at 7 15 20 PM
Screen Shot 2022-05-05 at 7 15 51 PM
Screen Shot 2022-05-05 at 7 16 11 PM

Issues uncovered during Load test

All the requests were intermittently failing due to "Network Closed for unknown Reason" or "UNAVAILABLE: io expectation". We checked logs from few services and following are error messages we captured:

UserManagementService

o.a.c.u.m.service.UserManagementService  : Error occurred while pulling users, UNAVAILABLE: Network closed for unknown reason

[ult-executor-58] o.a.c.u.m.service.UserManagementService  : Error occurred while checking username, UNAVAILABLE: io exception

GroupManagementService

o.a.c.g.m.s.GroupManagementService       : Error occurred at createGroups UNAVAILABLE: io exception

io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:240) ~[grpc-stub-1.25.0.jar!/:1.25.0]
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:221) ~[grpc-stub-1.25.0.jar!/:1.25.0]
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:140) ~[grpc-stub-1.25.0.jar!/:1.25.0]
	at org.apache.custos.iam.service.IamAdminServiceGrpc$IamAdminServiceBlockingStub.createGroups(IamAdminServiceGrpc.java:1902) ~[iam-admin-core-service-client-stub-1.1.jar!/:1.1]
	at org.apache.custos.iam.admin.client.IamAdminServiceClient.createGroups(IamAdminServiceClient.java:251) ~[iam-admin-core-service-client-stub-1.1.jar!/:1.1]
	at org.apache.custos.group.management.service.GroupManagementService.createKeycloakGroups(GroupManagementService.java:67) ~[classes!/:1.1]
	at org.apache.custos.group.management.service.GroupManagementServiceGrpc$MethodHandlers.invoke(GroupManagementServiceGrpc.java:1268) ~[classes!/:1.1]
	at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172) ~[grpc-stub-1.25.0.jar!/:1.25.0]
	at brave.grpc.TracingServerInterceptor$ScopingServerCallListener.onHalfClose(TracingServerInterceptor.java:159) ~[brave-instrumentation-grpc-5.9.1.jar!/:na]
	at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) ~[grpc-api-1.25.0.jar!/:1.25.0]
	at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) ~[grpc-api-1.25.0.jar!/:1.25.0]
	at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40) ~[grpc-api-1.25.0.jar!/:1.25.0]
	at org.apache.custos.integration.core.interceptor.ServiceInterceptor$1.onHalfClose(ServiceInterceptor.java:84) ~[custos-integration-core-1.1.jar!/:1.1]
	at brave.grpc.TracingServerInterceptor$ScopingServerCallListener.onHalfClose(TracingServerInterceptor.java:159) ~[brave-instrumentation-grpc-5.9.1.jar!/:na]
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) ~[grpc-core-1.25.0.jar!/:1.25.0]
	at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814) ~[grpc-core-1.25.0.jar!/:1.25.0]
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.25.0.jar!/:1.25.0]
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[grpc-core-1.25.0.jar!/:1.25.0]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: iam-admin-core-service.custos.svc.cluster.local/10.43.85.1:7000
Caused by: java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:na]
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[na:na]
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:688) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) ~[netty-common-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.42.Final.jar!/:4.1.42.Final]

It is evident in the above error log from GroupManagementService that the IOException/Netwrok errors are caused due to iam-admin-core-service

Stress Test:

In Stress testing, we plan to run a high throughput in a small window of time. This will test how our systems can handle sudden bursts of requests that are very common to occur in real-world scenarios.

In this test, we have designed a JMeter script to start the requests slowly and after an instant, increase the throughput many folds and continue for a short duration (5-10 minutes) before completing.

The load we've considered is:

  1. 300 user creations in 10 minutes
  2. 300 group creations in 10 minutes
  3. 300 User Management actions per minute for 10 minutes
  4. 300 Group Management actions per minute for 10 minutes.

Stress Test Results:

Screen Shot 2022-05-05 at 7 21 54 PM
Screen Shot 2022-05-05 at 7 22 06 PM
Screen Shot 2022-05-05 at 7 22 24 PM
Screen Shot 2022-05-05 at 7 22 48 PM
Screen Shot 2022-05-05 at 7 23 17 PM

Issues uncovered during Stress test:

Following were the issues uncovered during stress test:

  • Similar to load test, we see a lot of failures (up to 45%). The failure rate has even increased from the load test.
  • The majority of issues pertained to the network error caused by iam-admin-core-service.
  • However, during the course of this test, IAM-ADMIN-CORE-SERVICE crashed multiple times.
  • The reason for crashing was OOMKilled, that is the container/pod ran out of memory and kubernetes had to terminate them.

IAM-ADMIN-CORE-SERVICE crash

iam-admin-core-service-oom

OOM Reason

iam-admin-core-oom-reason

As we can observe from the above image, iam-admin-core-service is allocated only 500 MB of memory, and on load it is consuming much beyond the allocated and hence triggering kubernetes to kill the container. This repeats over and over again during the stress test and hence we can see up to 9 restarts on the pod.

Soak Test:

The idea of soak test is to verify how the system performs on constant load over long durations. In this test we plan to run a smaller load (half the load from our load test), for 30 minutes. We expect to see much lower error rate and that no service to crash.

The load we've considered is:

  1. 150 user creations in 30 minutes.
  2. 150 group creations in 30 minutes.
  3. 50 User Management actions per minute for 30 minutes.
  4. 50 Group Management actions per minute for 30 minutes.

SOAP Test Results:

Screen Shot 2022-05-05 at 7 29 56 PM
Screen Shot 2022-05-05 at 7 30 58 PM
Screen Shot 2022-05-05 at 7 31 14 PM

Screen Shot 2022-05-05 at 7 30 12 PM
Screen Shot 2022-05-05 at 7 30 36 PM

Observations from soap test:

  • We can see a much more stable response from the system.
  • Up to 97% of the requests succeeded.
  • Other metrics such as response time, throughput etc remained constant throughout the test.
  • This suggests that system responds well to a constant low load.

Feedback based on Test Results

  • Majority of the issues were caused due to failure in iam-admin-core-service. This suggests that this service can act as a single-point of failure for the entire system.

  • Also, all the services are implemented in java springboot framework. It is advisable to allocate at least 1GB of container size for springboot applications. This is especially true for larger, complex systems because of the sheer number of dependencies in the project and the nature of spring to load these dependencies. A minor mismanagement might result in higher memory consumption.

  • Additionally, we can horizontally scale iam-admin-core-service to help share the load among multiple instances.

  • Overall, the system ran with single instance without autoscaler. Hence there is no fault tolerance, which is visible in the number of failures resulting due to a failed service.