-
Notifications
You must be signed in to change notification settings - Fork 2
Homework 4: Custos Testing
In this last phase of the course, we plan to test the freshly deployed custos services. The goal of this exercise is two-fold - to test the correctness of the deployment and to test the non-functional aspects such as scalability, reliability, and fault-tolerance of the system.
This is an important phase before we can test the non-functional aspects for the below reasons:
- We do not own the source code that has been deployed, hence we are not aware of its functionalities and behavior.
- Non-Functional aspects only matter when the functionalities are working as they are expected to.
- As end-users, we are only communicated with what the system's capabilities are and are exposed to limited number of endpoints/avenues to verify it.
With the above in mind, Black box testing strategy seems the best testing strategy to adapt to successfully explore the system while also validating its functionalities and correctness.
Following outlines our approach toward achieving this objective:
- We explored the API endpoints exposed by the CUSTOS through the API Documentation
- We assessed the information available at our disposal - We are provided with a client_id, client_secret and admin_user
- With the available information, we examined the maximum number of APIs that we can validate.
- For each selected API, we have made note of the required input and expected output.
- The validation is successful if the expected output matches the output of the API.
We tested in total 16 APIs that majorly belong to the groups:
- Identity Management
- User Management
- Group Management
We have documented all the API request/response in this postman collection
All the tested APIs worked as expected, suggesting that our deployment was stable and functioning as expected. This now allows us to conduct performance tests such as load tests and stress tests on the system.
In Performance testing, we plan to test the non functional aspects such as the stability, reliability, fault tolerance, and scalability of the system. To this end we plan to conduct the following tests on the 16 APIs explored in the functional testing phase:
- Load Test
- Stress Test
- Soak Test
-
Custos is deployed on a 3 Node Kubernetes cluster where all the 3 nodes act as worker nodes. Further, each node is a v1.Medium instance on Jetstream1 platform having 16GB Memory and 60 GB Storage.
-
All the services are deployed as a single instance. i.e replicas=1
-
No autoscaler setup.
The objective of this test is to incrementally increase the load of the system over time and analyze system performance. This will help us in understanding the impact of high traffic/load on different APIs and the system's response towards it.
We designed a load test with the following settings:
- create 100 users over 10 minutes.
- create 100 groups over 10 minutes.
- Repeatedly add/remove users from groups.
- Continuously fetch access tokens for created users.
The above steps are repeated with increasing volumes to analyse the impact.
- Load Test 1 - 100 Threads run in 10 minutes.
- Load Test 2 - 200 Threads run in 10 minutes.
- Load Test 3 - 400 Threads run in 10 minutes.
All the requests were intermittently failing due to "Network Closed for unknown Reason" or "UNAVAILABLE: io expectation". We checked logs from few services and following are error messages we captured:
UserManagementService
o.a.c.u.m.service.UserManagementService : Error occurred while pulling users, UNAVAILABLE: Network closed for unknown reason
[ult-executor-58] o.a.c.u.m.service.UserManagementService : Error occurred while checking username, UNAVAILABLE: io exception
GroupManagementService
o.a.c.g.m.s.GroupManagementService : Error occurred at createGroups UNAVAILABLE: io exception
io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:240) ~[grpc-stub-1.25.0.jar!/:1.25.0]
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:221) ~[grpc-stub-1.25.0.jar!/:1.25.0]
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:140) ~[grpc-stub-1.25.0.jar!/:1.25.0]
at org.apache.custos.iam.service.IamAdminServiceGrpc$IamAdminServiceBlockingStub.createGroups(IamAdminServiceGrpc.java:1902) ~[iam-admin-core-service-client-stub-1.1.jar!/:1.1]
at org.apache.custos.iam.admin.client.IamAdminServiceClient.createGroups(IamAdminServiceClient.java:251) ~[iam-admin-core-service-client-stub-1.1.jar!/:1.1]
at org.apache.custos.group.management.service.GroupManagementService.createKeycloakGroups(GroupManagementService.java:67) ~[classes!/:1.1]
at org.apache.custos.group.management.service.GroupManagementServiceGrpc$MethodHandlers.invoke(GroupManagementServiceGrpc.java:1268) ~[classes!/:1.1]
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172) ~[grpc-stub-1.25.0.jar!/:1.25.0]
at brave.grpc.TracingServerInterceptor$ScopingServerCallListener.onHalfClose(TracingServerInterceptor.java:159) ~[brave-instrumentation-grpc-5.9.1.jar!/:na]
at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) ~[grpc-api-1.25.0.jar!/:1.25.0]
at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) ~[grpc-api-1.25.0.jar!/:1.25.0]
at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40) ~[grpc-api-1.25.0.jar!/:1.25.0]
at org.apache.custos.integration.core.interceptor.ServiceInterceptor$1.onHalfClose(ServiceInterceptor.java:84) ~[custos-integration-core-1.1.jar!/:1.1]
at brave.grpc.TracingServerInterceptor$ScopingServerCallListener.onHalfClose(TracingServerInterceptor.java:159) ~[brave-instrumentation-grpc-5.9.1.jar!/:na]
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) ~[grpc-core-1.25.0.jar!/:1.25.0]
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814) ~[grpc-core-1.25.0.jar!/:1.25.0]
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.25.0.jar!/:1.25.0]
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[grpc-core-1.25.0.jar!/:1.25.0]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: iam-admin-core-service.custos.svc.cluster.local/10.43.85.1:7000
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:na]
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[na:na]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:688) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) ~[netty-common-4.1.42.Final.jar!/:4.1.42.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.42.Final.jar!/:4.1.42.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.42.Final.jar!/:4.1.42.Final]
It is evident in the above error log from GroupManagementService that the IOException/Netwrok errors are caused due to iam-admin-core-service
In Stress testing, we plan to run a high throughput in a small window of time. This will test how our systems can handle sudden bursts of requests that are very common to occur in real-world scenarios.
In this test, we have designed a JMeter script to start the requests slowly and after an instant, increase the throughput many folds and continue for a short duration (5-10 minutes) before completing.
The load we've considered is:
- 300 user creations in 10 minutes
- 300 group creations in 10 minutes
- 300 User Management actions per minute for 10 minutes
- 300 Group Management actions per minute for 10 minutes.
Following were the issues uncovered during stress test:
- Similar to load test, we see a lot of failures (up to 45%). The failure rate has even increased from the load test.
- The majority of issues pertained to the network error caused by iam-admin-core-service.
- However, during the course of this test, IAM-ADMIN-CORE-SERVICE crashed multiple times.
- The reason for crashing was OOMKilled, that is the container/pod ran out of memory and kubernetes had to terminate them.
IAM-ADMIN-CORE-SERVICE crash
OOM Reason
As we can observe from the above image, iam-admin-core-service is allocated only 500 MB of memory, and on load it is consuming much beyond the allocated and hence triggering kubernetes to kill the container. This repeats over and over again during the stress test and hence we can see up to 9 restarts on the pod.
The idea of soak test is to verify how the system performs on constant load over long durations. In this test we plan to run a smaller load (half the load from our load test), for 30 minutes. We expect to see much lower error rate and that no service to crash.
The load we've considered is:
- 50 user creations in 10 minutes.
- 50 group creations in 10 minutes.
- 50 User Management actions per minute for 10 minutes.
- 50 Group Management actions per minute for 10 minutes.
- We can see a much more stable response from the system.
- Up to 97% of the requests succeeded.
- Other metrics such as response time, throughput etc remained constant throughout the test.
- This suggests that system responds well to a constant low load.
-
Majority of the issues were caused due to failure in iam-admin-core-service. This suggests that this service can act as a single-point of failure for the entire system.
-
Also, all the services are implemented in java springboot framework. It is advisable to allocate at least 1GB of container size for springboot applications. This is especially true for larger, complex systems because of the sheer number of dependencies in the project and the nature of spring to load these dependencies. A minor mismanagement might result in higher memory consumption.
-
Additionally, we can horizontally scale iam-admin-core-service to help share the load among multiple instances.
-
Overall, the system ran with single instance without autoscaler. Hence there is no fault tolerance, which is visible in the number of failures resulting due to a failed service.