Skip to content

Homework 4: Custos Testing

Madhavan K R edited this page May 5, 2022 · 17 revisions

Overview

In this last phase of the course, we plan to test the freshly deployed custos services. The goal of this exercise is two-fold - to test the correctness of the deployment and to test the non-functional aspects such as scalability, reliability, and fault-tolerance of the system.

Functional Testing

Motivation

This is an important phase before we can test the non-functional aspects for the below reasons:

  1. We do not own the source code that has been deployed, hence we are not aware of its functionalities and behavior.
  2. Non-Functional aspects only matter when the functionalities are working as they are expected to.
  3. As end-users, we are only communicated with what the system's capabilities are and are exposed to limited number of endpoints/avenues to verify it.

BlackBox Testing

With the above in mind, Black box testing strategy seems the best testing strategy to adapt to successfully explore the system while also validating its functionalities and correctness.

Following outlines our approach toward achieving this objective:

  1. We explored the API endpoints exposed by the CUSTOS through the API Documentation
  2. We assessed the information available at our disposal - We are provided with a client_id, client_secret and admin_user
  3. With the available information, we examined the maximum number of APIs that we can validate.
  4. For each selected API, we have made note of the required input and expected output.
  5. The validation is successful if the expected output matches the output of the API.

We tested in total 16 APIs that majorly belong to the groups:

  1. Identity Management
  2. User Management
  3. Group Management

We have documented all the API request/response in this postman collection

Conclusion

All the tested APIs worked as expected, suggesting that our deployment was stable and functioning as expected. This now allows us to conduct performance tests such as load tests and stress tests on the system.

Performance Testing

In Performance testing, we plan to test the non functional aspects such as the stability, reliability, fault tolerance, and scalability of the system. To this end we plan to conduct the following tests on the 16 APIs explored in the functional testing phase:

  1. Load Test
  2. Stress Test
  3. Soak Test

Setup

  • Custos is deployed on a 3 Node Kubernetes cluster where all the 3 nodes act as worker nodes. Further, each node is a v1.Medium instance on Jetstream1 platform having 16GB Memory and 60 GB Storage.

  • All the services are deployed as a single instance. i.e replicas=1

  • No autoscaler setup

Load Test

The objective of this test is to incrementally increase the load of the system over time and analyze system performance. This will help us in understanding the impact of high traffic/load on different APIs and the system's response towards it.

We designed a load test with the following settings:

  1. create 100 users over 10 minutes.
  2. create 100 groups over 10 minutes.
  3. Repeatedly add/remove users from groups.
  4. Continuously fetch access tokens for created users.

The above steps are repeated with increasing volumes to analyse the impact.

  • Load Test 1 - 100 Threads run in 10 minutes.
  • Load Test 2 - 200 Threads run in 10 minutes.
  • Load Test 3 - 400 Threads run in 10 minutes.
    Screen Shot 2022-05-05 at 6 05 40 PM

Issues uncovered during Load test

All the requests were intermittently failing due to "Network Closed for unknown Reason" or "UNAVAILABLE: io expectation". We checked logs from few services and following are error messages we captured:

UserManagementService

o.a.c.u.m.service.UserManagementService  : Error occurred while pulling users, UNAVAILABLE: Network closed for unknown reason

[ult-executor-58] o.a.c.u.m.service.UserManagementService  : Error occurred while checking username, UNAVAILABLE: io exception

GroupManagementService

o.a.c.g.m.s.GroupManagementService       : Error occurred at createGroups UNAVAILABLE: io exception

io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:240) ~[grpc-stub-1.25.0.jar!/:1.25.0]
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:221) ~[grpc-stub-1.25.0.jar!/:1.25.0]
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:140) ~[grpc-stub-1.25.0.jar!/:1.25.0]
	at org.apache.custos.iam.service.IamAdminServiceGrpc$IamAdminServiceBlockingStub.createGroups(IamAdminServiceGrpc.java:1902) ~[iam-admin-core-service-client-stub-1.1.jar!/:1.1]
	at org.apache.custos.iam.admin.client.IamAdminServiceClient.createGroups(IamAdminServiceClient.java:251) ~[iam-admin-core-service-client-stub-1.1.jar!/:1.1]
	at org.apache.custos.group.management.service.GroupManagementService.createKeycloakGroups(GroupManagementService.java:67) ~[classes!/:1.1]
	at org.apache.custos.group.management.service.GroupManagementServiceGrpc$MethodHandlers.invoke(GroupManagementServiceGrpc.java:1268) ~[classes!/:1.1]
	at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172) ~[grpc-stub-1.25.0.jar!/:1.25.0]
	at brave.grpc.TracingServerInterceptor$ScopingServerCallListener.onHalfClose(TracingServerInterceptor.java:159) ~[brave-instrumentation-grpc-5.9.1.jar!/:na]
	at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) ~[grpc-api-1.25.0.jar!/:1.25.0]
	at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) ~[grpc-api-1.25.0.jar!/:1.25.0]
	at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40) ~[grpc-api-1.25.0.jar!/:1.25.0]
	at org.apache.custos.integration.core.interceptor.ServiceInterceptor$1.onHalfClose(ServiceInterceptor.java:84) ~[custos-integration-core-1.1.jar!/:1.1]
	at brave.grpc.TracingServerInterceptor$ScopingServerCallListener.onHalfClose(TracingServerInterceptor.java:159) ~[brave-instrumentation-grpc-5.9.1.jar!/:na]
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) ~[grpc-core-1.25.0.jar!/:1.25.0]
	at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814) ~[grpc-core-1.25.0.jar!/:1.25.0]
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.25.0.jar!/:1.25.0]
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[grpc-core-1.25.0.jar!/:1.25.0]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: iam-admin-core-service.custos.svc.cluster.local/10.43.85.1:7000
Caused by: java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:na]
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[na:na]
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:688) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514) ~[netty-transport-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) ~[netty-common-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.42.Final.jar!/:4.1.42.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.42.Final.jar!/:4.1.42.Final]

It is evident in the above error log from GroupManagementService that the IOException/Netwrok errors are caused due to iam-admin-core-service