Torpedo is a framework to test the resiliency of the Airship deployed environment. It provides templates and prebuilt tools to test all components from hardware nodes to service elements of the deployed stack. The report and logging module helps easy triage of design issues.
Label the nodes on which argo to be run as argo=enabled
Label the nodes on which metacontroller needs to be enabled as
metacontroller=enabled
Label the nodes on which Torpedo should run as torpedo-controller=enabled
Label the nodes on which traffic and chaos jobs to run as resiliency=enabled
Label the nodes on which log-collector jobs to run as log-collector=enabled
git clone https://github.com/att-comdev/torpedo.git
kubectl create ns metacontroller
cat torpedo/metacontroller-rbac.yaml | kubectl create -n metacontroller -f –
cat torpedo/install_metacontroller.yaml | kubectl create -n metacontroller -f –
kubectl create ns argo
cat torpedo/install_argo.yaml | kubectl create -n argo -f –
cat torpedo/torpedo_crd.yaml | kubectl create -f -
cat torpedo/controller.yaml | kubectl create -f -
cat torpedo/resiliency_rbac.yaml | kubectl create -f -
cat torpedo/torpedo_rbac.yaml | kubectl create -f -
kubectl create configmap torpedo-metacontroller -n metacontroller
--from-file=torpedo-metacontroller=torpedo_metacontroller.py
cat torpedo/torpedo-controller.yaml | kubectl create -n metacontroller -f –
cat torpedo/webhook.yaml | kubectl create -n metacontroller -f –
cat <test-suite> | kubectl -n metacontroller create -f -
In case ceph storage is used to create a pvc, create a ceph secret in the namespace the pvc needs to created with same name as userSecretName as mentioned in the ceph storage class. The ceph secret can be obtained by the following command –
kubectl exec -it -n ceph ceph_mon_pod -- ceph auth get-key client.admin | base64
Replace the key and name in torpedo/secret.yaml with the key generated in above command and the name mentioned in the ceph storage class respectively and execute the following command –
cat torpedo/secret.yaml|kubectl create -f –
1. Openstack
- Openstack API GET calls
- Keystone (Service list)
- Mariadb (Keystone service list)
- Memcached (Keystone service list)
- Ingress (Keystone service list)
- Glance (Image list)
- Neutron (Port list)
- Nova (Server list)
- Cinder (Volume list)
- Heat (Stack list)
- Horizon (GET call on horizon landing page)
- Openstack rabbitmq
- Glance rabbitmq (POST call to create and upload and delete an image)
- Neutron rabbitmq (POST call to create and delete a router)
- Nova rabbitmq (POST call to create and delete a server)
- Cinder rabbitmq (POST call to create a volume)
- Heat rabbitmq (POST call to create and delete a stack)
- Openstack API POST calls
- Glance (POST call to create and upload and delete an image)
- Neutron (POST call to create and delete a router)
- Neutron dhcp-agent (POST call to create a virtual machine, assign a
floating ip to the virtual machine and initiate a ping request to the
floating IP)
- Openvswitch DB (POST call to create a virtual machine,
assign a floating ip to the virtual machine and initiate a ping request
to the floating IP)
- Openvswitch daemon (POST call to create a virtual machine, assign a
floating ip to the virtual machine and initiate a ping request to the
floating IP)
- Nova Compute (POST call to create and delete a server)
- Nova Scheduler (POST call to create and delete a server)
- Nova Conductor (POST call to create and delete a server)
- Libvirt (POST call to create and delete a server)
- Cinder Volume (POST call to create a volume)
- Cinder Scheduler (POST call to create a volume)
- Heat (POST call to create and delete a stack)
2. UCP
- UCP API get calls
- Keystone (Keystone service list)
- Promenade (Get call to check health)
- Armada (Releases list)
- Drydock (nodes list)
- Shipyard (configdocs list)
- Barbican (Secrets list)
- Deckhand (Revisions list)
3. Kubernetes
- Kubernetes Proxy (Creates a pod and a service and initiate a ping request
to the service IP)
- Kubernetes Apiserver (GET call to the pod list)
- Kubernetes Scheduler (POST call to create and delete a pod)
- Ingress (GET call to kube-apiserver)
The test suite contains following sections -
- Auth
- Job parameters
- Namespace
- Orchestrator Plugin
- Chaos Plugin
- Volume storage class
- Volume storage capacity
- Volume name
Auth section consists of Keystone auth information in case of Openstack and UCP and url and token in case of Kubernetes
- auth:
auth_url: http://keystone-api.openstack.svc.cluster.local:5000/v3
username: <username>
password: <password>
user_domain_name: default
project_domain_name: default
project_name: admin
Job parameters section further consists 7 sections -
-
name: name for test case
-
service - Name of the service against which the tests to be run (example - nova, cinder etc)
-
component - Component of service against which the test to run (example - nova-os-api, cinder-scheduler)
-
job-duration - Duration for which the job needs to run (Both chaos and traffic jobs)
-
count - Number of times chaos/ traffic should be induced on target service. Takes precedence only if job-duration is set to 0.
-
nodes - Used in case of Node power off scenario. Defaults to None in normal scenarios. Takes a list of nodes with the following information - - ipmi_ip - IPMI IP of the target node - password - IPMI password of the target node - user - IPMI username of the target node - node_name - Node name of the target node
-
sanity-checks - A list of checks that needs to be performed while the traffic and chaos jobs are running. Defaults to None. Example - get a list of pods, nodes etc. Takes 3 parameters as input :
- image: Image to be used to run the sanity-checks - name: Name of the sanity-check - command: command to be executed
-
extra-args - A list of extra parameters which can be passed for a specific test scenario. Defaults to None.
Namespace in which the service to verify is running.
The plugin to be used to initiate traffic.
The plugin to be used to initiate chaos.
Storage class to be used to create a pvc. Used to choose the type of storage to be used to create pvc.
Volume capacity of the pvc to be created.
Name of the volume pvc
```
apiVersion: torpedo.k8s.att.io/v1
kind: Torpedo
metadata:
name: openstack-torpedo-test
spec:
auth:
auth_url: http://keystone-api.openstack.svc.cluster.local:5000/v3
username: admin
password: ********
user_domain_name: default
project_domain_name: default
project_name: admin
job-params:
- - service: nova
component: os-api
kill-interval: 30
kill-count: 4
same-node: True
pod-labels:
- 'application=nova'
- 'component=os-api'
node-labels:
- 'openstack-nova-control=enabled'
service-mapping: nova
name: nova-os-api
max-nodes: 2
nodes:
- ipmi_ip: <ipmi ip>
node_name: <node name>
user: <username>
password: <password>
sanity-checks:
- name: pod-list
image: kiriti29/torpedo-traffic-generator:v1
command:
- /bin/bash
- sanity_checks.sh
- pod-list
- "2000"
- "kubectl get pods --all-namespaces -o wide"
extra-args: ""
namespace: openstack
job-duration: 100
count: 60
orchestrator_plugin: "torpedo-traffic-orchestrator"
chaos_plugin: "torpedo-chaos"
volume_storage_class: "general"
volume_storage: "10Gi"
volume_name: "openstack-torpedo-test"
```
The framework aims at creating a chaos in a NC environment and thereby measuring the downtime before the cluster starts behaving normally, parallely collecting all the logs pertaining to Openstack api calls, pods list, nodes list and so on.
-
The testcase initially creates a heat stack which in turn creates a stack of 10 vms before introducing any chaos (ORT tests).
-
Once the heat stack is completely validated, we record the state.
-
Initiate sanity checks for -
a. Checking the health of openstack services -
Keystone - GET call on service list Glance - GET call on image list Neutron - GET call on port list Nova - GET call on server list Heat - GET call on stack list Cinder - GET call on volume list
b. Checks on Kubernetes -
Pod list - kubectl get pods --all-namespaces -o wide Node list - kubectl get nodes Rabbitmq cluster status - kubectl exec -it <rabbitmq pod on target node> -n
-- rabbitmqctl cluster_status Ceph cluster status - kubectl exec -it -n -- ceph health
-
Now we shutdown the node (IPMI power off)
-
Parallely instantiate the heat stack creation and see how much time it takes for the heat stack to finish
- Verify heat stack is created in 15 minutes(config param). If not, re-initiate the stack creation, we try this in loop.
- The test exits with a failure after 40 minutes time-limit (this is a configurable parameter).
-
If the heat stack creation is complete, then we bring up the shutdown node, and repeat the steps (1-5) on other nodes.
-
Logs are captured with request/response times, failures/success messages on the test requests.
-
A report is generated based on the number of testcases that have passed or failed.
- All the logs of sanity checks(apache common log format)
- The entire pod logs in all namespaces in the cluster.
- The heat logs
Muktevi Kiriti
Gurpreet Singh
Hemanth Kumar Nakkina