Skip to content

1.13.3

Compare
Choose a tag to compare
@ksatchit ksatchit released this 15 Apr 23:20
dc086b3

New Features & Enhancements

  • For updates on 2.0.0-Beta releases, refer the notes for Litmus 2.0.0-Beta3.

  • Enhances the EC2 termination experiments to filter targets by tags (apart from IDs), along with support for list and percentage-based selection of instances, serial and parallel failure modes

  • Supports collection of chaos metrics for all ChaosEngine resources by default instead of selective monitoring controlled via spec attributes

  • Supports the definition of ‘context’ (metadata) for an experiment via a Kubernetes label on the ChaosEngine that translates to a metric label value on Prometheus. This can be used to group experiment results via context/reason or derive useful insights from metrics.

  • Introduces a new chaos metric litmuschaos_experiment_verdict that provides an instance-specific run result (instead of cumulative result stats) that can be used alongside the litmuschaos_awaited_experiments to obtain improved chaos interleaved dashboards.

  • Adds documentation around supported chaos metrics and their utility.

  • Allows users to specify the terminationGracePeriodSeconds for the chaos experiment and helper pods to allow abort routines to go through (useful in clusters with high API traffic or under group chaos execution on multiple apps at once)

  • Provides new environment variables (translating to stress-ng flags) for node resource chaos experiments to ensure the granular definition of the load/stress profile.

  • Adds abort routines for infra/node and autoscaler experiments and optimizes the same for pod experiments in which they are already defined.

  • Introduces a randomness factor in the pod-delete experiment to ensure that the delete operations occur at random intervals (the random periods being picked within a time range defined by lower-upper bounds).

  • Enhances the pumba chaoslib for stress experiments by providing an additional ENV var for defining the stress image (that is pulled at runtime on the target pod’s node to inject the stressor). This is useful for folks running experiments with images from their private registries.

  • Introduces a tech-preview of a DNS-chaos experiment (available in litmuschaos/go-runner:ci image) that can cause dns errors/failure in target containers

  • Updates the Chaos Github Actions used in the PR/commit-based e2e suite on the litmus-go repository.

  • Improves the e2e dashboard to represent the experiment e2e coverage in a clearer way.

  • Begins the migration process of specific e2e pipelines to GitHub Actions from Gitlab to aid definition of multiple component/feature-based workflows from within a single branch

  • Adds a new utility (nsutil) to execute commands on the target containers namespace, with potential usage in multiple pod-level chaos experiments

Major Bug Fixes

  • Fixes repeated scheduling of experiment pods upon helper failure/ungraceful exits (error state )- the pods will now enter the completed state upon first error.

  • Appends missing CRD validation schema for image pull policy for experiments

  • Upgrades all litmus artifacts containing CRD spec to use version v1 from v1beta1 to support newer Kubernetes platforms

  • Adds checks to validate the definition of app labels when annotation checks are set to false on the ChaosEngine (and fail fast with appropriate error).

  • Fixes the behavior where multiple “downstream” probes defined in the same phase (pre/post/on chaos) fail if the first probe evaluates to failure.

  • Fixes an issue that is seen when running chaos on multiple application replicas/targets at once, where chaos injection against the last replica/target alone is considered for the success of the experiment.

  • Adds retries to factor in the pending status of helper pods in populated/dense clusters where it takes time for the pod to be scheduled.

  • Adds logs to the Kafka liveness/load pod launched during the Kafka broker failure experiments to verify successful service discovery & topic creation success/failure.

Major Known Issues & Limitations

Issue:

The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND
Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Fix:

This is being actively worked on (native litmus chaoslib that can inject stress processes w/o exec requirement for docker/containerd/crio) and should be available in a subsequent release.

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.3.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs