Skip to content

Commit

Permalink
refactor: rename camunda/zeebe repo
Browse files Browse the repository at this point in the history
The repository camunda/zeebe will be rename to camunda/camunda, this
commit adjusts all references to this repository.
  • Loading branch information
Zelldon committed May 22, 2024
1 parent 479929f commit 6b63494
Show file tree
Hide file tree
Showing 26 changed files with 59 additions and 59 deletions.
6 changes: 3 additions & 3 deletions chaos-days/blog/2022-08-02-deployment-distribution/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,14 @@ authors: zell
# Chaos Day Summary


We encountered recently a severe bug [zeebe#9877](https://github.com/camunda/zeebe/issues/9877) and I was wondering why we haven't spotted it earlier, since we have chaos experiments for it. I realized two things:
We encountered recently a severe bug [zeebe#9877](https://github.com/camunda/camunda/issues/9877) and I was wondering why we haven't spotted it earlier, since we have chaos experiments for it. I realized two things:

1. The experiments only check for parts of it (BPMN resource only). The production code has changed, and a new feature has been added (DMN) but the experiments/tests haven't been adjusted.
2. More importantly we disabled the automated execution of the deployment distribution experiment because it was flaky due to a missing standalone gateway in Camunda Cloud SaaS [zeebe-io/zeebe-chaos#61](https://github.com/zeebe-io/zeebe-chaos/issues/61). This is no longer the case, see [Standalone Gateway in CCSaaS](../2022-02-15-Standalone-Gateway-in-CCSaaS/index.md)

On this chaos day I want to bring the automation of this chaos experiment back to life. If I have still time I want to enhance the experiment.

**TL;DR;** The experiment still worked, and our deployment distribution is still resilient against network partitions. It also works with DMN resources. I can enable the experiment again, and we can close [zeebe-io/zeebe-chaos#61](https://github.com/zeebe-io/zeebe-chaos/issues/61). Unfortunately, we were not able to reproduce [zeebe#9877](https://github.com/camunda/zeebe/issues/9877) but we did some good preparation work for it.
**TL;DR;** The experiment still worked, and our deployment distribution is still resilient against network partitions. It also works with DMN resources. I can enable the experiment again, and we can close [zeebe-io/zeebe-chaos#61](https://github.com/zeebe-io/zeebe-chaos/issues/61). Unfortunately, we were not able to reproduce [zeebe#9877](https://github.com/camunda/camunda/issues/9877) but we did some good preparation work for it.

<!--truncate-->

Expand Down Expand Up @@ -190,7 +190,7 @@ We can adjust the experiment further to await the result of the process executio

#### Reproduce our bug

The current experiment didn't reproduce the bug in [zeebe#9877](https://github.com/camunda/zeebe/issues/9877), since the DMN resource has to be distributed multiple times. Currently, we create a network partition such that the distribution doesn't work at all.
The current experiment didn't reproduce the bug in [zeebe#9877](https://github.com/camunda/camunda/issues/9877), since the DMN resource has to be distributed multiple times. Currently, we create a network partition such that the distribution doesn't work at all.

![](deploymentDistributionExperimentV2.png)

Expand Down
14 changes: 7 additions & 7 deletions chaos-days/blog/2023-02-23-Recursive-call-activity/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,27 +13,27 @@ authors: zell
# Chaos Day Summary

Long time no see. Happy to do my first chaos day this year. In the last week have implemented interesting features, which I would like to experiment with.
[Batch processing](https://github.com/camunda/zeebe/issues/11416) was one of them.
[Batch processing](https://github.com/camunda/camunda/issues/11416) was one of them.

**TL;DR;** Chaos experiment failed. :boom: Batch processing doesn't seem to respect the configured limit, which causes issues with processing and influences the health of the system. We found a bug :muscle:

<!--truncate-->

## Chaos Experiment

In today's chaos experiment, we want to experiment with [Batch processing](https://github.com/camunda/zeebe/issues/11416) and how it can handle error conditions, like deploying an endless recursive process model.
In today's chaos experiment, we want to experiment with [Batch processing](https://github.com/camunda/camunda/issues/11416) and how it can handle error conditions, like deploying an endless recursive process model.

![recursive process](call.png)

### Expected

When we deploy such a process model and create an instance of it, we expect that the execution is done endlessly. In normal process models with batch processing, the execution of a process instance is done until a wait state is reached. In this process model, there exists no wait state. To handle such cases, we have implemented a batch limit, which can be configured via [maxCommandsInBatch](https://github.com/camunda/zeebe/blob/main/dist/src/main/config/broker.standalone.yaml.template#L695). This configuration is by default set to 100 commands. Meaning the stream processor will process 100 commands until it stops, to make room for other things.
When we deploy such a process model and create an instance of it, we expect that the execution is done endlessly. In normal process models with batch processing, the execution of a process instance is done until a wait state is reached. In this process model, there exists no wait state. To handle such cases, we have implemented a batch limit, which can be configured via [maxCommandsInBatch](https://github.com/camunda/camunda/blob/main/dist/src/main/config/broker.standalone.yaml.template#L695). This configuration is by default set to 100 commands. Meaning the stream processor will process 100 commands until it stops, to make room for other things.

We expect that our limit handling steps in during the execution and we can execute also other instances or, cancel the problematic process instance. Furthermore, we expect to stay healthy, we should be able to update our health check continuously.

### Actual

Before we can start with our experiment we need to start our benchmark Zeebe cluster. This has become easier now since I have written the last post. Previously we had to use the scripts and Makefile in the [zeebe/benchmark sub-directory](https://github.com/camunda/zeebe/tree/main/benchmarks/setup).
Before we can start with our experiment we need to start our benchmark Zeebe cluster. This has become easier now since I have written the last post. Previously we had to use the scripts and Makefile in the [zeebe/benchmark sub-directory](https://github.com/camunda/camunda/tree/main/benchmarks/setup).

We have now provided new [Benchmark Helm charts](https://github.com/zeebe-io/benchmark-helm), based on our Camunda Platform Helm charts. They allow us to deploy a new zeebe benchmark setup via:

Expand Down Expand Up @@ -75,7 +75,7 @@ We can see that the processing starts immediately quite high and is continuously

**We have two instances running, one on partition three and one on partition one.**

_One interesting fact is that the topology request rate is also up to 0.400 per second, so potentially every 2.5 seconds we send a topology request to the gateway. But there is no application deployed that does this. [I have recently found out again](https://github.com/camunda/zeebe/pull/11599#discussion_r1109846523), that we have the Zeebe client usage in the gateway to request the topology. Might be worth investigating whether this is an issue._
_One interesting fact is that the topology request rate is also up to 0.400 per second, so potentially every 2.5 seconds we send a topology request to the gateway. But there is no application deployed that does this. [I have recently found out again](https://github.com/camunda/camunda/pull/11599#discussion_r1109846523), that we have the Zeebe client usage in the gateway to request the topology. Might be worth investigating whether this is an issue._

After observing this cluster for a while we can see that after around five minutes the cluster fails. The processing for the partitions breaks down to 1/10 of what was processed before. A bit later it looks like it tries to come back but, failed again.

Expand Down Expand Up @@ -103,5 +103,5 @@ With this, I mark this chaos experiment as failed. We need to investigate this f
## Found Bugs

* [zbchaos logs debug message on normal usage](https://github.com/zeebe-io/zeebe-chaos/issues/323)
* [Every 2.5 seconds we send a topology request, which is shown in the metrics](https://github.com/camunda/zeebe/issues/11799)
* [Batch processing doesn't respect the limit](https://github.com/camunda/zeebe/issues/11798)
* [Every 2.5 seconds we send a topology request, which is shown in the metrics](https://github.com/camunda/camunda/issues/11799)
* [Batch processing doesn't respect the limit](https://github.com/camunda/camunda/issues/11798)
10 changes: 5 additions & 5 deletions chaos-days/blog/2023-04-06-gateway-termination/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ authors: zell
In today's chaos day, we wanted to experiment with the gateway and resiliency of workers.

We have seen in recent weeks some issues within our benchmarks when gateways have been restarted,
see [zeebe#11975](https://github.com/camunda/zeebe/issues/11975).
see [zeebe#11975](https://github.com/camunda/camunda/issues/11975).

We did a similar experiment [in the past](../2022-02-15-Standalone-Gateway-in-CCSaaS/index.md),
today we want to focus on self-managed ([benchmarks with our helm charts](https://helm.camunda.io/)).
Expand All @@ -25,14 +25,14 @@ Ideally, we can automate this as well soon.
Today [Nicolas](https://github.com/npepinpe) joined me on the chaos day :tada:

**TL;DR;** We were able to show that the workers (clients) can reconnect after a gateway is shutdown :white_check_mark:
Furthermore, we have discovered a potential performance issue on lower load, which impacts process execution latency ([zeebe#12311](https://github.com/camunda/zeebe/issues/12311)).
Furthermore, we have discovered a potential performance issue on lower load, which impacts process execution latency ([zeebe#12311](https://github.com/camunda/camunda/issues/12311)).

<!--truncate-->

## Chaos Experiment

We will use our [Zeebe benchmark helm charts](https://github.com/zeebe-io/benchmark-helm) to set up the test cluster, and
our helper scripts [here](https://github.com/camunda/zeebe/tree/main/benchmarks/setup).
our helper scripts [here](https://github.com/camunda/camunda/tree/main/benchmarks/setup).

### Setup:

Expand All @@ -47,7 +47,7 @@ We will run the benchmark with a low load, 10 process instances per second creat
we deploy one starter and worker. This reduces the blast radius and allows us to observe more easily how the workers
behave when a gateway is restarted.

During the experiment, we will use our [grafana dashboard](https://github.com/camunda/zeebe/tree/main/monitor/grafana) to
During the experiment, we will use our [grafana dashboard](https://github.com/camunda/camunda/tree/main/monitor/grafana) to
observe to which gateway the worker will connect and which gateway we need to stop/restart.


Expand Down Expand Up @@ -299,6 +299,6 @@ We first expected that to be related to snapshotting, but snapshots happen much
![snapshot](snapshot-count.png)

Interestingly is that it seems to be related to our segment creation (again), even if we have
async segment creation in our journal built recently. We need to investigate this further within [zeebe#12311](https://github.com/camunda/zeebe/issues/12311).
async segment creation in our journal built recently. We need to investigate this further within [zeebe#12311](https://github.com/camunda/camunda/issues/12311).

![segment](segment.png)
6 changes: 3 additions & 3 deletions chaos-days/blog/2023-05-15-SST-Partitioning-toggle/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ authors: zell

# Chaos Day Summary

On this chaos day I wanted to experiment with a new experimental feature we have released recently. The [enablement of the partitioning of the SST files in RocksDB](https://github.com/camunda/zeebe/pull/12483). This is an experimental feature from RocksDb, which we made available now for our users as well, since we have seen great benefits in performance, especially with larger runtime data.
On this chaos day I wanted to experiment with a new experimental feature we have released recently. The [enablement of the partitioning of the SST files in RocksDB](https://github.com/camunda/camunda/pull/12483). This is an experimental feature from RocksDb, which we made available now for our users as well, since we have seen great benefits in performance, especially with larger runtime data.

I wanted to experiment a bit with the SST partitioning and find out whether it would be possible to enable and disable the flag/configuration without issues.

Expand All @@ -23,7 +23,7 @@ I wanted to experiment a bit with the SST partitioning and find out whether it w

## Chaos Experiment

For our chaos experiment we set up again our [normal benchmark cluster](https://github.com/camunda/zeebe/tree/main/benchmarks/setup), this time without any clients (no workers/starters).
For our chaos experiment we set up again our [normal benchmark cluster](https://github.com/camunda/camunda/tree/main/benchmarks/setup), this time without any clients (no workers/starters).

Setting all client replicas to zero:
```diff
Expand Down Expand Up @@ -70,7 +70,7 @@ When operating a cluster, I can enable the SST partitioning without an impact on

### Actual

As linked above I used again our [benchmark/setup](https://github.com/camunda/zeebe/tree/main/benchmarks/setup) scripts to set up a cluster.
As linked above I used again our [benchmark/setup](https://github.com/camunda/camunda/tree/main/benchmarks/setup) scripts to set up a cluster.

#### First Part: Verify Steady state
To verify the readiness and run all actions I used the [zbchaos](https://github.com/zeebe-io/zeebe-chaos/tree/zbchaos-v1.0.0) tool.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ In our second experiment, we will disable the SST partitioning again.
### Actual

As linked above I used again our [benchmark/setup](https://github.com/camunda/zeebe/tree/main/benchmarks/setup) scripts to set up a cluster.
As linked above I used again our [benchmark/setup](https://github.com/camunda/camunda/tree/main/benchmarks/setup) scripts to set up a cluster.

```shell
$ diff ../default/values.yaml values.yaml
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ authors: zell

New day new chaos. :skull: In today's chaos day I want to pick up a topic, which had bothered people for long time. I created a [chaos day three years ago](https://zeebe-io.github.io/zeebe-chaos/2020/07/16/big-multi-instance/) around this topic as well.

Today, we experiment with large multi-instances again. In the recent patch release [8.2.5](https://github.com/camunda/zeebe/releases/tag/8.2.5) we fixed an issue with spawning larger multi instances. Previously if you have created a process instance with a large multi-instance it was likely that this caused to blacklist the process instance, since the multi-instance spawning ran into `maxMessageSize` limitations.
Today, we experiment with large multi-instances again. In the recent patch release [8.2.5](https://github.com/camunda/camunda/releases/tag/8.2.5) we fixed an issue with spawning larger multi instances. Previously if you have created a process instance with a large multi-instance it was likely that this caused to blacklist the process instance, since the multi-instance spawning ran into `maxMessageSize` limitations.

This means the process instance was stuck and was no longer executable. In Operate this was not shown and caused a lot of friction or confusion to users. With the recent fix, Zeebe should chunk even large collections into smaller batches to spawn/execute the multi-instance without any issues.

Expand Down Expand Up @@ -197,6 +197,6 @@ When reaching a certain limit (maxMessageSize) we get a described rejection by t

## Found Bugs

* in a previous test I run into https://github.com/camunda/zeebe/issues/12918
* Related bug regarding the input collection https://github.com/camunda/zeebe/issues/12873
* in a previous test I run into https://github.com/camunda/camunda/issues/12918
* Related bug regarding the input collection https://github.com/camunda/camunda/issues/12873

Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@ authors: zell

# Chaos Day Summary

Today, we want to experiment with hot backups in SaaS and a larger runtime state in Zeebe and how it impacts the ongoing processing in Zeebe (or not?). This is part of the investigation of a recently created bug issue we wanted to verify/reproduce [#14696](https://github.com/camunda/zeebe/issues/14696).
Today, we want to experiment with hot backups in SaaS and a larger runtime state in Zeebe and how it impacts the ongoing processing in Zeebe (or not?). This is part of the investigation of a recently created bug issue we wanted to verify/reproduce [#14696](https://github.com/camunda/camunda/issues/14696).

**TL;DR;** We were able to prove that hot backups are indeed not impacting overall processing throughput in Zeebe. We found that having a full Elasticsearch disk might impact or even fail your backups, which is intransparent to the user.

<!--truncate-->

## Chaos Experiment

For the experiment, we have set up a Camunda SaaS cluster (G3-M configuration), and run the [cloud benchmark](https://github.com/camunda/zeebe/tree/main/benchmarks/setup/cloud-default) workload against it. During the experiment, we will run a stable load, which will cause to increase in the runtime state. We will create/initiate in different stages backups to verify the impact on processing depending on state size.
For the experiment, we have set up a Camunda SaaS cluster (G3-M configuration), and run the [cloud benchmark](https://github.com/camunda/camunda/tree/main/benchmarks/setup/cloud-default) workload against it. During the experiment, we will run a stable load, which will cause to increase in the runtime state. We will create/initiate in different stages backups to verify the impact on processing depending on state size.

We kept the starter rate (creation of process instance 100 PI/s) but reduced the worker capacity and replicas.

Expand Down
6 changes: 3 additions & 3 deletions chaos-days/blog/2023-11-30-Job-push-overloading/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ We expect that if the workers are slowing down, the load is distributed to other
### Actual


We deployed a normal benchmark, with [default configurations](https://github.com/camunda/zeebe/blob/main/benchmarks/setup/default/values.yaml).
We deployed a normal benchmark, with [default configurations](https://github.com/camunda/camunda/blob/main/benchmarks/setup/default/values.yaml).


We slowed the workers down, in the sense that we changed [the completionDelay to 1250 ms](https://github.com/zeebe-io/benchmark-helm/blob/main/charts/zeebe-benchmark/templates/worker.yaml#L30)
Expand All @@ -56,7 +56,7 @@ So far so good, first experiment worked as expected :white_check_mark:
The normal scenario when something is slow is for a user to scale up. This is what we did in the next experiment, we scaled the workers to 10 replicas (from 3), to verify how the system behaves in this case.


Something to keep in mind when the completion delay is 1250ms, we [multiply the activation timeout by 6 in our workers](https://github.com/camunda/zeebe/blob/7002d53a079c06ab3a94f5485f022681a41dc9ed/benchmarks/project/src/main/java/io/camunda/zeebe/Worker.java#L113). This means completionDelay: 1250 -> job timeout 7.5s
Something to keep in mind when the completion delay is 1250ms, we [multiply the activation timeout by 6 in our workers](https://github.com/camunda/camunda/blob/7002d53a079c06ab3a94f5485f022681a41dc9ed/benchmarks/project/src/main/java/io/camunda/zeebe/Worker.java#L113). This means completionDelay: 1250 -> job timeout 7.5s

### Expected

Expand Down Expand Up @@ -145,7 +145,7 @@ We wanted to understand and experiment with the impact of a slow worker on diffe

To see such an impact in our metrics we had to patch our current execution metrics, such that includes the BPMN processId, so we can differentiate between execution times of different processes.

See the related branch for more details [ck-latency-metrics](https://github.com/camunda/zeebe/tree/ck-latency-metrics)
See the related branch for more details [ck-latency-metrics](https://github.com/camunda/camunda/tree/ck-latency-metrics)


Furthermore, a new process model was added `slow-task.bpm` and new deployments to create such instances and work on them. The process model was similar to the benchmark model, only the job type has been changed.
Expand Down
Loading

0 comments on commit 6b63494

Please sign in to comment.