diff --git a/chaos-days/blog/2022-08-02-deployment-distribution/index.md b/chaos-days/blog/2022-08-02-deployment-distribution/index.md index e36295dd7..b4bff58be 100644 --- a/chaos-days/blog/2022-08-02-deployment-distribution/index.md +++ b/chaos-days/blog/2022-08-02-deployment-distribution/index.md @@ -14,14 +14,14 @@ authors: zell # Chaos Day Summary -We encountered recently a severe bug [zeebe#9877](https://github.com/camunda/zeebe/issues/9877) and I was wondering why we haven't spotted it earlier, since we have chaos experiments for it. I realized two things: +We encountered recently a severe bug [zeebe#9877](https://github.com/camunda/camunda/issues/9877) and I was wondering why we haven't spotted it earlier, since we have chaos experiments for it. I realized two things: 1. The experiments only check for parts of it (BPMN resource only). The production code has changed, and a new feature has been added (DMN) but the experiments/tests haven't been adjusted. 2. More importantly we disabled the automated execution of the deployment distribution experiment because it was flaky due to a missing standalone gateway in Camunda Cloud SaaS [zeebe-io/zeebe-chaos#61](https://github.com/zeebe-io/zeebe-chaos/issues/61). This is no longer the case, see [Standalone Gateway in CCSaaS](../2022-02-15-Standalone-Gateway-in-CCSaaS/index.md) On this chaos day I want to bring the automation of this chaos experiment back to life. If I have still time I want to enhance the experiment. -**TL;DR;** The experiment still worked, and our deployment distribution is still resilient against network partitions. It also works with DMN resources. I can enable the experiment again, and we can close [zeebe-io/zeebe-chaos#61](https://github.com/zeebe-io/zeebe-chaos/issues/61). Unfortunately, we were not able to reproduce [zeebe#9877](https://github.com/camunda/zeebe/issues/9877) but we did some good preparation work for it. +**TL;DR;** The experiment still worked, and our deployment distribution is still resilient against network partitions. It also works with DMN resources. I can enable the experiment again, and we can close [zeebe-io/zeebe-chaos#61](https://github.com/zeebe-io/zeebe-chaos/issues/61). Unfortunately, we were not able to reproduce [zeebe#9877](https://github.com/camunda/camunda/issues/9877) but we did some good preparation work for it. @@ -190,7 +190,7 @@ We can adjust the experiment further to await the result of the process executio #### Reproduce our bug -The current experiment didn't reproduce the bug in [zeebe#9877](https://github.com/camunda/zeebe/issues/9877), since the DMN resource has to be distributed multiple times. Currently, we create a network partition such that the distribution doesn't work at all. +The current experiment didn't reproduce the bug in [zeebe#9877](https://github.com/camunda/camunda/issues/9877), since the DMN resource has to be distributed multiple times. Currently, we create a network partition such that the distribution doesn't work at all. ![](deploymentDistributionExperimentV2.png) diff --git a/chaos-days/blog/2023-02-23-Recursive-call-activity/index.md b/chaos-days/blog/2023-02-23-Recursive-call-activity/index.md index d4a080578..5d34cde33 100644 --- a/chaos-days/blog/2023-02-23-Recursive-call-activity/index.md +++ b/chaos-days/blog/2023-02-23-Recursive-call-activity/index.md @@ -13,7 +13,7 @@ authors: zell # Chaos Day Summary Long time no see. Happy to do my first chaos day this year. In the last week have implemented interesting features, which I would like to experiment with. -[Batch processing](https://github.com/camunda/zeebe/issues/11416) was one of them. +[Batch processing](https://github.com/camunda/camunda/issues/11416) was one of them. **TL;DR;** Chaos experiment failed. :boom: Batch processing doesn't seem to respect the configured limit, which causes issues with processing and influences the health of the system. We found a bug :muscle: @@ -21,19 +21,19 @@ Long time no see. Happy to do my first chaos day this year. In the last week hav ## Chaos Experiment -In today's chaos experiment, we want to experiment with [Batch processing](https://github.com/camunda/zeebe/issues/11416) and how it can handle error conditions, like deploying an endless recursive process model. +In today's chaos experiment, we want to experiment with [Batch processing](https://github.com/camunda/camunda/issues/11416) and how it can handle error conditions, like deploying an endless recursive process model. ![recursive process](call.png) ### Expected -When we deploy such a process model and create an instance of it, we expect that the execution is done endlessly. In normal process models with batch processing, the execution of a process instance is done until a wait state is reached. In this process model, there exists no wait state. To handle such cases, we have implemented a batch limit, which can be configured via [maxCommandsInBatch](https://github.com/camunda/zeebe/blob/main/dist/src/main/config/broker.standalone.yaml.template#L695). This configuration is by default set to 100 commands. Meaning the stream processor will process 100 commands until it stops, to make room for other things. +When we deploy such a process model and create an instance of it, we expect that the execution is done endlessly. In normal process models with batch processing, the execution of a process instance is done until a wait state is reached. In this process model, there exists no wait state. To handle such cases, we have implemented a batch limit, which can be configured via [maxCommandsInBatch](https://github.com/camunda/camunda/blob/main/dist/src/main/config/broker.standalone.yaml.template#L695). This configuration is by default set to 100 commands. Meaning the stream processor will process 100 commands until it stops, to make room for other things. We expect that our limit handling steps in during the execution and we can execute also other instances or, cancel the problematic process instance. Furthermore, we expect to stay healthy, we should be able to update our health check continuously. ### Actual -Before we can start with our experiment we need to start our benchmark Zeebe cluster. This has become easier now since I have written the last post. Previously we had to use the scripts and Makefile in the [zeebe/benchmark sub-directory](https://github.com/camunda/zeebe/tree/main/benchmarks/setup). +Before we can start with our experiment we need to start our benchmark Zeebe cluster. This has become easier now since I have written the last post. Previously we had to use the scripts and Makefile in the [zeebe/benchmark sub-directory](https://github.com/camunda/camunda/tree/main/benchmarks/setup). We have now provided new [Benchmark Helm charts](https://github.com/zeebe-io/benchmark-helm), based on our Camunda Platform Helm charts. They allow us to deploy a new zeebe benchmark setup via: @@ -75,7 +75,7 @@ We can see that the processing starts immediately quite high and is continuously **We have two instances running, one on partition three and one on partition one.** -_One interesting fact is that the topology request rate is also up to 0.400 per second, so potentially every 2.5 seconds we send a topology request to the gateway. But there is no application deployed that does this. [I have recently found out again](https://github.com/camunda/zeebe/pull/11599#discussion_r1109846523), that we have the Zeebe client usage in the gateway to request the topology. Might be worth investigating whether this is an issue._ +_One interesting fact is that the topology request rate is also up to 0.400 per second, so potentially every 2.5 seconds we send a topology request to the gateway. But there is no application deployed that does this. [I have recently found out again](https://github.com/camunda/camunda/pull/11599#discussion_r1109846523), that we have the Zeebe client usage in the gateway to request the topology. Might be worth investigating whether this is an issue._ After observing this cluster for a while we can see that after around five minutes the cluster fails. The processing for the partitions breaks down to 1/10 of what was processed before. A bit later it looks like it tries to come back but, failed again. @@ -103,5 +103,5 @@ With this, I mark this chaos experiment as failed. We need to investigate this f ## Found Bugs * [zbchaos logs debug message on normal usage](https://github.com/zeebe-io/zeebe-chaos/issues/323) -* [Every 2.5 seconds we send a topology request, which is shown in the metrics](https://github.com/camunda/zeebe/issues/11799) -* [Batch processing doesn't respect the limit](https://github.com/camunda/zeebe/issues/11798) +* [Every 2.5 seconds we send a topology request, which is shown in the metrics](https://github.com/camunda/camunda/issues/11799) +* [Batch processing doesn't respect the limit](https://github.com/camunda/camunda/issues/11798) diff --git a/chaos-days/blog/2023-04-06-gateway-termination/index.md b/chaos-days/blog/2023-04-06-gateway-termination/index.md index 38e43d102..4a01d5e3b 100644 --- a/chaos-days/blog/2023-04-06-gateway-termination/index.md +++ b/chaos-days/blog/2023-04-06-gateway-termination/index.md @@ -16,7 +16,7 @@ authors: zell In today's chaos day, we wanted to experiment with the gateway and resiliency of workers. We have seen in recent weeks some issues within our benchmarks when gateways have been restarted, -see [zeebe#11975](https://github.com/camunda/zeebe/issues/11975). +see [zeebe#11975](https://github.com/camunda/camunda/issues/11975). We did a similar experiment [in the past](../2022-02-15-Standalone-Gateway-in-CCSaaS/index.md), today we want to focus on self-managed ([benchmarks with our helm charts](https://helm.camunda.io/)). @@ -25,14 +25,14 @@ Ideally, we can automate this as well soon. Today [Nicolas](https://github.com/npepinpe) joined me on the chaos day :tada: **TL;DR;** We were able to show that the workers (clients) can reconnect after a gateway is shutdown :white_check_mark: -Furthermore, we have discovered a potential performance issue on lower load, which impacts process execution latency ([zeebe#12311](https://github.com/camunda/zeebe/issues/12311)). +Furthermore, we have discovered a potential performance issue on lower load, which impacts process execution latency ([zeebe#12311](https://github.com/camunda/camunda/issues/12311)). ## Chaos Experiment We will use our [Zeebe benchmark helm charts](https://github.com/zeebe-io/benchmark-helm) to set up the test cluster, and -our helper scripts [here](https://github.com/camunda/zeebe/tree/main/benchmarks/setup). +our helper scripts [here](https://github.com/camunda/camunda/tree/main/benchmarks/setup). ### Setup: @@ -47,7 +47,7 @@ We will run the benchmark with a low load, 10 process instances per second creat we deploy one starter and worker. This reduces the blast radius and allows us to observe more easily how the workers behave when a gateway is restarted. -During the experiment, we will use our [grafana dashboard](https://github.com/camunda/zeebe/tree/main/monitor/grafana) to +During the experiment, we will use our [grafana dashboard](https://github.com/camunda/camunda/tree/main/monitor/grafana) to observe to which gateway the worker will connect and which gateway we need to stop/restart. @@ -299,6 +299,6 @@ We first expected that to be related to snapshotting, but snapshots happen much ![snapshot](snapshot-count.png) Interestingly is that it seems to be related to our segment creation (again), even if we have -async segment creation in our journal built recently. We need to investigate this further within [zeebe#12311](https://github.com/camunda/zeebe/issues/12311). +async segment creation in our journal built recently. We need to investigate this further within [zeebe#12311](https://github.com/camunda/camunda/issues/12311). ![segment](segment.png) diff --git a/chaos-days/blog/2023-05-15-SST-Partitioning-toggle/index.md b/chaos-days/blog/2023-05-15-SST-Partitioning-toggle/index.md index 4ebdfc87b..031e673ba 100644 --- a/chaos-days/blog/2023-05-15-SST-Partitioning-toggle/index.md +++ b/chaos-days/blog/2023-05-15-SST-Partitioning-toggle/index.md @@ -13,7 +13,7 @@ authors: zell # Chaos Day Summary -On this chaos day I wanted to experiment with a new experimental feature we have released recently. The [enablement of the partitioning of the SST files in RocksDB](https://github.com/camunda/zeebe/pull/12483). This is an experimental feature from RocksDb, which we made available now for our users as well, since we have seen great benefits in performance, especially with larger runtime data. +On this chaos day I wanted to experiment with a new experimental feature we have released recently. The [enablement of the partitioning of the SST files in RocksDB](https://github.com/camunda/camunda/pull/12483). This is an experimental feature from RocksDb, which we made available now for our users as well, since we have seen great benefits in performance, especially with larger runtime data. I wanted to experiment a bit with the SST partitioning and find out whether it would be possible to enable and disable the flag/configuration without issues. @@ -23,7 +23,7 @@ I wanted to experiment a bit with the SST partitioning and find out whether it w ## Chaos Experiment -For our chaos experiment we set up again our [normal benchmark cluster](https://github.com/camunda/zeebe/tree/main/benchmarks/setup), this time without any clients (no workers/starters). +For our chaos experiment we set up again our [normal benchmark cluster](https://github.com/camunda/camunda/tree/main/benchmarks/setup), this time without any clients (no workers/starters). Setting all client replicas to zero: ```diff @@ -70,7 +70,7 @@ When operating a cluster, I can enable the SST partitioning without an impact on ### Actual -As linked above I used again our [benchmark/setup](https://github.com/camunda/zeebe/tree/main/benchmarks/setup) scripts to set up a cluster. +As linked above I used again our [benchmark/setup](https://github.com/camunda/camunda/tree/main/benchmarks/setup) scripts to set up a cluster. #### First Part: Verify Steady state To verify the readiness and run all actions I used the [zbchaos](https://github.com/zeebe-io/zeebe-chaos/tree/zbchaos-v1.0.0) tool. diff --git a/chaos-days/blog/2023-05-19-Continuing-SST-Partitioning-toggle/index.md b/chaos-days/blog/2023-05-19-Continuing-SST-Partitioning-toggle/index.md index a26729534..9c83f52ab 100644 --- a/chaos-days/blog/2023-05-19-Continuing-SST-Partitioning-toggle/index.md +++ b/chaos-days/blog/2023-05-19-Continuing-SST-Partitioning-toggle/index.md @@ -54,7 +54,7 @@ In our second experiment, we will disable the SST partitioning again. ### Actual -As linked above I used again our [benchmark/setup](https://github.com/camunda/zeebe/tree/main/benchmarks/setup) scripts to set up a cluster. +As linked above I used again our [benchmark/setup](https://github.com/camunda/camunda/tree/main/benchmarks/setup) scripts to set up a cluster. ```shell $ diff ../default/values.yaml values.yaml diff --git a/chaos-days/blog/2023-06-02-Using-Large-Multi-Instance/index.md b/chaos-days/blog/2023-06-02-Using-Large-Multi-Instance/index.md index 53d32c6fb..ce2fe1d1c 100644 --- a/chaos-days/blog/2023-06-02-Using-Large-Multi-Instance/index.md +++ b/chaos-days/blog/2023-06-02-Using-Large-Multi-Instance/index.md @@ -14,7 +14,7 @@ authors: zell New day new chaos. :skull: In today's chaos day I want to pick up a topic, which had bothered people for long time. I created a [chaos day three years ago](https://zeebe-io.github.io/zeebe-chaos/2020/07/16/big-multi-instance/) around this topic as well. -Today, we experiment with large multi-instances again. In the recent patch release [8.2.5](https://github.com/camunda/zeebe/releases/tag/8.2.5) we fixed an issue with spawning larger multi instances. Previously if you have created a process instance with a large multi-instance it was likely that this caused to blacklist the process instance, since the multi-instance spawning ran into `maxMessageSize` limitations. +Today, we experiment with large multi-instances again. In the recent patch release [8.2.5](https://github.com/camunda/camunda/releases/tag/8.2.5) we fixed an issue with spawning larger multi instances. Previously if you have created a process instance with a large multi-instance it was likely that this caused to blacklist the process instance, since the multi-instance spawning ran into `maxMessageSize` limitations. This means the process instance was stuck and was no longer executable. In Operate this was not shown and caused a lot of friction or confusion to users. With the recent fix, Zeebe should chunk even large collections into smaller batches to spawn/execute the multi-instance without any issues. @@ -197,6 +197,6 @@ When reaching a certain limit (maxMessageSize) we get a described rejection by t ## Found Bugs - * in a previous test I run into https://github.com/camunda/zeebe/issues/12918 - * Related bug regarding the input collection https://github.com/camunda/zeebe/issues/12873 + * in a previous test I run into https://github.com/camunda/camunda/issues/12918 + * Related bug regarding the input collection https://github.com/camunda/camunda/issues/12873 diff --git a/chaos-days/blog/2023-11-07-Hot-backups-impact-on-processing/index.md b/chaos-days/blog/2023-11-07-Hot-backups-impact-on-processing/index.md index e0e742cc1..4523040c2 100644 --- a/chaos-days/blog/2023-11-07-Hot-backups-impact-on-processing/index.md +++ b/chaos-days/blog/2023-11-07-Hot-backups-impact-on-processing/index.md @@ -12,7 +12,7 @@ authors: zell # Chaos Day Summary -Today, we want to experiment with hot backups in SaaS and a larger runtime state in Zeebe and how it impacts the ongoing processing in Zeebe (or not?). This is part of the investigation of a recently created bug issue we wanted to verify/reproduce [#14696](https://github.com/camunda/zeebe/issues/14696). +Today, we want to experiment with hot backups in SaaS and a larger runtime state in Zeebe and how it impacts the ongoing processing in Zeebe (or not?). This is part of the investigation of a recently created bug issue we wanted to verify/reproduce [#14696](https://github.com/camunda/camunda/issues/14696). **TL;DR;** We were able to prove that hot backups are indeed not impacting overall processing throughput in Zeebe. We found that having a full Elasticsearch disk might impact or even fail your backups, which is intransparent to the user. @@ -20,7 +20,7 @@ Today, we want to experiment with hot backups in SaaS and a larger runtime state ## Chaos Experiment -For the experiment, we have set up a Camunda SaaS cluster (G3-M configuration), and run the [cloud benchmark](https://github.com/camunda/zeebe/tree/main/benchmarks/setup/cloud-default) workload against it. During the experiment, we will run a stable load, which will cause to increase in the runtime state. We will create/initiate in different stages backups to verify the impact on processing depending on state size. +For the experiment, we have set up a Camunda SaaS cluster (G3-M configuration), and run the [cloud benchmark](https://github.com/camunda/camunda/tree/main/benchmarks/setup/cloud-default) workload against it. During the experiment, we will run a stable load, which will cause to increase in the runtime state. We will create/initiate in different stages backups to verify the impact on processing depending on state size. We kept the starter rate (creation of process instance 100 PI/s) but reduced the worker capacity and replicas. diff --git a/chaos-days/blog/2023-11-30-Job-push-overloading/index.md b/chaos-days/blog/2023-11-30-Job-push-overloading/index.md index 475d98318..81fc6fb37 100644 --- a/chaos-days/blog/2023-11-30-Job-push-overloading/index.md +++ b/chaos-days/blog/2023-11-30-Job-push-overloading/index.md @@ -29,7 +29,7 @@ We expect that if the workers are slowing down, the load is distributed to other ### Actual -We deployed a normal benchmark, with [default configurations](https://github.com/camunda/zeebe/blob/main/benchmarks/setup/default/values.yaml). +We deployed a normal benchmark, with [default configurations](https://github.com/camunda/camunda/blob/main/benchmarks/setup/default/values.yaml). We slowed the workers down, in the sense that we changed [the completionDelay to 1250 ms](https://github.com/zeebe-io/benchmark-helm/blob/main/charts/zeebe-benchmark/templates/worker.yaml#L30) @@ -56,7 +56,7 @@ So far so good, first experiment worked as expected :white_check_mark: The normal scenario when something is slow is for a user to scale up. This is what we did in the next experiment, we scaled the workers to 10 replicas (from 3), to verify how the system behaves in this case. -Something to keep in mind when the completion delay is 1250ms, we [multiply the activation timeout by 6 in our workers](https://github.com/camunda/zeebe/blob/7002d53a079c06ab3a94f5485f022681a41dc9ed/benchmarks/project/src/main/java/io/camunda/zeebe/Worker.java#L113). This means completionDelay: 1250 -> job timeout 7.5s +Something to keep in mind when the completion delay is 1250ms, we [multiply the activation timeout by 6 in our workers](https://github.com/camunda/camunda/blob/7002d53a079c06ab3a94f5485f022681a41dc9ed/benchmarks/project/src/main/java/io/camunda/zeebe/Worker.java#L113). This means completionDelay: 1250 -> job timeout 7.5s ### Expected @@ -145,7 +145,7 @@ We wanted to understand and experiment with the impact of a slow worker on diffe To see such an impact in our metrics we had to patch our current execution metrics, such that includes the BPMN processId, so we can differentiate between execution times of different processes. -See the related branch for more details [ck-latency-metrics](https://github.com/camunda/zeebe/tree/ck-latency-metrics) +See the related branch for more details [ck-latency-metrics](https://github.com/camunda/camunda/tree/ck-latency-metrics) Furthermore, a new process model was added `slow-task.bpm` and new deployments to create such instances and work on them. The process model was similar to the benchmark model, only the job type has been changed. diff --git a/chaos-days/blog/2023-12-06-Job-Push-resiliency/index.md b/chaos-days/blog/2023-12-06-Job-Push-resiliency/index.md index 638994977..49770b22e 100644 --- a/chaos-days/blog/2023-12-06-Job-Push-resiliency/index.md +++ b/chaos-days/blog/2023-12-06-Job-Push-resiliency/index.md @@ -254,4 +254,4 @@ Again we were able to show that job push is resilient, and can even handle a com ## Found Bugs * On restart (especially on cluster restart) it looks like job push engine metrics are counted multiple times -* [We found a place where we should better handle the exception in pushing async.](https://github.com/camunda/zeebe/blob/a86decce9a46218798663e3466267a49adef506e/transport/src/main/java/io/camunda/zeebe/transport/stream/impl/RemoteStreamPusher.java#L55-L56C14) +* [We found a place where we should better handle the exception in pushing async.](https://github.com/camunda/camunda/blob/a86decce9a46218798663e3466267a49adef506e/transport/src/main/java/io/camunda/zeebe/transport/stream/impl/RemoteStreamPusher.java#L55-L56C14) diff --git a/chaos-days/blog/2023-12-19-Dynamic-Scaling-with-Dataloss/index.md b/chaos-days/blog/2023-12-19-Dynamic-Scaling-with-Dataloss/index.md index 5c2f55e86..33674b015 100644 --- a/chaos-days/blog/2023-12-19-Dynamic-Scaling-with-Dataloss/index.md +++ b/chaos-days/blog/2023-12-19-Dynamic-Scaling-with-Dataloss/index.md @@ -147,4 +147,4 @@ We want to improve this behavior in the future but likely can't prevent it compl That means that there is an increased risk of unavailable partitions during scaling. However, this only occurs if another broker becomes unavailable with an unfortunate timing and resolves itself automatically once the broker is available again. -Zeebe issue: https://github.com/camunda/zeebe/issues/15679 \ No newline at end of file +Zeebe issue: https://github.com/camunda/camunda/issues/15679 \ No newline at end of file diff --git a/chaos-days/blog/2023-12-20-Broker-scaling-performance/index.md b/chaos-days/blog/2023-12-20-Broker-scaling-performance/index.md index 1da854543..5a8d39bad 100644 --- a/chaos-days/blog/2023-12-20-Broker-scaling-performance/index.md +++ b/chaos-days/blog/2023-12-20-Broker-scaling-performance/index.md @@ -40,7 +40,7 @@ Each process instance has a single service task: ![](./one_task.png) -The processing load is generated by our own [benchmarking application](https://github.com/camunda/zeebe/tree/9e723b21b0e408fc2b97fd7d3f6b092af8e62dbe/benchmarks). +The processing load is generated by our own [benchmarking application](https://github.com/camunda/camunda/tree/9e723b21b0e408fc2b97fd7d3f6b092af8e62dbe/benchmarks). ### Expected @@ -118,7 +118,7 @@ The new process model consists of 10 tasks with two timers in-between, each dela ![](./ten_tasks.png) -The processing load is generated by our own [benchmarking application](https://github.com/camunda/zeebe/tree/9e723b21b0e408fc2b97fd7d3f6b092af8e62dbe/benchmarks), initially starting 40 process instances per second. +The processing load is generated by our own [benchmarking application](https://github.com/camunda/camunda/tree/9e723b21b0e408fc2b97fd7d3f6b092af8e62dbe/benchmarks), initially starting 40 process instances per second. This results in 400 jobs created and completed per second. diff --git a/chaos-days/blog/2024-01-19-Job-Activation-Latency/index.md b/chaos-days/blog/2024-01-19-Job-Activation-Latency/index.md index 46d5b680f..ade6a6ce0 100644 --- a/chaos-days/blog/2024-01-19-Job-Activation-Latency/index.md +++ b/chaos-days/blog/2024-01-19-Job-Activation-Latency/index.md @@ -60,7 +60,7 @@ Already we can infer certain performance bottle necks based on the following: - In the worst case scenario, we have to poll _every_ partition. - The gateway does not know in advance which partitions have jobs available. - Scaling out your clients may have adverse effects by sending out too many requests which all have to be processed independently -- [If you have a lot of jobs, you can run into major performance issues when accessing the set of available jobs](https://github.com/camunda/zeebe/issues/11813) +- [If you have a lot of jobs, you can run into major performance issues when accessing the set of available jobs](https://github.com/camunda/camunda/issues/11813) So if we have, say, 30 partitions, and each gateway-to-broker request takes 100ms, fetching the jobs on the last partition will take up to 3 seconds, even though the actual activation time on that partition was only 100ms. @@ -70,7 +70,7 @@ One would think a workaround to this issue would simply be to poll more often, b ### Long polling: a second implementation -To simplify things, the Zeebe team introduced [long polling in 2019](https://github.com/camunda/zeebe/issues/2825). [Long polling](https://en.wikipedia.org/wiki/Push_technology#Long_polling) is a fairly common technique to emulate a push or streaming approach while maintaing the request-response pattern of polling. Essentially, if the server has nothing to send to the client, instead of completing the request it will hold it until content is available, or a timeout is reached. +To simplify things, the Zeebe team introduced [long polling in 2019](https://github.com/camunda/camunda/issues/2825). [Long polling](https://en.wikipedia.org/wiki/Push_technology#Long_polling) is a fairly common technique to emulate a push or streaming approach while maintaing the request-response pattern of polling. Essentially, if the server has nothing to send to the client, instead of completing the request it will hold it until content is available, or a timeout is reached. In Zeebe, this means that if we did not reach the maximum number of jobs to activate after polling all partitions, the request is parked but not closed. Eventually when jobs are available, the brokers will make this information known to the gateways, who will then unpark the oldest request and start a new polling round. @@ -90,7 +90,7 @@ However, there are still some issues: ## Job push: third time's the charm -In order to solve these issues, the team decided to implement [a push-based approach to job activation](https://github.com/camunda/zeebe/issues/11231). +In order to solve these issues, the team decided to implement [a push-based approach to job activation](https://github.com/camunda/camunda/issues/11231). Essentially, we added a new `StreamActivatedJobs` RPC to our gRPC protocol, a so-called [server streaming RPC](https://grpc.io/docs/what-is-grpc/core-concepts/#server-streaming-rpc). In our case, this is meant to be a long-lived stream, such that the call is completed only if the client terminates it, or if the server is shutting down. diff --git a/go-chaos/backend/clients.go b/go-chaos/backend/clients.go index 7c4fc926a..4d5306250 100644 --- a/go-chaos/backend/clients.go +++ b/go-chaos/backend/clients.go @@ -15,7 +15,7 @@ package backend import ( - "github.com/camunda/zeebe/clients/go/v8/pkg/zbc" + "github.com/camunda/camunda/clients/go/v8/pkg/zbc" "github.com/zeebe-io/zeebe-chaos/go-chaos/internal" ) diff --git a/go-chaos/backend/connection.go b/go-chaos/backend/connection.go index 3f789c95b..7460e1408 100644 --- a/go-chaos/backend/connection.go +++ b/go-chaos/backend/connection.go @@ -18,7 +18,7 @@ import ( "errors" "fmt" - "github.com/camunda/zeebe/clients/go/v8/pkg/zbc" + "github.com/camunda/camunda/clients/go/v8/pkg/zbc" "github.com/zeebe-io/zeebe-chaos/go-chaos/internal" v1 "k8s.io/api/core/v1" ) diff --git a/go-chaos/cmd/stress.go b/go-chaos/cmd/stress.go index f855ef705..0bcdb037a 100644 --- a/go-chaos/cmd/stress.go +++ b/go-chaos/cmd/stress.go @@ -18,7 +18,7 @@ import ( "errors" "fmt" - "github.com/camunda/zeebe/clients/go/v8/pkg/zbc" + "github.com/camunda/camunda/clients/go/v8/pkg/zbc" "github.com/spf13/cobra" "github.com/zeebe-io/zeebe-chaos/go-chaos/internal" v1 "k8s.io/api/core/v1" diff --git a/go-chaos/cmd/topology.go b/go-chaos/cmd/topology.go index 14cc6f73d..cf8d394ed 100644 --- a/go-chaos/cmd/topology.go +++ b/go-chaos/cmd/topology.go @@ -22,7 +22,7 @@ import ( "strings" "text/tabwriter" - "github.com/camunda/zeebe/clients/go/v8/pkg/pb" + "github.com/camunda/camunda/clients/go/v8/pkg/pb" "github.com/spf13/cobra" "github.com/zeebe-io/zeebe-chaos/go-chaos/internal" ) diff --git a/go-chaos/cmd/topology_test.go b/go-chaos/cmd/topology_test.go index 8d53b3c01..d39d8593e 100644 --- a/go-chaos/cmd/topology_test.go +++ b/go-chaos/cmd/topology_test.go @@ -18,7 +18,7 @@ import ( "bytes" "testing" - "github.com/camunda/zeebe/clients/go/v8/pkg/pb" + "github.com/camunda/camunda/clients/go/v8/pkg/pb" "github.com/stretchr/testify/assert" ) diff --git a/go-chaos/cmd/worker.go b/go-chaos/cmd/worker.go index 82d4f9761..55e8da593 100644 --- a/go-chaos/cmd/worker.go +++ b/go-chaos/cmd/worker.go @@ -19,9 +19,9 @@ import ( "os" "strings" - "github.com/camunda/zeebe/clients/go/v8/pkg/entities" - zbworker "github.com/camunda/zeebe/clients/go/v8/pkg/worker" - "github.com/camunda/zeebe/clients/go/v8/pkg/zbc" + "github.com/camunda/camunda/clients/go/v8/pkg/entities" + zbworker "github.com/camunda/camunda/clients/go/v8/pkg/worker" + "github.com/camunda/camunda/clients/go/v8/pkg/zbc" "github.com/spf13/cobra" "github.com/zeebe-io/zeebe-chaos/go-chaos/internal" worker "github.com/zeebe-io/zeebe-chaos/go-chaos/worker" diff --git a/go-chaos/go.mod b/go-chaos/go.mod index 01c9e2caf..bd76f9489 100644 --- a/go-chaos/go.mod +++ b/go-chaos/go.mod @@ -3,7 +3,7 @@ module github.com/zeebe-io/zeebe-chaos/go-chaos go 1.19 require ( - github.com/camunda/zeebe/clients/go/v8 v8.4.5 + github.com/camunda/camunda/clients/go/v8 v8.4.5 github.com/rs/zerolog v1.32.0 github.com/spf13/cobra v1.8.0 github.com/stretchr/testify v1.9.0 diff --git a/go-chaos/go.sum b/go-chaos/go.sum index f04112407..6d288174c 100644 --- a/go-chaos/go.sum +++ b/go-chaos/go.sum @@ -10,8 +10,8 @@ github.com/Microsoft/hcsshim v0.11.4/go.mod h1:smjE4dvqPX9Zldna+t5FG3rnoHhaB7QYx github.com/armon/go-socks5 v0.0.0-20160902184237-e75332964ef5 h1:0CwZNZbxp69SHPdPJAN/hZIm0C4OItdklCFmMRWYpio= github.com/asaskevich/govalidator v0.0.0-20200108200545-475eaeb16496 h1:zV3ejI06GQ59hwDQAvmK1qxOQGB3WuVTRoY0okPTAv0= github.com/asaskevich/govalidator v0.0.0-20200108200545-475eaeb16496/go.mod h1:oGkLhpf+kjZl6xBf758TQhh5XrAeiJv/7FRz/2spLIg= -github.com/camunda/zeebe/clients/go/v8 v8.4.5 h1:xusLe/JbXqjT0oo9NyzNPMN5BJnpns9YaofR0fxIr1w= -github.com/camunda/zeebe/clients/go/v8 v8.4.5/go.mod h1:d61Utm85QiLV75pnWh+5ciLVAu+6tf1bKKW9BXKf5SU= +github.com/camunda/camunda/clients/go/v8 v8.4.5 h1:xusLe/JbXqjT0oo9NyzNPMN5BJnpns9YaofR0fxIr1w= +github.com/camunda/camunda/clients/go/v8 v8.4.5/go.mod h1:d61Utm85QiLV75pnWh+5ciLVAu+6tf1bKKW9BXKf5SU= github.com/cenkalti/backoff/v4 v4.2.1 h1:y4OZtCnogmCPw98Zjyt5a6+QwPLGkiQsYW5oUqylYbM= github.com/cenkalti/backoff/v4 v4.2.1/go.mod h1:Y3VNntkOUPxTVeUxJ/G5vcM//AlwfmyYozVcomhLiZE= github.com/containerd/containerd v1.7.11 h1:lfGKw3eU35sjV0aG2eYZTiwFEY1pCzxdzicHP3SZILw= diff --git a/go-chaos/internal/fake.go b/go-chaos/internal/fake.go index f2b543653..74f93b677 100644 --- a/go-chaos/internal/fake.go +++ b/go-chaos/internal/fake.go @@ -16,10 +16,10 @@ package internal import ( "context" - "github.com/camunda/zeebe/clients/go/v8/pkg/commands" - "github.com/camunda/zeebe/clients/go/v8/pkg/entities" - "github.com/camunda/zeebe/clients/go/v8/pkg/pb" - "github.com/camunda/zeebe/clients/go/v8/pkg/zbc" + "github.com/camunda/camunda/clients/go/v8/pkg/commands" + "github.com/camunda/camunda/clients/go/v8/pkg/entities" + "github.com/camunda/camunda/clients/go/v8/pkg/pb" + "github.com/camunda/camunda/clients/go/v8/pkg/zbc" ) /* diff --git a/go-chaos/internal/zeebe.go b/go-chaos/internal/zeebe.go index ddad9ffd8..f7412a4c1 100644 --- a/go-chaos/internal/zeebe.go +++ b/go-chaos/internal/zeebe.go @@ -23,8 +23,8 @@ import ( "strings" "time" - "github.com/camunda/zeebe/clients/go/v8/pkg/pb" - "github.com/camunda/zeebe/clients/go/v8/pkg/zbc" + "github.com/camunda/camunda/clients/go/v8/pkg/pb" + "github.com/camunda/camunda/clients/go/v8/pkg/zbc" "google.golang.org/grpc" v1 "k8s.io/api/core/v1" ) diff --git a/go-chaos/internal/zeebe_test.go b/go-chaos/internal/zeebe_test.go index c11b3a15d..043ee15d4 100644 --- a/go-chaos/internal/zeebe_test.go +++ b/go-chaos/internal/zeebe_test.go @@ -20,7 +20,7 @@ import ( "testing" "time" - "github.com/camunda/zeebe/clients/go/v8/pkg/pb" + "github.com/camunda/camunda/clients/go/v8/pkg/pb" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" ) diff --git a/go-chaos/worker/chaos_worker.go b/go-chaos/worker/chaos_worker.go index 62f0ddea2..1dd231acc 100644 --- a/go-chaos/worker/chaos_worker.go +++ b/go-chaos/worker/chaos_worker.go @@ -21,8 +21,8 @@ import ( "strings" "time" - "github.com/camunda/zeebe/clients/go/v8/pkg/entities" - "github.com/camunda/zeebe/clients/go/v8/pkg/worker" + "github.com/camunda/camunda/clients/go/v8/pkg/entities" + "github.com/camunda/camunda/clients/go/v8/pkg/worker" "github.com/zeebe-io/zeebe-chaos/go-chaos/internal" chaos_experiments "github.com/zeebe-io/zeebe-chaos/go-chaos/internal/chaos-experiments" ) diff --git a/go-chaos/worker/chaos_worker_test.go b/go-chaos/worker/chaos_worker_test.go index 07a586093..7dc6277b6 100644 --- a/go-chaos/worker/chaos_worker_test.go +++ b/go-chaos/worker/chaos_worker_test.go @@ -21,8 +21,8 @@ import ( "testing" "time" - "github.com/camunda/zeebe/clients/go/v8/pkg/entities" - "github.com/camunda/zeebe/clients/go/v8/pkg/pb" + "github.com/camunda/camunda/clients/go/v8/pkg/entities" + "github.com/camunda/camunda/clients/go/v8/pkg/pb" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" chaos_experiments "github.com/zeebe-io/zeebe-chaos/go-chaos/internal/chaos-experiments" diff --git a/go-chaos/worker/fake.go b/go-chaos/worker/fake.go index 7cd2a25b8..caf1cc573 100644 --- a/go-chaos/worker/fake.go +++ b/go-chaos/worker/fake.go @@ -18,9 +18,9 @@ import ( "context" "time" - "github.com/camunda/zeebe/clients/go/v8/pkg/commands" - "github.com/camunda/zeebe/clients/go/v8/pkg/pb" - "github.com/camunda/zeebe/clients/go/v8/pkg/worker" + "github.com/camunda/camunda/clients/go/v8/pkg/commands" + "github.com/camunda/camunda/clients/go/v8/pkg/pb" + "github.com/camunda/camunda/clients/go/v8/pkg/worker" ) type FakeJobClient struct {