From b9c842ecbb783648fa8369f320a2300435c339bd Mon Sep 17 00:00:00 2001
From: Christian Sutter <git@doublesignal.com>
Date: Fri, 23 Feb 2024 11:09:55 +0000
Subject: [PATCH] Add ADR 005: Publishing API debouncing

---
 docs/adr/005-publishing-api-debounce.md | 121 ++++++++++++++++++++++++
 1 file changed, 121 insertions(+)
 create mode 100644 docs/adr/005-publishing-api-debounce.md

diff --git a/docs/adr/005-publishing-api-debounce.md b/docs/adr/005-publishing-api-debounce.md
new file mode 100644
index 0000000..9ee345d
--- /dev/null
+++ b/docs/adr/005-publishing-api-debounce.md
@@ -0,0 +1,121 @@
+# ADR 005: Debouncing Publishing API
+2024-03-05
+
+## Context
+When content is updated in GOV.UK's upstream content management systems (CMSs), its state is
+persisted into the core content repositories through the central workflow-as-a-service tool
+[Publishing API][publishing-api]. This service always acts as the main gateway for content
+publishing, and in addition to persisting content into the repositories, will subsequently notify
+downstream apps (such as this one) of changes using a RabbitMQ message queue.
+
+For historical reasons in how upstream CMSs work internally, a single (ostensibly atomic) change to
+a piece of content may actually be represented as a series of changes in Publishing API, in turn
+triggering several messages in rapid succession to downstream consumers. For example, updating
+"links" between different content items involves a separate operation from updating the content of a
+page itself.
+
+Currently, this presents two problems for Search API v2:
+- Discovery Engine has a strict rate limit on how many times a document can be updated within the
+  space of a second (roughly once), which we exceed due to processing several messages
+  simultaneously across workers
+- Having a high number of workers listening to the message queue leads to potential out-of-order
+  processing of messages, meaning that the last update performed upstream isn't necessarily the last
+  one to be processed by us
+
+Both of these points have the potential to lead to data integrity issues, with newer content in the
+Discovery Engine datastore being overwritten by outdated content (out of order), or newer content
+not being written at all (hitting rate limits).
+
+Neither of these problems were encountered with the previous Elasticsearch-based Search API v1, as
+it did not have rate limiting, and made use of the messages' payload version and Elasticsearch's
+versioning ability to avoid overwriting newer documents with older ones.
+
+## Considered options
+
+Discovery Engine does not offer transactions, document versioning, or atomic read-and-update-if
+style operations, which limits our options to:
+
+### Do nothing
+
+Currently, any message coming through from Publishing API triggers an update of the document
+directly in Discovery Engine (regardless of the document version or any other workers concurrently
+dealing with the same document).
+
+This is not a long-term option, because:
+- Having any number of missing or stale documents in search over an extended period of time is
+  unacceptable
+- The amount of errors is polluting our error management tool (and muting/ignoring them may mask
+  future unrelated bugs)
+
+### Reduce worker count to 1 to enfore in-order processing
+
+In theory, limiting incoming content update processing to a single worker in Search API v2 could
+bypass this problem, as under those circumstances RabbitMQ will guarantee in-order delivery and the
+Discovery Engine API calls would have enough breathing room to not run into rate limiting.
+
+However, this isn't acceptable as:
+- we would lose redundancy in case of worker/Kubernetes node failure
+- a single worker would struggle to keep up with peak load (or full re-sync scenarios) leading to
+  exceeding our desired timeframe of how long it takes for an updated document to be visible in
+  search (<1 minute)
+
+### Implement a locking mechanism
+
+A relatively basic locking mechanism using the existing shared Publishing AWS ElastiCache Redis
+instance (to be migrated to a separate instance in the long term once Platforms have a plan for
+these) as part of the environment would allow us to reach an acceptable level of correctness without
+excessive architectural efforts.
+
+For each document, we would keep up to two keys in Redis:
+- `lock:<content_id>`: Whether a worker is currently processing the document (key only present if
+  locked)
+- `latest:<content_id>`: The last successfully processed version for this document within the last
+  24 hours, if any (exact expiry TBC)
+  - Note: We cannot just rely on the remote document version in Discovery Engine as the document may
+    have been deleted as part of a previous operation
+
+Assuming a peak document update load of 100,000 documents in a 24 hour period, and up to 200 bytes
+for each set of Redis keys/values including allowance for overhead, we would store a maximum of
+20MiB in Redis, meaning a minimally provisioned instance would be sufficient.
+
+If the locking process fails due to Redis being unavailable, we can temporarily fall back onto the
+existing "first update wins" behaviour and track the resulting errors, and perform a full resync if
+the scale of the downtime is large enough that a significant number of updates failed due to the old
+behaviour.
+
+Upon receiving a message, a worker would:
+- use the [`redlock-rb` gem][redlock-gem] to acquire a lock of `lock:<content_id>` with a sensible
+  timeout
+- if the lock currently exists, [wait for it to be available again](#addendum-how-to-wait-for-lock)
+- check `latest:<content_id>` to see if a newer message has recently been successfully processed
+  - if so, do nothing and go straight to releasing the lock
+- process the message (update or delete the document in Discovery Engine)
+  - if processing fails, handle as usual (requeue or discard)
+  - otherwise use `SET latest:<content_id> <version>` to update the latest processed version
+- release the lock
+
+#### Addendum: How to wait for lock
+
+If the lock for a given document is currently held by another worker, we need to wait for that
+worker to complete its processing (or timeout) before continuing.
+
+There are broadly two options:
+- polling a.k.a. `sleep` in a loop: if the lock key exists, wait for the remaining time on the lock
+  (maybe with a random jitter) before retrying
+  - very simple to implement but keeps the worker blocked and unable to process other messages in
+    the meantime
+- put the message on a special RabbitMQ "delay" queue that gets re-queued after a time-to-live using
+  a dead letter exchange
+  - avoids blocking workers while waiting for a lock to be released, but adds complexity to the
+    RabbitMQ configuration (especially given that we need several separate queues to avoid impacting
+    other RabbitMQ consumers, and need to configure the worker to listen to several queues)
+
+## Decision
+Pending decision, proposed we implement the described locking behaviour with a polling wait for lock
+release.
+
+## Status
+Pending.
+
+[publishing-api]: https://github.com/alphagov/publishing-api
+[redlock-gem]: https://github.com/leandromoreira/redlock-rb