Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ADR 005: Publishing API debouncing #230

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions docs/adr/005-publishing-api-debounce.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# ADR 005: Debouncing Publishing API
2024-03-05

## Context
When content is updated in GOV.UK's upstream content management systems (CMSs), its state is
persisted into the core content repositories through the central workflow-as-a-service tool
[Publishing API][publishing-api]. This service always acts as the main gateway for content
publishing, and in addition to persisting content into the repositories, will subsequently notify
downstream apps (such as this one) of changes using a RabbitMQ message queue.

For historical reasons in how upstream CMSs work internally, a single (ostensibly atomic) change to
a piece of content may actually be represented as a series of changes in Publishing API, in turn
triggering several messages in rapid succession to downstream consumers. For example, updating
"links" between different content items involves a separate operation from updating the content of a
page itself.

Currently, this presents two problems for Search API v2:
- Discovery Engine has a strict rate limit on how many times a document can be updated within the
space of a second (roughly once), which we exceed due to processing several messages
simultaneously across workers
- Having a high number of workers listening to the message queue leads to potential out-of-order
processing of messages, meaning that the last update performed upstream isn't necessarily the last
one to be processed by us

Both of these points have the potential to lead to data integrity issues, with newer content in the
Discovery Engine datastore being overwritten by outdated content (out of order), or newer content
not being written at all (hitting rate limits).

Neither of these problems were encountered with the previous Elasticsearch-based Search API v1, as
it did not have rate limiting, and made use of the messages' payload version and Elasticsearch's
versioning ability to avoid overwriting newer documents with older ones.

## Considered options

Discovery Engine does not offer transactions, document versioning, or atomic read-and-update-if
style operations, which limits our options to:

### Do nothing

Currently, any message coming through from Publishing API triggers an update of the document
directly in Discovery Engine (regardless of the document version or any other workers concurrently
dealing with the same document).

This is not a long-term option, because:
- Having any number of missing or stale documents in search over an extended period of time is
unacceptable
- The amount of errors is polluting our error management tool (and muting/ignoring them may mask
future unrelated bugs)

### Reduce worker count to 1 to enfore in-order processing

In theory, limiting incoming content update processing to a single worker in Search API v2 could
bypass this problem, as under those circumstances RabbitMQ will guarantee in-order delivery and the
Discovery Engine API calls would have enough breathing room to not run into rate limiting.

However, this isn't acceptable as:
- we would lose redundancy in case of worker/Kubernetes node failure
- a single worker would struggle to keep up with peak load (or full re-sync scenarios) leading to
exceeding our desired timeframe of how long it takes for an updated document to be visible in
search (<1 minute)

### Implement a locking mechanism

A relatively basic locking mechanism using the existing shared Publishing AWS ElastiCache Redis
instance (to be migrated to a separate instance in the long term once Platforms have a plan for
these) as part of the environment would allow us to reach an acceptable level of correctness without
excessive architectural efforts.

For each document, we would keep up to two keys in Redis:
- `lock:<content_id>`: Whether a worker is currently processing the document (key only present if
locked)
- `latest:<content_id>`: The last successfully processed version for this document within the last
24 hours, if any (exact expiry TBC)
- Note: We cannot just rely on the remote document version in Discovery Engine as the document may
have been deleted as part of a previous operation

Assuming a peak document update load of 100,000 documents in a 24 hour period, and up to 200 bytes
for each set of Redis keys/values including allowance for overhead, we would store a maximum of
20MiB in Redis, meaning a minimally provisioned instance would be sufficient.

If the locking process fails due to Redis being unavailable, we can temporarily fall back onto the
existing "first update wins" behaviour and track the resulting errors, and perform a full resync if
the scale of the downtime is large enough that a significant number of updates failed due to the old
behaviour.

Upon receiving a message, a worker would:
- use the [`redlock-rb` gem][redlock-gem] to acquire a lock of `lock:<content_id>` with a sensible
timeout
- if the lock currently exists, [wait for it to be available again](#addendum-how-to-wait-for-lock)
- check `latest:<content_id>` to see if a newer message has recently been successfully processed
- if so, do nothing and go straight to releasing the lock
- process the message (update or delete the document in Discovery Engine)
- if processing fails, handle as usual (requeue or discard)
- otherwise use `SET latest:<content_id> <version>` to update the latest processed version
- release the lock
Comment on lines +86 to +95
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there still a scenario here where we could exceed Discovery Engine's rate
limit if the messages come in quickly in the correct order? Say:

Time(ms) Payload Version Content ID
0 0 cafebabe
10 1 cafebabe
20 2 cafebabe
30 3 cafebabe
40 4 cafebabe
50 5 cafebabe

If each of thos messages gets processed quickly enough, we'll hit the rate limit.

Do we want to put some kind of deliberate delay in to handle that case?

For example, if we made the locks timeout after 1 second, and then just didn't
release them explicitly, that would throttle messages to one message per
second.

If we do release the locks explicitly, then this won't necessarily slow down
the processing of sequential messages by much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good shout - with the caveat that during peak times latency for an upsert API call to Vertex is about 1s anyway, and only the parallel requests are a problem. But an enforced delay of sorts would probably still help us during off-peak times when it's 0.1s.


#### Addendum: How to wait for lock

If the lock for a given document is currently held by another worker, we need to wait for that
worker to complete its processing (or timeout) before continuing.

There are broadly two options:
- polling a.k.a. `sleep` in a loop: if the lock key exists, wait for the remaining time on the lock
(maybe with a random jitter) before retrying
- very simple to implement but keeps the worker blocked and unable to process other messages in
the meantime
- put the message on a special RabbitMQ "delay" queue that gets re-queued after a time-to-live using
a dead letter exchange
- avoids blocking workers while waiting for a lock to be released, but adds complexity to the
RabbitMQ configuration (especially given that we need several separate queues to avoid impacting
other RabbitMQ consumers, and need to configure the worker to listen to several queues)

## Decision
Pending decision, proposed we implement the described locking behaviour with a polling wait for lock
release.

## Status
Pending.

[publishing-api]: https://github.com/alphagov/publishing-api
[redlock-gem]: https://github.com/leandromoreira/redlock-rb
Loading