-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ADR 005: Publishing API debouncing #230
Open
csutter
wants to merge
1
commit into
main
Choose a base branch
from
adr
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+121
−0
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
# ADR 005: Debouncing Publishing API | ||
2024-03-05 | ||
|
||
## Context | ||
When content is updated in GOV.UK's upstream content management systems (CMSs), its state is | ||
persisted into the core content repositories through the central workflow-as-a-service tool | ||
[Publishing API][publishing-api]. This service always acts as the main gateway for content | ||
publishing, and in addition to persisting content into the repositories, will subsequently notify | ||
downstream apps (such as this one) of changes using a RabbitMQ message queue. | ||
|
||
For historical reasons in how upstream CMSs work internally, a single (ostensibly atomic) change to | ||
a piece of content may actually be represented as a series of changes in Publishing API, in turn | ||
triggering several messages in rapid succession to downstream consumers. For example, updating | ||
"links" between different content items involves a separate operation from updating the content of a | ||
page itself. | ||
|
||
Currently, this presents two problems for Search API v2: | ||
- Discovery Engine has a strict rate limit on how many times a document can be updated within the | ||
space of a second (roughly once), which we exceed due to processing several messages | ||
simultaneously across workers | ||
- Having a high number of workers listening to the message queue leads to potential out-of-order | ||
processing of messages, meaning that the last update performed upstream isn't necessarily the last | ||
one to be processed by us | ||
|
||
Both of these points have the potential to lead to data integrity issues, with newer content in the | ||
Discovery Engine datastore being overwritten by outdated content (out of order), or newer content | ||
not being written at all (hitting rate limits). | ||
|
||
Neither of these problems were encountered with the previous Elasticsearch-based Search API v1, as | ||
it did not have rate limiting, and made use of the messages' payload version and Elasticsearch's | ||
versioning ability to avoid overwriting newer documents with older ones. | ||
|
||
## Considered options | ||
|
||
Discovery Engine does not offer transactions, document versioning, or atomic read-and-update-if | ||
style operations, which limits our options to: | ||
|
||
### Do nothing | ||
|
||
Currently, any message coming through from Publishing API triggers an update of the document | ||
directly in Discovery Engine (regardless of the document version or any other workers concurrently | ||
dealing with the same document). | ||
|
||
This is not a long-term option, because: | ||
- Having any number of missing or stale documents in search over an extended period of time is | ||
unacceptable | ||
- The amount of errors is polluting our error management tool (and muting/ignoring them may mask | ||
future unrelated bugs) | ||
|
||
### Reduce worker count to 1 to enfore in-order processing | ||
|
||
In theory, limiting incoming content update processing to a single worker in Search API v2 could | ||
bypass this problem, as under those circumstances RabbitMQ will guarantee in-order delivery and the | ||
Discovery Engine API calls would have enough breathing room to not run into rate limiting. | ||
|
||
However, this isn't acceptable as: | ||
- we would lose redundancy in case of worker/Kubernetes node failure | ||
- a single worker would struggle to keep up with peak load (or full re-sync scenarios) leading to | ||
exceeding our desired timeframe of how long it takes for an updated document to be visible in | ||
search (<1 minute) | ||
|
||
### Implement a locking mechanism | ||
|
||
A relatively basic locking mechanism using the existing shared Publishing AWS ElastiCache Redis | ||
instance (to be migrated to a separate instance in the long term once Platforms have a plan for | ||
these) as part of the environment would allow us to reach an acceptable level of correctness without | ||
excessive architectural efforts. | ||
|
||
For each document, we would keep up to two keys in Redis: | ||
- `lock:<content_id>`: Whether a worker is currently processing the document (key only present if | ||
locked) | ||
- `latest:<content_id>`: The last successfully processed version for this document within the last | ||
24 hours, if any (exact expiry TBC) | ||
- Note: We cannot just rely on the remote document version in Discovery Engine as the document may | ||
have been deleted as part of a previous operation | ||
|
||
Assuming a peak document update load of 100,000 documents in a 24 hour period, and up to 200 bytes | ||
for each set of Redis keys/values including allowance for overhead, we would store a maximum of | ||
20MiB in Redis, meaning a minimally provisioned instance would be sufficient. | ||
|
||
If the locking process fails due to Redis being unavailable, we can temporarily fall back onto the | ||
existing "first update wins" behaviour and track the resulting errors, and perform a full resync if | ||
the scale of the downtime is large enough that a significant number of updates failed due to the old | ||
behaviour. | ||
|
||
Upon receiving a message, a worker would: | ||
- use the [`redlock-rb` gem][redlock-gem] to acquire a lock of `lock:<content_id>` with a sensible | ||
timeout | ||
- if the lock currently exists, [wait for it to be available again](#addendum-how-to-wait-for-lock) | ||
- check `latest:<content_id>` to see if a newer message has recently been successfully processed | ||
- if so, do nothing and go straight to releasing the lock | ||
- process the message (update or delete the document in Discovery Engine) | ||
- if processing fails, handle as usual (requeue or discard) | ||
- otherwise use `SET latest:<content_id> <version>` to update the latest processed version | ||
- release the lock | ||
|
||
#### Addendum: How to wait for lock | ||
|
||
If the lock for a given document is currently held by another worker, we need to wait for that | ||
worker to complete its processing (or timeout) before continuing. | ||
|
||
There are broadly two options: | ||
- polling a.k.a. `sleep` in a loop: if the lock key exists, wait for the remaining time on the lock | ||
(maybe with a random jitter) before retrying | ||
- very simple to implement but keeps the worker blocked and unable to process other messages in | ||
the meantime | ||
- put the message on a special RabbitMQ "delay" queue that gets re-queued after a time-to-live using | ||
a dead letter exchange | ||
- avoids blocking workers while waiting for a lock to be released, but adds complexity to the | ||
RabbitMQ configuration (especially given that we need several separate queues to avoid impacting | ||
other RabbitMQ consumers, and need to configure the worker to listen to several queues) | ||
|
||
## Decision | ||
Pending decision, proposed we implement the described locking behaviour with a polling wait for lock | ||
release. | ||
|
||
## Status | ||
Pending. | ||
|
||
[publishing-api]: https://github.com/alphagov/publishing-api | ||
[redlock-gem]: https://github.com/leandromoreira/redlock-rb |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there still a scenario here where we could exceed Discovery Engine's rate
limit if the messages come in quickly in the correct order? Say:
If each of thos messages gets processed quickly enough, we'll hit the rate limit.
Do we want to put some kind of deliberate delay in to handle that case?
For example, if we made the locks timeout after 1 second, and then just didn't
release them explicitly, that would throttle messages to one message per
second.
If we do release the locks explicitly, then this won't necessarily slow down
the processing of sequential messages by much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good shout - with the caveat that during peak times latency for an upsert API call to Vertex is about 1s anyway, and only the parallel requests are a problem. But an enforced delay of sorts would probably still help us during off-peak times when it's 0.1s.