Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimistic lock contention on HCA replicas #6648

Open
hannes-ucsc opened this issue Oct 23, 2024 · 5 comments
Open

Optimistic lock contention on HCA replicas #6648

hannes-ucsc opened this issue Oct 23, 2024 · 5 comments
Assignees
Labels
+ [priority] High bug [type] A defect preventing use of the system as specified indexer [subject] The indexer part of Azul orange [process] Done by the Azul team perf [subject] Performance, efficiency or cost

Comments

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Oct 23, 2024

The solution to #6582 fixed the omission of replicas for several types of low-cardinality, i.e., frequently referenced entities, such as donor_organism, sequencing_prototcol and library_preparation_protocol. This led to increased contention on the respective documents in the replica index.

[WARNING] 2024-10-23T14:36:17.525Z 6e6b10f5-9e45-5b8b-9b5a-4f6695def641 azul.indexer.index_service There was a conflict with document ReplicaCoordinates(entity=EntityReference(entity_type='sequencing_protocol', entity_id='571cc0c7-4dc2-443b-93f4-0ce4af08cf6d'), content_hash='a3a9f3a538a2a649690aa0974f4d1c070f6fc910'): ConflictError(409, 'version_conflict_engine_exception', {'error': {'root_cause': [{'type': 'version_conflict_engine_exception', 'reason': '[sequencing_protocol_571cc0c7-4dc2-443b-93f4-0ce4af08cf6d_a3a9f3a538a2a649690aa0974f4d1c070f6fc910]: version conflict, required seqNo [55840], primary term [1]. current document has seqNo [55843] and primary term [1]', 'index_uuid': 't2PQs1SoTMueXWOCuGtvzQ', 'shard': '16', 'index': 'azul_v2_prod_dcp43_replica'}], 'type': 'version_conflict_engine_exception', 'reason': '[sequencing_protocol_571cc0c7-4dc2-443b-93f4-0ce4af08cf6d_a3a9f3a538a2a649690aa0974f4d1c070f6fc910]: version conflict, required seqNo [55840], primary term [1]. current document has seqNo [55843] and primary term [1]', 'index_uuid': 't2PQs1SoTMueXWOCuGtvzQ', 'shard': '16', 'index': 'azul_v2_prod_dcp43_replica'}, 'status': 409}). Total # of errors: 1, retrying.

During an attempted reindex of catalog dcp32, 2 million replica writes resulted in such a 409 response from ES, compared to one thousand on a previous reindex without the fix for #6582:

CleanShot 2024-10-23 at 08 35 15@2x

The retry queue filled up with notifications and overall progress was slow:

CleanShot 2024-10-23 at 08 31 53@2x

I tried pushing the retries to ES by setting retry_on_conflict but this did not alleviate the issue much. While there were fewer conflicts, the ES client timeout of 1min kicked in more often.

Index: src/azul/indexer/document.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/indexer/document.py b/src/azul/indexer/document.py
--- a/src/azul/indexer/document.py	(revision 0f4f04cf1d7f6f85a84329f733fdd6df2d8618d3)
+++ b/src/azul/indexer/document.py	(date 1729658882164)
@@ -1585,5 +1585,14 @@
             'upsert': super()._body(field_types)
         }
 
+    def to_index(self,
+                 catalog: Optional[CatalogName],
+                 field_types: CataloguedFieldTypes,
+                 bulk: bool = False
+                 ) -> JSON:
+        result = super().to_index(catalog, field_types, bulk)
+        k, v = 'retry_on_conflict', 10
+        return {**result, f'_{k}': v} if bulk else {**result, 'params': {k: v}}
+
 
 CataloguedContribution = Contribution[CataloguedEntityReference]
@hannes-ucsc hannes-ucsc added the orange [process] Done by the Azul team label Oct 23, 2024
@hannes-ucsc
Copy link
Member Author

hannes-ucsc commented Oct 23, 2024

The way the distinction between bulk and non-bulk leaks into the to_index method is annoying. The patch above shows that. The raw bulk API is quite clean but the Python ES client erases that by merging body and parameters using proprietary keyword arguments whose names start with an underscore. The bulk parameter to to_index should be removed and that method should return the keyword arguments to the client method. Another method should be used to translate to_index's return value to a bulk action, if and when required.

@hannes-ucsc
Copy link
Member Author

hannes-ucsc commented Oct 23, 2024

I suspect that one partition of a bundle redundantly emits replicas also emitted by all other partitions. That's a suspicion that would need to be verified.

@hannes-ucsc hannes-ucsc changed the title Contention Optimistic lock contention on HCA replicas Oct 23, 2024
@hannes-ucsc
Copy link
Member Author

Currently reindexing dcp43 with this hot-deployed temporary hotfix:

Index: deployments/prod/environment.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/deployments/prod/environment.py b/deployments/prod/environment.py
--- a/deployments/prod/environment.py	(revision 0f4f04cf1d7f6f85a84329f733fdd6df2d8618d3)
+++ b/deployments/prod/environment.py	(revision cebc97550db9409849f32456f9c51eef9347d164)
@@ -1280,5 +1280,5 @@
             'channel_id': 'C04JWDFCPFZ'  # #team-boardwalk-prod
         }),
 
-        'AZUL_ENABLE_REPLICAS': '1',
+        'AZUL_ENABLE_REPLICAS': '0',
     }

@hannes-ucsc
Copy link
Member Author

hannes-ucsc commented Oct 23, 2024

Assignee to file PR against prod with the above temporary hotfix.

The PR should only reindex lm7 and lm8. dcp43 is currently being reindexed with this hotfix hot-deployed. I hope to get approval for dcp43 this week so that we can skip reindexing dcp42 and file the deletion of dcp42 as a permanent hotfix as well. We would then need to backport three hotfixes: this one, the lm8 fix (#6444) and the dcp42 removal.

@hannes-ucsc
Copy link
Member Author

hannes-ucsc commented Oct 28, 2024

Donor (donor_organism) and protocol (…_protocol) replicas should be handled the same way HCA project and AnVIL dataset replicas are handled currently on develop. We've been using the term implicit hub to refer to these types of entities but I would like to use hot entities instead. What distinguishes these types of entities from, say, specimens, is not the fact that their instances are only implicitly linked to a file (and only HCA projects are implicitly linked), but that they have few instances that are referenced by many files. Moreover, neither projects nor datasets are currently hubs. In fact, datasets will become hubs as part of #6529.

We already track the UUIDs of protocol and donor entities in contributions and aggregate documents, but we need to ensure that the aggregation of such references is not lossy, by failing the aggregation should the aggregator hit its limit.

Untitled

hannes-ucsc added a commit that referenced this issue Oct 29, 2024
This change alleviates the contention (#6648) a little bit but doesn't resolve it. Once #6648 is fully addressed, this should significantly reduce the remaining bits of contention, of which there were just a few dozen incidents per reindex before we fixed the absence of donor and protocol replicas, which are the main drivers of the contention.
hannes-ucsc added a commit that referenced this issue Oct 29, 2024
This change alleviates the contention (#6648) a little bit but doesn't resolve it. Once #6648 is fully addressed, this should significantly reduce the remaining bits of contention, of which there were just a few dozen incidents per reindex before we fixed the absence of donor and protocol replicas, which are the main drivers of the contention.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
+ [priority] High bug [type] A defect preventing use of the system as specified indexer [subject] The indexer part of Azul orange [process] Done by the Azul team perf [subject] Performance, efficiency or cost
Projects
None yet
Development

No branches or pull requests

2 participants