-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Index Workers lose connection to RMQ while queue still full #118
Comments
@artntek Some brief thoughts on this, hopefully they are useful to you. You wrote:
This seems like a design issue. Workers should always be able to respond to new messages coming in. If they block while processing, I think that is a problem. At a minimum, seems like they should always be able to say In addition to actively managing In addition, if a worker can be written to accept multiple messages, and it hits its capacity limit, can't we use a In summary, it strikes me that each worker should 1) be written to be processing the incoming messages in its main thread, 2) be set to accept a maximum number of messages via I think a sequence diagram showing this process would be really useful. Here's the analogous sequence diagram for MetaDIG. |
Also, for a little commiseration, see this Reddit thread:
Maybe this is what is causing your |
Even if the |
Thank you - great info. I will regroup and propose next steps. In the meantime, the reindex finished on test.adc k8s; I'll post remaining findings from that run below (since we're drifting off-topic for the original title of Metacat Issue 1932, and the info is more relevant here.) |
After helpful discussions in the 7/23/24 dev meeting, and afterwards with @taojing2002 and @jeanetteclark, I tried the following:
Indexer currently defaults to pool size of 5 threads per worker. I tried reducing the limit to 1 thread per worker, to see if that would stop the channel closures. Deployed 7/23 at 13:29. This change DID NOT resolve channel closure issue. |
Experimental worker-code changes, based on discussions with J&J:
Deployments
|
Notes from our 7/24/24 meeting(Attendees: @artntek, @mbjones, @jeanetteclark, @taojing2002) Problem Statementrmq will always try to "pre-fetch" additional messages for the worker to process, even if it's busy. The limit can be set using Proposed SolutionHave a non-blocking thread that immediately processes
Footnotes
|
We agreed that, short term, I will tidy up and release the fixes that I made in the experimental code yesterday (PR #119), since this successfully completed indexing and works for now. However, the next step is to do fix this the right way, per the description above. Useful links: |
When doing a reindex-all, the indexing runs well until it starts to process the resource maps, which take a long time. At this point, the RMQ channels get closed in a timeframe determined by the
consumer_timeout
(default 30 mins).We're ACK-ing messages immediately, in an attempt to circumvent this issue, but we're still having problems, even with longer timeout settings. The problem is with messages being sent to indexers that are still working on the previous job - the next message cannot be delivered, and so times out.
For more details, see Metacat Issue 1932
Proposed fix is to use @jeanetteclark's solution from metadig engine: catch the resulting
com.rabbitmq.client.AlreadyClosedException
, and re-open the connections.The text was updated successfully, but these errors were encountered: