-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Failing integ test due to model is not deployed due to open memory circuit breaker #596
Comments
Same error for 2.13 as well
|
Action items for neural search team from meeting(03/27/2024) between @prudhvigodithi and @vamshin
|
I have started the neural-search job on a standalone server to see if it can pass provided if the issue is with the memory limitations. https://build.ci.opensearch.org/job/integ-test/8077/console Thanks |
Getting same error in the standalone build
|
@prudhvigodithi @vibrantvarun I think this confirms it is not infra set up issue? |
Ya I dont see it as a memory error, @vibrantvarun can you confirm based on the above logs you posted the test didn't fail because of lack of any resources, I see as |
I want to like to wait for @martin-gaievski thoughts on it. As he logged this bug. |
Hey @prudhvigodithi and @gaiksaya what was the memory setting you guys kept when you ran the plugin in standalone mode yesterday? |
4CPU 8GBMEM |
Infra already update LINUX host to provision 4CPU 16GBRAM docker containers now. |
What is the bug?
Tests are failing in distribution pipeline for 2.12. It's about 6-8 failing tests, exact tests are always different. Example of a trace from test runner: https://build.ci.opensearch.org/blue/organizations/jenkins/integ-test/detail/integ-test/7696/pipeline/102
tests run results are something like:
How can one reproduce the bug?
It's only in distribution pipeline, in plugin CI and in local tests are passing. In local and plugin CI the memory settings are higher as they are override at the plugin level:
https://github.com/opensearch-project/neural-search/blob/main/build.gradle#L388C14-L388C31
What is your host/environment?
Issue is for 2.12, should also be same in 2.x and main
Do you have any additional context?
Example of a server log from test cluster from my local copy of infra build tool:
stdout.txt
Following error is in the log, corresponding to a failed test. Memory CB from ml-commons is opened, then JVM GC kicks in and frees some memory. After that next few tests will be successful, then situation repeats.
The cluster is started with
-Xms1g, -Xmx1g
, ignoring plugin settings. As of time of writing there is no way to change that setting in a test cluster for distribution.Probably it's possible to check CB state before deploying a model from the test, or try to deploy it, and if exception occurs and its due to open CB then wait and retry.
The text was updated successfully, but these errors were encountered: