Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: After milvus recovers from querynode pod failure chaos, upsert/query/search/hybrid search/full text search interfaces experience partial request failures #39321

Open
1 task done
zhuwenxing opened this issue Jan 16, 2025 · 1 comment
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20250115-f5234c3c-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2025-01-15T09:50:16.113Z] 2025-01-15 09:50:16,033 [ERROR][handler]: grpc RpcError: [query], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2025-01-15 09:50:06.032919', 'gRPC error': '2025-01-15 09:50:16.033788'}> (decorators.py:151)
[2025-01-15T09:50:16.113Z] 2025-01-15 09:50:16,044 [ERROR][handler]: grpc RpcError: [query], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2025-01-15 09:50:06.043980', 'gRPC error': '2025-01-15 09:50:16.044898'}> (decorators.py:151)
[2025-01-15T09:50:16.671Z] 2025-01-15 09:50:16,418 [ERROR][handler]: grpc RpcError: [query], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2025-01-15 09:50:06.418154', 'gRPC error': '2025-01-15 09:50:16.418877'}> (decorators.py:151)
[2025-01-15T09:50:16.671Z] 2025-01-15 09:50:16,571 [ERROR][handler]: RPC error: [upsert_rows], <MilvusException: (code=65535, message=code: STREAMING_CODE_TRANSACTION_EXPIRED, cause: transaction expired at 455326986030546950, current 455326986947788807)>, <Time:{'RPC start': '2025-01-15 09:49:52.535649', 'RPC error': '2025-01-15 09:50:16.571479'}> (decorators.py:140)
[2025-01-15T09:50:17.229Z] 2025-01-15 09:50:16,978 [ERROR][handler]: grpc RpcError: [query], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2025-01-15 09:50:06.977818', 'gRPC error': '2025-01-15 09:50:16.978851'}> (decorators.py:151)
[2025-01-15T09:50:18.588Z] 2025-01-15 09:50:18,166 [ERROR][handler]: grpc RpcError: [search], <_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2025-01-15 09:50:08.165668', 'gRPC error': '2025-01-15 09:50:18.166836'}> (decorators.py:151)
[2025-01-15T09:50:18.843Z] 2025-01-15 09:50:18,833 [ERROR][handler]: grpc RpcError: [search], <_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2025-01-15 09:50:08.832727', 'gRPC error': '2025-01-15 09:50:18.833795'}> (decorators.py:151)
[2025-01-15T09:50:19.402Z] 2025-01-15 09:50:19,167 [ERROR][handler]: grpc RpcError: [search], <_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2025-01-15 09:50:09.154298', 'gRPC error': '2025-01-15 09:50:19.167912'}> (decorators.py:151)
[2025-01-15T09:50:23.561Z] 2025-01-15 09:50:23,307 [ERROR][handler]: grpc RpcError: [search], <_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2025-01-15 09:50:13.306532', 'gRPC error': '2025-01-15 09:50:23.307867'}> (decorators.py:151)
[2025-01-15T09:50:24.929Z] 2025-01-15 09:50:24,728 [ERROR][handler]: grpc RpcError: [query], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2025-01-15 09:50:14.726421', 'gRPC error': '2025-01-15 09:50:24.728077'}> (decorators.py:151)
[2025-01-15T09:50:24.929Z] 2025-01-15 09:50:24,873 [ERROR][handler]: grpc RpcError: [search], <_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2025-01-15 09:50:14.871829', 'gRPC error': '2025-01-15
[2025-01-15T09:47:26.726Z] 2025-01-15 09:47:26,480 [ERROR][handler]: RPC error: [upsert_rows], <MilvusException: (code=65535, message=code: STREAMING_CODE_TRANSACTION_EXPIRED, cause: some message has been expired, expired at 455326943025037327, current 455326943522586627)>, <Time:{'RPC start': '2025-01-15 09:47:19.079349', 'RPC error': '2025-01-15 09:47:26.480012'}> (decorators.py:140)
[2025-01-15T09:52:42.768Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__J9Q7sPYr] - pytest_assume.plugin.FailedAssumption: 
[2025-01-15T09:52:42.768Z] 1 Failed Assumptions:
[2025-01-15T09:52:42.768Z] 
[2025-01-15T09:52:42.768Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.full_text_search succ rate 0.9295774647887324, total: 71, average time: 2.7442
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__CYYtuknc] - pytest_assume.plugin.FailedAssumption: 
[2025-01-15T09:52:42.769Z] 4 Failed Assumptions:
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.upsert succ rate 0.75, total: 4, average time: 4.8750
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.search succ rate 0.9398496240601504, total: 133, average time: 0.5713
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.full_text_search succ rate 0.8688524590163934, total: 61, average time: 2.8472
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.hybrid_search succ rate 0.9393939393939394, total: 132, average time: 0.5982
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__25pmqWTQ] - pytest_assume.plugin.FailedAssumption: 
[2025-01-15T09:52:42.769Z] 4 Failed Assumptions:
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.upsert succ rate 0.75, total: 4, average time: 4.4997
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.search succ rate 0.937984496124031, total: 129, average time: 0.5946
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.full_text_search succ rate 0.8688524590163934, total: 61, average time: 2.3147
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.hybrid_search succ rate 0.937984496124031, total: 129, average time: 0.6032
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__eEsBN8e1] - pytest_assume.plugin.FailedAssumption: 
[2025-01-15T09:52:42.769Z] 4 Failed Assumptions:
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.upsert succ rate 0.75, total: 4, average time: 9.5644
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.search succ rate 0.9296875, total: 128, average time: 0.5762
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.full_text_search succ rate 0.8679245283018868, total: 53, average time: 2.9131
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.hybrid_search succ rate 0.9296875, total: 128, average time: 0.5966
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__51qXTUNn] - pytest_assume.plugin.FailedAssumption: 
[2025-01-15T09:52:42.769Z] 4 Failed Assumptions:
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.upsert succ rate 0.75, total: 4, average time: 7.2421
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.search succ rate 0.9466666666666667, total: 150, average time: 0.5688
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.full_text_search succ rate 0.8727272727272727, total: 55, average time: 3.0127
[2025-01-15T09:52:42.769Z] assert False
[2025-01-15T09:52:42.769Z] 
[2025-01-15T09:52:42.769Z] chaos_commons.py:124: AssumptionFailure
[2025-01-15T09:52:42.769Z] >>	pytest.assume(
[2025-01-15T09:52:42.769Z] AssertionError: Expect Succ: Op.hybrid_search succ rate 0.9466666666666667, total: 150, average time: 0.5815
[2025-01-15T09:52:42.769Z] assert False

Expected Behavior

No response

Steps To Reproduce

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/20169/pipeline
log:

artifacts-querynode-pod-failure-20169-server-logs.tar.gz

cluster: 4am
ns: chaos-testing
pod info

[2025-01-15T09:37:52.801Z] + kubectl get pods -o wide

[2025-01-15T09:37:52.802Z] + grep querynode-pod-failure-20169

[2025-01-15T09:37:53.057Z] querynode-pod-failure-20169-etcd-0                                1/1     Running                  0                32m     10.104.30.184   4am-node38   <none>           <none>

[2025-01-15T09:37:53.057Z] querynode-pod-failure-20169-etcd-1                                1/1     Running                  0                32m     10.104.27.165   4am-node31   <none>           <none>

[2025-01-15T09:37:53.057Z] querynode-pod-failure-20169-etcd-2                                1/1     Running                  0                32m     10.104.33.60    4am-node36   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-milvus-datanode-7554488b96-8gxcd      1/1     Running                  3 (31m ago)      32m     10.104.19.6     4am-node28   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-milvus-datanode-7554488b96-sc2ms      1/1     Running                  3 (31m ago)      32m     10.104.23.250   4am-node27   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-milvus-indexnode-65c745b64c-57scg     1/1     Running                  3 (31m ago)      32m     10.104.15.228   4am-node20   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-milvus-indexnode-65c745b64c-mswdk     1/1     Running                  3 (31m ago)      32m     10.104.23.251   4am-node27   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-milvus-indexnode-65c745b64c-nd677     1/1     Running                  3 (31m ago)      32m     10.104.24.25    4am-node29   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-milvus-mixcoord-84c9b95dc5-hh79p      1/1     Running                  3 (31m ago)      32m     10.104.23.249   4am-node27   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-milvus-proxy-79ffd77977-zdzbk         1/1     Running                  3 (31m ago)      32m     10.104.19.254   4am-node28   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-milvus-querynode-9dd6f65c-4xd5b       1/1     Running                  6 (8m28s ago)    32m     10.104.16.83    4am-node21   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-milvus-querynode-9dd6f65c-5v6lg       1/1     Running                  6 (8m28s ago)    32m     10.104.21.180   4am-node24   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-milvus-querynode-9dd6f65c-clh26       1/1     Running                  6 (8m28s ago)    32m     10.104.32.65    4am-node39   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-milvus-streamingnode-5f97c5f885zgwm   1/1     Running                  3 (31m ago)      32m     10.104.23.248   4am-node27   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-minio-0                               1/1     Running                  0                32m     10.104.30.183   4am-node38   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-minio-1                               1/1     Running                  0                32m     10.104.27.164   4am-node31   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-minio-2                               1/1     Running                  0                32m     10.104.33.58    4am-node36   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-minio-3                               1/1     Running                  0                32m     10.104.19.11    4am-node28   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-pulsarv3-bookie-0                     1/1     Running                  0                32m     10.104.33.61    4am-node36   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-pulsarv3-bookie-1                     1/1     Running                  0                32m     10.104.30.191   4am-node38   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-pulsarv3-bookie-2                     1/1     Running                  0                32m     10.104.27.170   4am-node31   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-pulsarv3-bookie-init-mftz2            0/1     Completed                0                32m     10.104.13.199   4am-node16   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-pulsarv3-broker-0                     1/1     Running                  0                32m     10.104.14.242   4am-node18   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-pulsarv3-broker-1                     1/1     Running                  0                32m     10.104.13.200   4am-node16   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-pulsarv3-proxy-0                      1/1     Running                  0                32m     10.104.14.239   4am-node18   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-pulsarv3-proxy-1                      1/1     Running                  0                32m     10.104.13.198   4am-node16   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-pulsarv3-pulsar-init-fp2cf            0/1     Completed                0                32m     10.104.14.241   4am-node18   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-pulsarv3-recovery-0                   1/1     Running                  0                32m     10.104.14.240   4am-node18   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-pulsarv3-zookeeper-0                  1/1     Running                  0                32m     10.104.30.186   4am-node38   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-pulsarv3-zookeeper-1                  1/1     Running                  0                32m     10.104.27.161   4am-node31   <none>           <none>

[2025-01-15T09:37:53.058Z] querynode-pod-failure-20169-pulsarv3-zookeeper-2                  1/1     Running                  0                32m     10.104.33.57    4am-node36   <none>           <none>

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 16, 2025
@zhuwenxing zhuwenxing added this to the 2.6.0 milestone Jan 16, 2025
@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Jan 16, 2025
@yanliang567
Copy link
Contributor

/assign @weiliu1031
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants