[Bug]: When performing a text match query in a growing segment, the result will be empty. #36962

zhuwenxing · 2024-10-17T09:27:55Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Performed text matching using the token with the highest frequency in the corpus, but no results were returned.

[pytest : test] self = <test_query.TestQueryTextMatch object at 0x7f25041845b0>

[pytest : test] tokenizer = 'default', enable_inverted_index = False

[pytest : test] enable_partition_key = True

[pytest : test] 

[pytest : test]         @pytest.mark.tags(CaseLabel.L0)

[pytest : test]         @pytest.mark.parametrize("enable_partition_key", [True, False])

[pytest : test]         @pytest.mark.parametrize("enable_inverted_index", [True, False])

[pytest : test]         @pytest.mark.parametrize("tokenizer", ["jieba", "default"])

[pytest : test]         def test_query_text_match_normal(

[pytest : test]             self, tokenizer, enable_inverted_index, enable_partition_key

[pytest : test]         ):

[pytest : test]             """

[pytest : test]             target: test text match normal

[pytest : test]             method: 1. enable text match and insert data with varchar

[pytest : test]                     2. get the most common words and query with text match

[pytest : test]                     3. verify the result

[pytest : test]             expected: text match successfully and result is correct

[pytest : test]             """

[pytest : test]             tokenizer_params = {

[pytest : test]                 "tokenizer": tokenizer,

[pytest : test]             }

[pytest : test]             dim = 128

[pytest : test]             fields = [

[pytest : test]                 FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),

[pytest : test]                 FieldSchema(

[pytest : test]                     name="word",

[pytest : test]                     dtype=DataType.VARCHAR,

[pytest : test]                     max_length=65535,

[pytest : test]                     enable_tokenizer=True,

[pytest : test]     				enable_match=True,

[pytest : test]                     is_partition_key=enable_partition_key,

[pytest : test]                     tokenizer_params=tokenizer_params,

[pytest : test]                 ),

[pytest : test]                 FieldSchema(

[pytest : test]                     name="sentence",

[pytest : test]                     dtype=DataType.VARCHAR,

[pytest : test]                     max_length=65535,

[pytest : test]                     enable_tokenizer=True,

[pytest : test]     				enable_match=True,

[pytest : test]                     tokenizer_params=tokenizer_params,

[pytest : test]                 ),

[pytest : test]                 FieldSchema(

[pytest : test]                     name="paragraph",

[pytest : test]                     dtype=DataType.VARCHAR,

[pytest : test]                     max_length=65535,

[pytest : test]                     enable_tokenizer=True,

[pytest : test]     				enable_match=True,

[pytest : test]                     tokenizer_params=tokenizer_params,

[pytest : test]                 ),

[pytest : test]                 FieldSchema(

[pytest : test]                     name="text",

[pytest : test]                     dtype=DataType.VARCHAR,

[pytest : test]                     max_length=65535,

[pytest : test]                     enable_tokenizer=True,

[pytest : test]     				enable_match=True,

[pytest : test]                     tokenizer_params=tokenizer_params,

[pytest : test]                 ),

[pytest : test]                 FieldSchema(name="emb", dtype=DataType.FLOAT_VECTOR, dim=dim),

[pytest : test]             ]

[pytest : test]             schema = CollectionSchema(fields=fields, description="test collection")

[pytest : test]             data_size = 3000

[pytest : test]             collection_w = self.init_collection_wrap(

[pytest : test]                 name=cf.gen_unique_str(prefix), schema=schema

[pytest : test]             )

[pytest : test]             fake = fake_en

[pytest : test]             if tokenizer == "jieba":

[pytest : test]                 language = "zh"

[pytest : test]                 fake = fake_zh

[pytest : test]             else:

[pytest : test]                 language = "en"

[pytest : test]     

[pytest : test]             data = [

[pytest : test]                 {

[pytest : test]                     "id": i,

[pytest : test]                     "word": fake.word().lower(),

[pytest : test]                     "sentence": fake.sentence().lower(),

[pytest : test]                     "paragraph": fake.paragraph().lower(),

[pytest : test]                     "text": fake.text().lower(),

[pytest : test]                     "emb": [random.random() for _ in range(dim)],

[pytest : test]                 }

[pytest : test]                 for i in range(data_size)

[pytest : test]             ]

[pytest : test]             df = pd.DataFrame(data)

[pytest : test]             log.info(f"dataframe\n{df}")

[pytest : test]             batch_size = 5000

[pytest : test]             for i in range(0, len(df), batch_size):

[pytest : test]                 collection_w.insert(

[pytest : test]                     data[i : i + batch_size]

[pytest : test]                     if i + batch_size < len(df)

[pytest : test]                     else data[i : len(df)]

[pytest : test]                 )

[pytest : test]             collection_w.create_index(

[pytest : test]                 "emb",

[pytest : test]                 {"index_type": "IVF_SQ8", "metric_type": "L2", "params": {"nlist": 64}},

[pytest : test]             )

[pytest : test]             if enable_inverted_index:

[pytest : test]                 collection_w.create_index("word", {"index_type": "INVERTED"})

[pytest : test]             collection_w.load()

[pytest : test]             # analyze the croup

[pytest : test]             text_fields = ["word", "sentence", "paragraph", "text"]

[pytest : test]             wf_map = {}

[pytest : test]             for field in text_fields:

[pytest : test]                 wf_map[field] = cf.analyze_documents(df[field].tolist(), language=language)

[pytest : test]             # query single field for one token

[pytest : test]             for field in text_fields:

[pytest : test]                 token = wf_map[field].most_common()[0][0]

[pytest : test]                 expr = f"TextMatch({field}, '{token}')"

[pytest : test]                 log.info(f"expr: {expr}")

[pytest : test]                 res, _ = collection_w.query(expr=expr, output_fields=["id", field])

[pytest : test] >               assert len(res) > 0

[pytest : test] E               assert 0 > 0

[pytest : test] E                +  where 0 = len(data: [] )

[pytest : test] 

[pytest : test] testcases/test_query.py:4724: AssertionError

[pytest : test] ------------------------------ Captured log setup ------------------------------

[pytest : test] [2024-10-17 08:14:29 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:46)

[pytest : test] [2024-10-17 08:14:29 - INFO - ci_test]: [setup_method] Start setup test case test_query_text_match_normal. (client_base.py:47)

[pytest : test] ----------------------------- Captured stderr call -----------------------------

[pytest : test] 
Split strings:   0%|          | 0/3000 [00:00<?, ?it/s]
                                                       

Split strings:   0%|          | 0/3000 [00:00<?, ?it/s]
                                                       

Split strings:   0%|          | 0/3000 [00:00<?, ?it/s]
                                                       

Split strings:   0%|          | 0/3000 [00:00<?, ?it/s]
                                                       

[pytest : test] ------------------------------ Captured log call -------------------------------

[pytest : test] [2024-10-17 08:14:29 - DEBUG - ci_test]: (api_request)  : [Connections.has_connection] args: ['default'], kwargs: {} (api_request.py:62)

[pytest : test] [2024-10-17 08:14:29 - DEBUG - ci_test]: (api_response) : False  (api_request.py:37)

[pytest : test] [2024-10-17 08:14:29 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default', '', '', 'default', ''], kwargs: {'host': 'md-36952-1-py-pr-milvus.milvus-ci', 'port': '19530'} (api_request.py:62)

[pytest : test] [2024-10-17 08:14:29 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[pytest : test] [2024-10-17 08:14:29 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['query_M6Tl7wz0', {'auto_id': False, 'description': 'test collection', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'word', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535, 'enable_match':......, kwargs: {'consistency_level': 'Strong'} (api_request.py:62)

[pytest : test] [2024-10-17 08:14:29 - DEBUG - ci_test]: (api_response) : <Collection>:

[pytest : test] -------------

[pytest : test] <name>: query_M6Tl7wz0

[pytest : test] <description>: test collection

[pytest : test] <schema>: {'auto_id': False, 'description': 'test collection', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'word', 'description': '', 'type'......  (api_request.py:37)

[pytest : test] [2024-10-17 08:14:30 - INFO - ci_test]: dataframe

[pytest : test]         id      word                                           sentence                                          paragraph                                               text                                                emb

[pytest : test] 0        0      good                 fight nothing during because down.  guess none government value. opportunity struc...  teach experience result discussion card. night...  [0.25722484194522754, 0.10716873880402633, 0.6...

[pytest : test] 1        1       way               fire wait interest dinner operation.  thousand quite animal suddenly friend data. fe...  whether development change smile until early. ...  [0.6393285775234575, 0.25662668740363703, 0.84...

[pytest : test] 2        2      cell       common budget receive sister listen article.  time person citizen institution full front com...  big opportunity fall arrive it seven lose pass...  [0.5733811147237775, 0.4365202152412886, 0.054...

[pytest : test] 3        3  employee  western agreement window city mention hard req...  let tell upon house. economic lot daughter art...  machine involve agent. science though blood pa...  [0.28186465318599285, 0.8690287069224029, 0.95...

[pytest : test] 4        4     shake                                   may hot provide.         other decade them debate wrong difference.  pass trial such heart news executive. hot tech...  [0.22295701922050015, 0.23748072218399807, 0.7...

[pytest : test] ...    ...       ...                                                ...                                                ...                                                ...                                                ...

[pytest : test] 2995  2995     small                    exactly far risk social pretty.  fund there similar morning. life right attorne...  set myself thing do mrs.\nposition sense entir...  [0.8289662416206579, 0.034599987863862536, 0.2...

[pytest : test] 2996  2996     ready  skill particularly page sign back however whil...  fact campaign within cover next but. internati...  few the young walk position early. girl mr sit...  [0.5687476818480588, 0.37727107847674124, 0.51...

[pytest : test] 2997  2997    police                        stuff picture former study.  stop stop they next party must. own sing floor...  myself majority drug key arm difference. devel...  [0.738114035712616, 0.26428015252046155, 0.679...

[pytest : test] 2998  2998   current               each stay sit range happen idea may.                        modern window room control.  cold alone much commercial space shake.\nnewsp...  [0.680958986682074, 0.838743116079627, 0.16323...

[pytest : test] 2999  2999      yard                               anyone to direction.  economy eye condition. public next article rea...  last leg market no. product recent economic th...  [0.013924658826216851, 0.9652646286447241, 0.5...

[pytest : test] 

[pytest : test] [3000 rows x 6 columns] (test_query.py:4698)

[pytest : test] [2024-10-17 08:14:30 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [[{'id': 0, 'word': 'good', 'sentence': 'fight nothing during because down.', 'paragraph': 'guess none government value. opportunity structure thank common increase break pick.', 'text': 'teach experience result discussion card. night message far itself friend plant tonight. analysis collection indi......, kwargs: {'timeout': 180} (api_request.py:62)

[pytest : test] [2024-10-17 08:14:31 - DEBUG - ci_test]: (api_response) : (insert count: 3000, delete count: 0, upsert count: 0, timestamp: 453287050275979268, success count: 3000, err count: 0  (api_request.py:37)

[pytest : test] [2024-10-17 08:14:31 - DEBUG - ci_test]: (api_request)  : [Collection.create_index] args: ['emb', {'index_type': 'IVF_SQ8', 'metric_type': 'L2', 'params': {'nlist': 64}}, 1200], kwargs: {'index_name': ''} (api_request.py:62)

[pytest : test] [2024-10-17 08:14:31 - DEBUG - ci_test]: (api_response) : Status(code=0, message=)  (api_request.py:37)

[pytest : test] [2024-10-17 08:14:31 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 180], kwargs: {} (api_request.py:62)

[pytest : test] [2024-10-17 08:14:32 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[pytest : test] [2024-10-17 08:14:32 - INFO - ci_test]: Analyze document cost time: 0.015944480895996094 (common_func.py:111)

[pytest : test] [2024-10-17 08:14:32 - INFO - ci_test]: Analyze document cost time: 0.028260469436645508 (common_func.py:111)

[pytest : test] [2024-10-17 08:14:32 - INFO - ci_test]: Analyze document cost time: 0.05806303024291992 (common_func.py:111)

[pytest : test] [2024-10-17 08:14:33 - INFO - ci_test]: Analyze document cost time: 0.07532739639282227 (common_func.py:111)

[pytest : test] [2024-10-17 08:14:33 - INFO - ci_test]: expr: TextMatch(word, 'specific') (test_query.py:4722)

[pytest : test] [2024-10-17 08:14:33 - DEBUG - ci_test]: (api_request)  : [Collection.query] args: ["TextMatch(word, 'specific')", ['id', 'word'], None, 180], kwargs: {} (api_request.py:62)

[pytest : test] [2024-10-17 08:14:33 - DEBUG - ci_test]: (api_response) : data: []   (api_request.py:37)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed ci job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-36952/1/pipeline/

Anything else?

No response

The text was updated successfully, but these errors were encountered:

zhuwenxing · 2024-10-17T09:28:50Z

/assign @czs007

PTAL

zhuwenxing · 2024-10-18T06:44:25Z

This case has a relatively high failure rate in CI, so it's set to critical.

czs007 · 2024-10-18T06:47:15Z

working on this.

zhuwenxing · 2024-10-21T10:08:55Z

failed job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-36693/9/pipeline

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Word frequency: [('win', 10), ('resource', 9), ('beat', 9), ('together', 9), ('hand', 9), ('themselves', 8), ('business', 8), ('exist', 8), ('enough', 8), ('why', 8)] (common_func.py:117)
......
[pytest : test] [2024-10-21 07:36:04 - DEBUG - ci_test]: (api_request)  : [Collection.query] args: ["TextMatch(word, 'win')", ['id', 'word'], None, 180], kwargs: {} (api_request.py:62)

[pytest : test] [2024-10-21 07:36:05 - DEBUG - ci_test]: (api_response) : data: []   (api_request.py:37)

The word "win" appears 10 times in the corpus of the "word" field, but the result returned by text match is empty.


[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Analyze document cost time: 0.007334709167480469 (common_func.py:116)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Word frequency: [('win', 10), ('resource', 9), ('beat', 9), ('together', 9), ('hand', 9), ('themselves', 8), ('business', 8), ('exist', 8), ('enough', 8), ('why', 8)] (common_func.py:117)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Analyze document cost time: 0.019197463989257812 (common_func.py:116)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Word frequency: [('least', 31), ('force', 29), ('claim', 28), ('south', 28), ('environment', 28), ('nothing', 27), ('partner', 27), ('mission', 27), ('follow', 27), ('material', 27)] (common_func.py:117)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Analyze document cost time: 0.02630472183227539 (common_func.py:116)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Word frequency: [('go', 67), ('together', 63), ('open', 62), ('position', 61), ('light', 60), ('half', 59), ('pull', 58), ('pay', 58), ('cold', 58), ('prevent', 57)] (common_func.py:117)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Analyze document cost time: 0.0409548282623291 (common_func.py:116)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Word frequency: [('thing', 94), ('plant', 93), ('environmental', 91), ('along', 91), ('air', 90), ('bill', 89), ('upon', 89), ('improve', 89), ('smile', 89), ('develop', 87)] (common_func.py:117)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: expr: TextMatch(word, 'win') (test_query.py:4722)

[pytest : test] [2024-10-21 07:36:04 - DEBUG - ci_test]: (api_request)  : [Collection.query] args: ["TextMatch(word, 'win')", ['id', 'word'], None, 180], kwargs: {} (api_request.py:62)

[pytest : test] [2024-10-21 07:36:05 - DEBUG - ci_test]: (api_response) : data: []   (api_request.py:37)

@czs007

zhuwenxing · 2024-10-22T06:05:31Z

one more failed ci: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-36693/12/pipeline

aoiasd · 2024-10-29T02:49:27Z

Growing segment insert success， but can't search，checking temporary index for growing.

xiaofan-luan · 2024-11-05T18:06:20Z

this is the expected hebaviour.
For match it takes 200ms to reindex the data so we can not search it in strong consistent.
Adding another brutefore might be too much for this case

zhuwenxing · 2024-11-06T10:54:49Z

this is the expected hebaviour. For match it takes 200ms to reindex the data so we can not search it in strong consistent. Adding another brutefore might be too much for this case

@xiaofan-luan
This is not an expected behavior. It's not that we require being able to query immediately after inserting, but if the data remains in a growing state and is not flushed, these data will never be queryable.

xiaofan-luan · 2024-11-06T22:51:29Z

This is data should be able to be searched after a couple of hundred milli seconds.

xiaofan-luan · 2024-11-06T22:53:23Z

this is the expected behaviour. this is because tantivy is searchable only if data is synced to disk, and we can not sync to disk every time. There is a checker here to sync data to disk every certain amount of time

czs007 · 2025-01-09T13:58:17Z

may be fixed by #39070

zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 17, 2024

zhuwenxing assigned yanliang567 Oct 17, 2024

sre-ci-robot assigned czs007 Oct 17, 2024

yanliang567 added this to the 2.5.0 milestone Oct 17, 2024

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 17, 2024

yanliang567 removed their assignment Oct 17, 2024

zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Oct 18, 2024

zhuwenxing added the feature/text match label Oct 18, 2024

yanliang567 added the ci/e2e label Oct 28, 2024

zhuwenxing changed the title ~~[Bug]: flaky test case test_query_text_match_normal in ci~~ [Bug]: When performing a text match query in a growing segment, the result will be empty. Nov 2, 2024

yanliang567 modified the milestones: 2.5.0, 2.5.1, 2.5.2 Dec 24, 2024

yanliang567 modified the milestones: 2.5.2, 2.5.3 Jan 6, 2025

yanliang567 modified the milestones: 2.5.3, 2.5.4 Jan 16, 2025

yanliang567 modified the milestones: 2.5.4, 2.5.5 Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: When performing a text match query in a growing segment, the result will be empty. #36962

[Bug]: When performing a text match query in a growing segment, the result will be empty. #36962

zhuwenxing commented Oct 17, 2024

zhuwenxing commented Oct 17, 2024

zhuwenxing commented Oct 18, 2024

czs007 commented Oct 18, 2024

zhuwenxing commented Oct 21, 2024

zhuwenxing commented Oct 22, 2024

aoiasd commented Oct 29, 2024

xiaofan-luan commented Nov 5, 2024

zhuwenxing commented Nov 6, 2024

xiaofan-luan commented Nov 6, 2024

xiaofan-luan commented Nov 6, 2024

czs007 commented Jan 9, 2025

[Bug]: When performing a text match query in a growing segment, the result will be empty. #36962

[Bug]: When performing a text match query in a growing segment, the result will be empty. #36962

Comments

zhuwenxing commented Oct 17, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

zhuwenxing commented Oct 17, 2024

zhuwenxing commented Oct 18, 2024

czs007 commented Oct 18, 2024

zhuwenxing commented Oct 21, 2024

zhuwenxing commented Oct 22, 2024

aoiasd commented Oct 29, 2024

xiaofan-luan commented Nov 5, 2024

zhuwenxing commented Nov 6, 2024

xiaofan-luan commented Nov 6, 2024

xiaofan-luan commented Nov 6, 2024

czs007 commented Jan 9, 2025