Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: When performing a text match query in a growing segment, the result will be empty. #36962

Open
1 task done
zhuwenxing opened this issue Oct 17, 2024 · 11 comments
Open
1 task done
Assignees
Labels
ci/e2e feature/text match kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Performed text matching using the token with the highest frequency in the corpus, but no results were returned.

[pytest : test] self = <test_query.TestQueryTextMatch object at 0x7f25041845b0>

[pytest : test] tokenizer = 'default', enable_inverted_index = False

[pytest : test] enable_partition_key = True

[pytest : test] 

[pytest : test]         @pytest.mark.tags(CaseLabel.L0)

[pytest : test]         @pytest.mark.parametrize("enable_partition_key", [True, False])

[pytest : test]         @pytest.mark.parametrize("enable_inverted_index", [True, False])

[pytest : test]         @pytest.mark.parametrize("tokenizer", ["jieba", "default"])

[pytest : test]         def test_query_text_match_normal(

[pytest : test]             self, tokenizer, enable_inverted_index, enable_partition_key

[pytest : test]         ):

[pytest : test]             """

[pytest : test]             target: test text match normal

[pytest : test]             method: 1. enable text match and insert data with varchar

[pytest : test]                     2. get the most common words and query with text match

[pytest : test]                     3. verify the result

[pytest : test]             expected: text match successfully and result is correct

[pytest : test]             """

[pytest : test]             tokenizer_params = {

[pytest : test]                 "tokenizer": tokenizer,

[pytest : test]             }

[pytest : test]             dim = 128

[pytest : test]             fields = [

[pytest : test]                 FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),

[pytest : test]                 FieldSchema(

[pytest : test]                     name="word",

[pytest : test]                     dtype=DataType.VARCHAR,

[pytest : test]                     max_length=65535,

[pytest : test]                     enable_tokenizer=True,

[pytest : test]     				enable_match=True,

[pytest : test]                     is_partition_key=enable_partition_key,

[pytest : test]                     tokenizer_params=tokenizer_params,

[pytest : test]                 ),

[pytest : test]                 FieldSchema(

[pytest : test]                     name="sentence",

[pytest : test]                     dtype=DataType.VARCHAR,

[pytest : test]                     max_length=65535,

[pytest : test]                     enable_tokenizer=True,

[pytest : test]     				enable_match=True,

[pytest : test]                     tokenizer_params=tokenizer_params,

[pytest : test]                 ),

[pytest : test]                 FieldSchema(

[pytest : test]                     name="paragraph",

[pytest : test]                     dtype=DataType.VARCHAR,

[pytest : test]                     max_length=65535,

[pytest : test]                     enable_tokenizer=True,

[pytest : test]     				enable_match=True,

[pytest : test]                     tokenizer_params=tokenizer_params,

[pytest : test]                 ),

[pytest : test]                 FieldSchema(

[pytest : test]                     name="text",

[pytest : test]                     dtype=DataType.VARCHAR,

[pytest : test]                     max_length=65535,

[pytest : test]                     enable_tokenizer=True,

[pytest : test]     				enable_match=True,

[pytest : test]                     tokenizer_params=tokenizer_params,

[pytest : test]                 ),

[pytest : test]                 FieldSchema(name="emb", dtype=DataType.FLOAT_VECTOR, dim=dim),

[pytest : test]             ]

[pytest : test]             schema = CollectionSchema(fields=fields, description="test collection")

[pytest : test]             data_size = 3000

[pytest : test]             collection_w = self.init_collection_wrap(

[pytest : test]                 name=cf.gen_unique_str(prefix), schema=schema

[pytest : test]             )

[pytest : test]             fake = fake_en

[pytest : test]             if tokenizer == "jieba":

[pytest : test]                 language = "zh"

[pytest : test]                 fake = fake_zh

[pytest : test]             else:

[pytest : test]                 language = "en"

[pytest : test]     

[pytest : test]             data = [

[pytest : test]                 {

[pytest : test]                     "id": i,

[pytest : test]                     "word": fake.word().lower(),

[pytest : test]                     "sentence": fake.sentence().lower(),

[pytest : test]                     "paragraph": fake.paragraph().lower(),

[pytest : test]                     "text": fake.text().lower(),

[pytest : test]                     "emb": [random.random() for _ in range(dim)],

[pytest : test]                 }

[pytest : test]                 for i in range(data_size)

[pytest : test]             ]

[pytest : test]             df = pd.DataFrame(data)

[pytest : test]             log.info(f"dataframe\n{df}")

[pytest : test]             batch_size = 5000

[pytest : test]             for i in range(0, len(df), batch_size):

[pytest : test]                 collection_w.insert(

[pytest : test]                     data[i : i + batch_size]

[pytest : test]                     if i + batch_size < len(df)

[pytest : test]                     else data[i : len(df)]

[pytest : test]                 )

[pytest : test]             collection_w.create_index(

[pytest : test]                 "emb",

[pytest : test]                 {"index_type": "IVF_SQ8", "metric_type": "L2", "params": {"nlist": 64}},

[pytest : test]             )

[pytest : test]             if enable_inverted_index:

[pytest : test]                 collection_w.create_index("word", {"index_type": "INVERTED"})

[pytest : test]             collection_w.load()

[pytest : test]             # analyze the croup

[pytest : test]             text_fields = ["word", "sentence", "paragraph", "text"]

[pytest : test]             wf_map = {}

[pytest : test]             for field in text_fields:

[pytest : test]                 wf_map[field] = cf.analyze_documents(df[field].tolist(), language=language)

[pytest : test]             # query single field for one token

[pytest : test]             for field in text_fields:

[pytest : test]                 token = wf_map[field].most_common()[0][0]

[pytest : test]                 expr = f"TextMatch({field}, '{token}')"

[pytest : test]                 log.info(f"expr: {expr}")

[pytest : test]                 res, _ = collection_w.query(expr=expr, output_fields=["id", field])

[pytest : test] >               assert len(res) > 0

[pytest : test] E               assert 0 > 0

[pytest : test] E                +  where 0 = len(data: [] )

[pytest : test] 

[pytest : test] testcases/test_query.py:4724: AssertionError

[pytest : test] ------------------------------ Captured log setup ------------------------------

[pytest : test] [2024-10-17 08:14:29 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:46)

[pytest : test] [2024-10-17 08:14:29 - INFO - ci_test]: [setup_method] Start setup test case test_query_text_match_normal. (client_base.py:47)

[pytest : test] ----------------------------- Captured stderr call -----------------------------

[pytest : test] 
Split strings:   0%|          | 0/3000 [00:00<?, ?it/s]
                                                       

Split strings:   0%|          | 0/3000 [00:00<?, ?it/s]
                                                       

Split strings:   0%|          | 0/3000 [00:00<?, ?it/s]
                                                       

Split strings:   0%|          | 0/3000 [00:00<?, ?it/s]
                                                       

[pytest : test] ------------------------------ Captured log call -------------------------------

[pytest : test] [2024-10-17 08:14:29 - DEBUG - ci_test]: (api_request)  : [Connections.has_connection] args: ['default'], kwargs: {} (api_request.py:62)

[pytest : test] [2024-10-17 08:14:29 - DEBUG - ci_test]: (api_response) : False  (api_request.py:37)

[pytest : test] [2024-10-17 08:14:29 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default', '', '', 'default', ''], kwargs: {'host': 'md-36952-1-py-pr-milvus.milvus-ci', 'port': '19530'} (api_request.py:62)

[pytest : test] [2024-10-17 08:14:29 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[pytest : test] [2024-10-17 08:14:29 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['query_M6Tl7wz0', {'auto_id': False, 'description': 'test collection', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'word', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535, 'enable_match':......, kwargs: {'consistency_level': 'Strong'} (api_request.py:62)

[pytest : test] [2024-10-17 08:14:29 - DEBUG - ci_test]: (api_response) : <Collection>:

[pytest : test] -------------

[pytest : test] <name>: query_M6Tl7wz0

[pytest : test] <description>: test collection

[pytest : test] <schema>: {'auto_id': False, 'description': 'test collection', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'word', 'description': '', 'type'......  (api_request.py:37)

[pytest : test] [2024-10-17 08:14:30 - INFO - ci_test]: dataframe

[pytest : test]         id      word                                           sentence                                          paragraph                                               text                                                emb

[pytest : test] 0        0      good                 fight nothing during because down.  guess none government value. opportunity struc...  teach experience result discussion card. night...  [0.25722484194522754, 0.10716873880402633, 0.6...

[pytest : test] 1        1       way               fire wait interest dinner operation.  thousand quite animal suddenly friend data. fe...  whether development change smile until early. ...  [0.6393285775234575, 0.25662668740363703, 0.84...

[pytest : test] 2        2      cell       common budget receive sister listen article.  time person citizen institution full front com...  big opportunity fall arrive it seven lose pass...  [0.5733811147237775, 0.4365202152412886, 0.054...

[pytest : test] 3        3  employee  western agreement window city mention hard req...  let tell upon house. economic lot daughter art...  machine involve agent. science though blood pa...  [0.28186465318599285, 0.8690287069224029, 0.95...

[pytest : test] 4        4     shake                                   may hot provide.         other decade them debate wrong difference.  pass trial such heart news executive. hot tech...  [0.22295701922050015, 0.23748072218399807, 0.7...

[pytest : test] ...    ...       ...                                                ...                                                ...                                                ...                                                ...

[pytest : test] 2995  2995     small                    exactly far risk social pretty.  fund there similar morning. life right attorne...  set myself thing do mrs.\nposition sense entir...  [0.8289662416206579, 0.034599987863862536, 0.2...

[pytest : test] 2996  2996     ready  skill particularly page sign back however whil...  fact campaign within cover next but. internati...  few the young walk position early. girl mr sit...  [0.5687476818480588, 0.37727107847674124, 0.51...

[pytest : test] 2997  2997    police                        stuff picture former study.  stop stop they next party must. own sing floor...  myself majority drug key arm difference. devel...  [0.738114035712616, 0.26428015252046155, 0.679...

[pytest : test] 2998  2998   current               each stay sit range happen idea may.                        modern window room control.  cold alone much commercial space shake.\nnewsp...  [0.680958986682074, 0.838743116079627, 0.16323...

[pytest : test] 2999  2999      yard                               anyone to direction.  economy eye condition. public next article rea...  last leg market no. product recent economic th...  [0.013924658826216851, 0.9652646286447241, 0.5...

[pytest : test] 

[pytest : test] [3000 rows x 6 columns] (test_query.py:4698)

[pytest : test] [2024-10-17 08:14:30 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [[{'id': 0, 'word': 'good', 'sentence': 'fight nothing during because down.', 'paragraph': 'guess none government value. opportunity structure thank common increase break pick.', 'text': 'teach experience result discussion card. night message far itself friend plant tonight. analysis collection indi......, kwargs: {'timeout': 180} (api_request.py:62)

[pytest : test] [2024-10-17 08:14:31 - DEBUG - ci_test]: (api_response) : (insert count: 3000, delete count: 0, upsert count: 0, timestamp: 453287050275979268, success count: 3000, err count: 0  (api_request.py:37)

[pytest : test] [2024-10-17 08:14:31 - DEBUG - ci_test]: (api_request)  : [Collection.create_index] args: ['emb', {'index_type': 'IVF_SQ8', 'metric_type': 'L2', 'params': {'nlist': 64}}, 1200], kwargs: {'index_name': ''} (api_request.py:62)

[pytest : test] [2024-10-17 08:14:31 - DEBUG - ci_test]: (api_response) : Status(code=0, message=)  (api_request.py:37)

[pytest : test] [2024-10-17 08:14:31 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 180], kwargs: {} (api_request.py:62)

[pytest : test] [2024-10-17 08:14:32 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[pytest : test] [2024-10-17 08:14:32 - INFO - ci_test]: Analyze document cost time: 0.015944480895996094 (common_func.py:111)

[pytest : test] [2024-10-17 08:14:32 - INFO - ci_test]: Analyze document cost time: 0.028260469436645508 (common_func.py:111)

[pytest : test] [2024-10-17 08:14:32 - INFO - ci_test]: Analyze document cost time: 0.05806303024291992 (common_func.py:111)

[pytest : test] [2024-10-17 08:14:33 - INFO - ci_test]: Analyze document cost time: 0.07532739639282227 (common_func.py:111)

[pytest : test] [2024-10-17 08:14:33 - INFO - ci_test]: expr: TextMatch(word, 'specific') (test_query.py:4722)

[pytest : test] [2024-10-17 08:14:33 - DEBUG - ci_test]: (api_request)  : [Collection.query] args: ["TextMatch(word, 'specific')", ['id', 'word'], None, 180], kwargs: {} (api_request.py:62)

[pytest : test] [2024-10-17 08:14:33 - DEBUG - ci_test]: (api_response) : data: []   (api_request.py:37)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed ci job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-36952/1/pipeline/

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 17, 2024
@zhuwenxing
Copy link
Contributor Author

/assign @czs007

PTAL

@yanliang567 yanliang567 added this to the 2.5.0 milestone Oct 17, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 17, 2024
@yanliang567 yanliang567 removed their assignment Oct 17, 2024
@zhuwenxing
Copy link
Contributor Author

This case has a relatively high failure rate in CI, so it's set to critical.

@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Oct 18, 2024
@czs007
Copy link
Collaborator

czs007 commented Oct 18, 2024

working on this.

@zhuwenxing
Copy link
Contributor Author

failed job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-36693/9/pipeline

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Word frequency: [('win', 10), ('resource', 9), ('beat', 9), ('together', 9), ('hand', 9), ('themselves', 8), ('business', 8), ('exist', 8), ('enough', 8), ('why', 8)] (common_func.py:117)
......
[pytest : test] [2024-10-21 07:36:04 - DEBUG - ci_test]: (api_request)  : [Collection.query] args: ["TextMatch(word, 'win')", ['id', 'word'], None, 180], kwargs: {} (api_request.py:62)

[pytest : test] [2024-10-21 07:36:05 - DEBUG - ci_test]: (api_response) : data: []   (api_request.py:37)

The word "win" appears 10 times in the corpus of the "word" field, but the result returned by text match is empty.


[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Analyze document cost time: 0.007334709167480469 (common_func.py:116)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Word frequency: [('win', 10), ('resource', 9), ('beat', 9), ('together', 9), ('hand', 9), ('themselves', 8), ('business', 8), ('exist', 8), ('enough', 8), ('why', 8)] (common_func.py:117)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Analyze document cost time: 0.019197463989257812 (common_func.py:116)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Word frequency: [('least', 31), ('force', 29), ('claim', 28), ('south', 28), ('environment', 28), ('nothing', 27), ('partner', 27), ('mission', 27), ('follow', 27), ('material', 27)] (common_func.py:117)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Analyze document cost time: 0.02630472183227539 (common_func.py:116)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Word frequency: [('go', 67), ('together', 63), ('open', 62), ('position', 61), ('light', 60), ('half', 59), ('pull', 58), ('pay', 58), ('cold', 58), ('prevent', 57)] (common_func.py:117)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Analyze document cost time: 0.0409548282623291 (common_func.py:116)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: Word frequency: [('thing', 94), ('plant', 93), ('environmental', 91), ('along', 91), ('air', 90), ('bill', 89), ('upon', 89), ('improve', 89), ('smile', 89), ('develop', 87)] (common_func.py:117)

[pytest : test] [2024-10-21 07:36:04 - INFO - ci_test]: expr: TextMatch(word, 'win') (test_query.py:4722)

[pytest : test] [2024-10-21 07:36:04 - DEBUG - ci_test]: (api_request)  : [Collection.query] args: ["TextMatch(word, 'win')", ['id', 'word'], None, 180], kwargs: {} (api_request.py:62)

[pytest : test] [2024-10-21 07:36:05 - DEBUG - ci_test]: (api_response) : data: []   (api_request.py:37)

@czs007

@zhuwenxing
Copy link
Contributor Author

@aoiasd
Copy link
Contributor

aoiasd commented Oct 29, 2024

Growing segment insert success, but can't search,checking temporary index for growing.

@zhuwenxing zhuwenxing changed the title [Bug]: flaky test case test_query_text_match_normal in ci [Bug]: When performing a text match query in a growing segment, the result will be empty. Nov 2, 2024
@xiaofan-luan
Copy link
Collaborator

this is the expected hebaviour.
For match it takes 200ms to reindex the data so we can not search it in strong consistent.
Adding another brutefore might be too much for this case

@zhuwenxing
Copy link
Contributor Author

this is the expected hebaviour. For match it takes 200ms to reindex the data so we can not search it in strong consistent. Adding another brutefore might be too much for this case

@xiaofan-luan
This is not an expected behavior. It's not that we require being able to query immediately after inserting, but if the data remains in a growing state and is not flushed, these data will never be queryable.

@xiaofan-luan
Copy link
Collaborator

This is data should be able to be searched after a couple of hundred milli seconds.

@xiaofan-luan
Copy link
Collaborator

this is the expected behaviour. this is because tantivy is searchable only if data is synced to disk, and we can not sync to disk every time. There is a checker here to sync data to disk every certain amount of time

@yanliang567 yanliang567 modified the milestones: 2.5.0, 2.5.1, 2.5.2 Dec 24, 2024
@yanliang567 yanliang567 modified the milestones: 2.5.2, 2.5.3 Jan 6, 2025
@czs007
Copy link
Collaborator

czs007 commented Jan 9, 2025

may be fixed by #39070

@yanliang567 yanliang567 modified the milestones: 2.5.3, 2.5.4 Jan 16, 2025
@yanliang567 yanliang567 modified the milestones: 2.5.4, 2.5.5 Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/e2e feature/text match kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants