Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not search using multiple words in queries and please add unicode support #39

Open
ZeroCool940711 opened this issue Sep 2, 2017 · 11 comments
Labels

Comments

@ZeroCool940711
Copy link

Hi there, I found that making searches with multiple words return nothing, also if I try to search for anything that use non-english words the search will fail and I get a unicode error for example words in spanish or japanese. I tried search for things likes "Game of thrones" or " Dark Matter" on my site using flask-whooshee and I got no results from those queries but if I search for "Game" , "Dark" or anything with a single word the search works. Thanks for your time and hope I can soon use your extension, it looks pretty interesting.

@bkabrda
Copy link
Collaborator

bkabrda commented Jun 22, 2018

Hi, so I believe that #45 should solve the unicode issue. I'll be releasing version 0.6.0 pretty soon that should have that issue fixed. As for your other issue, I'd need to see your code or a reproducer. Could you please provide one of those?

@bkabrda
Copy link
Collaborator

bkabrda commented Jun 22, 2018

The unicode support should be fixed in version 0.6.0: https://pypi.org/project/flask-whooshee/0.6.0/

@thejcannon
Copy link

@bkabrda I believe the multi-word search issue is due to the fact that whoosh.qparser.MultifieldParser doesn't use the whoosh.qparser.PhrasePlugin plugin by default.

There doesn't look to be any way to add or override the plugins when using this library 😢.

@thejcannon
Copy link

thejcannon commented Apr 22, 2019

I stand corrected, MultifieldParser uses QueryParser and adds the MultiField plugin which means it also has all the default plugins.

So not sure why searching '"foo bar"' is always coming back empty.

@thejcannon
Copy link

thejcannon commented Apr 22, 2019

OK, seems like an issue with the match_substring implementation.

If my search_string is "foo bar", the parsed query with match_subtring=True looks like:
(title:*foo* *bar* OR content:*foo* *bar*)

If I set match_subtring=False it looks like:
(title:"foo bar" OR content:"foo bar")

@bkabrda
Copy link
Collaborator

bkabrda commented May 10, 2019

@thejcannon I'm not sure I follow. Could you please be provide a more complete example or ideally a reproducer that produces the wrong result and what result you expect to be produced? Thanks!

@thejcannon
Copy link

Add this to your tests (I put it before test_mw_result_in_different_fields)

def test_multi_word(self):
   found = self.Entry.query.whooshee_search('"less dangerous"').all()
   self.assertEqual(len(found), 1)
   self.assertEqual(found[0], self.e4)

@Gurbert
Copy link

Gurbert commented May 14, 2019

Hi there I can confirm what thejcannon just said about searching for multiple words like '"foo bar"' . It becomes only possible when using match_substrings=False . It is also the case for search using AND and OR parameters like foo AND bar .

Here's the code I am using for my flask route:

@bp.route('/search/')
def search():
    title = 'Search'
    search_query = request.args.get('q')
    try:
        posts = Post.query\
            .whooshee_search(search_query)\
            .filter_by(published=True)
    except ValueError:
        flash('Search string must have at least 3 characters.')
        return redirect(url_for('main.index'))
    return render_template('index.html', posts=posts, title=title)

@bkabrda
Copy link
Collaborator

bkabrda commented May 15, 2019

Thanks for the detailed info, I'll try looking into this.

@bkabrda
Copy link
Collaborator

bkabrda commented May 15, 2019

So this is actually pretty complicated and a result of my younger self doing things that my current self doesn't like too much. Let me explain (and see the bottom of this comment for a proposed solution):

When match_substrings is True, the prep_search_string method naively takes the search string and replaces all the whitespace substrings with * *. When no quoting is used, e.g. 'less dangerous' is searched for, this works correctly, yielding search string '*less* *dangerous*'. But when the search string is quoted, e.g. '"less dangerous"', then prep_search_string returns a meaningless result '*"less* *dangerous"*' which is obviously not found. When match_substrings is False, this replacing doesn't happen and everything works ok.

How to fix this:
I think the correct solution (for when match_substrings is True) is:

  • First parse the search string to get search tokens => this should yield less dangerous as a single token.
  • Then apply the asterisks to the tokens.

So e.g. for whooshee_search('something "and something else"'), the above should yield *something* *and something else* as the real search string, which I think is correct. Do you folks think this makes sense? I can implement this pretty easily if so.

@KietnaQuTak
Copy link

KietnaQuTak commented Jul 17, 2019

Hello. Thanks for posting information about how to use quotation marks. Settings match_substrings=False helps me to found exact match for dates and numbers separated with special characters.

In fact I have stumbled upon strange situation with different behavior when searching date like strings (e.g. 1985-07-28) and numbers with special characters (e.g. ABC/1985/07/28/ABC). Searching for a number gives me a lot of results, so as I understand whooshe_search() splits that number by special characters and searches for all of number parts (using quotation marks gives me exact match). But in the same time when searching for date like string (1985-07-28) gives me absolutely no results while I'm 100% sure that "1985" string was indexed. I am able to search for date like strings only with match_substrings=False and when using quotation marks for string. Is that normal behavior? I can understand why number without quotation marks gives me not exact match but I cannot understand why date like strings gives me results only with quotation marks.

PS. Thanks for easy to use implementation of Whoosh :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants