Skip to content
This repository has been archived by the owner on Feb 19, 2021. It is now read-only.

Add insensitive search for documents #545

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

CkuT
Copy link
Contributor

@CkuT CkuT commented Jun 1, 2019

This PR adds an insensitive-search on document title and content. This should close #115.

As it was pointed out by @danielquinn, str.casefold() is a good start, but not enough: it does not remove accents. For the latter, we need a NFKD unicode normalization (see https://unicode.org/reports/tr15/).

I'm open to any improvements !

@CkuT CkuT requested a review from a team June 1, 2019 13:33
@CkuT CkuT changed the title Add insensitive search for documents WIP: Add insensitive search for documents Jun 1, 2019
@CkuT
Copy link
Contributor Author

CkuT commented Jun 1, 2019

I think the search by tag or correspondant is now broken if there are any accents. The PR is still in WIP.

Copy link
Collaborator

@danielquinn danielquinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good addition, but there were a few things that I think need to be fixed before we accept the merge. I'm sorry that I'm not as readily available to this project as I once was, so if you do make the requested changes and I don't properly respond in a timely manner to toggle approve, feel free to email me directly: paperless at <my-github-username> dot org.

remove_tag_from_selected,
set_correspondent_on_selected
)
from documents.actions import (add_tag_to_selected,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interests of consistency, please don't reformat imports. If anything, imports should conform to an isort configuration of:

isort -m 3 --dont-skip __init__.py --virtual-env ${virtualenv}/.. --quiet ${file_name}


from django.db import migrations, models

from paperless.utils import slugify as slugifyOCR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it can be tempting to import stuff from modules into a migration to keep your code DRY, this will later bite us in the ass should we decide to rename/remove this function. It will break all migrations for all time as a result.

Instead, if you have logic you wish to make available in a migration, copy it verbatim into the migration.

import unicodedata


def slugify(content):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we come up with a more appropriate name for this than slugify()? Something like make_searchable() would be more apt as this doesn't turn the content into a slug at all.

doc.searchable_content = slugifyOCR(doc.content)
doc.save()

def casefold_backwards(apps, schema_editor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Protip: you don't need to create an empty backwards method. You can just do migrations.RunPython(casefold_forwards, migrations.RunPython.noop) below instead.

@@ -21,3 +21,28 @@ def test_file_deletion(self):
mock_unlink.assert_any_call(file_path)
mock_unlink.assert_any_call(thumb_path)
self.assertEqual(mock_unlink.call_count, 2)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooray for tests!

@@ -266,6 +281,13 @@ def __str__(self):
return "{}: {}".format(created, self.correspondent or self.title)
return str(created)

def save(self, *args, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this in .save() and not in a signal. This is much more appropriate.

@CkuT CkuT force-pushed the insensitive-search branch 3 times, most recently from de68565 to 51b0dc0 Compare November 11, 2019 18:33
@CkuT CkuT changed the title WIP: Add insensitive search for documents Add insensitive search for documents Nov 11, 2019
@CkuT CkuT requested a review from danielquinn November 11, 2019 18:34
@CkuT CkuT force-pushed the insensitive-search branch from 51b0dc0 to 4ac538b Compare November 11, 2019 19:40
@CkuT CkuT force-pushed the insensitive-search branch from 4ac538b to 77b0f65 Compare November 11, 2019 19:42
@CkuT CkuT requested a review from a team February 9, 2020 17:37
@CkuT
Copy link
Contributor Author

CkuT commented Feb 9, 2020

Hey ! Does someone have any time to review this ? :)

Copy link
Member

@pitkley pitkley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like the change, but I have left one comment about this change potentially breaking the search entirely for user-languages that don't use the latin alphabet and I'd like your opinion/ideas on it. 🙂

def make_searchable(content):
return (
unicodedata.normalize("NFKD", content.casefold())
.encode("ASCII", "ignore")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm not sure if this is the right thing to do. In light of your change only searching the new searchable fields, this straight up breaks search for languages that share no characters with the ASCII codepage, right? Judging from the official Python docs, "ignore" would simply drop non-ASCII characters, resulting in a potentially empty search-term.

As far as I can tell not ignoring these characters here would break the entire feature, your Zürich Weiß example would, after case-folding first, produce zu\xcc\x88rich weiss, which obviously cannot be searched for with zurich weiss.

Maybe the simplest fix is to search both the new ASCII-fied fields and the regular fields?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Search is too rigid: allow accent-insensitive search queries
3 participants