Add insensitive search for documents #545

CkuT · 2019-06-01T13:33:41Z

This PR adds an insensitive-search on document title and content. This should close #115.

As it was pointed out by @danielquinn, str.casefold() is a good start, but not enough: it does not remove accents. For the latter, we need a NFKD unicode normalization (see https://unicode.org/reports/tr15/).

I'm open to any improvements !

CkuT · 2019-06-01T15:16:33Z

I think the search by tag or correspondant is now broken if there are any accents. The PR is still in WIP.

danielquinn

It's a good addition, but there were a few things that I think need to be fixed before we accept the merge. I'm sorry that I'm not as readily available to this project as I once was, so if you do make the requested changes and I don't properly respond in a timely manner to toggle approve, feel free to email me directly: paperless at <my-github-username> dot org.

danielquinn · 2019-06-21T08:24:25Z

src/documents/admin.py

-    remove_tag_from_selected,
-    set_correspondent_on_selected
-)
+from documents.actions import (add_tag_to_selected,


In the interests of consistency, please don't reformat imports. If anything, imports should conform to an isort configuration of:

isort -m 3 --dont-skip __init__.py --virtual-env ${virtualenv}/.. --quiet ${file_name}

danielquinn · 2019-06-21T08:25:51Z

src/documents/migrations/0023_document_searchable_content.py

+
+from django.db import migrations, models
+
+from paperless.utils import slugify as slugifyOCR


While it can be tempting to import stuff from modules into a migration to keep your code DRY, this will later bite us in the ass should we decide to rename/remove this function. It will break all migrations for all time as a result.

Instead, if you have logic you wish to make available in a migration, copy it verbatim into the migration.

danielquinn · 2019-06-21T08:27:24Z

src/paperless/utils.py

+import unicodedata
+
+
+def slugify(content):


Can we come up with a more appropriate name for this than slugify()? Something like make_searchable() would be more apt as this doesn't turn the content into a slug at all.

danielquinn · 2019-06-21T08:28:58Z

src/documents/migrations/0023_document_searchable_content.py

+                doc.searchable_content = slugifyOCR(doc.content)
+            doc.save()
+
+    def casefold_backwards(apps, schema_editor):


Protip: you don't need to create an empty backwards method. You can just do migrations.RunPython(casefold_forwards, migrations.RunPython.noop) below instead.

danielquinn · 2019-06-21T08:30:08Z

src/documents/tests/test_document_model.py

@@ -21,3 +21,28 @@ def test_file_deletion(self):
            mock_unlink.assert_any_call(file_path)
            mock_unlink.assert_any_call(thumb_path)
            self.assertEqual(mock_unlink.call_count, 2)
+


Hooray for tests!

danielquinn · 2019-06-21T08:30:30Z

src/documents/models.py

@@ -266,6 +281,13 @@ def __str__(self):
            return "{}: {}".format(created, self.correspondent or self.title)
        return str(created)

+    def save(self, *args, **kwargs):


Thank you for doing this in .save() and not in a signal. This is much more appropriate.

CkuT · 2020-02-09T17:37:51Z

Hey ! Does someone have any time to review this ? :)

pitkley

Overall I like the change, but I have left one comment about this change potentially breaking the search entirely for user-languages that don't use the latin alphabet and I'd like your opinion/ideas on it. 🙂

pitkley · 2020-02-15T10:29:53Z

src/paperless/utils.py

+def make_searchable(content):
+    return (
+        unicodedata.normalize("NFKD", content.casefold())
+        .encode("ASCII", "ignore")


Hm, I'm not sure if this is the right thing to do. In light of your change only searching the new searchable fields, this straight up breaks search for languages that share no characters with the ASCII codepage, right? Judging from the official Python docs, "ignore" would simply drop non-ASCII characters, resulting in a potentially empty search-term.

As far as I can tell not ignoring these characters here would break the entire feature, your Zürich Weiß example would, after case-folding first, produce zu\xcc\x88rich weiss, which obviously cannot be searched for with zurich weiss.

Maybe the simplest fix is to search both the new ASCII-fied fields and the regular fields?

Add insensitive search

e5438ae

CkuT requested a review from a team June 1, 2019 13:33

CkuT added 2 commits June 1, 2019 15:34

Typo

026a6d6

Lint

5b0b832

CkuT changed the title ~~Add insensitive search for documents~~ WIP: Add insensitive search for documents Jun 1, 2019

danielquinn suggested changes Jun 21, 2019

View reviewed changes

CkuT force-pushed the insensitive-search branch 3 times, most recently from de68565 to 51b0dc0 Compare November 11, 2019 18:33

CkuT changed the title ~~WIP: Add insensitive search for documents~~ Add insensitive search for documents Nov 11, 2019

CkuT requested a review from danielquinn November 11, 2019 18:34

CkuT force-pushed the insensitive-search branch from 51b0dc0 to 4ac538b Compare November 11, 2019 19:40

Handle Tag model for insensitive search

77b0f65

CkuT force-pushed the insensitive-search branch from 4ac538b to 77b0f65 Compare November 11, 2019 19:42

CkuT requested a review from a team February 9, 2020 17:37

pitkley suggested changes Feb 15, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add insensitive search for documents #545

Add insensitive search for documents #545

CkuT commented Jun 1, 2019

CkuT commented Jun 1, 2019

danielquinn left a comment

danielquinn Jun 21, 2019

danielquinn Jun 21, 2019

danielquinn Jun 21, 2019

danielquinn Jun 21, 2019

danielquinn Jun 21, 2019

danielquinn Jun 21, 2019

CkuT commented Feb 9, 2020

pitkley left a comment

pitkley Feb 15, 2020


		from django.db import migrations, models

		from paperless.utils import slugify as slugifyOCR

Add insensitive search for documents #545

Are you sure you want to change the base?

Add insensitive search for documents #545

Conversation

CkuT commented Jun 1, 2019

CkuT commented Jun 1, 2019

danielquinn left a comment

Choose a reason for hiding this comment

danielquinn Jun 21, 2019

Choose a reason for hiding this comment

danielquinn Jun 21, 2019

Choose a reason for hiding this comment

danielquinn Jun 21, 2019

Choose a reason for hiding this comment

danielquinn Jun 21, 2019

Choose a reason for hiding this comment

danielquinn Jun 21, 2019

Choose a reason for hiding this comment

danielquinn Jun 21, 2019

Choose a reason for hiding this comment

CkuT commented Feb 9, 2020

pitkley left a comment

Choose a reason for hiding this comment

pitkley Feb 15, 2020

Choose a reason for hiding this comment