-
Notifications
You must be signed in to change notification settings - Fork 503
Add insensitive search for documents #545
base: master
Are you sure you want to change the base?
Conversation
I think the search by tag or correspondant is now broken if there are any accents. The PR is still in WIP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good addition, but there were a few things that I think need to be fixed before we accept the merge. I'm sorry that I'm not as readily available to this project as I once was, so if you do make the requested changes and I don't properly respond in a timely manner to toggle approve
, feel free to email me directly: paperless
at <my-github-username>
dot org
.
src/documents/admin.py
Outdated
remove_tag_from_selected, | ||
set_correspondent_on_selected | ||
) | ||
from documents.actions import (add_tag_to_selected, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interests of consistency, please don't reformat imports. If anything, imports should conform to an isort configuration of:
isort -m 3 --dont-skip __init__.py --virtual-env ${virtualenv}/.. --quiet ${file_name}
|
||
from django.db import migrations, models | ||
|
||
from paperless.utils import slugify as slugifyOCR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it can be tempting to import stuff from modules into a migration to keep your code DRY, this will later bite us in the ass should we decide to rename/remove this function. It will break all migrations for all time as a result.
Instead, if you have logic you wish to make available in a migration, copy it verbatim into the migration.
src/paperless/utils.py
Outdated
import unicodedata | ||
|
||
|
||
def slugify(content): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we come up with a more appropriate name for this than slugify()
? Something like make_searchable()
would be more apt as this doesn't turn the content into a slug at all.
doc.searchable_content = slugifyOCR(doc.content) | ||
doc.save() | ||
|
||
def casefold_backwards(apps, schema_editor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Protip: you don't need to create an empty backwards method. You can just do migrations.RunPython(casefold_forwards, migrations.RunPython.noop)
below instead.
@@ -21,3 +21,28 @@ def test_file_deletion(self): | |||
mock_unlink.assert_any_call(file_path) | |||
mock_unlink.assert_any_call(thumb_path) | |||
self.assertEqual(mock_unlink.call_count, 2) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hooray for tests!
@@ -266,6 +281,13 @@ def __str__(self): | |||
return "{}: {}".format(created, self.correspondent or self.title) | |||
return str(created) | |||
|
|||
def save(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this in .save()
and not in a signal. This is much more appropriate.
de68565
to
51b0dc0
Compare
51b0dc0
to
4ac538b
Compare
4ac538b
to
77b0f65
Compare
Hey ! Does someone have any time to review this ? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like the change, but I have left one comment about this change potentially breaking the search entirely for user-languages that don't use the latin alphabet and I'd like your opinion/ideas on it. 🙂
def make_searchable(content): | ||
return ( | ||
unicodedata.normalize("NFKD", content.casefold()) | ||
.encode("ASCII", "ignore") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm not sure if this is the right thing to do. In light of your change only searching the new searchable fields, this straight up breaks search for languages that share no characters with the ASCII codepage, right? Judging from the official Python docs, "ignore"
would simply drop non-ASCII characters, resulting in a potentially empty search-term.
As far as I can tell not ignoring these characters here would break the entire feature, your Zürich Weiß
example would, after case-folding first, produce zu\xcc\x88rich weiss
, which obviously cannot be searched for with zurich weiss
.
Maybe the simplest fix is to search both the new ASCII-fied fields and the regular fields?
This PR adds an insensitive-search on document title and content. This should close #115.
As it was pointed out by @danielquinn,
str.casefold()
is a good start, but not enough: it does not remove accents. For the latter, we need aNFKD
unicode normalization (see https://unicode.org/reports/tr15/).I'm open to any improvements !