-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
U-Blad Corpus #1590
U-Blad Corpus #1590
Conversation
remove the Year field from ublad add content_transform func for ublad content add display_type for urls to Ublad add csv_core fields for Ublad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
A few fields are missing an appropriate mapping and I'm not sure about the locale setting.
backend/corpora/ublad/ublad.py
Outdated
import locale | ||
locale.setlocale(locale.LC_ALL, 'nl_NL.UTF-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the corpus is imported when starting up the server, this may have unintended side effects. I would recommend restoring the locale setting when it's no longer needed.
It's not clear from the code when that would be, by the way; please document what this setting is intended to achieve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Luka, this should not live in runtime code. My guess it is used in indexing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used for the indexing of the date; where would I put it then? Or what should I restore the locale to exactly?
backend/corpora/ublad/ublad.py
Outdated
def transform_content(soup): | ||
""" | ||
Transforms the text contents of a page node (soup) into a string consisting | ||
of blocks of text, foregoing the column structure of the OCR'ed material. | ||
""" | ||
page_text = "" | ||
for child in soup.children: | ||
if isinstance(child, Tag) and 'ocr_carea' in child.get('class', []): | ||
paragraph_text = "" | ||
paragraph_list = child.get_text().split('\n') | ||
for item in paragraph_list[1:]: | ||
if not item: | ||
pass | ||
elif item.endswith('-'): | ||
paragraph_text += item.strip('-') | ||
else: | ||
paragraph_text += item + ' ' | ||
if paragraph_text: | ||
page_text += paragraph_text + '\n\n' | ||
page_node = BeautifulSoup(page_text, 'html.parser') | ||
return page_node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
Since the return value is just a node with the string content you want, why not use extract_soup_func
to return the string directly?
backend/corpora/ublad/ublad.py
Outdated
FieldDefinition( | ||
name='volume_id', | ||
display_name='Volume ID', | ||
description='Unique identifier for this volume', | ||
hidden=True, | ||
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={ | ||
'attribute': 'name', | ||
'value': 'identifier_ocn' | ||
} | ||
) | ||
), | ||
FieldDefinition( | ||
name='id', | ||
display_name='Page ID', | ||
description='Unique identifier for this page', | ||
hidden=True, | ||
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={ | ||
'attribute': 'name', | ||
'value': 'identifier_indexid' | ||
} | ||
) | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields are missing es_mapping=keyword_mapping()
backend/corpora/ublad/ublad.py
Outdated
FieldDefinition( | ||
name='edition', | ||
display_name='Edition', | ||
description='The number of the edition in this volume. Every year starts at 1.', | ||
sortable=True, | ||
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={ | ||
'attribute': 'name', | ||
'value': 'aflevering' | ||
} | ||
) | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
es_mapping=int_mapping()
, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keyword_mapping cause sometimes it is a phrase instead of a number
backend/corpora/ublad/ublad.py
Outdated
scan_image_type = getattr(settings, 'UBLAD_SCAN_IMAGE_TYPE', 'image/jpeg') | ||
allow_image_download = getattr(settings, 'UBLAD_ALLOW_IMAGE_DOWNLOAD', True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do these attributes need to be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scan_image_type
is used in GetMediaView
to open the image type correctly, allow_image_download
is used in the frontend to allow downlaods and the default is false
. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the values make sense, but you could just write the value directly, i.e.
scan_image_type = 'image/jpeg'
allow_image_download = True
This definition allows the values to be configured in settings
, but why? That would make sense if, for instance, image download would need to be disabled in some environments but not in others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few typos/language use suggestions. See Luka's review for a few code issues that should be resolved. Good work, nice that it's (almost) done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leuke afbeelding, nice.
backend/corpora/ublad/ublad.py
Outdated
import locale | ||
locale.setlocale(locale.LC_ALL, 'nl_NL.UTF-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Luka, this should not live in runtime code. My guess it is used in indexing?
Co-authored-by: Jelte van Boheemen <[email protected]>
Co-authored-by: Jelte van Boheemen <[email protected]>
…lab/I-analyzer into feature/ublad-corpus
I think I have addressed all your comments, see if you agree with my solutions. No worries if this does not make it into the next release! |
backend/es/es_index.py
Outdated
try: | ||
corpus_locale = settings.CORPORA_LOCALES[corpus_name] | ||
if corpus_locale: | ||
locale.setlocale(locale.LC_ALL, corpus_locale) | ||
except: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a catch-all except
(i.e. it's not checking for specific error types); using pass
as the body means that the script will not only move past the error but it will be completely undetectable. This is a big no-no, as it can create hidden bugs. In this case, if an unexpected error occurs during the indexing setup, you want to know what happened (and you want the script to break so you can fix it first).
In most cases, a try/except block should look for specific errors that are expected under the circumstances. If it's a handler for any kind of error (usually because you're evaluating code that is not under the control of the current script, e.g. imported code, user code, etc.), the program should communicate what happened; usually by writing to the log or stderr.
In this case, it looks the block is meant to catch a AttributeError
or KeyError
, so a more specific except
could be used. Both of those errors could also be avoided by providing appropriate fallbacks with getattr
/ dict.get
during the lookup on line 101; so you would not need an except clause at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really agree with this solution. If locale switching is implemented in the application-wide indexing procedure, it increases the complexity of how indexing works generally; unless it's a common problem, it would be better to solve this problem in the corpus class, to keep the application more modular.
In this case, I expect it's possible to set and reset the locale in the documents()
method of the corpus class instead. Something like this:
def documents(self, sources=None):
#set locale
for doc in super().documents(sources):
yield doc
# reset locale
If it is is necessary to do this here, please make sure it's properly documented:
- This change adds a new setting, which should be documented in the settings documentation
- If this option is useful for other corpora, it should be included in the documentation on writing Python corpora
- There should be a unit test (or several) for the new functionality. As it is, it's quite likely that the functionality would be inadvertently broken or deleted at some point, since it will serve no apparent function; that is, no test will break if you remove it.
backend/corpora/ublad/ublad.py
Outdated
scan_image_type = getattr(settings, 'UBLAD_SCAN_IMAGE_TYPE', 'image/jpeg') | ||
allow_image_download = getattr(settings, 'UBLAD_ALLOW_IMAGE_DOWNLOAD', True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the values make sense, but you could just write the value directly, i.e.
scan_image_type = 'image/jpeg'
allow_image_download = True
This definition allows the values to be configured in settings
, but why? That would make sense if, for instance, image download would need to be disabled in some environments but not in others.
remove locale switching from global environment PR feedback
add skip logic for ublad test install language pack in test update test settings
d7ce070
to
78d64bc
Compare
Adds the U-Blad corpus. This corpus is already indexed and ready to use on the test server! It should not be a lot of work to review this since it is not adding any new functionality. Potentially interesting note is that in this corpus definition the
soup_transform_func
is used not to just extract some text from a node but instead is used to format the strings inside of the node, which could potentially be later expanded with styling features, as hocr contains stylistic classes such as bold/italic etc and font size.