Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

U-Blad Corpus #1590

Merged
merged 19 commits into from
Jun 19, 2024
Merged

U-Blad Corpus #1590

merged 19 commits into from
Jun 19, 2024

Conversation

Meesch
Copy link
Contributor

@Meesch Meesch commented May 28, 2024

Adds the U-Blad corpus. This corpus is already indexed and ready to use on the test server! It should not be a lot of work to review this since it is not adding any new functionality. Potentially interesting note is that in this corpus definition the soup_transform_func is used not to just extract some text from a node but instead is used to format the strings inside of the node, which could potentially be later expanded with styling features, as hocr contains stylistic classes such as bold/italic etc and font size.

Copy link
Contributor

@lukavdplas lukavdplas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

A few fields are missing an appropriate mapping and I'm not sure about the locale setting.

Comment on lines 20 to 21
import locale
locale.setlocale(locale.LC_ALL, 'nl_NL.UTF-8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the corpus is imported when starting up the server, this may have unintended side effects. I would recommend restoring the locale setting when it's no longer needed.

It's not clear from the code when that would be, by the way; please document what this setting is intended to achieve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Luka, this should not live in runtime code. My guess it is used in indexing?

Copy link
Contributor Author

@Meesch Meesch May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used for the indexing of the date; where would I put it then? Or what should I restore the locale to exactly?

Comment on lines 23 to 43
def transform_content(soup):
"""
Transforms the text contents of a page node (soup) into a string consisting
of blocks of text, foregoing the column structure of the OCR'ed material.
"""
page_text = ""
for child in soup.children:
if isinstance(child, Tag) and 'ocr_carea' in child.get('class', []):
paragraph_text = ""
paragraph_list = child.get_text().split('\n')
for item in paragraph_list[1:]:
if not item:
pass
elif item.endswith('-'):
paragraph_text += item.strip('-')
else:
paragraph_text += item + ' '
if paragraph_text:
page_text += paragraph_text + '\n\n'
page_node = BeautifulSoup(page_text, 'html.parser')
return page_node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

Since the return value is just a node with the string content you want, why not use extract_soup_func to return the string directly?

Comment on lines 128 to 149
FieldDefinition(
name='volume_id',
display_name='Volume ID',
description='Unique identifier for this volume',
hidden=True,
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'identifier_ocn'
}
)
),
FieldDefinition(
name='id',
display_name='Page ID',
description='Unique identifier for this page',
hidden=True,
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'identifier_indexid'
}
)
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields are missing es_mapping=keyword_mapping()

Comment on lines 150 to 160
FieldDefinition(
name='edition',
display_name='Edition',
description='The number of the edition in this volume. Every year starts at 1.',
sortable=True,
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'aflevering'
}
)
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

es_mapping=int_mapping(), I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keyword_mapping cause sometimes it is a phrase instead of a number

Comment on lines 56 to 57
scan_image_type = getattr(settings, 'UBLAD_SCAN_IMAGE_TYPE', 'image/jpeg')
allow_image_download = getattr(settings, 'UBLAD_ALLOW_IMAGE_DOWNLOAD', True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these attributes need to be configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scan_image_type is used in GetMediaView to open the image type correctly, allow_image_download is used in the frontend to allow downlaods and the default is false. Or am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the values make sense, but you could just write the value directly, i.e.

scan_image_type = 'image/jpeg'
allow_image_download = True

This definition allows the values to be configured in settings, but why? That would make sense if, for instance, image download would need to be disabled in some environments but not in others.

Copy link
Contributor

@JeltevanBoheemen JeltevanBoheemen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few typos/language use suggestions. See Luka's review for a few code issues that should be resolved. Good work, nice that it's (almost) done!

backend/corpora/ublad/description/ublad.md Outdated Show resolved Hide resolved
backend/corpora/ublad/description/ublad.md Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leuke afbeelding, nice.

Comment on lines 20 to 21
import locale
locale.setlocale(locale.LC_ALL, 'nl_NL.UTF-8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Luka, this should not live in runtime code. My guess it is used in indexing?

@Meesch
Copy link
Contributor Author

Meesch commented May 31, 2024

I think I have addressed all your comments, see if you agree with my solutions. No worries if this does not make it into the next release!

@Meesch Meesch requested a review from lukavdplas May 31, 2024 12:20
Comment on lines 100 to 105
try:
corpus_locale = settings.CORPORA_LOCALES[corpus_name]
if corpus_locale:
locale.setlocale(locale.LC_ALL, corpus_locale)
except:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a catch-all except (i.e. it's not checking for specific error types); using pass as the body means that the script will not only move past the error but it will be completely undetectable. This is a big no-no, as it can create hidden bugs. In this case, if an unexpected error occurs during the indexing setup, you want to know what happened (and you want the script to break so you can fix it first).

In most cases, a try/except block should look for specific errors that are expected under the circumstances. If it's a handler for any kind of error (usually because you're evaluating code that is not under the control of the current script, e.g. imported code, user code, etc.), the program should communicate what happened; usually by writing to the log or stderr.

In this case, it looks the block is meant to catch a AttributeError or KeyError, so a more specific except could be used. Both of those errors could also be avoided by providing appropriate fallbacks with getattr / dict.get during the lookup on line 101; so you would not need an except clause at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really agree with this solution. If locale switching is implemented in the application-wide indexing procedure, it increases the complexity of how indexing works generally; unless it's a common problem, it would be better to solve this problem in the corpus class, to keep the application more modular.

In this case, I expect it's possible to set and reset the locale in the documents() method of the corpus class instead. Something like this:

def documents(self, sources=None):
    #set locale
    for doc in super().documents(sources):
        yield doc
    # reset locale

If it is is necessary to do this here, please make sure it's properly documented:

  • This change adds a new setting, which should be documented in the settings documentation
  • If this option is useful for other corpora, it should be included in the documentation on writing Python corpora
  • There should be a unit test (or several) for the new functionality. As it is, it's quite likely that the functionality would be inadvertently broken or deleted at some point, since it will serve no apparent function; that is, no test will break if you remove it.

Comment on lines 56 to 57
scan_image_type = getattr(settings, 'UBLAD_SCAN_IMAGE_TYPE', 'image/jpeg')
allow_image_download = getattr(settings, 'UBLAD_ALLOW_IMAGE_DOWNLOAD', True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the values make sense, but you could just write the value directly, i.e.

scan_image_type = 'image/jpeg'
allow_image_download = True

This definition allows the values to be configured in settings, but why? That would make sense if, for instance, image download would need to be disabled in some environments but not in others.

Meesch added 2 commits June 18, 2024 14:46
remove locale switching from global environment

PR feedback
add skip logic for ublad test

install language pack in test

update test settings
@Meesch Meesch force-pushed the feature/ublad-corpus branch from d7ce070 to 78d64bc Compare June 18, 2024 13:58
@Meesch Meesch merged commit 97d5bcd into develop Jun 19, 2024
2 checks passed
@Meesch Meesch deleted the feature/ublad-corpus branch June 19, 2024 09:07
@lukavdplas lukavdplas mentioned this pull request Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants