U-Blad Corpus #1590

Meesch · 2024-05-28T13:52:00Z

Adds the U-Blad corpus. This corpus is already indexed and ready to use on the test server! It should not be a lot of work to review this since it is not adding any new functionality. Potentially interesting note is that in this corpus definition the soup_transform_func is used not to just extract some text from a node but instead is used to format the strings inside of the node, which could potentially be later expanded with styling features, as hocr contains stylistic classes such as bold/italic etc and font size.

remove the Year field from ublad add content_transform func for ublad content add display_type for urls to Ublad add csv_core fields for Ublad

lukavdplas

Nice!

A few fields are missing an appropriate mapping and I'm not sure about the locale setting.

lukavdplas · 2024-05-28T14:14:02Z

backend/corpora/ublad/ublad.py

+import locale
+locale.setlocale(locale.LC_ALL, 'nl_NL.UTF-8')


Since the corpus is imported when starting up the server, this may have unintended side effects. I would recommend restoring the locale setting when it's no longer needed.

It's not clear from the code when that would be, by the way; please document what this setting is intended to achieve.

I agree with Luka, this should not live in runtime code. My guess it is used in indexing?

It is used for the indexing of the date; where would I put it then? Or what should I restore the locale to exactly?

lukavdplas · 2024-05-28T14:16:35Z

backend/corpora/ublad/ublad.py

+def transform_content(soup):
+    """
+    Transforms the text contents of a page node (soup) into a string consisting
+    of blocks of text, foregoing the column structure of the OCR'ed material.
+    """
+    page_text = ""
+    for child in soup.children:
+        if isinstance(child, Tag) and 'ocr_carea' in child.get('class', []):
+            paragraph_text = ""
+            paragraph_list = child.get_text().split('\n')
+            for item in paragraph_list[1:]:
+                if not item:
+                    pass
+                elif item.endswith('-'):
+                    paragraph_text += item.strip('-')
+                else:
+                    paragraph_text += item + ' '
+            if paragraph_text:
+                page_text += paragraph_text + '\n\n'
+    page_node = BeautifulSoup(page_text, 'html.parser')
+    return page_node


Neat!

Since the return value is just a node with the string content you want, why not use extract_soup_func to return the string directly?

lukavdplas · 2024-05-28T14:23:59Z

backend/corpora/ublad/ublad.py

+        FieldDefinition(
+            name='volume_id',
+            display_name='Volume ID',
+            description='Unique identifier for this volume',
+            hidden=True,
+            extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
+                'attribute': 'name',
+                'value': 'identifier_ocn'
+                }
+            )
+        ),
+        FieldDefinition(
+            name='id',
+            display_name='Page ID',
+            description='Unique identifier for this page',
+            hidden=True,
+            extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
+                'attribute': 'name',
+                'value': 'identifier_indexid'
+                }
+            )
+        ),


These fields are missing es_mapping=keyword_mapping()

lukavdplas · 2024-05-28T14:24:39Z

backend/corpora/ublad/ublad.py

+        FieldDefinition(
+            name='edition',
+            display_name='Edition',
+            description='The number of the edition in this volume. Every year starts at 1.',
+            sortable=True,
+            extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
+                'attribute': 'name',
+                'value': 'aflevering'
+                }
+            )
+        ),


es_mapping=int_mapping(), I think?

keyword_mapping cause sometimes it is a phrase instead of a number

lukavdplas · 2024-05-28T14:26:47Z

backend/corpora/ublad/ublad.py

+    scan_image_type = getattr(settings, 'UBLAD_SCAN_IMAGE_TYPE', 'image/jpeg')
+    allow_image_download = getattr(settings, 'UBLAD_ALLOW_IMAGE_DOWNLOAD', True)


Why do these attributes need to be configurable?

scan_image_type is used in GetMediaView to open the image type correctly, allow_image_download is used in the frontend to allow downlaods and the default is false. Or am I missing something?

Yes, the values make sense, but you could just write the value directly, i.e.

scan_image_type = 'image/jpeg' allow_image_download = True

This definition allows the values to be configured in settings, but why? That would make sense if, for instance, image download would need to be disabled in some environments but not in others.

JeltevanBoheemen

Few typos/language use suggestions. See Luka's review for a few code issues that should be resolved. Good work, nice that it's (almost) done!

backend/corpora/ublad/description/ublad.md

JeltevanBoheemen · 2024-05-30T11:31:30Z

backend/corpora/ublad/images/ublad.jpg

Leuke afbeelding, nice.

JeltevanBoheemen · 2024-05-30T11:32:06Z

backend/corpora/ublad/ublad.py

+import locale
+locale.setlocale(locale.LC_ALL, 'nl_NL.UTF-8')


I agree with Luka, this should not live in runtime code. My guess it is used in indexing?

Co-authored-by: Jelte van Boheemen <[email protected]>

…lab/I-analyzer into feature/ublad-corpus

Meesch · 2024-05-31T12:20:18Z

I think I have addressed all your comments, see if you agree with my solutions. No worries if this does not make it into the next release!

lukavdplas · 2024-06-04T12:16:54Z

backend/es/es_index.py

+    try:
+        corpus_locale = settings.CORPORA_LOCALES[corpus_name]
+        if corpus_locale:
+            locale.setlocale(locale.LC_ALL, corpus_locale)
+    except:
+        pass


This is a catch-all except (i.e. it's not checking for specific error types); using pass as the body means that the script will not only move past the error but it will be completely undetectable. This is a big no-no, as it can create hidden bugs. In this case, if an unexpected error occurs during the indexing setup, you want to know what happened (and you want the script to break so you can fix it first).

In most cases, a try/except block should look for specific errors that are expected under the circumstances. If it's a handler for any kind of error (usually because you're evaluating code that is not under the control of the current script, e.g. imported code, user code, etc.), the program should communicate what happened; usually by writing to the log or stderr.

In this case, it looks the block is meant to catch a AttributeError or KeyError, so a more specific except could be used. Both of those errors could also be avoided by providing appropriate fallbacks with getattr / dict.get during the lookup on line 101; so you would not need an except clause at all.

lukavdplas · 2024-06-04T12:47:24Z

backend/es/es_index.py

I don't really agree with this solution. If locale switching is implemented in the application-wide indexing procedure, it increases the complexity of how indexing works generally; unless it's a common problem, it would be better to solve this problem in the corpus class, to keep the application more modular.

In this case, I expect it's possible to set and reset the locale in the documents() method of the corpus class instead. Something like this:

def documents(self, sources=None): #set locale for doc in super().documents(sources): yield doc # reset locale

If it is is necessary to do this here, please make sure it's properly documented:

This change adds a new setting, which should be documented in the settings documentation

If this option is useful for other corpora, it should be included in the documentation on writing Python corpora

There should be a unit test (or several) for the new functionality. As it is, it's quite likely that the functionality would be inadvertently broken or deleted at some point, since it will serve no apparent function; that is, no test will break if you remove it.

lukavdplas · 2024-06-04T12:52:44Z

backend/corpora/ublad/ublad.py

+    scan_image_type = getattr(settings, 'UBLAD_SCAN_IMAGE_TYPE', 'image/jpeg')
+    allow_image_download = getattr(settings, 'UBLAD_ALLOW_IMAGE_DOWNLOAD', True)


Yes, the values make sense, but you could just write the value directly, i.e.

scan_image_type = 'image/jpeg' allow_image_download = True

This definition allows the values to be configured in settings, but why? That would make sense if, for instance, image download would need to be disabled in some environments but not in others.

remove locale switching from global environment PR feedback

add skip logic for ublad test install language pack in test update test settings

Meesch added 10 commits March 19, 2024 16:21

addition preliminary ublad corpus

c4a35c2

add document context

eac1369

add year variable to ublad

e4d4df0

add ublad description

6c5e023

Merge branch 'develop' into feature/ublad-corpus

2478951

remove citation page from ublad

cd29f43

rename id field for ublad

0dc2659

Merge branch 'develop' into feature/ublad-corpus

136d325

ublad cleanup and tweaks

633be4d

remove the Year field from ublad add content_transform func for ublad content add display_type for urls to Ublad add csv_core fields for Ublad

skip snapshots for ublad corpus

c4e1e08

lukavdplas requested changes May 28, 2024

View reviewed changes

JeltevanBoheemen reviewed May 30, 2024

View reviewed changes

Meesch and others added 6 commits May 31, 2024 12:38

move setlocale to the index script

c42aa8c

Update backend/corpora/ublad/description/ublad.md

1850050

Co-authored-by: Jelte van Boheemen <[email protected]>

Update backend/corpora/ublad/description/ublad.md

40a6ca7

Co-authored-by: Jelte van Boheemen <[email protected]>

use extract instead of transform_soup_func for ublad

4432df9

add es_mappings

fd1d4df

Merge branch 'feature/ublad-corpus' of github.com:UUDigitalHumanities…

9dff0d8

…lab/I-analyzer into feature/ublad-corpus

Meesch requested a review from lukavdplas May 31, 2024 12:20

lukavdplas requested changes Jun 4, 2024

View reviewed changes

Meesch added 2 commits June 18, 2024 14:46

change locales within date_transform method

212ed79

remove locale switching from global environment PR feedback

add test for transform_date

78d64bc

add skip logic for ublad test install language pack in test update test settings

Meesch force-pushed the feature/ublad-corpus branch from d7ce070 to 78d64bc Compare June 18, 2024 13:58

lukavdplas approved these changes Jun 18, 2024

View reviewed changes

Merge branch 'develop' into feature/ublad-corpus

4333b44

Meesch merged commit 97d5bcd into develop Jun 19, 2024
2 checks passed

Meesch deleted the feature/ublad-corpus branch June 19, 2024 09:07

lukavdplas mentioned this pull request Jul 18, 2024

Impure locale setting in U-blad? #1634

Closed

lukavdplas mentioned this pull request Jul 30, 2024

New corpus: u-blad #1323

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

U-Blad Corpus #1590

U-Blad Corpus #1590

Meesch commented May 28, 2024

lukavdplas left a comment

lukavdplas May 28, 2024

JeltevanBoheemen May 30, 2024

Meesch May 30, 2024 •

edited

Loading

lukavdplas May 28, 2024

lukavdplas May 28, 2024

lukavdplas May 28, 2024

Meesch May 31, 2024

lukavdplas May 28, 2024

Meesch May 31, 2024

lukavdplas Jun 4, 2024

JeltevanBoheemen left a comment

JeltevanBoheemen May 30, 2024

JeltevanBoheemen May 30, 2024

Meesch commented May 31, 2024

lukavdplas Jun 4, 2024

lukavdplas Jun 4, 2024

lukavdplas Jun 4, 2024

		scan_image_type = getattr(settings, 'UBLAD_SCAN_IMAGE_TYPE', 'image/jpeg')
		allow_image_download = getattr(settings, 'UBLAD_ALLOW_IMAGE_DOWNLOAD', True)

U-Blad Corpus #1590

U-Blad Corpus #1590

Conversation

Meesch commented May 28, 2024

lukavdplas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Meesch May 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JeltevanBoheemen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Meesch commented May 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Meesch May 30, 2024 •

edited

Loading