Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

U-Blad Corpus #1590

Merged
merged 19 commits into from
Jun 19, 2024
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions backend/corpora/ublad/description/ublad.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Op 5 september 1969 kreeg de Universiteit Utrecht voor het eerst een onafhankelijk blad: _U utrechtse universitaire reflexen_. Dit blad kwam voort uit een fusie van twee andere tijdschriften: _Sol Iustitiae_ dat voornamelijk gericht was op studenten en _Solaire Reflexen_ dat meer was bedoeld voor medewerkers. U utrechtse universitaire reflexen was bedoeld voor alle geledingen.

In 1974 veranderde de naam in het _Ublad_. Dat bleef zo tot de universiteit besloot het papieren Ublad digitaal te maken. Onder luid protest verdween het papieren Ublad en ontstond in april 2010 _DUB_, het digitale universiteitsblad.

Om alle informatie uit het verleden toegankelijk te maken, hebben we samen met de Universiteitsbibliotheek de oude jaargangen gescand. In I-analyzer kunt u alle jaargangen van U utrechtse universitaire reflexen en het Ublad vinden en doorzoeken.
Meesch marked this conversation as resolved.
Show resolved Hide resolved

Het onafhankelijke Ublad geeft een kleurrijk verslag van wat er speelde op de universiteit, de stad en het studentenleven door middel van atikelen, foto’s en cartoons. De afbeelding die is gebruikt voor OCR is voor elke pagina bijgevoegd zodat u altijd het originele bronmateriaal kan raadplegen.
Meesch marked this conversation as resolved.
Show resolved Hide resolved
Binary file added backend/corpora/ublad/images/ublad.jpg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leuke afbeelding, nice.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
253 changes: 253 additions & 0 deletions backend/corpora/ublad/ublad.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
from datetime import datetime
import os
from os.path import join, splitext


from django.conf import settings
from addcorpus.python_corpora.corpus import HTMLCorpusDefinition, FieldDefinition
from addcorpus.python_corpora.extract import FilterAttribute
from addcorpus.es_mappings import *
from addcorpus.python_corpora.filters import DateFilter
from addcorpus.es_settings import es_settings


from ianalyzer_readers.readers.html import HTMLReader
from ianalyzer_readers.readers.core import Field
from ianalyzer_readers.extract import html, Constant

from bs4 import BeautifulSoup, Tag

import locale
locale.setlocale(locale.LC_ALL, 'nl_NL.UTF-8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the corpus is imported when starting up the server, this may have unintended side effects. I would recommend restoring the locale setting when it's no longer needed.

It's not clear from the code when that would be, by the way; please document what this setting is intended to achieve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Luka, this should not live in runtime code. My guess it is used in indexing?

Copy link
Contributor Author

@Meesch Meesch May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used for the indexing of the date; where would I put it then? Or what should I restore the locale to exactly?


def transform_content(soup):
"""
Transforms the text contents of a page node (soup) into a string consisting
of blocks of text, foregoing the column structure of the OCR'ed material.
"""
page_text = ""
for child in soup.children:
if isinstance(child, Tag) and 'ocr_carea' in child.get('class', []):
paragraph_text = ""
paragraph_list = child.get_text().split('\n')
for item in paragraph_list[1:]:
if not item:
pass
elif item.endswith('-'):
paragraph_text += item.strip('-')
else:
paragraph_text += item + ' '
if paragraph_text:
page_text += paragraph_text + '\n\n'
page_node = BeautifulSoup(page_text, 'html.parser')
return page_node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

Since the return value is just a node with the string content you want, why not use extract_soup_func to return the string directly?



class UBlad(HTMLCorpusDefinition):
title = 'U-Blad'
description = 'The print editions of the Utrecht University paper from 1969 until 2010.'
description_page = 'ublad.md'
min_date = datetime(year=1969, month=1, day=1)
max_date = datetime(year=2010, month=12, day=31)

data_directory = settings.UBLAD_DATA
es_index = getattr(settings, 'UBLAD_ES_INDEX', 'ublad')
image = 'ublad.jpg'
scan_image_type = getattr(settings, 'UBLAD_SCAN_IMAGE_TYPE', 'image/jpeg')
allow_image_download = getattr(settings, 'UBLAD_ALLOW_IMAGE_DOWNLOAD', True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these attributes need to be configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scan_image_type is used in GetMediaView to open the image type correctly, allow_image_download is used in the frontend to allow downlaods and the default is false. Or am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the values make sense, but you could just write the value directly, i.e.

scan_image_type = 'image/jpeg'
allow_image_download = True

This definition allows the values to be configured in settings, but why? That would make sense if, for instance, image download would need to be disabled in some environments but not in others.


document_context = {
'context_fields': ['volume_id'],
'sort_field': 'sequence',
'sort_direction': 'asc',
'context_display_name': 'volume'
}

languages = ['nl']
category = 'periodical'

@property
def es_settings(self):
return es_settings(self.languages[:1], stopword_analysis=True, stemming_analysis=True)

def sources(self, start=min_date, end=max_date):
for directory, _, filenames in os.walk(self.data_directory):
_body, tail = os.path.split(directory)
if '.snapshot' in _:
_.remove('.snapshot')
continue
for filename in filenames:
if filename != '.DS_Store':
full_path = join(directory, filename)
yield full_path, {'filename': filename}


fields = [
FieldDefinition(
name = 'content',
display_name='Content',
display_type='text_content',
description='Text content of the page, generated by OCR',
results_overview=True,
csv_core=True,
search_field_core=True,
visualizations=['ngram', 'wordcloud'],
es_mapping = main_content_mapping(True, True, True, 'nl'),
extractor= FilterAttribute(tag='div',
recursive=True,
multiple=False,
flatten=False,
transform_soup_func=transform_content,
attribute_filter={
'attribute': 'class',
'value': 'ocr_page'
})
),
FieldDefinition(
name='pagenum',
display_name='Page number',
description='Page number',
csv_core=True,
es_mapping = int_mapping(),
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'pagenum'
}
)
),
FieldDefinition(
name='journal_title',
display_name='Publication Title',
description='Title of the publication',
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'journal_title'
}
)
),
FieldDefinition(
name='volume_id',
display_name='Volume ID',
description='Unique identifier for this volume',
hidden=True,
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'identifier_ocn'
}
)
),
FieldDefinition(
name='id',
display_name='Page ID',
description='Unique identifier for this page',
hidden=True,
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'identifier_indexid'
}
)
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields are missing es_mapping=keyword_mapping()

FieldDefinition(
name='edition',
display_name='Edition',
description='The number of the edition in this volume. Every year starts at 1.',
sortable=True,
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'aflevering'
}
)
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

es_mapping=int_mapping(), I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keyword_mapping cause sometimes it is a phrase instead of a number

FieldDefinition(
name='volume',
display_name='Volume',
sortable=True,
results_overview=True,
csv_core=True,
description='The volume number of this publication. There is one volume per year.',
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'yearstring'
}
),
),
FieldDefinition(
name='date',
display_name='Date',
description='The publication date of this edition',
es_mapping={'type': 'date', 'format': 'yyyy-MM-dd'},
visualizations=['resultscount', 'termfrequency'],
sortable=True,
results_overview=True,
search_filter=DateFilter(
min_date,
max_date,
description=(
'Accept only articles with publication date in this range.'
)
),
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'datestring',
},
transform=lambda x: datetime.strptime(
x, '%d %B %Y').strftime('%Y-%m-%d')
)
),
FieldDefinition(
name='repo_url',
display_name='Repository URL',
description='URL to the dSPACE repository entry of this volume',
es_mapping=keyword_mapping(),
display_type='url',
searchable=False,
extractor=FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'link_repository'
}
)
),
FieldDefinition(
name='reader_url',
display_name='Reader URL',
description='URL to the UB reader view of this page',
es_mapping=keyword_mapping(),
display_type='url',
searchable=False,
extractor=FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'link_objects_image'
}
)
),
FieldDefinition(
name='jpg_url',
display_name='Image URL',
description='URL to the jpg file of this page',
es_mapping=keyword_mapping(),
display_type='url',
searchable=False,
extractor=FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'link_objects_jpg'
}
)
),
FieldDefinition(
name='worldcat_url',
display_name='Worldcat URL',
description='URL to the Worldcat entry of this volume',
es_mapping=keyword_mapping(),
display_type='url',
searchable=False,
extractor=FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'link_worldcat'
}
)
)
]

def request_media(self, document, corpus_name):
image_list = [document['fieldValues']['jpg_url']]
return {'media': image_list}
Loading