Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

U-Blad Corpus #1590

Merged
merged 19 commits into from
Jun 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions backend/corpora/ublad/description/ublad.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Op 5 september 1969 kreeg de Universiteit Utrecht voor het eerst een onafhankelijk blad: _U utrechtse universitaire reflexen_. Dit blad kwam voort uit een fusie van twee andere tijdschriften: _Sol Iustitiae_ dat voornamelijk gericht was op studenten en _Solaire Reflexen_ dat meer was bedoeld voor medewerkers. U utrechtse universitaire reflexen was bedoeld voor alle geledingen.

In 1974 veranderde de naam in het _Ublad_. Dat bleef zo tot de universiteit besloot het papieren Ublad digitaal te maken. Onder luid protest verdween het papieren Ublad en ontstond in april 2010 _DUB_, het digitale universiteitsblad.

Om alle informatie uit het verleden toegankelijk te maken, heeft het Centre for Digital Humanities samen met de Universiteitsbibliotheek de oude jaargangen gedigitaliseerd. In I-analyzer kunt u alle jaargangen van U utrechtse universitaire reflexen en het Ublad vinden en doorzoeken.

Het onafhankelijke Ublad geeft een kleurrijk verslag van wat er speelde op de universiteit, de stad en het studentenleven door middel van artikelen, foto’s en cartoons. De afbeelding die is gebruikt voor OCR is voor elke pagina bijgevoegd zodat u altijd het originele bronmateriaal kunt raadplegen.
Binary file added backend/corpora/ublad/images/ublad.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions backend/corpora/ublad/tests/test_ublad.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
import locale
import pytest
from corpora.ublad.ublad import transform_date
import datetime


def test_transform_date():
datestring = '6 september 2007'
goal_date = datetime.date(2007, 9, 6)
try:
date = transform_date(datestring)
except locale.Error:
pytest.skip('Dutch Locale not installed in environment')
assert date == str(goal_date)
264 changes: 264 additions & 0 deletions backend/corpora/ublad/ublad.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
from datetime import datetime
import os
from os.path import join, splitext
import locale
import logging

from django.conf import settings
from addcorpus.python_corpora.corpus import HTMLCorpusDefinition, FieldDefinition
from addcorpus.python_corpora.extract import FilterAttribute
from addcorpus.es_mappings import *
from addcorpus.python_corpora.filters import DateFilter
from addcorpus.es_settings import es_settings


from ianalyzer_readers.readers.html import HTMLReader
from ianalyzer_readers.readers.core import Field
from ianalyzer_readers.extract import html, Constant

from bs4 import BeautifulSoup, Tag

def transform_content(soup):
"""
Transforms the text contents of a page node (soup) into a string consisting
of blocks of text, foregoing the column structure of the OCR'ed material.
"""
page_text = ""
for child in soup.children:
if isinstance(child, Tag) and 'ocr_carea' in child.get('class', []):
paragraph_text = ""
paragraph_list = child.get_text().split('\n')
for item in paragraph_list[1:]:
if not item:
pass
elif item.endswith('-'):
paragraph_text += item.strip('-')
else:
paragraph_text += item + ' '
if paragraph_text:
page_text += paragraph_text + '\n\n'
return page_text

def transform_date(date_string):
try:
locale.setlocale(locale.LC_ALL, 'nl_NL.UTF-8')
date = datetime.strptime(date_string, '%d %B %Y').strftime('%Y-%m-%d')
locale.setlocale(locale.LC_ALL, '')
return date
except ValueError:
logger.error("Unable to get date from {}".format(date_string))
return None


logger = logging.getLogger('indexing')

class UBlad(HTMLCorpusDefinition):
title = 'U-Blad'
description = 'The print editions of the Utrecht University paper from 1969 until 2010.'
description_page = 'ublad.md'
min_date = datetime(year=1969, month=1, day=1)
max_date = datetime(year=2010, month=12, day=31)

data_directory = settings.UBLAD_DATA
es_index = getattr(settings, 'UBLAD_ES_INDEX', 'ublad')
image = 'ublad.jpg'
scan_image_type = 'image/jpeg'
allow_image_download = True

document_context = {
'context_fields': ['volume_id'],
'sort_field': 'sequence',
'sort_direction': 'asc',
'context_display_name': 'volume'
}

languages = ['nl']
category = 'periodical'

@property
def es_settings(self):
return es_settings(self.languages[:1], stopword_analysis=True, stemming_analysis=True)

def sources(self, start=min_date, end=max_date):
for directory, _, filenames in os.walk(self.data_directory):
_body, tail = os.path.split(directory)
if '.snapshot' in _:
_.remove('.snapshot')
continue
for filename in filenames:
if filename != '.DS_Store':
full_path = join(directory, filename)
yield full_path, {'filename': filename}


fields = [
FieldDefinition(
name = 'content',
display_name='Content',
display_type='text_content',
description='Text content of the page, generated by OCR',
results_overview=True,
csv_core=True,
search_field_core=True,
visualizations=['ngram', 'wordcloud'],
es_mapping = main_content_mapping(True, True, True, 'nl'),
extractor= FilterAttribute(tag='div',
recursive=True,
multiple=False,
flatten=False,
extract_soup_func=transform_content,
attribute_filter={
'attribute': 'class',
'value': 'ocr_page'
})
),
FieldDefinition(
name='pagenum',
display_name='Page number',
description='Page number',
csv_core=True,
es_mapping = int_mapping(),
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'pagenum'
}
)
),
FieldDefinition(
name='journal_title',
display_name='Publication Title',
description='Title of the publication',
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'journal_title'
}
)
),
FieldDefinition(
name='volume_id',
display_name='Volume ID',
description='Unique identifier for this volume',
hidden=True,
es_mapping=keyword_mapping(),
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'identifier_ocn'
}
)
),
FieldDefinition(
name='id',
display_name='Page ID',
description='Unique identifier for this page',
hidden=True,
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'identifier_indexid'
}
)
),
FieldDefinition(
name='edition',
display_name='Edition',
description='The number of the edition in this volume. Every year starts at 1.',
sortable=True,
es_mapping = keyword_mapping(),
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'aflevering'
}
)
),
FieldDefinition(
name='volume',
display_name='Volume',
sortable=True,
results_overview=True,
csv_core=True,
description='The volume number of this publication. There is one volume per year.',
es_mapping=keyword_mapping(),
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'yearstring'
}
),
),
FieldDefinition(
name='date',
display_name='Date',
description='The publication date of this edition',
es_mapping={'type': 'date', 'format': 'yyyy-MM-dd'},
visualizations=['resultscount', 'termfrequency'],
sortable=True,
results_overview=True,
search_filter=DateFilter(
min_date,
max_date,
description=(
'Accept only articles with publication date in this range.'
)
),
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'datestring',
},
transform=transform_date
)
),
FieldDefinition(
name='repo_url',
display_name='Repository URL',
description='URL to the dSPACE repository entry of this volume',
es_mapping=keyword_mapping(),
display_type='url',
searchable=False,
extractor=FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'link_repository'
}
)
),
FieldDefinition(
name='reader_url',
display_name='Reader URL',
description='URL to the UB reader view of this page',
es_mapping=keyword_mapping(),
display_type='url',
searchable=False,
extractor=FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'link_objects_image'
}
)
),
FieldDefinition(
name='jpg_url',
display_name='Image URL',
description='URL to the jpg file of this page',
es_mapping=keyword_mapping(),
display_type='url',
searchable=False,
extractor=FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'link_objects_jpg'
}
)
),
FieldDefinition(
name='worldcat_url',
display_name='Worldcat URL',
description='URL to the Worldcat entry of this volume',
es_mapping=keyword_mapping(),
display_type='url',
searchable=False,
extractor=FilterAttribute(tag='meta', attribute='content', attribute_filter={
'attribute': 'name',
'value': 'link_worldcat'
}
)
)
]

def request_media(self, document, corpus_name):
image_list = [document['fieldValues']['jpg_url']]
return {'media': image_list}
1 change: 1 addition & 0 deletions backend/es/es_index.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really agree with this solution. If locale switching is implemented in the application-wide indexing procedure, it increases the complexity of how indexing works generally; unless it's a common problem, it would be better to solve this problem in the corpus class, to keep the application more modular.

In this case, I expect it's possible to set and reset the locale in the documents() method of the corpus class instead. Something like this:

def documents(self, sources=None):
    #set locale
    for doc in super().documents(sources):
        yield doc
    # reset locale

If it is is necessary to do this here, please make sure it's properly documented:

  • This change adds a new setting, which should be documented in the settings documentation
  • If this option is useful for other corpora, it should be included in the documentation on writing Python corpora
  • There should be a unit test (or several) for the new functionality. As it is, it's quite likely that the functionality would be inadvertently broken or deleted at some point, since it will serve no apparent function; that is, no test will break if you remove it.

Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@ def populate(client: Elasticsearch, corpus: Corpus, start=None, end=None):

corpus_server = settings.SERVERS[
settings.CORPUS_SERVER_NAMES.get(corpus_name, 'default')]

# Do bulk operation
for success, info in es_helpers.streaming_bulk(
client,
Expand Down
2 changes: 2 additions & 0 deletions backend/ianalyzer/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@

CORPUS_SERVER_NAMES = {}

CORPORA_LOCALES = {}

CORPORA = {}

WORDCLOUD_LIMIT = 1000
Expand Down
2 changes: 2 additions & 0 deletions backend/ianalyzer/settings_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,6 @@ def test_corpus_path(*path):
TIMES_DATA = os.path.join(BASE_DIR, 'addcorpus', 'python_corpora', 'tests')
TIMES_ES_INDEX = 'times-test'

UBLAD_DATA = '' # necessary to make ublad test not fail

SERVERS['default']['index_prefix'] = 'test'
Loading