-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
U-Blad Corpus #1590
Merged
Merged
U-Blad Corpus #1590
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
c4a35c2
addition preliminary ublad corpus
Meesch eac1369
add document context
Meesch e4d4df0
add year variable to ublad
Meesch 6c5e023
add ublad description
Meesch 2478951
Merge branch 'develop' into feature/ublad-corpus
Meesch cd29f43
remove citation page from ublad
Meesch 0dc2659
rename id field for ublad
Meesch 136d325
Merge branch 'develop' into feature/ublad-corpus
Meesch 633be4d
ublad cleanup and tweaks
Meesch c4e1e08
skip snapshots for ublad corpus
Meesch c42aa8c
move setlocale to the index script
Meesch 1850050
Update backend/corpora/ublad/description/ublad.md
Meesch 40a6ca7
Update backend/corpora/ublad/description/ublad.md
Meesch 4432df9
use extract instead of transform_soup_func for ublad
Meesch fd1d4df
add es_mappings
Meesch 9dff0d8
Merge branch 'feature/ublad-corpus' of github.com:UUDigitalHumanities…
Meesch 212ed79
change locales within date_transform method
Meesch 78d64bc
add test for transform_date
Meesch 4333b44
Merge branch 'develop' into feature/ublad-corpus
Meesch File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
Op 5 september 1969 kreeg de Universiteit Utrecht voor het eerst een onafhankelijk blad: _U utrechtse universitaire reflexen_. Dit blad kwam voort uit een fusie van twee andere tijdschriften: _Sol Iustitiae_ dat voornamelijk gericht was op studenten en _Solaire Reflexen_ dat meer was bedoeld voor medewerkers. U utrechtse universitaire reflexen was bedoeld voor alle geledingen. | ||
|
||
In 1974 veranderde de naam in het _Ublad_. Dat bleef zo tot de universiteit besloot het papieren Ublad digitaal te maken. Onder luid protest verdween het papieren Ublad en ontstond in april 2010 _DUB_, het digitale universiteitsblad. | ||
|
||
Om alle informatie uit het verleden toegankelijk te maken, heeft het Centre for Digital Humanities samen met de Universiteitsbibliotheek de oude jaargangen gedigitaliseerd. In I-analyzer kunt u alle jaargangen van U utrechtse universitaire reflexen en het Ublad vinden en doorzoeken. | ||
|
||
Het onafhankelijke Ublad geeft een kleurrijk verslag van wat er speelde op de universiteit, de stad en het studentenleven door middel van artikelen, foto’s en cartoons. De afbeelding die is gebruikt voor OCR is voor elke pagina bijgevoegd zodat u altijd het originele bronmateriaal kunt raadplegen. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
import locale | ||
import pytest | ||
from corpora.ublad.ublad import transform_date | ||
import datetime | ||
|
||
|
||
def test_transform_date(): | ||
datestring = '6 september 2007' | ||
goal_date = datetime.date(2007, 9, 6) | ||
try: | ||
date = transform_date(datestring) | ||
except locale.Error: | ||
pytest.skip('Dutch Locale not installed in environment') | ||
assert date == str(goal_date) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,264 @@ | ||
from datetime import datetime | ||
import os | ||
from os.path import join, splitext | ||
import locale | ||
import logging | ||
|
||
from django.conf import settings | ||
from addcorpus.python_corpora.corpus import HTMLCorpusDefinition, FieldDefinition | ||
from addcorpus.python_corpora.extract import FilterAttribute | ||
from addcorpus.es_mappings import * | ||
from addcorpus.python_corpora.filters import DateFilter | ||
from addcorpus.es_settings import es_settings | ||
|
||
|
||
from ianalyzer_readers.readers.html import HTMLReader | ||
from ianalyzer_readers.readers.core import Field | ||
from ianalyzer_readers.extract import html, Constant | ||
|
||
from bs4 import BeautifulSoup, Tag | ||
|
||
def transform_content(soup): | ||
""" | ||
Transforms the text contents of a page node (soup) into a string consisting | ||
of blocks of text, foregoing the column structure of the OCR'ed material. | ||
""" | ||
page_text = "" | ||
for child in soup.children: | ||
if isinstance(child, Tag) and 'ocr_carea' in child.get('class', []): | ||
paragraph_text = "" | ||
paragraph_list = child.get_text().split('\n') | ||
for item in paragraph_list[1:]: | ||
if not item: | ||
pass | ||
elif item.endswith('-'): | ||
paragraph_text += item.strip('-') | ||
else: | ||
paragraph_text += item + ' ' | ||
if paragraph_text: | ||
page_text += paragraph_text + '\n\n' | ||
return page_text | ||
|
||
def transform_date(date_string): | ||
try: | ||
locale.setlocale(locale.LC_ALL, 'nl_NL.UTF-8') | ||
date = datetime.strptime(date_string, '%d %B %Y').strftime('%Y-%m-%d') | ||
locale.setlocale(locale.LC_ALL, '') | ||
return date | ||
except ValueError: | ||
logger.error("Unable to get date from {}".format(date_string)) | ||
return None | ||
|
||
|
||
logger = logging.getLogger('indexing') | ||
|
||
class UBlad(HTMLCorpusDefinition): | ||
title = 'U-Blad' | ||
description = 'The print editions of the Utrecht University paper from 1969 until 2010.' | ||
description_page = 'ublad.md' | ||
min_date = datetime(year=1969, month=1, day=1) | ||
max_date = datetime(year=2010, month=12, day=31) | ||
|
||
data_directory = settings.UBLAD_DATA | ||
es_index = getattr(settings, 'UBLAD_ES_INDEX', 'ublad') | ||
image = 'ublad.jpg' | ||
scan_image_type = 'image/jpeg' | ||
allow_image_download = True | ||
|
||
document_context = { | ||
'context_fields': ['volume_id'], | ||
'sort_field': 'sequence', | ||
'sort_direction': 'asc', | ||
'context_display_name': 'volume' | ||
} | ||
|
||
languages = ['nl'] | ||
category = 'periodical' | ||
|
||
@property | ||
def es_settings(self): | ||
return es_settings(self.languages[:1], stopword_analysis=True, stemming_analysis=True) | ||
|
||
def sources(self, start=min_date, end=max_date): | ||
for directory, _, filenames in os.walk(self.data_directory): | ||
_body, tail = os.path.split(directory) | ||
if '.snapshot' in _: | ||
_.remove('.snapshot') | ||
continue | ||
for filename in filenames: | ||
if filename != '.DS_Store': | ||
full_path = join(directory, filename) | ||
yield full_path, {'filename': filename} | ||
|
||
|
||
fields = [ | ||
FieldDefinition( | ||
name = 'content', | ||
display_name='Content', | ||
display_type='text_content', | ||
description='Text content of the page, generated by OCR', | ||
results_overview=True, | ||
csv_core=True, | ||
search_field_core=True, | ||
visualizations=['ngram', 'wordcloud'], | ||
es_mapping = main_content_mapping(True, True, True, 'nl'), | ||
extractor= FilterAttribute(tag='div', | ||
recursive=True, | ||
multiple=False, | ||
flatten=False, | ||
extract_soup_func=transform_content, | ||
attribute_filter={ | ||
'attribute': 'class', | ||
'value': 'ocr_page' | ||
}) | ||
), | ||
FieldDefinition( | ||
name='pagenum', | ||
display_name='Page number', | ||
description='Page number', | ||
csv_core=True, | ||
es_mapping = int_mapping(), | ||
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={ | ||
'attribute': 'name', | ||
'value': 'pagenum' | ||
} | ||
) | ||
), | ||
FieldDefinition( | ||
name='journal_title', | ||
display_name='Publication Title', | ||
description='Title of the publication', | ||
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={ | ||
'attribute': 'name', | ||
'value': 'journal_title' | ||
} | ||
) | ||
), | ||
FieldDefinition( | ||
name='volume_id', | ||
display_name='Volume ID', | ||
description='Unique identifier for this volume', | ||
hidden=True, | ||
es_mapping=keyword_mapping(), | ||
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={ | ||
'attribute': 'name', | ||
'value': 'identifier_ocn' | ||
} | ||
) | ||
), | ||
FieldDefinition( | ||
name='id', | ||
display_name='Page ID', | ||
description='Unique identifier for this page', | ||
hidden=True, | ||
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={ | ||
'attribute': 'name', | ||
'value': 'identifier_indexid' | ||
} | ||
) | ||
), | ||
FieldDefinition( | ||
name='edition', | ||
display_name='Edition', | ||
description='The number of the edition in this volume. Every year starts at 1.', | ||
sortable=True, | ||
es_mapping = keyword_mapping(), | ||
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={ | ||
'attribute': 'name', | ||
'value': 'aflevering' | ||
} | ||
) | ||
), | ||
FieldDefinition( | ||
name='volume', | ||
display_name='Volume', | ||
sortable=True, | ||
results_overview=True, | ||
csv_core=True, | ||
description='The volume number of this publication. There is one volume per year.', | ||
es_mapping=keyword_mapping(), | ||
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={ | ||
'attribute': 'name', | ||
'value': 'yearstring' | ||
} | ||
), | ||
), | ||
FieldDefinition( | ||
name='date', | ||
display_name='Date', | ||
description='The publication date of this edition', | ||
es_mapping={'type': 'date', 'format': 'yyyy-MM-dd'}, | ||
visualizations=['resultscount', 'termfrequency'], | ||
sortable=True, | ||
results_overview=True, | ||
search_filter=DateFilter( | ||
min_date, | ||
max_date, | ||
description=( | ||
'Accept only articles with publication date in this range.' | ||
) | ||
), | ||
extractor = FilterAttribute(tag='meta', attribute='content', attribute_filter={ | ||
'attribute': 'name', | ||
'value': 'datestring', | ||
}, | ||
transform=transform_date | ||
) | ||
), | ||
FieldDefinition( | ||
name='repo_url', | ||
display_name='Repository URL', | ||
description='URL to the dSPACE repository entry of this volume', | ||
es_mapping=keyword_mapping(), | ||
display_type='url', | ||
searchable=False, | ||
extractor=FilterAttribute(tag='meta', attribute='content', attribute_filter={ | ||
'attribute': 'name', | ||
'value': 'link_repository' | ||
} | ||
) | ||
), | ||
FieldDefinition( | ||
name='reader_url', | ||
display_name='Reader URL', | ||
description='URL to the UB reader view of this page', | ||
es_mapping=keyword_mapping(), | ||
display_type='url', | ||
searchable=False, | ||
extractor=FilterAttribute(tag='meta', attribute='content', attribute_filter={ | ||
'attribute': 'name', | ||
'value': 'link_objects_image' | ||
} | ||
) | ||
), | ||
FieldDefinition( | ||
name='jpg_url', | ||
display_name='Image URL', | ||
description='URL to the jpg file of this page', | ||
es_mapping=keyword_mapping(), | ||
display_type='url', | ||
searchable=False, | ||
extractor=FilterAttribute(tag='meta', attribute='content', attribute_filter={ | ||
'attribute': 'name', | ||
'value': 'link_objects_jpg' | ||
} | ||
) | ||
), | ||
FieldDefinition( | ||
name='worldcat_url', | ||
display_name='Worldcat URL', | ||
description='URL to the Worldcat entry of this volume', | ||
es_mapping=keyword_mapping(), | ||
display_type='url', | ||
searchable=False, | ||
extractor=FilterAttribute(tag='meta', attribute='content', attribute_filter={ | ||
'attribute': 'name', | ||
'value': 'link_worldcat' | ||
} | ||
) | ||
) | ||
] | ||
|
||
def request_media(self, document, corpus_name): | ||
image_list = [document['fieldValues']['jpg_url']] | ||
return {'media': image_list} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -76,6 +76,8 @@ | |
|
||
CORPUS_SERVER_NAMES = {} | ||
|
||
CORPORA_LOCALES = {} | ||
|
||
CORPORA = {} | ||
|
||
WORDCLOUD_LIMIT = 1000 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really agree with this solution. If locale switching is implemented in the application-wide indexing procedure, it increases the complexity of how indexing works generally; unless it's a common problem, it would be better to solve this problem in the corpus class, to keep the application more modular.
In this case, I expect it's possible to set and reset the locale in the
documents()
method of the corpus class instead. Something like this:If it is is necessary to do this here, please make sure it's properly documented: