Skip to content

Commit

Permalink
Merge branch 'release/4.1.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
lukavdplas committed Jun 28, 2023
2 parents fb80497 + 37aaf2c commit 56f9d39
Show file tree
Hide file tree
Showing 254 changed files with 5,506 additions and 2,456 deletions.
14 changes: 9 additions & 5 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,13 @@ message: >-
metadata from this file.
type: software
authors:
- name: Research Software Lab
email: [email protected]
affiliation: 'Centre for Digital Humanities, Utrecht University'
- name: 'Research Software Lab, Centre for Digital Humanities, Utrecht University'
website: 'https://cdh.uu.nl/centre-for-digital-humanities/research-software-lab/'
city: Utrecht
country: NL
identifiers:
- type: doi
value: 10.5281/zenodo.8064133
repository-code: 'https://github.com/UUDigitalHumanitieslab/I-analyzer'
url: 'https://ianalyzer.hum.uu.nl'
abstract: >-
Expand All @@ -31,6 +35,6 @@ keywords:
- elasticsearch
- natural language processing
license: MIT
commit: 96b9585
version: 4.0.2
commit: fb80497
version: 4.0.3
date-released: '2023-06-21'
61 changes: 7 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,14 @@
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8064133.svg)](https://doi.org/10.5281/zenodo.8064133)
[![Actions Status](https://github.com/UUDigitalHumanitiesLab/I-analyzer/workflows/Unit%20tests/badge.svg)](https://github.com/UUDigitalHumanitiesLab/I-analyzer/actions)


# I-analyzer

The text mining tool that obviates all others.

I-analyzer is a web application that allows users to search through large text corpora, requiring no experience in text mining or technical know-how.

## Directory structure

The I-analyzer backend (`/backend`) is a python/Django app that provides the following functionality:

- A 'users' module that defines user accounts.

- A 'corpora' module containing corpus definitions and metadata of the currently implemented corpora. For each corpus added in I-analyzer, this module defines how to extract document contents from its source files and sets parameters for displaying the corpus in the interface, such as sorting options.

- An 'addcorpus' module which manages the functionality to extract data from corpus source files (given the definition) and save this in an elasticsearch index. Source files can be XML or HTML format (which are parsed with `beautifulsoup4` + `lxml`) or CSV. This module also provides the basic data structure for corpora.

- An 'es' module which handles the communication with elasticsearch. The data is passed through to the index using the `elasticsearch` package for Python (note that `elasticsearch-dsl` is not used, since its [documentation](https://elasticsearch-dsl.readthedocs.io/en/latest) at the time seemed less immediately accessible than the [low-level](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html) version).

- An 'api' module that that enables users to search through an ElasticSearch index of a text corpus and stream search results into a CSV file. The module also performs more complex analysis of search results for visualisations.

- A 'visualizations' module that does the analysis for several types of text-based visualisations.

- A 'downloads' module that collects results into csv files.

- A 'wordmodels' module that handles functionality related to word embeddings.

`ianalyzer/frontend` is an [Angular 13](https://angular.io/) web interface.

See the documentation for [a more extensive overview](./documentation/Overview.md)
See the documentation for [an overview of the repository](./documentation/Overview.md)

## Prerequisites

Expand All @@ -38,8 +18,6 @@ See the documentation for [a more extensive overview](./documentation/Overview.m
* [Redis](https://www.redis.io/) (used by [Celery](http://www.celeryproject.org/)). Recommended installation is [installing from source](https://redis.io/docs/getting-started/installation/install-redis-from-source/)
* Yarn

If you wish to have email functionality, also make sure you have an email server set up, such as [maildev](https://maildev.github.io/maildev/).

The documentation includes a [recipe for installing the prerequisites on Debian 10](./documentation/Local-Debian-I-Analyzer-setup.md)

## First-time setup
Expand Down Expand Up @@ -77,49 +55,24 @@ yarn postinstall
The backend readme provides more details on these steps.
8. Set up the database and migrations by running `yarn django migrate`.
9. Make a superuser account with `yarn django createsuperuser`
10. In `frontend/src/environments`, create a file `environment.private.ts` with the following settings:
```
privateEnvironment = {
appName: I-Analyzer,
aboutPage: ianalyzer
}
```

## Adding corpora

To include corpora on your environment, you need to index them from their source files. The source files are not included in this directory; ask another developer about their availability. If you have (a sample of) the source files for a corpus, you can add it your our environment as follows:

_Note:_ these instructions are for adding a corpus that already has a corpus definition. For adding new corpus definitions, see [How to add a new corpus to I-analyzer](./documentation/How-to-add-a-new-corpus-to-Ianalyzer.md).
_Note:_ these instructions are for indexing a corpus that already has a corpus definition. For adding new corpus definitions, see [How to add a new corpus to I-analyzer](./documentation/How-to-add-a-new-corpus-to-Ianalyzer.md).

1. Add the corpus to the `CORPORA` dictionary in your local settings file. The key should match the class name of the corpus definition. This match is not case-sensitive, and your key may include extra non-alphabetic characters (they will be ignored when matching). The value should be the absolute path the corpus definition file (e.g. `.../backend/corpora/times/times.py`).
2. Set configurations for your corpus. Check the definition file to see which variables it expects to find in the configuration. Some of these may already be set in settings.py, but you will at least need to define the name of the elasticsearch index and the (absolute) path to your source files.
3. Activate your python virtual environment. Create an ElasticSearch index from the source files by running, e.g., `yarn django index dutchannualreports -s 1785-01-01 -e 2010-12-31`, for indexing the Dutch Annual Reports corpus starting in 1785 and ending in 2010. The dates are optional, and default to specified minimum and maximum dates of the corpus. (Note that new indices are created with `number_of_replicas` set to 0 (this is to make index creation easier/lighter). In production, you can automatically update this setting after index creation by adding the `--prod` flag (e.g. `yarn django index goodreads --prod`). Note though, that the
`--prod` flag creates a _versioned_ index name, which needs an alias to actually work as `name_of_index_without_version` (see below for more details).

#### Flags of indexing script
- --prod / -p Whether or not to create a versioned index name
- --mappings_only / -m Whether to only create an index with mappings and settings, without adding data to it (useful before reindexing from another index or another server)
- --add / -a Add documents to an existing index (skip index creation)
- --update / -u Add or change fields in the documents. This requires an `update_body` or `update_script` to be set in the corpus definition, see [example for update_body in dutchnewspapers](backend/corpora/dutchnewspapers/dutchnewspapers_all.py) and [example for update_script in goodreads](backend/corpora/goodreads/goodreads.py).
- --delete / -d Delete an existing index with the `corpus.es_index` name. Note that in production, `corpus.es_index` will usually be an *alias*, and you would use the `yarn django es alias -c corpus-name --clean` to achieve the same thing.
- --rollover / -r Only applies in production: rollover a versioned index to the newest version. This *will not* delete the old index (so you have a chance to check the new index and roll back, if necessary)

#### Production

On the servers, we work with aliases. Indices created with the `--prod` flag will have a version number (e.g. `indexname-1`), and as such will not be recognized by the corpus definition (which is looking for `indexname`). Create an alias for that using the `alias` command: `yarn django alias -c corpusname`. That script ensures that an alias is present for the index with the highest version numbers, and not for all others (i.e. older versions). The advantage of this approach is that an old version of the index can be kept in place as long as is needed, for example while a new version of the index is created. Note that removing an alias does not remove the index itself.

Once you have an alias in place, you might want to remove any old versions of the index. The `alias` command can be used for this. If you call `yarn django alias -c corpusname --clean` any versions of the index that are not the newest version will be removed. Note that removing an index also removes any existing aliases for it. You might want to perform this as a separate operation (i.e. after completing step 14) so that the old index stays in place for a bit while you check that all is fine.

See the documentation for more information about [indexing on the server](./documentation/Indexing-on-server.md).
2. Set configurations for your corpus. Check the definition file to see which variables it expects to find in the configuration. Some of these may already be set in settings.py, but you will at least need to define the (absolute) path to your source files.
3. Activate your python virtual environment. Create an ElasticSearch index from the source files by running, e.g., `yarn django index dutchannualreports`, for indexing the Dutch Annual Reports corpus in a development environment. See [Indexing](documentation/Indexing-corpora.md) for more information.

## Running a dev environment

1. Start your local elasticsearch server. If you installed from .zip or .tar.gz, this can be done by running `{path your your elasticsearch folder}/bin/elasticsearch`
2. Activate your python environment. Start the backend server with `yarn start-back`. This creates an instance of the Django server at `127.0.0.1:8000`.
3. (optional) If you want to use celery, start your local redis server by running `redis-server` in a separate terminal.
4. (optional) If you want to use celery, activate your python environment. Run `yarn celery worker`. Celery is used for long downloads and the word cloud and ngrams visualisations.
5. (optional) If you want to use email functionality, start your local email server.
6. Start the frontend by running `yarn start-front`.
5. Start the frontend by running `yarn start-front`.

## Notes for development

Expand Down
5 changes: 5 additions & 0 deletions backend/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,14 @@ flask_sql_data/
# Local settings file
ianalyzer/settings_local.py

# legacy config
ianalyzer/config.py

# csv downloads
download/csv_files/

# word models
corpora/*/wm/*
!corpora/*/wm/documentation.md


11 changes: 2 additions & 9 deletions backend/addcorpus/conftest.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
import pytest
import os
from django.contrib.auth.models import Group
from addcorpus.load_corpus import load_all_corpora
from addcorpus.models import Corpus

@pytest.fixture()
def group_with_access(db, mock_corpus, mock_corpora_in_db):
def group_with_access(db, mock_corpus):
'''Create a group with access to the mock corpus'''
group = Group.objects.create(name='nice-users')
corpus = Corpus.objects.get(name=mock_corpus)
Expand All @@ -17,11 +16,5 @@ def group_with_access(db, mock_corpus, mock_corpora_in_db):
here = os.path.abspath(os.path.dirname(__file__))

@pytest.fixture()
def mock_corpus(db):
def mock_corpus():
return 'mock-csv-corpus'

@pytest.fixture()
def mock_corpus_user(auth_user, group_with_access):
auth_user.groups.add(group_with_access)
auth_user.save()
return auth_user
14 changes: 14 additions & 0 deletions backend/addcorpus/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
CATEGORIES = [
('newspaper', 'Newspapers'),
('parliament', 'Parliamentary debates'),
('periodical', 'Periodicals'),
('finance', 'Financial reports'),
('ruling', 'Court rulings'),
('review', 'Online reviews'),
('inscription', 'Funerary inscriptions'),
('oration', 'Orations'),
('book', 'Books'),
]
'''
Types of data
'''
Loading

0 comments on commit 56f9d39

Please sign in to comment.