Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
p-goulart committed Jan 18, 2024
1 parent 1f2b122 commit 5c3f177
Showing 1 changed file with 70 additions and 2 deletions.
72 changes: 70 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,55 @@ This repository contains tools for compiling and deploying dictionaries for [Lan
The owner, maintainer, and main dev for this repository is @p-goulart. Any potential shell and perl components may be
better explained by @jaumeortola, though.

## Installation
## Setup

Under construction!
### Python dependencies

This is set up as a Poetry project, so you must have [Poetry](https://python-poetry.org/docs/) installed and ready to go.

Make sure you are using a [virtual environment](https://python-poetry.org/docs/managing-environments/) and then:

```bash
poetry install --with test,dev
```

### System dependencies

In addition to the Python dependencies, you will also need to have [Hunspell](https://github.com/hunspell/hunspell)
binaries installed on your system.

The most important one is `unmunch`. Check if it's installed:

```bash
which unmunch
# should return a path to a bin directory, like
# /opt/homebrew/bin/unmunch
```

If it's not installed, you may need to compile Hunspell from source. Clone the [Hunspell repo](https://github.com/hunspell/hunspell)
and then, from inside it, these steps should work on Ubuntu:
```bash
# install a bunch of dependencies needed for compilation
sudo apt-get install autoconf automake autopoint libtool
autoreconf -vfi
./configure
make
sudo make install
sudo ldconfig
```

### LT dependencies

The scripts here also depend on the `languagetool` Java codebase (for word tokenisation).

Make sure you have LT cloned locally, and export the following environment variable in your shell configuration:

```bash
export LT_HOME=/path/to/languagetool
```

If this is not done, the code in this project will set that variable as a default to `../languagetool` (meaning one
directory up from wherever this repo is cloned).

## Usage

Expand All @@ -20,3 +66,25 @@ This repository should be a submodule of language-specific repositories. For exa
⚠️ Note that the name of this repository is in **kebab-case**, but Python modules should be imported in **snake_case**.
Therefore, when importing this as a submodule, make sure to set the path to `dict_tools`, which uses the underscore.
If you don't do this, you may fail to import it as a module.

### `build_tagger_dicts.py`

This is the script that takes compiles source files into a binary dictionary to be used by the LT POS tagger, Word
Tokeniser, and Synthesiser.

You can check the usage parameters by invoking it with `--help`:

```bash
poetry run python scripts/build_tagger_dicts.py --help
```

### `build_spelling_dicts.py`

This is the script that takes all the Hunspell and helper files as input and yields as output binary files to be used
by the Morfologik speller.

You can check the usage parameters by invoking it with `--help`:

```bash
poetry run python scripts/build_spelling_dicts.py --help
```

0 comments on commit 5c3f177

Please sign in to comment.