Skip to content

The training code behind EmuBert, the largest open-source masked language model for Australian law.

License

Notifications You must be signed in to change notification settings

isaacus-dev/emubert-creator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

EmuBert Creator

EmuBert is the largest open-source masked language model for Australian law. This repository preserves the code used to create EmuBert.

If you're looking to download EmuBert, you may do so on Hugging Face.

Setup 🛠️

The EmuBert Creator has only been tested on Python 3.11 but should work for later versions and may also work for earlier versions.

To set up the Creator, start by running the following commands:

git clone https://github.com/umarbutler/emubert-creator.git
cd emubert-creator
pip install -r requirements.txt

Next, download the version of the Open Australian Legal Corpus you'd like to train EmuBert on by navigating to its changelog, clicking on the version number you'd like to use, clicking on the file named corpus.jsonl and finally hitting 'download'. Any version of the Corpus that begins with the number 4 should be compatible with the Creator. The specific version of the Corpus used to produce EmuBert is 4.2.1 and can be downloaded here.

Finally, you can either place the Corpus in a directory named data in the root of the repository, define an environment variable named OALC that points to the Corpus or override the corpus_path variable in scripts/config.py.

Usage 👩‍💻

To train EmuBert, run the following scripts in the scripts directory in order:

  1. preprocess.py, which cleans documents, splits them into training, validation and test sets, filters out short documents from the training set, deduplicates the training set, trains a tokeniser and finally save the resulting data.
  2. block.py, which splits texts into block of the same size as EmuBert's context window and saves them.
  3. train.py, which trains EmuBert and saves it to a directory named model (unless the model_dir variable in config.py is overridden). If training is interrupted at any point, set the script's RESUME variable to True.
  4. convert.py, which converts EmuBert from a Better Transformer into a vanilla Transformer.
  5. benchmark.py, which benchmarks EmuBert against other popular masked language models.

Licence 📜

The Creator is licensed under the MIT License.