-
-
Notifications
You must be signed in to change notification settings - Fork 3
Folder structure and file naming conventions
Isaac Schifferer edited this page Nov 17, 2023
·
2 revisions
silnlp uses the SIL_NLP_DATA_PATH environment variable to specify the path for a root folder (e.g., SIL_NLP_DATA_PATH="C:/silnlp"
). All of the reference data files and experiment files (configuration, models, predictions, etc) expected by the NMT scripts will be found under this root folder.
The subfolder structure that silnlp requires under this root folder is described in the table below.
Folder | Description |
---|---|
|
Data and experiments subfolder supporting Alignment experiments. |
|
Experiments subfolder with multiple subfolders, one per experiment. |
|
Subfolder for a single experiment. |
|
Data and experiments subfolder supporting Machine Translation experiments. |
|
Non-Scripture training data files (WMT '20, NewsTest, MultiCCAligned, etc). Refer to the next section for information on the naming conventions for the files in this subfolder. |
|
Experiments subfolder with multiple subfolders, one per experiment. |
|
Subfolder for a single experiment. |
|
Scripture training data files. When the extract_corpora script is run on a Paratext project, the extracted Scripture content is written to a file in this subfolder. Refer to the next section for information on the naming conventions for the files in this subfolder. |
|
Canonical list of verse references (e.g., "GEN 1:1"), in order, for all Scripture training data files extracted from the Paratext project. The order in which the verse references appear in this file is the same order in which the verse text appears in all Scripture training data files. This file can be generated by running the extract_corpora script on the Ref project (see below). |
|
Key Biblical Terms (KBT) data files. When the extract_corpora script is run on a Paratext project with populated KBT's, the extracted KBT's are written to file(s) in this subfolder>. Refer to the next section for information on the naming conventions for the files in this subfolder. |
|
Subfolder with Paratext projects and related Paratext supporting data. |
|
Subfolder with one or more Paratext projects. |
|
Subfolder with the files from an unzipped Paratext project. |
|
Subfolder containing a Reference Paratext project with versification that all other Paratext projects are aligned to when they are extracted. |
|
Reference files for processing Paratext KBT's. |
To be provided ...