Skip to content

Commit

Permalink
Merge pull request hplt-project#96 from alvations/pipeline
Browse files Browse the repository at this point in the history
New pipeline feature!
  • Loading branch information
alvations authored May 4, 2020
2 parents cc0617f + 9708bc2 commit 6df9f03
Show file tree
Hide file tree
Showing 3 changed files with 223 additions and 249 deletions.
124 changes: 58 additions & 66 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,24 +3,21 @@
[![Build Status](https://travis-ci.org/alvations/sacremoses.svg?branch=master)](https://travis-ci.org/alvations/sacremoses)
[![Build status](https://ci.appveyor.com/api/projects/status/bwgmj4axw9pdk1oq?svg=true)](https://ci.appveyor.com/project/alvations/sacremoses)

License
====
# License

GNU Lesser General Public License version 2.1 or, at your option, any later version.

Install
====
# Install

```
pip install -U sacremoses
```

NOTE: Sacremoses only supports Python 3 now (`sacremoses>=0.0.41`). If you're using Python 2, the last possible version is `sacremoses==0.0.40`.

Usage (Python)
====
# Usage (Python)

**Tokenizer and Detokenizer**
## Tokenizer and Detokenizer

```python
>>> from sacremoses import MosesTokenizer, MosesDetokenizer
Expand All @@ -43,7 +40,7 @@ True
```


**Truecaser**
## Truecaser

```python
>>> from sacremoses import MosesTruecaser, MosesTokenizer
Expand Down Expand Up @@ -78,7 +75,7 @@ True
'the adventures of Sherlock Holmes'
```

**Normalizer**
## Normalizer

```python
>>> from sacremoses import MosesPunctNormalizer
Expand All @@ -87,113 +84,112 @@ True
'THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."'
```

# Usage (CLI)

Usage (CLI)
====

Since version `0.0.42`, the pipeline feature for CLI is introduced, thus there
are global options that should be set first before calling the commands:

- language
- processes
- encoding
- quiet

```shell
$ pip install -U sacremoses>=0.0.38
$ pip install -U sacremoses>=0.0.42

$ sacremoses --help
Usage: sacremoses [OPTIONS] COMMAND [ARGS]...
Usage: sacremoses [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

Options:
--version Show the version and exit.
-h, --help Show this message and exit.
-l, --language TEXT Use language specific rules when tokenizing
-j, --processes INTEGER No. of processes.
-e, --encoding TEXT Specify encoding of file.
-q, --quiet Disable progress bar.
--version Show the version and exit.
-h, --help Show this message and exit.

Commands:
detokenize
detruecase
normalize
tokenize
train-truecase
truecase
```

## Pipeline

Example to chain the following commands:

- `normalize` with `-c` option to remove control characters.
- `tokenize` with `-a` option for aggressive dash split rules.
- `truecase` with `-a` option to indicate that model is for ASR and save the model with `-m` option to `big.truemodel` file.
- save the output to console to the `big.txt.norm.tok.true` file.

```shell
cat big.txt | sacremoses -l en -j 4 \
normalize -c tokenize -a truecase -a -m big.truemodel \
> big.txt.norm.tok.true
```

**Tokenizer**
## Tokenizer

```shell
$ sacremoses tokenize --help
Usage: sacremoses tokenize [OPTIONS]

Options:
-l, --language TEXT Use language specific rules when tokenizing
-j, --processes INTEGER No. of processes.
-a, --aggressive-dash-splits Triggers dash split rules.
-x, --xml-escape Escape special characters for XML.
-p, --protected-patterns TEXT Specify file with patters to be protected in
tokenisation.
-c, --custom-nb-prefixes TEXT Specify a custom non-breaking prefixes file,
add prefixes to the default ones from the
specified language.
-e, --encoding TEXT Specify encoding of file.
-h, --help Show this message and exit.


$ sacremoses tokenize -j 4 < big.txt > big.txt.tok
$ sacremoses -l en -j 4 tokenize < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 24363.39it/s

$ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/basic-protected-patterns
$ sacremoses tokenize -j 4 -p basic-protected-patterns < big.txt > big.txt.tok
$ sacremoses -l en -j 4 tokenize -p basic-protected-patterns < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 22183.94it/s
```
```
**Detokenizer**
## Detokenizer
```shell
$ sacremoses detokenize --help
Usage: sacremoses detokenize [OPTIONS]

Options:
-l, --language TEXT Use language specific rules when tokenizing
-j, --processes INTEGER No. of processes.
-x, --xml-unescape Unescape special characters for XML.
-e, --encoding TEXT Specify encoding of file.
-h, --help Show this message and exit.

-x, --xml-unescape Unescape special characters for XML.
-h, --help Show this message and exit.

$ sacremoses detokenize -j 4 < big.txt.tok > big.txt.tok.detok
128457it [00:23, 5355.88it/s]
$ sacremoses -l en -j 4 detokenize < big.txt.tok > big.txt.tok.detok
100%|██████████████████████████████████| 128457/128457 [00:16<00:00, 7931.26it/s]
```
**Train Truecaser**
```shell
$ sacremoses train-truecase --help
Usage: sacremoses train-truecase [OPTIONS]

Options:
-m, --modelfile TEXT Filename to save the modelfile. [required]
-j, --processes INTEGER No. of processes.
-a, --is-asr A flag to indicate that model is for ASR.
-p, --possibly-use-first-token Use the first token as part of truecasing.
-e, --encoding TEXT Specify encoding of file.
-h, --help Show this message and exit.

$ sacremoses train-truecase -m big.model -j 4 < big.txt.tok
128457it [00:12, 10049.23it/s]
```
**Truecase**
## Truecase
```shell
$ sacremoses truecase --help
Usage: sacremoses truecase [OPTIONS]

Options:
-m, --modelfile TEXT The trucaser modelfile to use. [required]
-j, --processes INTEGER No. of processes.
-a, --is-asr A flag to indicate that model is for ASR.
-e, --encoding TEXT Specify encoding of file.
-h, --help Show this message and exit.
-m, --modelfile TEXT Filename to save/load the modelfile.
[required]
-a, --is-asr A flag to indicate that model is for ASR.
-p, --possibly-use-first-token Use the first token as part of truecase
training.
-h, --help Show this message and exit.

$ sacremoses truecase -m big.model -j 4 < big.txt.tok > big.txt.tok.true
128457it [00:11, 11411.07it/s]
$ sacremoses -j 4 truecase -m big.model < big.txt.tok > big.txt.tok.true
100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 14257.27it/s]
```
**Detruecase**
## Detruecase
```shell
$ sacremoses detruecase --help
Expand All @@ -205,28 +201,24 @@ Options:
-e, --encoding TEXT Specify encoding of file.
-h, --help Show this message and exit.

$ sacremoses detruecase -j 4 < big.txt.tok.true > big.txt.tok.true.detrue
$ sacremoses -j 4 detruecase < big.txt.tok.true > big.txt.tok.true.detrue
100%|█████████████████████████████████| 128457/128457 [00:04<00:00, 26945.16it/s]
```
**Normalize**
## Normalize
```shell
$ sacremoses normalize --help
Usage: sacremoses normalize [OPTIONS]

Options:
-l, --language TEXT Use language specific rules when normalizing.
-j, --processes INTEGER No. of processes.
-q, --normalize-quote-commas Normalize quotations and commas.
-d, --normalize-numbers Normalize number.
-p, --replace-unicode-puncts Replace unicode punctuations BEFORE
normalization.
-c, --remove-control-chars Remove control characters AFTER normalization.
-e, --encoding TEXT Specify encoding of file.
-q, --quiet Disable progress bar.
-h, --help Show this message and exit.

$ sacremoses normalize -j 4 < big.txt > big.txt.norm.cli
$ sacremoses -j 4 normalize < big.txt > big.txt.norm
100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 13096.23it/s]
```
Loading

0 comments on commit 6df9f03

Please sign in to comment.