Skip to content
John Gamboa edited this page Sep 13, 2017 · 3 revisions

In the tools folder you will find some standalone scripts. These scripts perform some useful tasks that are hortogonal to the other Ovation functionalities. Each script is independent and receives different parameters. In the sections below, they are described in details.

fix_gersen.sh

The GerSEN script comes in several files. Most of them are in the UTF-8 format, but some of them are unfortunately encoded in the ISO88591 format. This script converts each one of the ISO88591 files into UTF-8. It receives the place where the GerSEN datasets is downloaded, and changes the relevant files.

Make sure, however, not to run this more than once: there is no way for the script to know that the encoding of the files is already UTF-8, so it will end up corrupting the content of the dataset.

To run it, just do:

./fix_gersen.sh /path/to/gersen/

generate_vocabulary.sh

Given a tokenized file, the script counts all tokens and generates a vocabulary file. Just like the create_vocabulary() function of the Dataset API, the script allows you to restrict the size of the vocabulary and to only consider tokens that appear a minimum number of times in the input file. It is also possible to downcase the words (to make the vocabulary case insensitive) and to use a different delimiter (by default, it uses spaces as the token delimiter). To use the script, just do:

python generate_vocabulary.py input_file

Or, if you want to set up the other parameters,

# This call will
# * receive `input_file` as input,
# * only insert in the vocabulary words that appear at least 5 times in `input_file`
# * if the vocabulary exceeds 10000 words, the program will discard the least frequent words
# * consider words case insensitive (by downcasing them)
# * consider a word anything delimited by spaces
python generate_vocabulary.py input_file --min_frequency=5 --max_vocab_size=10000 --downcase=True --delimiter=' '

load_w2v.py

Given a vocabulary file, generates a new file called w2v.npy with a set of vectors corresponding to the given vocabulary. The values present in the vectors depend on spaCy's vectors. If a word is present among the spaCy's vectors, it's vector is used; otherwise, a randomly initialized vector is used.

To use it, just do,

load_w2v.py input_vocabulary.txt

vocabulary_expansion.py

This is an adaptation of the vocabulary expansion script in tensorflow's skip-thoughts model. It requires more or less the same parameters as the parameters expected by the original script:

  • --model_checkpoint: A path to a checkpoint file (created with a tensorflow Saver() object)
  • --embedding_tensor_name: Name of the tensorflow variable containing the embedding matrix to be extended.
  • --vocab: A vocabulary file containing the vocabulary to be extended
  • --word2vec_model: A word2vec model created using gensim. This is the "big" vocabulary that we want to use to extend the vocabulary present in --vocab and --embedding_tensor_name.
  • --output_dir: The output directory where the results will be put

This script will generate two files (stored in --output_dir):

  • vocab.txt: The new vocabulary file containing the expanded vocabulary
  • embeddings.npy: The new embeddings file containing the vectors of the expanded vocabulary