-
create
corpus.txt
file with space delimeted sentences which are delimited by newlines, usingprepare.py
-
create pickled data files that are needed for training using
preprocess.py
.- params:
--ngrams --unk "|" --rand_window
- params:
-
run
train.py
to traing the word embeddings-
params:
--ngrams --n_negs 5 --weights --ss_t "1e-5"
-
use additional params for batch size (
--mb
), epochs (--epoch
) as you like
-
Run without --ngrams
to train reference model.
What does preprocess.py
do?
-
convert
method generates askipgram
for each word:-
owords
is the context from left to right- padded with
<|>
before and after if not present, since the fixedwindow
is taken into account.
- padded with
-
-
this is then saved into binaries.
What are the iwords
and owords
?
loaded pairs from PermutatedSubsampledCorpus
-> already like that from file.
-
iword
: current word$w_t$ -
owords:
: context words$w_c$ with$c \in \mathcal{C}_t$
some effort
-
window
-> must sample uniformly between 1 and 5 instead of just using a fixed size- I'll keep it fixed, else batching becomes unnecessarily difficult... usign a window of size 3 instead.
-
„probability proportional to the square root of the uni-gram frequency”
- can specify
weights
when initialising SGNS! adjust the function there.- done.
- can specify
-
Implementation of the
n-grams
and the adjusted scoring function:-
preprocessing
- Additional step to compute the n-grams and adding them to the dictionary; Hashing, building representation...
- done, but not using hashing
- Additional step to compute the n-grams and adding them to the dictionary; Hashing, building representation...
-
training
-
instead of just retrieving for current word and context from
Word2Vec
during loss computation:-
retrieve all n-gram vectors and context
-
compute loss based on those
-
-
-
Still todo (maybe)
-
„When building the word dictionary, we keep the words that appear at least 5 times in the training set“
- need to reduce that threshold, depending on our corpus. The words are set to unknown (
<|>
) and will have no effect on the computation of the embeddings vector. It might be a better approach to use a different character, thought. Or, in the case of tagged corpora and if the same tags could be assigned for new sentences on which the embeds are used, to use the tags.
- need to reduce that threshold, depending on our corpus. The words are set to unknown (
-
„we solve our optimization problem by performing SGD“ „we use a linear decay on the step size”
- currently using Adam.
-
use logistic loss function
$l:x ↦ \log(1+e^{-x})$ instead of LogSigmoid! -
do not compute gradient on the unknown char!
- Actually, this is already implemented. Passing a
padding_idx
to annn.Embedding
has exactly that effect.
- Actually, this is already implemented. Passing a
just a parameter
-
ss_t
:1e-4
(rejection threshold) -
n_negs
:5
First impressions Takes a very long time to train. One iteration needs approximately 25 seconds vs 9-10 iterations/sec when using the reference implementation. Will try to optimise a bit by sorting the corpus.
Unfortunately, this didn't help much. In fact's it's not speeding up the training at all!
FIXED - problem was a for-loop over the forward passes to the embeddings layer. Building the Index-Tensor for the whole batch instead made the training faster.
NOTE: It's important that the first entry of index 0 is the unknown/ nonexistent character!!!
-
And it shouldn't be trained...
-
And the vector should be initialised as zero-vector...
** Development notes**
computation of loss:
# what exactly do I want???
# 1. for each word w_t, t ∈ batch_size:
# compute the sum over scores for each context VEC (oloss) / negative word VEC (nloss):
# where the score is sum (over all n-gram-vec) of the DOT product between the n-gram-vec and that VEC
# -> already have these 59 values.
# -> and return the mean of the whole thing. Need to unsqueeze if just one element.
„Imagine a distribution of words based on how many times each word appeared in a corpus, denoted as