Paper, Tags: #nlp, #embeddings
New type of deep contextualized word representations that models:
- complex characteristics of word use
- how these uses vary across linguistic contexts.
The word vectors are learned functions of the internal states of a biLM.
Traditional word embeddings: each token is assigned a representation, a function of the entire input sentence.
- Traditional word type embeddings: each token is assigned a representation that is a function of the entire input sentence. Past approaches only allow a single context-independent representation for each word (Peters et al., 2017).
- This approach uses vectors derived from a bidirectional LSTM, trained with a coupled LM objective on a large text corpus -> ELMo, Embeddings from Lamguage Models. Unlike Peters et al., 2017, ELMo are deep, a function of all internal layers of the biLM. The system learns a linear combination of the vectors stacked above each input word for each end task, while Peters et al use only the top layer of the LSTM.
The biLM is trained on a 30M sentence corpus. after pretraining the biLM with unlabeled data, we fix the weights and add additional task-specific model capacity. The BiLM is able to disambiguate both the part of speech and word sense in the source sentence.
Higher-level LSTM states capture context-dependent aspects of word meaning, and lower-level states model aspects of syntax and can be used for POS tagging. Simultaneously exposing all these signals allows the learned model to select the types of semi-supervision that are most useful for each end task.
To add ELMo to the supervised model, freeze the weights of the biLM, and concatenate the ELMo vector with the token and pass the ELMo enhanced representation ti the task RNN.
Fine-tuning the biLM on domain specific data leads to significant drops in perplexity and an increase in downstream task performance --> domain transfer.
Illustrations and more explanations in this article: NLP history.