Paper, Tags: #nlp, #embeddings
This paper is a detaile dempirical study of how the choice of neural architecture influences end task accuracy and qualitative properties of the representations that are learned.
Previous work has used LSTM-based biLMs, but this might not be the best possible architecture. Gated CNNs, feed forward self-attention based approaches such as Transformer... Each architecture represents information in a different manner.
There is a tradeoff between speed and accuracy between LSTMs and the other alternatives
- Word embedding layer: focuses exclusively on word morphology
- Lowest contextual layers of biLM: local syntax
- Upper layers: semantic content
- LSTM
- Large hidden state while reducing the total # parameters
- Transformer
- Feed forward self-attention based architecture. Good for MT, constituency parsing, semantic role labeling
- Gated CNN
- Good for sequence-to-sequence MT
- Pre-trained biLMs
- While it's possible to reduce perplexities for all models by scaling up, goal is to compare representations across architectures for biLMs for approximately equal skill, as measured by perplexity
We evaluate the quality of the pre-trained biLM representations as ELMo-like contextual word vectors in state-of-the-art models across a suite of four benchmark NLP tasks (Table 2 in paper). Across all tasks, the LSTM architectures perform the best.
We consider the inter-sentence contextual similarity of words and phrases. Then we show that the biLM word embeddings capture little semantic information that is instead represented in the contextual layers.
Intra-sentence similarity
- The lower layers capture mostly local information
- Top layer represents longer range relationships
- At lowerst layers, the biLM tends to place words from the same syntactic constituents in similar parts of the vector space
- The biLM is implicitly learning other linguistic information in the upper layer. All verbs have high similarity suggesting that the biLM captures POS info. The model also implicitly learns to perform coreference resolution
The layers best suited for constituency parsing are at or above the layers with maximum POS accuracies, and the layers most transferable to parsing are at or below the layers with maximum pronominal conreference accuracy in all models.
The maximum layer weights occur below the top layers as the most transferable contextual representations tend to occur in the middle layers, while the top layers specialize for language modeling.
biLM representations are far from perfect; during training, they have access to only surface forms of words and their order, meaning deeper linguistic phenomena must be learned "tabula rasa". Infusing models with explicit syntactic structure or other linguistically motivated inductive biases may overcome some of the limitations of sequential biLMs.