Dissecting contextual word embeddings: architecture and representation, Peters et al., 2018

Paper, Tags: #nlp, #embeddings

This paper is a detaile dempirical study of how the choice of neural architecture influences end task accuracy and qualitative properties of the representations that are learned.

Previous work

Previous work has used LSTM-based biLMs, but this might not be the best possible architecture. Gated CNNs, feed forward self-attention based approaches such as Transformer... Each architecture represents information in a different manner.

There is a tradeoff between speed and accuracy between LSTMs and the other alternatives

Representations

Word embedding layer: focuses exclusively on word morphology
Lowest contextual layers of biLM: local syntax
Upper layers: semantic content

Architectures

LSTM
- Large hidden state while reducing the total # parameters
Transformer
- Feed forward self-attention based architecture. Good for MT, constituency parsing, semantic role labeling
Gated CNN
- Good for sequence-to-sequence MT
Pre-trained biLMs
- While it's possible to reduce perplexities for all models by scaling up, goal is to compare representations across architectures for biLMs for approximately equal skill, as measured by perplexity

Evaluation as word representations

We evaluate the quality of the pre-trained biLM representations as ELMo-like contextual word vectors in state-of-the-art models across a suite of four benchmark NLP tasks (Table 2 in paper). Across all tasks, the LSTM architectures perform the best.

Properties of contextual vectors

We consider the inter-sentence contextual similarity of words and phrases. Then we show that the biLM word embeddings capture little semantic information that is instead represented in the contextual layers.

Intra-sentence similarity

The lower layers capture mostly local information
Top layer represents longer range relationships
At lowerst layers, the biLM tends to place words from the same syntactic constituents in similar parts of the vector space
The biLM is implicitly learning other linguistic information in the upper layer. All verbs have high similarity suggesting that the biLM captures POS info. The model also implicitly learns to perform coreference resolution

The layers best suited for constituency parsing are at or above the layers with maximum POS accuracies, and the layers most transferable to parsing are at or below the layers with maximum pronominal conreference accuracy in all models.

The maximum layer weights occur below the top layers as the most transferable contextual representations tend to occur in the middle layers, while the top layers specialize for language modeling.

BiLM representation drawbacks

biLM representations are far from perfect; during training, they have access to only surface forms of words and their order, meaning deeper linguistic phenomena must be learned "tabula rasa". Infusing models with explicit syntactic structure or other linguistically motivated inductive biases may overcome some of the limitations of sequential biLMs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1808.08949.md

1808.08949.md

Dissecting contextual word embeddings: architecture and representation, Peters et al., 2018

Paper, Tags: #nlp, #embeddings

Previous work

Representations

Architectures

Evaluation as word representations

Properties of contextual vectors

BiLM representation drawbacks

Files

1808.08949.md

Latest commit

History

1808.08949.md

File metadata and controls

Dissecting contextual word embeddings: architecture and representation, Peters et al., 2018

Paper, Tags: #nlp, #embeddings

Previous work

Representations

Architectures

Evaluation as word representations

Properties of contextual vectors

BiLM representation drawbacks