The Global Anchor Method for Quantifying Linguistic Shifts and Domain Adaptation

This is the repo for the experiments and collected corpora in the paper `The Global Anchor Method for Quantifying Linguistic Shifts and Domain Adaptation', NeurIPS 2018.

Paper: https://papers.nips.cc/paper/8152-the-global-anchor-method-for-quantifying-linguistic-shifts-and-domain-adaptation
arXiv Category Corpora: https://gitlab.com/vinsachi/arxiv-category-corpora

@inproceedings{
  title={The Global Anchor Method for Quantifying Linguistic Shifts and Domain Adaptation},
  author={Yin, Zi and Sachidananda, Vin and  Prabhakar, Balaji},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2018}
}

The global anchor method is a powerful tool for comparing language usage between different corpora through word vectors. It can be used for

Transfer learning: determining whether a model trained on one corpus will transfer to another. If the corpora are very different in terms of their language usage, transfer learning may not perform well.
Discover linguistic shifts: one can use this method to determine the rate at which language changes with respect to time.
Discover domain variations: one can use this method to discover how language deviates in different domains.

In particular, we showed that the global anchor method is

theoretically as powerful as the alignment method
practically more widely applicable and easier to implement than the alignment method (i.e. compare embeddings with different dimensionalities)
reveals finer structures than frequency-based methods (e.g. Pechenick et. al. Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution)

Here is a short overview of what is in this directory.

Directory	What's in it?
`equivalence.py`	In the paper we showed that the alignment and global anchor methods, when viewed as metrics, are equivalent. This provides numerical verification for that claim.
`jsd_loss.ipynb`	This is the script for computing the Jensen-Shannon divergence for the Google ngram corpus. We demonstrate the jsd method does not provide fine-grained details as our method, in particular we show it does not capture the war-effect on English language and literature.
`laplacian.ipynb`	The script of the Laplacian method for language evolution trajectory and topic clustering.
`pip_loss.ipynb`	The script for calculating the PIP loss for Google ngram corpus between every year.
`plot.ipynb`	The script for the war-effect on English language evolution.
`validate_equivalence.ipynb`	The script for empirical validation of the equivalence of the global anchor method and the alignment method.

We also provides a set of processed corpora:

Dataset name	Download
Google Books	Google Books Ngram Dataset (We have trained a set of word vectors for years between 1800-2008, which can be found here)
arXiv Category Corpora	Repository This repo contains text corpora of academic papers separated by category from arXiv submitted between January 2007 - December 2017

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
PIP_loss_matrix.npy		PIP_loss_matrix.npy
PIP_loss_matrix_arxiv.npy		PIP_loss_matrix_arxiv.npy
PIP_loss_matrix_arxiv_full.npy		PIP_loss_matrix_arxiv_full.npy
PIP_loss_matrix_no_normalization.npy		PIP_loss_matrix_no_normalization.npy
README.md		README.md
alignment.npy		alignment.npy
anchor.npy		anchor.npy
equivalence.py		equivalence.py
jsd_loss.ipynb		jsd_loss.ipynb
jsd_matrix.npy		jsd_matrix.npy
laplacian.ipynb		laplacian.ipynb
pip_loss.ipynb		pip_loss.ipynb
plot.ipynb		plot.ipynb
validate_equivalence.ipynb		validate_equivalence.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Global Anchor Method for Quantifying Linguistic Shifts and Domain Adaptation

About

Releases

Packages

Contributors 2

Languages

ziyin-dl/global-anchor-method

Folders and files

Latest commit

History

Repository files navigation

The Global Anchor Method for Quantifying Linguistic Shifts and Domain Adaptation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages