This directory provides several baseline methods for conversational response selection. These help to benchmark the performance on the various datasets.
Baselines are intended to be relatively fast to run locally, and are not intended to be highly competitive with state of the art methods. As such they are limited to using a small portion of the training set, typically ten thousand randomly sampled examples.
Note that baselines only use the context
feature to rank the response
, and do not take into account extra_context
s.
The keyword-based baselines use keyword similarity metrics to rank responses given a context. The TF_IDF
method computes inverse document frequency statistics on the training set. Responses are scored using their tf-idf cosine similarity to the context.
The BM25
method builds on top of the tf-idf similarity, applying an adjustment to the term weights. See Okapi BM25: a non-binary model for further discussion of the approach.
The vector-based methods use publicly available neural network embedding models to embed contexts and responses into a vector space. The models implemented currently are:
- USE - the universal sentence encoder
- USE_LARGE - a larger version of the universal sentence encoder
- ELMO - the Embeddings from Language Models approach
- BERT_SMALL - the Bidirectional Encoder Representations from Transformers approach
- BERT_LARGE - a larger version of BERT
- USE_QA - The dual question/answer encoder version of the universal sentence encoder. Note this encodes contexts and responses using separate subnetworks, and
USE_QA_SIM
amounts to ranking with the pre-trained dot-product score.
all of which are loaded from Tensorflow Hub.
There are two vector-based baseline methods, one for each of the above models. The SIM
method ranks responses according to their cosine similarity with the context vector. This method does not use the training set at all.
The MAP
method learns a linear mapping on top of the response vector. The final score of a response with vector given a context with vector is the cosine similarity of the context vector with the mapped response vector:
where
and are learned parameters. This allows for learning an arbitrary linear mapping on the context side, while making it easy for the model to interpolate with the SIM
baseline using the residual connection gated by . Vectors are L2-normalized before being fed to the MAP method, so that the method is invariant to scaling.
The parameters are learned on the training set, using the dot product loss from Henderson et al 2017. A sweep over learning rate and regularization parameters is performed using a held-out dev set. The final learned parameters are used on the evaluation set.
The combination of the five embedding models with the two vector-based methods gives ten baseline methods: USE_SIM
, USE_MAP
, USE_LARGE_SIM
, USE_LARGE_MAP
, ELMO_SIM
, ELMO_MAP
, BERT_SMALL_SIM
, BERT_SMALL_MAP
, BERT_LARGE_SIM
and BERT_LARGE_MAP
.
To get the standard random sampling of the train and test sets, please get in touch with Matt.
You can also generate the data yourself, and then copy it locally, though this may result in slightly different results:
mkdir data
gsutil cp ${DATADIR?}/train-00001-* data/
gsutil cp ${DATADIR?}/test-00001-* data/
For Amazon QA data, you will need to copy two shards of the test set to get enough examples.
This provides a random subset of the train and test set to use for the baselines. Recall that conversational datasets are always randomly shuffled and sharded.
We recommend using run_baselines.ipynb
to run the baselines on Google Colab, using a free GPU.
When running vector-based methods, make use of tensorflow hub's caching to speed up results:
export TFHUB_CACHE_DIR=~/.tfhub_cache
Then run an individual baseline with:
python baselines/run_baseline.py \
--method TF_IDF \
--train_dataset data/train-* \
--test_dataset data/test-*
Note that the USE_LARGE
, ELMO
and all BERT
-based models baselines are slow, and may benefit from faster hardware. For these methods set --eval_num_batches 100
.