S³BERT: Semantically Structured Sentence Embeddings

Code for generating and training sentence embeddings with semantic features. Two main goals:

increase interpretability of sentence embeddings and explain similarity
effective aspectual clustering and semantic search

For more information, background and demonstration, please check our AACL paper.

Requirements

Please make sure to have at least the following packages installed:

package                 (version tested)
----------------------------------------
torch                           (1.11.0)
transformers                    (4.16.1)
sentence-transformers           (2.1.0)
numpy                           (1.21.2)                         
scipy                           (1.7.3)
huggingface-hub                 (0.10.0)       
[python                         (3.8.12)]

Command for installing all needed PyPI packages:

pip install \
  torch==1.11.0+cu113 \
  transformers==4.16.1 \
  sentence-transformers==2.1.0 \
  numpy==1.21.2 \ 
  scipy==1.7.3 \ 
  huggingface-hub==0.10.0 \ 
  --extra-index-url https://download.pytorch.org/whl/cu113

Dockerfile usage

The Dockerfile can be build by executing docker build -t s3bert . in the projects root directory. This will build a Docker Container based on Ubuntu 20.04 with Cuda Version 11.4.3, including all necessary Python Packages and the default training data. If you do not want to have that training data included in your container comment out the last three lines of the Dockerfile by adding a # at the beginning of each line.

To work with the locally built container run docker run -it --gpus all s3bert. Attention: this will allocate all GPUs available to the Container. If you want to allocate only one device replace all with e.g. device=0.

The script src/check_cuda.py should be used for checking GPU capabilities after starting the container.

The basic idea (how to customize)

The basic idea is simple:

Define/apply metrics that measure similarity with regard to aspects or topics that you're interested in.
Assign a specific sub-embedding to each metric
During training, it learns to route information into the assigned sub-embeddings so that they can reflect your metrics of interest. The power of the overall embedding is preserved with consistency control.
In inference, you are told how the aspects have modulated overall text similarity decision.

Note that any (possibly costly) computation of metrics from step 1. is not needed in inference

Rule of thumb for size of feature dimensions: From experience with different models that use 15 similarity aspect metrics, about 1/3 of the embedding may be reserved for the residual.

edim: size of sentence embedding
n: number of custom metrics
feadim: size of a sentence feature (sentence sub-embedding)

Then feadim can be set approximately to (edim - edim / 3)/n.

Full example with AMR

In our paper, we define metrics between abstract meaning representations (AMRs) such that we can measure, e.g., coreference or quantification similarity of sentences and see how these sub-similarities modulate the overall similarity.

Get our training data

The data contains the sentences and AMRs with AMR metric scores (note: we only need metric scores and sentences, the AMR graphs are attached only for potential further experimention)

Download and extract data:

wget https://cl.uni-heidelberg.de/~opitz/data/amr_data_set.tar.gz
tar -xvzf amr_data_set.tar.gz

This is how the format of the traing data should look

cd src/
python data_helpers.py

S3BERT embeddings: Train to generate semantic partitioning

Simply run

cd src/
python s3bert_train.py

Some settings can be adjusted in config.py. For other settings, the source code must be consulted.

S3BERT embeddings: inference

We have prepared an example script:

cd src/
python s3bert_infer.py

Check out its content for info on how to obtain and use the embeddings.

Pretrained models and scores:

Model Download

We provide pre-trained model here:

Model name	model link	s3bert config
s3bert_all-mpnet-base-v2	model	config
s3bert_all-MiniLM-L12-v2	model	config

Downloaded S3BERT models may be unpacked in src

tar -xvzf s3bert_all-MiniLM-L12-v2 -C src/

Use pre-trained model: See above (S3BERT embeddings: inference). Use specific config.py (see table above), which is needed so that we know which features are assigned a particular metric.

Scores of pre-trained models

Table

All numbers are Spearmanr.

Model	STSB	SICKR	UKPASPECT	Concepts	Frames	Named Ent.	Negations	Coreference	SRL	Smatch	Unlabeled	max_indegree_sim	max_outdegree_sim	max_degree_sim	root_sim	quant_sim	score_wlk	score_wwlk
s3bert_all-mpnet-base-v2	83.5	81.1	57.9	79.8	73.0	54.5	34.9	54.9	69.8	74.7	72.0	36.2	49.6	35.3	52.3	75.3	80.8	80.3
all-mpnet-base-v2	83.4	80.5	56.2	74.3	41.5	-12.7	-0.3	9.0	42.8	57.6	52.1	23.6	21.1	17.7	22.9	10.8	68.3	66.6
s3bert_all-MiniLM-L12-v2	83.7	78.9	56.6	74.3	66.3	51.0	33.4	44.1	61.4	67.5	65.1	31.9	42.4	29.5	43.6	73.6	74.6	74.2
all-MiniLM-L12-v2	83.1	78.9	54.2	76.7	37.3	-12.8	-3.8	7.7	42.1	56.3	51.5	23.8	19.0	19.0	20.1	9.4	66.3	63.5

Table Column names I: basic similarity benchmarking

For both SBERT and S3BERT the similarity for every pair is calculated on the full embeddings (cosine).

STSB: results on human sentence similarity benchmark STS
SICKR: results on human relatedness similarity benchmark SICK
UKPA: results on human argument similarity benchmark

Table Column names II: aspect similarity of explainable features

For non S3BERT models the aspect similarity is calculated via the full embedding (i.e., it gives the same similarity in every aspect). For S3BERT models the aspect similarities are calculated from the dedicated sub-embeddings.

Concepts: Similarity w.r.t. to similarity of concepts in sentences
Frames: Similarity w.r.t. to similarity of predicates in sentences
Named Ent: Similarity w.r.t. named entity similarities in sentences
Negation: Similarity w.r.t. negation structure of sentences
Coreference: Similarity w.r.t. coreference structure of sentences
SRL: Similarity w.r.t. semantic role structure of sentences
Smatch: Similarity w.r.t. to overall similarity of sentences' semantic meaning structures
Unlabeled: Similarity w.r.t. to overall similarity of sentences' semantic meaning structures minus relation label
(in/out/root)_degree_sim: Similarity w.r.t. to similarity of connected nodes in meaning space ("Focus")
quant_sim: Similarity w.r.t.\ quantificational structure similarity of sentences(three vs. four, a vs. all, etc.)
score_wlk: see Smatch, but measured with contextual Weisfeiler Leman Kernel isntead of Smatch
score_wwlk: See Smatch, but measured with Wasserstein Weisfeiler Leman Kernel instead of Smatch

Citation

If you find the work interesting, consider citing:

@article{opitz2022sbert,
  title={SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features},
  author={Opitz, Juri and Frank, Anette},
  journal={arXiv preprint arXiv:2206.07023},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
src		src
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

S³BERT: Semantically Structured Sentence Embeddings

Requirements

Dockerfile usage

The basic idea (how to customize)

Full example with AMR

Get our training data

S3BERT embeddings: Train to generate semantic partitioning

S3BERT embeddings: inference

Pretrained models and scores:

Model Download

Scores of pre-trained models

Table

Table Column names I: basic similarity benchmarking

Table Column names II: aspect similarity of explainable features

Citation

About

Releases

Packages

Languages

License

chr-werner/S3BERT

Folders and files

Latest commit

History

Repository files navigation

S3BERT: Semantically Structured Sentence Embeddings

Requirements

Dockerfile usage

The basic idea (how to customize)

Full example with AMR

Get our training data

S3BERT embeddings: Train to generate semantic partitioning

S3BERT embeddings: inference

Pretrained models and scores:

Model Download

Scores of pre-trained models

Table

Table Column names I: basic similarity benchmarking

Table Column names II: aspect similarity of explainable features

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

S³BERT: Semantically Structured Sentence Embeddings

Packages