DNLP SS24 Final Project – BERT for Multitask Learning and BART for Paraphrasing

Table of Contents

DNLP SS24 Final Project – BERT for Multitask Learning and BART for Paraphrasing
Group name: BERT's Buddies
Group code: G04
Group repository: FrederikHennecke/DeepLearning4NLP
Tutor responsible: Yasir
Group team leader: Hennecke, Frederik
Group member 1: Hennecke, Frederik
Group member 2: Aly, Mohamed Email, Github
Group member 3: Tillmann, Arne

Introduction

This repository our official implementation of the Multitask BERT project for the Deep Learning for Natural Language Processing course at the University of Göttingen.

A pre-trained BERT (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding) model was used as the basis for our experiments. The model was fine-tuned on the three tasks using a multitask learning approach. The model was trained on the three tasks simultaneously, with a single shared BERT encoder and three separate task-specific classifiers.

Requirements

To install requirements and all dependencies using conda, run:

bash setup_gwdg.sh

This will download and install Miniconda on your machine, create the project's conda environment and activate it. It will also download the models used through the project and the POS and NER tags from spaCy.

Data

We describe the datasets we used in the following table

Dataset	Task	Description	Size
Quora Dataset (QQP)	Paraphrase detection	Two sentences are given as input and a binary label (0, 1) is output indicating if sentences are paraphrases of one another	Train: 121,620 Dev: 40,540 Test: 40,540
Stanford Sentiment Treebank (SST)	Sentiment analysis (classification)	One sentence is given as input to be classified on a scale from 0 (most negative) to 5 (most positive)	Train: 7,111 Dev: 2,365 Test: 2,371
SemEval STS Benchmark Dataset	Textual Similarity (regression)	Two sentences are given as input and their mutual relation is to be evaluated on continuous labels from 0 (least similar) to 5 (most similar)	Train: 5,149 Dev: 1,709 Test: 1,721
Extended Typology Paraphrase Corpus (ETPC)	Paraphrase Detection / Generation	Two sentences are given as input and wether the pairs are paraphrases of each other, either "yes" (1) or "no" (0) according to mrpc and etpc annotation schema	Train: 2020 Dev: 505 Test: 573 (detection) 701 (generation)

Training

To train the multitask BERT, and BART you need to activate the environment and run

python -u train_multitask_pal.py --option finetune --use_gpu --local_files_only --smoketest --no_tensorboard --no_train_classifier --use_smart_regularization --use_pal --use_amp

Alternatively, you can run the run_train.sh script which also has the option to run on the GWDG cluster and submit jobs for training. You can also choose different scripts to run from that file and set the parameters.

There are lots of parameters to set. To see all of them, run python <filename> --help. Theses are most important ones in the train_multitask.py and train_multitask_pal.py file. Please note that parameters from branches other than the main are included as well:

Parameter	Description
`--additional_input`	Activates the usage for POS and NER tags for the input of BERT
`--batch_size`	Batch size.
`--clip`	Gradient clipping value.
`--epochs`	Number of epochs.
`--hidden_dropout_prob`	Dropout probability for hidden layers.
`--hpo_trials`	Number of trials for hyperparameter optimization.
`--hpo`	Activate hyperparameter optimization.
`--lr`	Learning rate.
`--optimizer`	Optimizer to use. Options are `AdamW` and `SophiaH`.
`--option`	Determines if BERT parameters are frozen (`pretrain`) or updated (`finetune`).
`--samples_per_epoch`	Number of samples per epoch.
`--scheduler`	Learning rate scheduler to use. Options are `plateau`, `cosine`, `linear_warmup` or `None` for non schedulers
`--unfreeze_interval`	Number of epochs until the next BERT layer is unfrozen
`--use_gpu`	Whether to use the GPU.
`--weight_decay`	Weight decay for optimizer.
`--smoketest`	To test the code implementation and debug it before the actual run
`--pooling`	Add a pooling layer before the classifier
`--layers`	Select more layers from the model to train
`--add_layers`	Add more linear layers to train before the classifier for the SST task
`--combined_models`	Use combined BERT models to train
`--train_mode`	Specifies if train the last hidden state, the pooler output or certain layers of the model. Options are `all_pooled`, `last_hidden_state`, `single_layers`
`--max_length`	Maximum length of tokens to chunk
`--n_hidden_layers`	Number of hidden layers to use
`--task_scheduler`	Choose how to schedule the task during training. Options are `random`, `round_robin`, `pal`, `para`, `sts`, `sst`
`--projection`	How to handle competing gradients from different tasks. Options are `pcgrad`, `vaccine`, `none`
`--combine_strategy`	Needed in case of using projection. Options are `encourage`, `force`, `none`
`--use_pal`	Add projected attention layers on top of BERT
`--patience`	Number of epochs to wait without improvement before the training stops
`--use_smart_regularization`	Implemented pytorch package to regularize the weights and reduce overfiting
`--write_summary` or `--no_tensorboard`	Whether to save training logs for later view on tensorboard
`--model_name`	Choose the model to pre-train or fine-tune

The parameters for bart_generation.py are the following:

Parameter	Description
`--seed`	Seed for random numbers
`--use_gpu`	Whether to use the GPU.
`--batch_size`	Batch size.
`--epochs`	Number of epochs.
`--lr`	Learning rate.
`--etpc_train`	Train dataset
`--etpc_dev`	Dev dataset
`--etpc_test`	Test dataset
`--similarity_weight`	Weight for similarity loss (default: 0.0).
`--dissimilarity_weight`	Weight for dissimilarity loss (default: 0.0).
`--copy_penalty_weight`	Weight for copy penalty loss (default: 0.0).
`--noise`	Proportion of words to modify (Swap, Delete, Replace) (default: 0.0).
`--synonym_prob`	Probability of replacing a word with its synonym (default 0.0).

The parameters for bart_detection.py are the following:

Parameter	Description
`--seed`	Seed for random numbers
`--use_gpu`	Whether to use the GPU.
`--batch_size`	Batch size.
`--epochs`	Number of epochs.
`--lr`	Learning rate.
`--etpc_train`	Train dataset
`--etpc_dev`	Dev dataset
`--etpc_test`	Test dataset
`--noise`	Proportion of words to modify (Swap, Delete, Replace) (default: 0.0).
`--synonym_prob`	Probability of replacing a word with its synonym (default 0.0).
`--type_2`	Weight multiplier for paraphrase type 2 (default 0.3).
`--type_6`	Weight multiplier for paraphrase type 6 (default 0.3).
`--type_7`	Weight multiplier for paraphrase type 7 (default 0.3).

Evaluation

The model is evaluated after each epoch on the validation set. The results are printed to the console and saved in the --logdirin case of setting --no_tensorboard to store-false. The model checkpoints are saved after each epoch in the filepath directory. After finishing the training, the model is loaded to be evaluated on the test set. The model's predictions are then saved in the predictions_path.

Results

In the first phase of the project, we aimed at getting the baseline score for each task, where we trained the model for each task separately with the following parameters:

option: finetune
epochs: 10
learning rate: 1e-5
hidden dropout prob: 0.1
batch size: 64

In the second phase, we used multitask training to let the model train on the data of all tasks, but with a different classifier for each task. This helped the model learn and the performance to improve since the tasks are dependent. In addition we performed single task training for the ones that didn't improve through the multitasking and tried to apply other approaches to improve their score.

For the multitask training, we used the following parameters:

option: finetune
epochs: 20
learning rate: 8e-5
hidden dropout prob: 0.3
batch size: 64
optimizer: àdamw
clip: 0.25
weight decay: 0.05
samples per epoch: 20000

For bart_generation, we used the following parameters:

use_gpu
epochs 5
lr 1e-5
similarity_weight 0.2
dissimilarity_weight 0.6
copy_penalty_weight 1.5
noise 0.2
synonym_prob 0.3

For bart_detection, we used the following parameters:

use_gpu
epochs 12
lr 1e-5
noise 0.2
synonym_prob 0.3

This allowed us to set a standard training framework and try with other options for the sake of further improvement. The training parameters used as well as the branch where the related scripts exist can be found in the corresponding slurm file.

We achieved the following results

Paraphrase Identification on Quora Question Pairs (QQP)

Paraphrase Detection is the task of finding paraphrases of texts in a large corpus of passages. Paraphrases are “rewordings of something written or spoken by someone else”; paraphrase detection thus essentially seeks to determine whether particular words or phrases convey the same semantic meaning. This task measures how well systems can understand fine-grained notions of semantic meaning.

Model name	Description	Accuracy	Link to Slurm
Pcgrad projection	Projected gradient descent with round robin scheduler	86.4%	pcgrad
Vaccine projection	Surgery of the gradient with round robin scheduler	85.9%	vaccine
Combined BERT models	4 BERT models combined (3 sub-models and a gating network)	85.7%	combined BERT
Augmented Attention Multitask	Attention layer on top of BERT	84.4%	attention multitask
BiLSTM-Multitask	BiLSTM layer on top of BERT	83.9%	bilstm multitask
Pal scheduler without vaccine	apply pal scheduler only	83.1%	pal no vaccine
Sophia with additional inputs	Sophia optimizer with POS and NER inputs	82.5%	sophia with additional inputs
Bert-large	use bert-large model for multitasking	82.4%	bert-large
Max pooling	Max of the last hidden states' sequence	81.9%	max pooling
Baseline (single task)	Single task training	80.6%	baseline
Hierarchical-BERT	Chunking the input up to a segment length	75.6%	hierarchical bert

Sentiment Classification on Stanford Sentiment Treebank (SST)

A basic task in understanding a given text is classifying its polarity (i.e., whether the expressed opinion in a text is positive, negative, or neutral). Sentiment analysis can be utilized to determine individual feelings towards particular products, politicians, or within news reports. Each phrase has a label of negative, somewhat negative, neutral, somewhat positive, or positive.

Model name	Description	Accuracy	Link to Slurm
BiLSTM-Multitask	BiLSTM layer on top of BERT	53.0%	bilstm multitask
Max pooling	Max of the last hidden states' sequence	52.8%	max pooling
Baseline (single task)	Single task training	52.2%	baseline
Bert-large	use bert-large model for multitasking	51.2%	bert-large
Vaccine projection	Surgery of the gradient with round robin scheduler	50.4%	vaccine
Augmented Attention Multitask	Attention layer on top of BERT	50.2%	attention multitask
Combined BERT models	4 BERT models combined (3 sub-models and a gating network)	49.1%	combined BERT
Sophia with additional inputs	Sophia optimizer with POS and NER inputs	48.8%	sophia with additional inputs
Hirarchical-BERT	Chunking the input up to a segment length	48.7%	hirarchical bert
Pcgrad projection	Projected gradient descent with round robin scheduler	47.9%	pcgrad
Pal scheduler without vaccine	apply pal scheduler only	45.0%	pal no vaccine

Semantic Textual Similarity on STS

The semantic textual similarity (STS) task seeks to capture the notion that some texts are more similar than others; STS seeks to measure the degree of semantic equivalence [Agirre et al., 2013]. STS differs from paraphrasing in it is not a yes or no decision; rather it allows degrees of similarity from 5 (same meaning) to 0 (not at all related).

Model name	Description	Accuracy	Link to Slurm
Vaccine projection	Surgery of the gradient with round robin scheduler	89.3%	vaccine
Pcgrad projection	Projected gradient descent with round robin scheduler	89.0%	pcgrad
Pal scheduler without vaccine	apply pal scheduler only	86.5%	pal no vaccine
Augmented Attention Multitask	Attention layer on top of BERT	85.8%	attention multitask
Bert-large	use bert-large model for multitasking	85.3%	bert-large
Combined BERT models	4 BERT models combined (3 sub-models and a gating network)	85.1%	combined BERT
Sophia with additional inputs	Sophia optimizer with POS and NER inputs	84.3%	sophia with additional inputs
BiLSTM-Multitask	BiLSTM layer on top of BERT	80.7%	bilstm multitask
Hierarchical-BERT	Chunking the input up to a segment length	71.4%	hierarchical bert
Max pooling	Max of the last hidden states' sequence	53.4%	max pooling
Baseline (single task)	Single task training	41.4%	baseline

Paraphrase Type Generation (PTG)

The paraphrase generation task is closely related to the paraphrase type detection task. Instead of detecting the paraphrase types, you will give the model one sentence and a list of paraphrase types as input. The model should generate a paraphrased version of the given sentence using the given paraphrase types. The data for this task is the same as the data for the paraphrase type detection task.

Paraphrase Type Generation (PTG)	Epoch 1 Loss	Epoch 2 Loss	Epoch 3 Loss	Epoch 4 Loss	Epoch 5 Loss	BLEU Score	Negative BLEU Score	Penalized BLEU Score
Version	Train / Dev	Train / Dev	Train / Dev	Train / Dev	Train / Dev	any%	any%	any%
Baseline	1.0534 / 0.7594	0.4610 / 0.7924	0.2781 / 0.8906	0.1774 / 1.0377	0.1486 / 1.0930	43.990%	21.245%	17.972%
Custom Loss	1.7079 / 1.2498	1.1150 / 0.7403	0.5095 / 0.8056	0.7873 / 1.5433	0.7170 / 1.6292	42.832%	23.125%	19.048%
Noise	1.2543 / 0.7036	0.6770 / 0.7924	0.2781 / 0.8906	0.3845 / 0.8993	0.2993 / 1.0415	38.224%	33.796%	24.843%
Synonym	1.0890 / 0.7034	0.4985 / 0.7895	0.3096 / 0.8880	0.1967 / 1.0476	0.1390 / 1.1058	43.643%	20.845%	17.495%
All Improvisation	1.8247 / 1.2030	1.2837 / 1.2078	1.0804 / 1.3721	0.9594 / 1.4304	0.8729 / 1.5008	39.772%	30.850%	23.596%

While the Penalized BLEU Score is the highest with only the noise function active, only half the sentences on the dev dataset did change, While ~85% of the sentences did change for all improvisations together. These scores are all after only 5 epochs. We also did a test with 12 epochs with all improvs and there we have the following scores:

BLEU Score: 30.926384502652546
Negative BLEU Score with input: 48.47821009206475
Penalized BLEU Score: 28.831841640530108

Paraphrase Type Detection (PTD)

While the QQP dataset only differentiates between two sentences being paraphrased or not, the paraphrase type detection task is a bit more advanced. Instead of asking if two sentences are paraphrases, the task is to find the correct type of paraphrase. There are seven different types of paraphrases. Note that one sentence pair can belong to multiple paraphrase types. The two columns sentence1(2)_segment_location contain information about which token belongs to which paraphrase type. The two columns sentence1(2)_tokenized contain each token of the sentence in the original sentence ordering.

Paraphrase Type Generation (PTG)	Epoch 12 Accuracy	MCC Score
Version	Train / Dev	any%
Baseline	0.9320 / 0.7567	13.3%
Custom Loss	0.9320 / 0.7567	63.5%
Noise + Custom Loss	0.8804 / 0.7253	55.3%
Synonym + Custom Loss	0.9178 / 0.7661	64.8%
All Improvisation + Custom Loss	0.8770 / 0.7443	56.2%

Methodology

In this section we will describe the methods we used to obtain our results for the given tasks.

POS and NER Tag Embeddings

Enriching the corpus with subword embeddings can enhance the model's understanding of language as reported by T. Mikolov, et al.Distributed Representations of Words and Phrases and their Compositionality and P. Bojanowski et al. Enriching Word Vectors with Subword Information. We used spaCy package for that purpose to extract the POS and NER tags from the input text and feed their embeddings to the model together with the input sequence.

POS: Identifying the grammatical category of words (noun, verb, adjective, etc.) helps the model understand the structure of a sentence and disambiguate meanings which then results in better capturing the relationships between words.
NER: Recognizing named entities (names, organizations, etc.) makes the model focus on relevant words and therefore better sentiment analysis of the text.

Adding such semantic information can make the model robust and therefore better generalization to unseen data.

Experimental Results

In contrast to our expectations, adding POS and NER tags to the model combined with the baseline AdamW optimizer did not improve the performance too much, rather it made the training too slow and consumed lots of computational resources without an actual benefit. The reason for the high training costs is that we train all the available data and indeed extracting the tags from each word, computing the embeddings and feeding them to the model is quite expensive.

Explanation of Results

The reason why including additional inputs did not enhance the performance is that Large Language Models (LLMs) like Bert are pre-trained on massive corpora whose embeddings are contextual, meaning that the representation of a word depends on the context around it in the specified window which allows the model to capture information relevant to POS and NER from the text. Researchers have conducted studies where they probe the hidden layers to asses the amount of syntactic or semantic information encoded in the layers. (A Structural Probe for Finding Syntax in Word Representations, BERT Rediscovers the Classical NLP Pipeline).

Optimizers

Sophia

Sophia is a Second-Order Clipped Stochastic Optimization method that uses. It uses the Hessian matrix as an approximation of second-order information to represent the curvature of the loss function which can lead to faster convergence and better generalization. Sophia incorporates an adaptive learning rate mechanism, adjusting the step size based on the curvature of the loss landscape and uses clipping to control the worst-case update size. in all directions, safeguarding against the negative impact of inaccurate Hessian estimates. Thanks to the light weight diagonal Hessian estimate, the speed-up in the number of steps translates to a speed-up in total compute wall-clock time.

Implementation

We used the Hutchinson's unbiased estimator of the Hessian diagonal in our implementation, which is to sample from the spherical Gaussian distribution. It can be activated in the training script by setting the --optimizer option to sophiah. The estimator can be used for all tasks including classification and regression, and requires only a Hessian vector product instead of the full Hessian matrix. It also has efficient implementation in PyTorch and JAX as mentioned in the original paper by H. Liu, et al. Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training.

Experimental Results

The implementation of Sophiah with additional inputs from spaCy did actually improve the performance for the QQP and STS tasks compared to the baseline AdamW, however it did not improve the SST task. Running Sophiah without POS and NER tags, we observed that the convergence is slower than AdamW.

Explanation of Results

Sophia optimizer is basically designed to be computationally feasible for pre-training large scale language models, which is different from the frame in which we used it (fine tuning of a relatively small corpus than usual). J.Kaddour, et al. (No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models) have applied the Sophia optimizer using the Gauss-Newton-Barlett method for downstream tasks and observed that it achieves comparable performance as AdamW.

Gradient Optimization (Projectors)

It is an optimization technique to handle the conflicting gradients during multitask-training. Since we train the model on all the data simultaneously, the gradients from these tasks might diverge when the gradients from tasks under training point in different directions. That indeed could lead to suboptimal performance and slowing down the training as well.

Pcgrad

T. Yu et al. (Gradient Surgery for Multi-Task Learning) proposed the Projected Gradient Descent or Pcgrad method which is a form of gradient surgery. They do the surgery by projecting the gradients of each task onto the normal plane of the gradient of any other task. if the projection is large, then the gradients are conflicting. At the end the projected gradients are summed up and used to update the model. This method proved to effectively reduce the conflict between gradients and enhance the converging speed of the model.

Experimental results

PCgrad achieved a high performance on the QQP (first rank) and the STS (second rank) tasks, however it worsened the SST (lower than the baseline).

Explanation of Results

Our hypothesis is that the pair sequence tasks (STS, QQP) are complicated and they have conflicting gradients, which is why they benefited from Pcgrad technique. Unlike the SST task which includes only one sequence and its gradients are less likely to conflict which is why the benefits of surgery are less pronounced for that task. Following that hypothesis, D. Saykin and K. Dolev (Gradient Descent in Multi-Task Learning) reported that training the SST task separately with using the Pcgrad surgery achieves better results than in multitask-training framework.

GradVac

GradVac stands for Gradient Vaccine which was originally proposed by Z. Wang, et al. (GRADIENT VACCINE: INVESTIGATING AND IMPROVING MULTI-TASK OPTIMIZATION IN MASSIVELY MULTILINGUAL MODELS). This technique was developed to mitigate the negative effects of conflict gradients in multitask-training. It works the same as Pcgrad, but involves a process of first clipping the gradients before aligning them, which adds computational overhead, however it is still efficient and scalable to large models and corpora.

Experimental results

It achieved relatively comparable results to Pcgrad, additionally it performed better on the SST task.

Explanation of Results

The additional clipping introduced in GradVac can prevent the model from making overly aggressive updates that could lead to instability. Therefore it helps stabilize the training, especially in case of noisy gradients due to class imbalance, as for the SST data, which could lead to an improvement as we obtained.

Schedulers

Schedulers in general are mechanisms that help adjusting the learning rate dynamically during training, which can lead to better performance and faster convergence. Schedulers are useful for escaping the local minima by momentarily increasing the learning rate, preventing overfiting, and efficient training. There are lots of schedule algorithms and we will consider some of them.

Linear Warmup

The idea behind it is to start with a small learning rate so that the basic features of the data are learned, then gradually increase it in a linear way over a predefined number of steps or epochs. After the warmup period, the learning rate transits to another schedule such as constant rate or a decay rate. Linear warmup has the benefits of stabilizing the training and avoiding exploding gradients in case of using large batch sizes, large models or large dataset. Therefore, it was used by the state-of-the-art LLM research (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding:, Attention Is All You Need). We implemented the hyper-parameters parameters of the scheduler from the first link (Delvin, et al.).

Experimental results

We applied the linear warmup scheduler on the single SST task combined with a Bidirectional LSTM (BiLSTM) layer on top of BERT. We set the scheduler hyper-parameters num_train_steps = len(sst_train_dataloader) * num_train_epochs and num_warmup_steps = 0 as the default. This achieved a relatively higher performance than the baseline.

Plateau

The plateau scheduler reduces the learning rate when a specific metric (validation loss or accuracy) plateaus, namely renmins unchanged for specific number of training steps or epochs. This helps with fine-tuning the model by keeping the learning rate high when the model still learning effectively. and then reducing it when it comes to complex structures in the data (when improvement stops). The amount of reducing the learning rate is pre-determined by a factor that defaults to 0.1. The number of epochs to wait on the plateau is pre-determined by the patience parameter which we set to 2 for faster training, and we monitor the validation accuracy.

Cosine

The basic idea behind the cosine annealing scheduler is to reduce the learning rate following a cosine function rather than a step-wise or exponential decay. This can lead to more effective training because it smoothly decreases the learning rate in a way that can help the model escape local minima and explore the loss landscape more effectively. I. Loshchilov and F. Hutter (SGDR: STOCHASTIC GRADIENT DESCENT WITH WARM RESTARTS) proposed a warm restart technique, meaning that the learning rate is set periodically to the maximum value and then follows the cosine annealing schedule. They applied their methodology on CIFAR-10 and CIFAR-100 in the computer vision scheme.

Round Robin

This algorithm is basically implemented by process and network schedulers in computing in which time slices (from 10 to 100 milliseconds) are assigned to each process in equal portions and in circular order, namely cycling trough the different tasks one after the other. The problem with round robin scheduler technique is that if the dataset contains more instances of one task than another (e.g QQP and SST), it will repeat several times all the examples of the task with few data points before it repeats all the examples of the other task. This will lead to overfitting for the task that repeats a lot and underfitting for the other one. To this end Stickland and Murray (BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning) proposed a novel method for scheduling training. At first the tasks are sampled proportionally to their training set size but then to avoid interferences, the weighting is reduced to have tasks sampled more uniformly.

Experimental results

We implemented the round robin schedulers with the gradient surgery algorithm and showed a high performance. However, we need to turn off the other optimization methods to have a rough estimate of the effect of the scheduler on the training in terms of score and convergence speed.

PAL

The PAL (Performance and Locality) scheduler was developed by R. Jain, et al (PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters) to address the issue of performance variability in large-scale computing environments and to handle machine Learning workloads in GPU clusters. Thus, this method aims at harnessing the performance variability by redesigning scheduling policies for GPU clusters to consider performance variability. First, the authors identified which GPUs exhibit similar performance variability and the used K-Means clustering to group them. Then they extend their novel algorithm by including also locality of GPUs that are widely distributed on the cluster.

Experimental results

We used the PAL scheduler without vaccine and showed relatively good performance on the STS and QQP tasks. However, we could not run it with the Gradient surgery algorithms due to memory issues on the GWDG cluster.

Random

Random scheduler is included for the case of single task training, where at each step of the training the data of the selected task will be sampled randomly to be more robust against overfitting.

Regularization

Due to the limited data from the target tasks/domain and the extremely high complexity of the pre-trained open domain models, aggressive fine-tuning often makes the adapted model overfit the training data of the target task/domain and therefore does not generalize well to unseen data. Regularization comes into play to mitigate this issue by introducing additional constraints or penalities on the model's parameters during training, which helps simplify the model and improves its generalization.

SMART

Sharpness-Aware Minimization with Robustness Techniques (SMART) is one of the efficient regularization techniques developed by H. Jiang, et al. (SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization). SMART is built on top of SAM introduced by P.Foret et al. (SHARPNESS-AWARE MINIMIZATION FOR EFFICIENTLY IMPROVING GENERALIZATION) which was originally proposed to simultaneously minimize loss values and loss sharpness. The sharpness refers to how much the loss value changes when there are small perturbation in the model parameters and the goal of SAM functions is to optimize the parameters that both minimize the loss as well as produce a flat loss landscape around the minimum. SMART icludes more robustness techniques for better generalization. The first one is smoothness-inducing adversarial regularization which manages the complexity of the model. Its role is to encourage the output of the model not to change much, when injecting a small perturbation to the input and therefore enforces the smothness of the model and controls its capacity. The second is Bregman proximal point optimization which can prevent aggressive updating by only updating the model within a small neighborhood region of the previous iterate and hence controls aggressive updating and stabilizes the training.

Exprimental results

We implemented the SMART regularization in our gradient surgery training (Pcgrad and vaccine) as well as with using PAL scheduler, where the performance is high especially for the STS and QQP tasks. In order to spot the focus only on SMART, we would then need to switch off the optimization methods to estimate its impact on the baseline.

Custom Loss for BART generation

The loss function improves the performance the model by penalizing the model for simply copying input sentences while still encouraging meaningful similarity. This loss function incorporates several components that work together to create a more nuanced and effective training process.:

Calculates the cross entropy loss (just like the default BART model we are using)
Cosine Similarity Loss:
- Similarity Loss: This measures how similar the embeddings of the input and output sentences are. The function computes the cosine similarity between the mean embeddings of the input (input_embeddings) and output sentences (output_embeddings). A similarity loss is calculated as 1 - cosine_sim.mean(), which encourages the model to generate output sentences that retain some similarity to the input.
- Dissimilarity Loss: This is simply the mean cosine similarity itself (cosine_sim.mean()). It acts as a counterbalance, ensuring that the output is not overly similar to the input. By tuning this component, the model is encouraged to create variations that still make sense contextually.

Score: Slighly better scores and around 15-20% more sentences changed

Noise for training for BART

The noise function introduces noise into a sentence by randomly modifying its content. This noise can be in the form of word swaps, deletions, or replacements with a special token (e.g., a mask token). The primary goal of this function is to simulate noisy or imperfect input data, which can be useful in training models that are better at handling variations in text.

Generation Score: Way better score than the baseline and around 35% more sentences changed. This was the biggest improvement Detection Score: Slightly worse. Probably because each word is important in detection the paraphrases

Synonyms for training for BART

The synonym function augments the dataset by generating additional sentences with synonyms substituted for some words in the original sentences. The function processes each sentence in the dataframe, identifies synonyms for specific words using the NLTK WordNet library, and appends these new sentences to the dataframe. This helps to expand the training dataset, improving the robustness and generalization of a language model by exposing it to a wider variety of sentence structures.

Generation Score: Slightly worse score (more or less equal, depending on the random seed), but more sentences changed. Overall less overfitting. Would be better than the baseline with more epochs. Detection Score: Slightly worse. Probably because each word is important in detection the paraphrases

Generally: BART generation

While it looks like every model is overfitting, not only do the resulting sentences do look much better, but the BLEU scores are also better. Therefore, the losses are not that important for our metrics.

Custom Loss for BART detection

Implemented a new custom loss function that combines the Matthews Correlation Coefficient (MCC) with the traditional BCEWithLogitsLoss (Binary Cross-Entropy with Logits). This was done to balance the strengths of both metrics, with the penalizing weights for each determined through a grid search to find the most effective combination.

Score: The scores improved and there is less overfitting now.

Details

Data Imbalance

The dataset we worked is imbalanced, either inside the individual datasets, such as the imbalanced classes of SST or in terms of the size of the datasets of different tasks when it comes to multi-tasking. Such effects of data imbalance can misguide the model to the majorioty classes at the expense of the minor ones and therefore incorrect predictions on test data. There are two solutions to overcome this:

Fixed samples per epoch: Is to select a fixed number of samples randomly from the dataloader of each task in each epoch. We ran our experiment with 10000 and 20000, but did not find a significant differtence on performance. In that case 10000 samples will be better for computational reaseons. This strategy is somehow simple and does not actually solve the problem since we still cannot balance the classes to mitigate overfitting, especially for the SST task.
Data augmentation: J. Wei and K. Zou (EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks) came up with the idea of doing Easy Data Augmentation (EDA) for boosting performance on text classification tasks. EDA consists of four operations:
- Synonym replacement: Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonymschosen at random.
- Random insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times.
- Random swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
- Random deletion: Randomly remove each word in the sentence with probability $p$
It is adequate for small datasets. It also saves computation costs since the authors reported that they achieved the same accuracy with only 50% of the training set as with the full available data. There is also another spaCy integrated package of data augmentation (augmenty) that was developed by K. Enevoldsen (AUGMENTY: A PYTHON LIBRARY FOR STRUCTURED TEXT AUGMENTATION) that serves the same purpose.

Although, the data augmentation approach seems more promising, we unfortunately came about its idea a bit late in the project's schedule, which is why we only implement the first approach and leave the latter for further investigations.

Classifier Architecture

We share the BERT layers among all the datasets and build a classifier on top that is characteristic for each task. We tried many different architectures to improve the performance of both single and multitask training. We use the first token of the sequence [CLS] of the last hidden state that encodes all the learned parameters of the model during the open domain pre-training phase and feed it to the classifiers, as already described in these research studies (Devlin et al. , Sun et al., Karimi et al.). The classifiers we build are:

Sentiment Analysis Classifier: At first we build three fully connected layers of size 768 * 768 each, with relu activation applied after each layer. Then the last layer which is of size hidden size * num_classes. These layers refine the BERT embeddings into logits corresponding to each class. We take the softamx and compute the loss betwee the predicted labels and the ground truth using the cross enropy function from PyTorch.
Semantic Textual Similarity Regressor: We compute the embeddings for the sequence pair and the cosine similarity between the embeddings vectors. The cosine similarity distance is an estimate of how semantically similar the two sentences are. The output is then multiplied by 5 to scale it against the true labels. Since it is a regression task, we use the Mean Square Error (MSE) to compute the loss.
Paraphrase Detection Classifier: We first generate the emnbeddings for each pair, then apply a fully connected layer of size 768 * 768 to both embedding vectors. We take the absolute sum and difference between the two vectors and concatenate them. The concatenated vector of dimension 2 * 768 is fed to two linear layers and then the to the binary classifier. We again applied the reluactivation after each of the connected layers in the last processing phase. At the end, we take the sigmoid function to get the predicted probabilities (logits) and compare them with the true labels through the cross entropy loss function.

Augmented Attention

We add an attention layer on top of the BERT model to refine the model's focus on certain parts of the input sequence, iproving its ability to capture relevant contextual information for the downstream tasks. It is important for the model to give more attention to certain tokens in the input sequence (great, amazing, terrible, bad, etc.) to better classify it or the relations between sequence pairs to decide if they are actually paraphrased. For that purpose we apply an attention mechanism inspired by A. Vaswani, et al. (Attention Is All You Need) that takes in the last hidden state of the model and apply a weighted sum mechanism to enhance the importance of specific tokens and ignores the others. Also, G. Limaye et al. (BertNet : Combining BERT language representation with Attention and CNN for Reading Comprehension) implemented a Query-Context attention layer for the question answering task (English-Arabic-English) and proved a significant performance on "difficult" questions.

BiLSTM

We added 3 Bidirectional Long-Short Term Memory (BiLSTM) layers on top of BERT to further capture long sequential dependencies and richer contextual information. A BiLSTM processes the sequence in both forward and backward directions to capture sequence relations across time steps. We looked through a hidden size of 128 sequence length of the last hidden state of BERT. We also added a dropout and an average pooling layer to avoid overfitting. The model however was still overfitting which therefore we then reduced the BiLSTM layers to one.

Feed more Sequences

We set a train_modeoption to include contexual information from some of the most top BERT layers instead of the last hidden state only. We first tried with all the 12 layers sequences, but that worsened the performance. Inspired from (Adversarial Training for Aspect-Based Sentiment Analysis with BERT) Then we took the [CLS] output from the last four layers and took the average pooling over them. That approach has reduced overfitting and enhanced the performance, espcially for the SST task.

Hierarchical BERT

We tried to investigate the hirarchical structure in the input sequences by chunking the input with a segment length of 34 and investigating the relation between the chunks and the sentence as a whole. This method is designed to process text at multiple levels granularity to capture both local and global contexts. Although this approach has improved the STS and QQP scores, it is particularly desgined to handle long text sequences as reported by J. Lu, et al. (A Sentence-level Hierarchical BERT Model for Document Classification with Limited Labelled Data), which does not apply perfectly to our small sequence data. The results for the QQP and SST tasks went under the baseline, as we suspected. Another reason for the result is that we don't actually know where we should split the sequence. It could be that the the sentence loses its meaning when it gets splitted and the resultant chunks become meaningless on their own. That's why as we explained earlier, that approach could be helpful for long sequences when we want to investigate the relation between sentences on a local level and the global document.

CNN BERT

Seeking improvements, we tried to add Convolution Network (CNN) layers on top of BERT. CNN is known for its ability to scan features to a high depth and capture complex relations in the input, which we thought it could be useful for our tasks. We added threee 1-dimensional CNN layers with input channel of size 768 (embeddings size or number of features per input token) and an output dimension of 512 (size of the feature map for each token). We implemented convolutional filters of size $5 \times 5$ meaning that the filter will consider five adjacent tokens at a timeas it slides along the sequence. We used a paddine of one to ensure that the output sequence has the same length as the input. Unexpectedly, this approach showed a degradation in performance, which we decided to drop it.

Combined Models

Here we use a way of combined BERT models called Mixture of Experts (MoE) which is composed of several sub-models or experts and a gating mechanism to dynamically select which experts to use for a given input. This approach is particularly useful when different subsets of data may benefit from specialized models. State-of-the-art models use the MoE such as Switch Transformer and Gshard. We basically implemented three BERT sub-models and one BERT model that functions as a gating network. We applied the softmax to consider the contribution of each expert. Our experiment showed a boost in performance compared to other approaches.

BERT-Large

We tried to run BERT-Large on our multitask training, which has a larger embeddings dimension than minBERT and therefore able to encode more contextual information. However, we did not perform pre-training, which needs a much large corpus than that we have. That' why model's performance was comparable and did not show any improvement.

PALs

Projected Attention Layers (PALs) are a novel technique introduced by (BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning) to efficiently adapt LLMs to new tasks without the need to pre-train the model. This can be done by introducing multi-head attention layers parallel to the BERT layers. The additional layers can be fine-tuned separately from the base model. The attention layers project the input into task-specific subspace and only the parameters of the PALs get updated, while keeping the parameters of the base model unchanged or frozen which saves computational costs. Also, PAls allow for sharing parameters among layers in multitask-learning which could have a high positive impact on the performance. Although, We integrated the PAls functions into our framework in branch BERT_PALs-Aly, we haven't measured its performance yet, since we still have to complete processing and debuging its implementation. We leave that part to further investigations of the project.

Hyperparameter Search

We used Ray Tune to perform hyperparameter tuning for the multitasking and select the best parameters that reduce overfitting. We used Optuna to search the hyperparameter space and AsyncHyperBandScheduler as the scheduler. The hyperparameters were searched for the whole model on all training data, not for each task individually to avoid overfitting to a single task. The trained models were evaluated on the validation data and the best one was selected based on the validation results. The metrics applied were accuracy for the sentiment analysis and paraphrase detection, and the Pearson correlation for the semantic textual similarity.
For BART_generation:
- Loss function: There are three parameters. Similarity Weight, Dissimilarity Weight, and Copy Penalty Weight. We started with higher weights for Dissimilarity Weight and Copy Penalty Weight, and lower values for Similarity Weight because we wanted to penalize sentences, which were too similar. We got good results and only finetuned a little bit more with gridsearch to get good dev scores.
- Noise: We only wanted a little bit of lose, so that the sentences would still make sence. Therefore we started with 0.1 (10% of words changed threw noise) and later went up to 0.2, because 0.1 did not change much.
- Synonym: For the synonyms, we originally started with a high value of 0.5 (50% of changed words), but this too much of the sentences and the original meaning was partly lost. Therefore we went down to 0.3, which helped increase our scores.
For BART_detection: A mixture of grid search for some hyperparameter, intuition for others and rule of thumb found in the the pytroch documentation.

Computation Resources

We trained our models on the GWDG HPC clusters of the George-August university of Göttingen. We used differents set ups depending on the implemented approach and how much computation costs it needs, We mainly used Grete and Emmy cluster nodes and trained the basic multitasking approach (with augmented attention) with the following configurations:

GPU cores: 4
Model: NVIDIA A100-SXM4-40GB
NVIDIA driver version: 535.104.12
XNNPACK available: True
OS: Rocky Linux 8.8 (Green Obsidian) (x86_64)
CPU cores per socket: 32

More details about the computation utilities can be found in the slurm file corresponding to each training.

Contributors

Task	Contrib1	Contrib2	Contrib3
BERT Tasks (SST, STS, QQP) Basic	Mohamed Aly	Frederik Hennecke	Arne Tillmann
BERT Tasks (SST, STS, QQP) Improvement	Mohamed Aly
BART Tasks Generation Basic	Mohamed Aly	Frederik Hennecke	Arne Tillmann
BART Tasks Generation Improvement	Frederik Hennecke
BERT Tasks Detection Basic	Mohamed Aly	Frederik Hennecke	Arne Tillmann
BERT Tasks Detection Improvement	Arne Tillmann	Frederik Hennecke
Writing the README file	Mohamed Aly	Frederik Hennecke	Arne Tillmann
Bash Scripts	Mohamed Aly	Frederik Hennecke	Arne Tillmann

Licence

The project involves the creation of software and documentation to be released under an open source licence. This license is the Apache License 2.0, which is a permissive licence that allows the use of the software for commercial purposes. The licence is also compatible with the licences of the libraries used in the project. Parts of the project were inspired by other open source codes (see acknowledgment). Using or modifying these parts in this project should adhere to the licence (Copyrights) of the original authors.

Usage Guidelines

To run or contribute to this project, please follow these steps:

Clone the repository to your local machine

git clone [email protected]:FrederikHennecke/DeepLearning4NLP.git

Add the upstream repository as a remote and disable pushing to it (only pull but not push)

git remote add upstream https://github.com/FrederikHennecke/DeepLearning4NLP.git
git remote set-url --push upstream DISABLE

You can pull the latest updates of the project with

git fetch upstream
git merge upstream/main

or you can specify a branch to pull from after fetching the changes

git pull <branch-name>

Pre-commit Hooks

The code quality and integrity is checked with pre-commit hooks. This is used to ensure that the code quality is consistent and that the code is formatted uniformly. To install the pre-commit hooks in your local repository, run the following

pip install pre-commit
pre-commit install

The pre-commit hooks will run automatically before each commit. If the hooks fail, the commit will be aborted. You can skip the pre-commit hooks by add --no-verifyflag to your commit command.

The installed pre-commit hooks are:

black - Code formatter (Line length 100)
flake8 - Code linter (Selected rules)
isort - Import sorter

GWDG Cluster

You can run the project on the GWDG cluster, in case you are a memeber of the GWDG community and specify the GPU usage you need. To start an interactive session for testing the project, execute the following

srun -p grete:shared --pty -n 1 -C inet -c 32 -G A100:1 --interactive /bin/bash

This command will allocate resources of one node with 32 CPU cores and A100:1 GPU NVIDIA core on the grete shared cluster. For the actual run of the script and usage of higher computational power, you need to submit a job. The job configurations are found in the run_train.sh script where you can select the file to run with the options you like to activate and your prefered parameters. You can also adjust the resources you want to use (nodes per task, GPU cores, memory, etc.) and add your email to receive a notification of the job's status. You can submit a job with the following command

sbatch run_train.sh

You can check the status of your submitted jobs with

squeue --me

This will show the job-id, if the job is running or still pending allocation, time elapsed since submission and the gpu node assigned to it.

The job details will be saved to the slurm_files directory where you can see a summary of the resources used, the installed packages, the executed command and the git related information (current branch, latest commit, etc.). If checkpointand write_summary options are activated, then the training logs will be saved to the logdir directory. The saved results can be viewed on the tensorboard. To create a tunnel to your local machine and start the tensorboard on the cluster, run the following

ssh -L localhost:16006:localhost:6006 <username>@glogin.hlrn.de
module load anaconda3
conda activate dnlp2
tensorboard --logdir logdir

AI-Usage Card

Artificial Intelligence (AI) aided the development of this project. For transparency, we provide our AI-Usage Card at the top. The card is based on https://ai-cards.org/.

Acknowledgement

The project description, partial implementation, and scripts were adapted from the default final project for the Stanford CS 224N class developed by Gabriel Poesia, John, Hewitt, Amelie Byun, John Cho, and their (large) team (Thank you!)
The BERT implementation part of the project was adapted from the "minbert" assignment developed at Carnegie Mellon University's CS11-711 Advanced NLP, created by Shuyan Zhou, Zhengbao Jiang, Ritam Dutt, Brendon Boldt, Aditya Veerubhotla, and Graham Neubig (Thank you!)
Multitask-training with augmented attention is inspired from Lars Kaesberg and Niklas Bauer under Apache License 2.0 (Thank you!).
PALs and schedulers implementations are inspired from Josselin Roberts, Marie Huynh and Tom Pritsky under Apache License 2.0 (Thank you!).
The structure of README file is adapted from good_example (Thank you!)

Disclaimer

We didn't get enough time to clean the repository and arrange the scripts in a perfectly engineered way. We also couldn't merge all the implemented approaches of the BERT part to the main. Therefore, we advise the user to refer to the original branches, which can be found in the slurm file in the tables of the results section

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
additional_predictions		additional_predictions
data		data
predictions-zip		predictions-zip
predictions		predictions
sanity_test		sanity_test
slurm_clean		slurm_clean
slurm_files		slurm_files
.gitignore		.gitignore
Group04_BERTs-Buddies_DNLPSS24_Yasir_Part02.zip		Group04_BERTs-Buddies_DNLPSS24_Yasir_Part02.zip
LICENSE		LICENSE
README.md		README.md
STRUCTURE.md		STRUCTURE.md
ai-usage-card-Aly.pdf		ai-usage-card-Aly.pdf
ai-usage-card_arne.pdf		ai-usage-card_arne.pdf
ai-usage-card_frederik.pdf		ai-usage-card_frederik.pdf
bart_detection.py		bart_detection.py
bart_generation.py		bart_generation.py
base_bert.py		base_bert.py
baseline_classifier.py		baseline_classifier.py
bert.py		bert.py
bert_pal.py		bert_pal.py
brainstorm-part2.md		brainstorm-part2.md
combined_models.py		combined_models.py
conda_install.sh		conda_install.sh
config.py		config.py
costum_loss.py		costum_loss.py
datasets.py		datasets.py
datasets_pal_processing.py		datasets_pal_processing.py
dnlp.yml		dnlp.yml
evaluation.py		evaluation.py
evaluation_separate.py		evaluation_separate.py
gradvac_amp.py		gradvac_amp.py
gwdg_run.sh		gwdg_run.sh
layers.py		layers.py
optimizer.py		optimizer.py
pcgrad.py		pcgrad.py
pcgrad_amp.py		pcgrad_amp.py
prepare_submit.py		prepare_submit.py
print_diff_generated_sentences.py		print_diff_generated_sentences.py
run_train.sh		run_train.sh
setup.sh		setup.sh
setup_gwdg.sh		setup_gwdg.sh
single_classify.py		single_classify.py
smart_regularization.py		smart_regularization.py
sophia.py		sophia.py
srun_train.sh		srun_train.sh
tokenizer.py		tokenizer.py
train_multitask.py		train_multitask.py
train_multitask_pal.py		train_multitask_pal.py
utils.py		utils.py

License

maly-phy/Final-DNLPSS24

Folders and files

Latest commit

History

Repository files navigation

DNLP SS24 Final Project – BERT for Multitask Learning and BART for Paraphrasing

Introduction

Requirements

Data

Training

Evaluation

Results

Paraphrase Identification on Quora Question Pairs (QQP)

Sentiment Classification on Stanford Sentiment Treebank (SST)

Semantic Textual Similarity on STS

Paraphrase Type Generation (PTG)

Paraphrase Type Detection (PTD)

Methodology

POS and NER Tag Embeddings

Experimental Results

Explanation of Results

Optimizers

Sophia

Implementation

Experimental Results

Explanation of Results

Gradient Optimization (Projectors)

Pcgrad

Experimental results

Explanation of Results

GradVac

Experimental results

Explanation of Results

Schedulers

Linear Warmup

Experimental results

Plateau

Cosine

Round Robin

Experimental results

PAL

Experimental results

Random

Regularization

SMART

Exprimental results

Custom Loss for BART generation

Noise for training for BART

Synonyms for training for BART

Generally: BART generation

Custom Loss for BART detection

Details

Data Imbalance

Classifier Architecture

Augmented Attention

BiLSTM

Feed more Sequences

Hierarchical BERT

CNN BERT

Combined Models

BERT-Large

PALs

Hyperparameter Search

Computation Resources

Contributors

Licence

Usage Guidelines

Pre-commit Hooks

GWDG Cluster

AI-Usage Card

Acknowledgement

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Packages