-
Notifications
You must be signed in to change notification settings - Fork 1
Research on LLM Hyper‐parameters
-
Batch Size: How many training samples should be used to train in one training iteration. Smaller batch sizes may lead to more stochastic updates, while larger batch sizes may offer better generalization. Usually, we should use the highest batch size which GPU memory allows to benefit from maximum generalization. If you increase batch size you probably need to increase learning rate or epochs to keep the balance of training.
Range: 16, 32, 64, ...
-
Epochs: How many times the model should train on the whole dataset. Too few epochs may result in underfitting, while too many epochs may lead to overfitting. For finetuning in our case the range would be:
Range: [3-15]
-
Learning Rate: The learning rate determines the step size taken during training to update the model's weights. How much the model can learn in each step.
Range: [1e-6, - 1e-4]
Early stopping:
-
Training loss usually is decreasing.
-
Keep eye on validation loss, stop training when validation loss started to increase.
-
At this epoch we will have the best preforming model.
Weight decay (WD): It is a regularization technique. It penalizes large weights to avoid overfitting. the bigger WD the more penalizing, model learns slower and less overfitting.
-
Recommended values: [1e-6, 1e-5, 1e-4, 1e-3, 1e-2]
-
Tune in a logarithmic manner
-
Weight decay and Learning Rate Interaction. Higher weight decay values may require a lower learning rate and vice versa.
Gradient Clipping:
This technique is used to prevent the gradients from becoming too large during the training process, which can lead to unstable training and the "exploding gradient" problem. By capping the gradients to a maximum norm, gradient clipping ensures that the updates to the model's weights are within a reasonable range. Range: [0.1, 0.5]
Learning rate scheduler:
Learning rate should not be constant during training. It should be higher in beginning to be able to take large steps to find global minima and lower at final epochs to find minima in specific are with small steps.
Recommended choice: Cosine scheduler
source: Prof. A. Maier lecture slides "Deep learning" in FAU
With optuna!
There are some libraries which systematically search for hyper-parameters for our model like optuna and Ray. We don't need to manually try them. Optuna does it in a systematic manner using methods like greed search, Bayesian optimizer and Tree-structured Parzen Estimator (TPE).
Recommended hyperparameter searcher library: Optuna
Recommended optimization method: TPE
A link to how to implement optuna in Huggingface: https://huggingface.co/docs/transformers/hpo_train
Rank(r): The parameter r controls the dimensionality of matrixes to update weights. Larger r means more precise updates. But as recent papers mentioned if you apply LoRA to all layers(usually this is the case), any r greater than 8 is fine and is not different in result. Anyway, higher r means more computation burden. Range: [8, 16, 32, 64]
Alpha:
When the weight changes are added back into the original model weights, they are multiplied by a scaling factor that’s calculated as alpha divided by rank. Influence of fine tuning can be controlled with below division. Actually if we needed to increase efffect of finetuning we should increase alpha.
Dropout: Is nothing new, just like dropout in neural networks. It removes portion of weights to help model come out of local minima. Range: [0 - 0.1] Source: https://www.entrypointai.com/blog/lora-fine-tuning/
- Original Assignee: anosh-ar