FMS Acceleration for Accelerated PeFT Techniques

Currently only supports LoRA-related techniques, but more are in the pipeline to be added:

Plugins

Plugin	Description	Depends	Loading	Augmentation	Callbacks
autogptq	Loads 4bit GPTQ-LoRA with quantized GPTQ as base	AutoGPTQ	✅	✅
bnb	Loads 4bit QLoRA with quantized bitsandbytes Linear4	Huggingface bitsandbytes	✅	✅

fix upcasting (resulting in slowdown) issue for bnb plugin, originally discovered by inventors of Unsloth.
bnb properly configured to work with FSDP following this guide.
triton_v2 kernels are not yet properly integrated into huggingface optimum.
triton_v2 kernels are the only 4bit kernels that work for training.

Models with sliding windows (e.g., Mistral, Mixtral) will have memory and throughout issues.
GPTQ-LORA sometimes observed to have nan grad norms in the begining of training, but training proceeds well otherwise.
low_cpu_mem_usage temporarily disabled for AutoGPTQ until bug with make_sure_no_tensor_in_meta_device is resolved.
Requires nightly AutoGPTQ until package > 0.7.1 becomes available
```
pip install git+https://github.com/AutoGPTQ/AutoGPTQ.git
```