FMS Acceleration for Accelerated PeFT Techniques

Currently only supports LoRA-related techniques, but more are in the pipeline to be added:

Plugins

Plugin	Description	Depends	Loading	Augmentation	Callbacks
autogptq	Loads 4bit GPTQ-LoRA with quantized GPTQ as base	AutoGPTQ	✅	✅	✅
bnb	Loads 4bit QLoRA with quantized bitsandbytes Linear4	Huggingface bitsandbytes	✅	✅	✅

fix upcasting (resulting in slowdown) issue for bnb plugin, originally discovered by inventors of Unsloth. NOTE: we recommend using mixed precision when using 4bit quant for better performance, as per our benchmarks.
bnb properly configured to work with FSDP following this guide.
triton_v2 kernels are not yet properly integrated into huggingface optimum.
triton_v2 kernels are the only 4bit kernels that work for training.

GPTQ-LORA depends on an AutoGPTQ backend to run. There are 2 backend options

Current Implementation
- This is an extracted local subset from ModelCloud's refactored fork.
- It removes redundant code to simplify build and installation of the plugin
Legacy Implementation
- This requires building the package from the official AutoGPTQ repository
- To replicate this implementation, follow the installation below
  - The legacy implementation of GPTQ-LORA uses an external AutoGPTQ package, you must ensure the specific commit is installed
```
pip install git+https://github.com/AutoGPTQ/AutoGPTQ.git@ea829c7bbe83561c2b1de26795b6592992373ef7
```
  - To construct the plugin, in the configuration object that is passed to the plugin - set use_external_lib: True (otherwise defaults to use the local AutoGPTQ package)
```
    peft:
    quantization: 
        auto_gptq:
        kernel: triton_v2
        from_quantized: True
        use_external_lib: True
```

GPTQ-LORA sometimes observed to have nan grad norms in the begining of training, but training proceeds well otherwise.