Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Support for Apple Silicon #1289

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

shashikanth-a
Copy link

@shashikanth-a shashikanth-a commented Nov 14, 2024

  • Unoptimized
  • No gguf support yet.
  • Build Triton and bitsandbytes from source
  • cmake -DCOMPUTE_BACKEND=mps -S . for bitsandbytes building
  • pip install unsloth-zoo==2024.11.4
  • pip install xformers==0.0.25

- No gguf support yet.
- Build Triton and bitsandbytes from source
- `cmake -DCOMPUTE_BACKEND=hip -S .` for bitsandbytes building
@yukiarimo
Copy link

Is this working?

@shimmyshimmer
Copy link
Collaborator

Hi there thank you for this we will need a bit more time to review! :)

@mkemka
Copy link

mkemka commented Nov 21, 2024

Hi @shashikanth-a - thank you for this. Could you please provide information about the environment and package versions you used for development?

@yukiarimo
Copy link

Hey, does this works with newly released vision support?

@mkemka
Copy link

mkemka commented Nov 23, 2024

Currently I can run this if:

  • Decorators mentioning "@torch.compile(fullgraph = False, dynamic = True, options = torch_compile_options)" are removed in llama and Gemma files.
  • Fine tune llama-3-8b (3.2 1b and 3b throw an error due to rope for some reason.

- lazy loading of model
- minor refactoring
- optimizers and lr schedulers
- gc
- should improve memory consumption
@mkemka
Copy link

mkemka commented Nov 26, 2024

With the changes I can run this out of the box with the steps outlined above:

  • Build Triton from source and pip install -e .
  • Build bnb with cmake -DCOMPUTE_BACKEND=mps -S . and pip install -e .

On a M4 Pro getting around 100 t/s for llama3-8b. Can confirm it will also now work with llama-3.2-3b

@shimmyshimmer
Copy link
Collaborator

Thanks a lot - would anyone be so kind to benchmark this against MLX itself and share results?

Time it took, amount of VRAM, context length, if the losses match - ofcourse it's a lot so just time and checking to see if the losses match would be more than helpful. Thank you so much! :)

@mkemka
Copy link

mkemka commented Jan 3, 2025

Sorry for the delay.
The test is fine-tuning the above PR compared to out of the box mlx lora fine tune with same model and same dataset
M4 Mac Pro - 48GB Model.
Dataset is mlx-community/wikisql that I converted from the mlx format back to the normal hf format for unsloth.

Unsloth run

python unsloth-cli.py --model_name "unsloth/llama-3-8b" --max_seq_length 8192 --dtype None --load_in_4bit --r 4 --lora_alpha 4 --lora_dropout 0.1 --bias "none" --use_gradient_checkpointing "unsloth" --random_state 3407 --use_rslora --per_device_train_batch_size 1 --gradient_accumulation_steps 8 --warmup_steps 5 --max_steps 100 --learning_rate 2e-6 --logging_steps 1 --optim "adamw_8bit" --weight_decay 0.005 --lr_scheduler_type "linear" --seed 3407 --output_dir "outputs" --report_to "tensorboard" --save_model --save_path "model" --dataset data/

Data is formatted and ready!
Trainable parameters: 0.021% (1.704M/8030.261M)
Starting training..., iters: 100
Iter 1: Val loss 1.889, Val took 24.562s
Iter 10: Train loss 1.848, Learning Rate 1.200e-06, It/sec 0.474, Tokens/sec 131.368, Trained Tokens 2769, Peak mem 17.353 GB
Iter 20: Train loss 1.827, Learning Rate 2.000e-06, It/sec 0.472, Tokens/sec 128.186, Trained Tokens 5483, Peak mem 17.353 GB
Iter 30: Train loss 1.875, Learning Rate 2.000e-06, It/sec 0.492, Tokens/sec 134.175, Trained Tokens 8212, Peak mem 17.353 GB
Iter 40: Train loss 1.841, Learning Rate 2.000e-06, It/sec 0.494, Tokens/sec 132.973, Trained Tokens 10903, Peak mem 17.353 GB
Iter 50: Train loss 1.810, Learning Rate 2.000e-06, It/sec 0.478, Tokens/sec 131.516, Trained Tokens 13654, Peak mem 17.353 GB
Iter 60: Train loss 1.804, Learning Rate 2.000e-06, It/sec 0.437, Tokens/sec 119.466, Trained Tokens 16387, Peak mem 17.353 GB
Iter 70: Train loss 1.835, Learning Rate 2.000e-06, It/sec 0.480, Tokens/sec 126.941, Trained Tokens 19030, Peak mem 17.353 GB
Iter 80: Train loss 1.723, Learning Rate 2.000e-06, It/sec 0.435, Tokens/sec 115.940, Trained Tokens 21693, Peak mem 17.353 GB
Iter 90: Train loss 1.743, Learning Rate 2.000e-06, It/sec 0.427, Tokens/sec 115.289, Trained Tokens 24393, Peak mem 17.353 GB
Iter 100: Val loss 1.600, Val took 26.121s
Iter 100: Train loss 1.724, Learning Rate 2.000e-06, It/sec 2.737, Tokens/sec 709.761, Trained Tokens 26986, Peak mem 17.353 GB

MLX Run

mlx_lm.lora \
  --model "unsloth/llama-3-8b" \
  --train \
  --data "mlx-community/wikisql" \
  --iters 100 \
  --batch-size 1 \
  --learning-rate 2e-6 \
  --weight-decay 0.005 \
  --seed 3407 \
  --adapter-path "outputs" \
  --grad-checkpoint \
  --max-seq-length 8192
Loading datasets
Loading Hugging Face dataset mlx-community/wikisql.
Training
Trainable parameters: 0.042% (3.408M/8030.261M)
Starting training..., iters: 100
Iter 1: Val loss 2.931, Val took 9.261s
Iter 10: Train loss 3.096, Learning Rate 2.000e-06, It/sec 1.238, Tokens/sec 92.346, Trained Tokens 746, Peak mem 15.299 GB
Iter 20: Train loss 3.045, Learning Rate 2.000e-06, It/sec 1.341, Tokens/sec 99.536, Trained Tokens 1488, Peak mem 15.326 GB
Iter 30: Train loss 2.504, Learning Rate 2.000e-06, It/sec 1.217, Tokens/sec 97.619, Trained Tokens 2290, Peak mem 15.330 GB
Iter 40: Train loss 2.347, Learning Rate 2.000e-06, It/sec 1.330, Tokens/sec 105.073, Trained Tokens 3080, Peak mem 15.330 GB
Iter 50: Train loss 2.430, Learning Rate 2.000e-06, It/sec 1.282, Tokens/sec 99.348, Trained Tokens 3855, Peak mem 15.330 GB
Iter 60: Train loss 2.148, Learning Rate 2.000e-06, It/sec 1.185, Tokens/sec 103.256, Trained Tokens 4726, Peak mem 15.330 GB
Iter 70: Train loss 1.879, Learning Rate 2.000e-06, It/sec 1.173, Tokens/sec 104.301, Trained Tokens 5615, Peak mem 15.571 GB
Iter 80: Train loss 1.972, Learning Rate 2.000e-06, It/sec 1.229, Tokens/sec 94.750, Trained Tokens 6386, Peak mem 15.571 GB
Iter 90: Train loss 1.845, Learning Rate 2.000e-06, It/sec 1.234, Tokens/sec 103.314, Trained Tokens 7223, Peak mem 15.571 GB
Iter 100: Val loss 1.641, Val took 7.520s
Iter 100: Train loss 1.715, Learning Rate 2.000e-06, It/sec 17.545, Tokens/sec 1336.898, Trained Tokens 7985, Peak mem 15.571 GB
Iter 100: Saved adapter weights to outputs/adapters.safetensors and outputs/0000100_adapters.safetensors.
Saved final weights to outputs/adapters.safetensors.

I can already see the parameter need to be reviewed since the trainable percentage of the models is different.
If this direction is useful I can keep looking at it.

@noaebbot
Copy link

noaebbot commented Jan 8, 2025

Was able to make this work! Thanks for this! But the unsloth-zoo==2014.11.4 did not work for me, some functions were missing. Was able to make it run with version 2014.11.6

@mitchross
Copy link

Any plan to get this to merge soon? I really need this feature.

@shimmyshimmer
Copy link
Collaborator

shimmyshimmer commented Jan 11, 2025

Hey y'all thanks a lot for all the tests and thanks once again @shashikanth-a for the PR. We'll be doing a PR review and benchmarking tests hopefully next week! Thanks @mkemka as well for the test we appreciate it

@yukiarimo
Copy link

Cool stuff! Also, if you tried KoboldCPP (based on llama.cpp), there’s stuff like prompt caching. When I text the model, it will only process the new tokens because the stuff above it already pre-renders, so it goes fast. How can I get that with Unsloth?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants