ReIntroduce Package for FMS Accel #223

fabianlim · 2024-06-30T15:58:27Z

Description of the change

@tedhtchang @anhuong
This reintroduces the fms-acceleration package, which will soon be available ~~once this PR is merged foundation-model-stack/fms-acceleration#45~~

Update: the package has been released https://pypi.org/project/fms-acceleration
It also introduces some minor fixes.

Related issue number

#219

How to verify the PR

Install and check its properly installed

pip install "fms-hf-tuning[fms-accel] @ git+https://github.com/fabianlim/fms-hf-tuning.git@fix/accel-ref"

Then verify

Successfully installed fms-acceleration-0.1.0 fms-hf-tuning-0.1.dev208+g3157e6f simpleeval-0.9.13 tokenizers-0.15.2 transformers-4.39.3

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

tedhtchang

/LGTM

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim · 2024-07-06T09:52:07Z

@tedhtchang is there anybody else that should review this or we can merge?

anhuong · 2024-07-08T19:50:04Z

@fabianlim I am reviewing and testing, will get review in today

anhuong

Looks good, I had a question on the usage of fms-acceleration library. It is assumed that if one installs fms-acceleration, that one must configure an acceleration framework correct? Because if I install and try to run sft_trainer normally, it fails with error ValueError: No plugins could be configured. Please check the acceleration framework configuration file.

anhuong · 2024-07-08T23:18:30Z

Also when I try installing the plugin, I get an error that torch isn't installed when it is.

$ pip install fms-acceleration
...
Successfully installed fms-acceleration-0.1.1 tokenizers-0.15.2 transformers-4.39.3

$ python -m fms_acceleration.cli install fms_acceleration_peft
/usr/local/lib/python3.11/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
Collecting git+https://github.com/foundation-model-stack/fms-acceleration.git#subdirectory=plugins/accelerated-peft
  Cloning https://github.com/foundation-model-stack/fms-acceleration.git to /tmp/pip-req-build-c7fg3o71
  Running command git clone --filter=blob:none --quiet https://github.com/foundation-model-stack/fms-acceleration.git /tmp/pip-req-build-c7fg3o71
  Resolved https://github.com/foundation-model-stack/fms-acceleration.git to commit 06dcc4d967cb13842660eed0ebf403a191748ce8
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting auto-gptq@ git+https://github.com/AutoGPTQ/AutoGPTQ.git@ea829c7bbe83561c2b1de26795b6592992373ef7 (from fms-acceleration-peft==0.0.1)
  Cloning https://github.com/AutoGPTQ/AutoGPTQ.git (to revision ea829c7bbe83561c2b1de26795b6592992373ef7) to /tmp/pip-install-91u6z775/auto-gptq_65c3bf33f0094c9295adc1efa87da6b0
  Running command git clone --filter=blob:none --quiet https://github.com/AutoGPTQ/AutoGPTQ.git /tmp/pip-install-91u6z775/auto-gptq_65c3bf33f0094c9295adc1efa87da6b0
  Running command git rev-parse -q --verify 'sha^ea829c7bbe83561c2b1de26795b6592992373ef7'
  Running command git fetch -q https://github.com/AutoGPTQ/AutoGPTQ.git ea829c7bbe83561c2b1de26795b6592992373ef7
  Running command git checkout -q ea829c7bbe83561c2b1de26795b6592992373ef7
  Resolved https://github.com/AutoGPTQ/AutoGPTQ.git to commit ea829c7bbe83561c2b1de26795b6592992373ef7
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [2 lines of output]
      Building PyTorch CUDA extension requires PyTorch being installed, please install PyTorch first: No module named 'torch'.
       NOTE: This issue may be raised due to pip build isolation system (ignoring local packages). Please use `--no-build-isolation` when installing with pip, and refer to https://github.com/AutoGPTQ/AutoGPTQ/pull/620 for more details.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

# but i see torch does exist
$ pip show torch
Name: torch
Version: 2.3.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: /home/tuning/.local/lib/python3.11/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, flash-attn, fms-acceleration, fms-hf-tuning, peft, trl

# as well as below works
$ python 
Python 3.11.7 (main, May 16 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import fms_acceleration

also verified that flash-attn and packaging are installed

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim · 2024-07-09T03:53:19Z

@anhuong thanks for reviewing. We have pushed another commit 3ef14a2 to clean up one the previous commits.

Looks good, I had a question on the usage of fms-acceleration library. It is assumed that if one installs fms-acceleration, that one must configure an acceleration framework correct? Because if I install and try to run sft_trainer normally, it fails with error ValueError: No plugins could be configured. Please check the acceleration framework configuration file.

This is a little strange, the code should already be handling this case, as you can see below, I have fms-accel installed.

fms-acceleration==0.1.1
fms-hf-tuning @ file:///data/repos/fms-hf-tuning

And if i run with the following, without any framework config arguments, it will run as per normal. Maybe you can test on the latest commit.

TRANSFORMERS_VERBOSITY=info \
	python \
	tuning/sft_trainer.py \
	--training_data_path $DATA_PATH \
	--output_dir ./results \
	--num_train_epochs 1 \
	--torch_dtype float16 \
    --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
	--per_device_train_batch_size 4 \
	--per_device_eval_batch_size 1 \
	--gradient_accumulation_steps 1 \
	--gradient_checkpointing True \
	--evaluation_strategy "no" \
	--save_strategy "no" \
	--learning_rate 2e-4 \
	--weight_decay 0.01 \
	--warmup_steps 10 \
	--adam_epsilon 1e-4 \
	--lr_scheduler_type "linear" \
	--logging_strategy steps \
	--logging_steps 10 \
	--include_tokens_per_second \
	--packing True \
	--use_flash_attn True \
	--response_template "\n### Response:" \
	--dataset_text_field "output" \
	--max_steps 200 \
	--peft_method lora \
	--r 16 --lora_alpha 16 --lora_dropout 0.1 \
	--target_modules q_proj k_proj v_proj o_proj

as you can see no fms acceleration framework plugins will be activated

***** Running training *****
  Num examples = 3,251
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 200
  Number of trainable parameters = 4,505,600
  0%|                                                                                                                                                                                                    | 0/200 [00:00<?, ?it/s]
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/home/flim/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
{'loss': 1.2318, 'grad_norm': 0.342529296875, 'learning_rate': 0.0002, 'epoch': 0.01}

Also when I try installing the plugin, I get an error that torch isn't installed when it is.

The error is not a torch installation error, but rather that AutoGPTQ requires cuda toolkit to be installed.

in our README, we provide sample nvidia images that have cuda toolkit already installed.
This is a common thing with libraries that have kernels. FOr example, flash attention used to have this (before they published their prebuilt binaries)
And we have an issue to remove cuda toolkit dependencies, by extracting out only triton portions of AutoGPTQ. This will be merged and published soon, so that it will make the installation process more seemless for future users.

If all looks good to you afgain, do you mind giving me another approve and I can merge

anhuong

Thanks for the details Fabian, running on this PR, I do not get the initial error where fms-acceleration is installed and I have to use it to run sft_trainer.py.

fabianlim requested review from anhuong, Ssukriti and alex-jw-brooks as code owners June 30, 2024 15:58

fabianlim requested review from tedhtchang and anhuong and removed request for alex-jw-brooks, Ssukriti and anhuong June 30, 2024 15:58

fabianlim self-assigned this Jun 30, 2024

fabianlim marked this pull request as draft June 30, 2024 15:58

fabianlim force-pushed the fix/accel-ref branch from d18cc4b to 3157e6f Compare July 1, 2024 00:02

fabianlim marked this pull request as ready for review July 1, 2024 07:18

fabianlim mentioned this pull request Jul 4, 2024

build: use poetry for reproducible virtual environments #209

Draft

2 tasks

tedhtchang previously approved these changes Jul 5, 2024

View reviewed changes

fabianlim added 2 commits July 6, 2024 10:38

revert some changes in foundation-model-stack#219

9496f2f

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

minor fixes

c859361

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim force-pushed the fix/accel-ref branch from 3157e6f to c859361 Compare July 6, 2024 02:38

anhuong previously approved these changes Jul 8, 2024

View reviewed changes

fabianlim dismissed stale reviews from anhuong and tedhtchang via 1d2b016 July 9, 2024 02:58

fix the is_empty logic

3ef14a2

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim force-pushed the fix/accel-ref branch from 1d2b016 to 3ef14a2 Compare July 9, 2024 03:39

anhuong approved these changes Jul 9, 2024

View reviewed changes

anhuong merged commit bf22a2f into foundation-model-stack:main Jul 9, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReIntroduce Package for FMS Accel #223

ReIntroduce Package for FMS Accel #223

fabianlim commented Jun 30, 2024 •

edited

Loading

tedhtchang left a comment

fabianlim commented Jul 6, 2024

anhuong commented Jul 8, 2024 •

edited

Loading

anhuong left a comment

anhuong commented Jul 8, 2024 •

edited

Loading

fabianlim commented Jul 9, 2024 •

edited

Loading

anhuong left a comment

ReIntroduce Package for FMS Accel #223

ReIntroduce Package for FMS Accel #223

Conversation

fabianlim commented Jun 30, 2024 • edited Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

tedhtchang left a comment

Choose a reason for hiding this comment

fabianlim commented Jul 6, 2024

anhuong commented Jul 8, 2024 • edited Loading

anhuong left a comment

Choose a reason for hiding this comment

anhuong commented Jul 8, 2024 • edited Loading

fabianlim commented Jul 9, 2024 • edited Loading

anhuong left a comment

Choose a reason for hiding this comment

fabianlim commented Jun 30, 2024 •

edited

Loading

anhuong commented Jul 8, 2024 •

edited

Loading

anhuong commented Jul 8, 2024 •

edited

Loading

fabianlim commented Jul 9, 2024 •

edited

Loading