Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM when batch size is automatically calculated #6

Open
seungduk-yanolja opened this issue Nov 13, 2024 · 0 comments
Open

OOM when batch size is automatically calculated #6

seungduk-yanolja opened this issue Nov 13, 2024 · 0 comments

Comments

@seungduk-yanolja
Copy link

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 20.55it/s]
2024-11-13 06:31:53.00986: FMO INFO: Finding the optimal batch size with offloading set to False
2024-11-13 06:31:54.00091: FMO INFO: Test Batch Size: 32
2024-11-13 06:32:30.00441: FMO INFO: Test Batch Size: 16
2024-11-13 06:32:48.00488: FMO INFO: Finished finding the optimal batch size: batch size: 16
Collect Activation Statistics:   2%|██▏                                                                                                                                      | 1/64 [00:39<41:05, 39.14s/it]
2024-11-13 06:33:28,997.00997: fmo.main ERROR: OOM: The process cannot run on the current device due to insufficient memory. Refer to the FAQ in README.md for handling out-of-memory errors.

However, when specified it manually (with a lower batch size, 8),

(fmo) seungduk@h100-1:~/friendli-model-optimizer$ fmo quantize --model-name-or-path /data/nas-2/seungduk/sanitizer/sanitizer_7b_full/checkpoint-33/ --dataset-name-or-path /data/nas-2/seungduk/fmo/fmo_samples.jsonl --dataset-max-length 4086 --dataset-num-samples 1024 --dataset-target-column-name text --dataset-split-name train  --output-dir /data/nas-2/seungduk/fmo/sanitizer-qwen-7b-fp8 --mode fp8 --pedantic-level 2 --dataset-batch-size 8
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 20.35it/s]
Collect Activation Statistics:  11%|██████████████▊                                                                                                                        | 14/128 [02:15<19:07, 10.07s/it

works without any issues. I think choosing the batch size conservatively may help this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant