Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in Padding-Free Benchmarks with Different Transformers Versions #70

Open
achew010 opened this issue Aug 19, 2024 · 3 comments
Labels
question Further information is requested

Comments

@achew010
Copy link
Contributor

achew010 commented Aug 19, 2024

Description

We observe no improvement with PaddingFree on QLoRA and GPTQ-LoRA when running benchmarks on OrcaMath.

However,

  • additionally applying FOAK along with PaddingFree shows significant improvement.
  • the benchmarks using FLAN also show an improvement when PaddingFree is used with QLoRA and GPTQ-LoRA.

Mistral7B (OrcaMath) with transformers==4.42.4

Framework Type Num Device Device Batch Size Train Runtime (sec) Throughput (toks/sec)
BNB 1 4 346 1586
BNB + PF 1 4 340 1595
BNB + FOAK 1 4 314 1748
BNB + FOAK + PF 1 4 245 2229

Mistral7B (FLAN) with transformers==4.42.4

Framework Type Num Device Device Batch Size Train Runtime (sec) Throughput (toks/sec)
BNB 1 4 1888 1500
BNB + PF 1 4 1225 2314

NOTE: There are some variability between the transformers versions, when transformers is upgraded

Mistral7B (OrcaMath) with transformers==4.44.0

Framework Type Num Device Device Batch Size Train Runtime (sec) Throughput (toks/sec)
BNB 1 4 347 1586
BNB + PF 1 4 318 1726
@fabianlim fabianlim changed the title Inconsistency in Benchmarks with Different Transformers Versions Inconsistency in Padding-Free Benchmarks with Different Transformers Versions Aug 19, 2024
@achew010
Copy link
Contributor Author

achew010 commented Sep 3, 2024

Description

We are again observing some variation with paddingfree in benchmarks for #78 comparing against published benches, there are some instances the improvement in training speed is significantly lesser.

Using transformers==4.44,

FLAN (6000 samples)

For FLAN, we are expecting both pretokenized and untokenized to be similar to the currently published values but notice current runs for both pretokenized and untokenized datasets show lesser improvement than before.

Experiment framework config num device batch size grad acc. runtime (s) throughput (toks/s) improvement
Current Published GPTQ-LoRA + FOAK 2 4 2 1034 1372 baseline
Current Published GPTQ-LoRA + FOAK + PF 2 4 2 587 2472 43%
pretokenized GPTQ-LoRA + FOAK 2 4 2 1029 1378 baseline
pretokenized GPTQ-LoRA + FOAK +PF 2 4 2 666 2181 35%
untokenized GPTQ-LoRA + FOAK 2 4 2 1034 1372 baseline
untokenized GPTQ-LoRA + FOAK +PF 2 4 2 670 2162 35%

Orca-Math (2000 samples)

For OrcaMath, we also notice different degrees of improvements between pretokenized and untokenized datasets than before.

Experiment framework config num device batch size grad acc. runtime (s) throughput (toks/s) improvement
Current Published GPTQ-LoRA 2 4 2 388 704 baseline
Current Published GPTQ-LoRA + PF 2 4 2 386 708 0
Current Published GPTQ-LoRA + FOAK 2 4 2 186 1359 baseline
Current Published GPTQ-LoRA + FOAK + PF 2 4 2 158 1771 15%
pretokenized GPTQ-LoRA 2 4 2 388 708 baseline
pretokenized GPTQ-LoRA + PF 2 4 2 386 708 0
pretokenized GPTQ-LoRA + FOAK 2 4 2 204 1340 baseline
pretokenized GPTQ-LoRA + FOAK +PF 2 4 2 177 1579 13.2%
untokenized GPTQ-LoRA 2 4 2 386 1478 baseline
untokenized GPTQ-LoRA + PF 2 4 2 365 1588 5%
untokenized GPTQ-LoRA + FOAK 2 4 2 185 1478 baseline
untokenized GPTQ-LoRA + FOAK +PF 2 4 2 178 1588 3%

Update: new benchmarks with larger 8000 sample subset of ORCA pushed to #78 with slightly more consistent values, but PF improvement of FOAK still shows quite minmal improvement.

@fabianlim
Copy link
Contributor

@achew010

We are again observing some variation with paddingfree in benchmarks for #78 comparing against published benches, there are some instances the improvement in training speed is significantly lesser.

  • I thought we are also seeing lesser improvment for GPTQ-LoRA + FoAK vs GPTQ-LoRA. has this been resolved?
  • can you list the transformer versions for the above results
  • i can see one issue is that your run tiem has a lot of variation. For example GPTQ-LoRA + FOAK, you essentially run it twice (pretok vs non-pretok) and got two 100 tokens variation (1340 vs 1478). This suggests that we need to raise the number of samples for ORCA bench, maybe to 8000 or something.

@achew010
Copy link
Contributor Author

We fixed a multiple import_and_reload bug in #79 that will now correctly patch FastCrossEntropy with PaddingFree (Previously FastCrossEntropy was not patched correctly), we expect some improvement on the FOAK+PF numbers reported here

@fabianlim fabianlim added the question Further information is requested label Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants