Inconsistency in Padding-Free Benchmarks with Different Transformers Versions #70

achew010 · 2024-08-19T06:36:05Z

Description

We observe no improvement with PaddingFree on QLoRA and GPTQ-LoRA when running benchmarks on OrcaMath.

However,

additionally applying FOAK along with PaddingFree shows significant improvement.
the benchmarks using FLAN also show an improvement when PaddingFree is used with QLoRA and GPTQ-LoRA.

Mistral7B (OrcaMath) with transformers==4.42.4

Framework Type	Num Device	Device Batch Size	Train Runtime (sec)	Throughput (toks/sec)
BNB	1	4	346	1586
BNB + PF	1	4	340	1595
BNB + FOAK	1	4	314	1748
BNB + FOAK + PF	1	4	245	2229

Mistral7B (FLAN) with transformers==4.42.4

Framework Type	Num Device	Device Batch Size	Train Runtime (sec)	Throughput (toks/sec)
BNB	1	4	1888	1500
BNB + PF	1	4	1225	2314

NOTE: There are some variability between the transformers versions, when transformers is upgraded

Mistral7B (OrcaMath) with transformers==4.44.0

Framework Type	Num Device	Device Batch Size	Train Runtime (sec)	Throughput (toks/sec)
BNB	1	4	347	1586
BNB + PF	1	4	318	1726

The text was updated successfully, but these errors were encountered:

achew010 · 2024-09-03T18:04:31Z

Description

We are again observing some variation with paddingfree in benchmarks for #78 comparing against published benches, there are some instances the improvement in training speed is significantly lesser.

Using transformers==4.44,

FLAN (6000 samples)

For FLAN, we are expecting both pretokenized and untokenized to be similar to the currently published values but notice current runs for both pretokenized and untokenized datasets show lesser improvement than before.

Experiment	framework config	num device	batch size	grad acc.	runtime (s)	throughput (toks/s)	improvement
Current Published	GPTQ-LoRA + FOAK	2	4	2	1034	1372	baseline
Current Published	GPTQ-LoRA + FOAK + PF	2	4	2	587	2472	43%
pretokenized	GPTQ-LoRA + FOAK	2	4	2	1029	1378	baseline
pretokenized	GPTQ-LoRA + FOAK +PF	2	4	2	666	2181	35%
untokenized	GPTQ-LoRA + FOAK	2	4	2	1034	1372	baseline
untokenized	GPTQ-LoRA + FOAK +PF	2	4	2	670	2162	35%

Orca-Math (2000 samples)

For OrcaMath, we also notice different degrees of improvements between pretokenized and untokenized datasets than before.

Experiment	framework config	num device	batch size	grad acc.	runtime (s)	throughput (toks/s)	improvement
Current Published	GPTQ-LoRA	2	4	2	388	704	baseline
Current Published	GPTQ-LoRA + PF	2	4	2	386	708	0
Current Published	GPTQ-LoRA + FOAK	2	4	2	186	1359	baseline
Current Published	GPTQ-LoRA + FOAK + PF	2	4	2	158	1771	15%
pretokenized	GPTQ-LoRA	2	4	2	388	708	baseline
pretokenized	GPTQ-LoRA + PF	2	4	2	386	708	0
pretokenized	GPTQ-LoRA + FOAK	2	4	2	204	1340	baseline
pretokenized	GPTQ-LoRA + FOAK +PF	2	4	2	177	1579	13.2%
untokenized	GPTQ-LoRA	2	4	2	386	1478	baseline
untokenized	GPTQ-LoRA + PF	2	4	2	365	1588	5%
untokenized	GPTQ-LoRA + FOAK	2	4	2	185	1478	baseline
untokenized	GPTQ-LoRA + FOAK +PF	2	4	2	178	1588	3%

Update: new benchmarks with larger 8000 sample subset of ORCA pushed to #78 with slightly more consistent values, but PF improvement of FOAK still shows quite minmal improvement.

fabianlim · 2024-09-03T23:52:33Z

@achew010

We are again observing some variation with paddingfree in benchmarks for #78 comparing against published benches, there are some instances the improvement in training speed is significantly lesser.

I thought we are also seeing lesser improvment for GPTQ-LoRA + FoAK vs GPTQ-LoRA. has this been resolved?
can you list the transformer versions for the above results
i can see one issue is that your run tiem has a lot of variation. For example GPTQ-LoRA + FOAK, you essentially run it twice (pretok vs non-pretok) and got two 100 tokens variation (1340 vs 1478). This suggests that we need to raise the number of samples for ORCA bench, maybe to 8000 or something.

achew010 · 2024-09-12T09:58:42Z

We fixed a multiple import_and_reload bug in #79 that will now correctly patch FastCrossEntropy with PaddingFree (Previously FastCrossEntropy was not patched correctly), we expect some improvement on the FOAK+PF numbers reported here

achew010 mentioned this issue Aug 19, 2024

Add Benchmarking Compatibility to PaddingFree Plugin #66

Merged

fabianlim changed the title ~~Inconsistency in Benchmarks with Different Transformers Versions~~ Inconsistency in Padding-Free Benchmarks with Different Transformers Versions Aug 19, 2024

achew010 mentioned this issue Sep 5, 2024

Allow PaddingFree to work with DataCollatorForCompletionOnlyLM #78

Merged

fabianlim added the question Further information is requested label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency in Padding-Free Benchmarks with Different Transformers Versions #70

Inconsistency in Padding-Free Benchmarks with Different Transformers Versions #70

achew010 commented Aug 19, 2024 •

edited by fabianlim

Loading

achew010 commented Sep 3, 2024 •

edited

Loading

fabianlim commented Sep 3, 2024

achew010 commented Sep 12, 2024

Inconsistency in Padding-Free Benchmarks with Different Transformers Versions #70

Inconsistency in Padding-Free Benchmarks with Different Transformers Versions #70

Comments

achew010 commented Aug 19, 2024 • edited by fabianlim Loading

Description

achew010 commented Sep 3, 2024 • edited Loading

Description

FLAN (6000 samples)

Orca-Math (2000 samples)

fabianlim commented Sep 3, 2024

achew010 commented Sep 12, 2024

achew010 commented Aug 19, 2024 •

edited by fabianlim

Loading

achew010 commented Sep 3, 2024 •

edited

Loading