Performance

The following two tables is a comparison of performance between LightSeq and Faster Transformer, Which is tested on Tesla T4 with a model of Transformer-base. We also provide a TF baseline which's code is from Faster Transformer.

Beam search

batch_size	beam_size	seq_len	TF(ms)	FT(ms)	lightseq(ms)	FT speedup	lightseq speedup
1	4	32	419.53	26.25	29.66	15.98	14.14
1	4	64	806.38	54.02	63.04	14.93	12.79
8	4	32	439.64	35.99	34.77	12.22	12.64
8	4	64	891.54	79.82	79.43	11.17	11.22
32	4	32	536	82.82	59.49	6.47	9.01
32	4	64	1116.74	198.95	155.08	5.61	7.20
64	4	32	668.45	144.53	101.54	4.62	6.58
64	4	64	1476.17	351.14	277.4	4.20	5.32
128	4	32	996.88	271.8	200.49	3.67	4.97
128	4	64	2157.85	671.76	502.91	3.21	4.29

Sampling

batch_size	topk/topp	seq_len	FT(ms)	lightseq(ms)	lightseq speedup
1	0.75	32	34.4	29.66	1.16
1	0.75	64	71.45	59.72	1.20
32	0.75	32	56.61	40.40	1.40
32	0.75	64	120.39	100.36	1.20
128	0.75	32	111.4	94.68	1.18
128	0.75	64	246.97	270.55	0.91
1	32	32	34.35	28.06	1.22
1	32	64	72.48	56.4	1.29
32	32	32	40.15	39.23	1.02
32	32	64	87.46	98.62	0.89
128	32	32	99	90.83	1.09
128	32	64	222.62	262	0.85

The following table is a comparison on a fr2en translation model which is a Transformer-big with a beam size of 4 and a target vocabulary size of approximately 30k. FP32 models are tested on Tesla P4, and FP16 models are tested on Tesla T4.

batch_size	seq_len	tf-fp32, ms	lightseq-fp32, ms	lightseq-fp16, ms	lightseq-fp32/tf-fp32, speedup	lightseq-fp16/lightseq-fp32, speedup	lightseq-fp16/tf-fp32, speedup
1	6	303	47	27	6.44	1.74	11.22
1	12	399	63	38	6.33	1.66	10.5
1	18	702	108	59	6.5	1.83	11.9
1	24	1071	167	82	6.41	2.04	13.06
1	36	1234	192	105	6.42	1.83	11.75
1	46	1445	227	110	6.36	2.06	13.14
1	58	1887	303	142	6.22	2.13	13.29
1	70	2771	428	197	6.47	2.17	14.07
2	6	317	57	32	5.56	1.78	9.91
2	12	418	73	39	5.72	1.87	10.72
2	18	723	131	66	5.51	1.98	10.95
2	24	1113	201	91	5.53	2.21	12.23
2	36	1276	234	104	5.45	2.25	12.27
2	46	1521	282	121	5.39	2.33	12.57
2	58	2004	371	159	5.4	2.33	12.6
2	70	2965	542	221	5.47	2.45	13.42
4	6	326	61	39	5.34	1.56	8.36
4	12	433	85	47	5.09	1.81	9.21
4	18	761	154	77	4.94	2	9.88
4	24	1195	245	113	4.87	2.17	10.58
4	36	1391	282	128	4.93	2.2	10.87
4	46	1679	339	153	4.95	2.22	10.97
4	58	2232	455	199	4.9	2.29	11.22
4	70	3406	673	285	5.06	2.36	11.95
8	6	364	76	43	4.78	1.77	8.47
8	12	470	110	56	4.27	1.96	8.39
8	18	854	205	91	4.16	2.25	9.38
8	24	1381	318	139	4.34	2.29	9.94
8	36	1628	378	156	4.3	2.42	10.44
8	46	1989	459	193	4.33	2.38	10.31
8	58	2683	617	254	4.34	2.43	10.56
8	70	4251	949	382	4.47	2.48	11.13

The following table is a comparison on a en2zh translation model which is a Transformer-deep(Compared with Transformer-big, it has 16 layers of encoder and other configurations remain the same) with a beam size of 4 and a target vocabulary size of approximately 30k. FP32 models are tested on Tesla P4, and FP16 models are tested on Tesla T4.

batch_size	seq_len	tf-fp32, ms	lightseq-fp32, ms	lightseq-fp16, ms	lightseq-fp32/tf-fp32, speedup	lightseq-fp16/lightseq-fp32, speedup	lightseq-fp16/tf-fp32, speedup
1	12	544	86	43	6.32	2	12.65
1	24	914	131	66	6.97	1.98	13.85
1	36	1290	200	93	6.45	2.15	13.87
1	48	1836	233	106	7.89	2.2	17.32
1	72	3456	482	212	7.17	2.27	16.3
1	84	2626	431	193	6.09	2.23	13.61
2	12	566	100	50	5.66	2	11.32
2	24	842	158	70	5.32	2.26	12.03
2	36	1287	247	103	5.21	2.4	12.5
2	48	1504	288	118	5.22	2.44	12.75
2	72	3131	611	240	5.12	2.55	13.05
2	84	2789	546	217	5.1	2.52	12.85
4	12	590	118	58	5	2.03	10.17
4	24	885	187	89	4.73	2.1	9.94
4	36	1380	301	127	4.58	2.37	10.87
4	48	1622	352	149	4.6	2.36	10.89
4	72	3492	763	311	4.57	2.45	11.23
4	84	3145	687	282	4.57	2.44	11.15
8	12	631	150	66	4.2	2.27	9.56
8	24	979	248	103	3.94	2.41	9.5
8	36	1584	412	156	3.84	2.64	10.15
8	48	1880	477	186	3.94	2.56	10.11
8	72	4218	1069	404	3.94	2.65	10.44
8	84	3831	976	373	3.92	2.62	10.27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance.md

performance.md

Performance

Files

performance.md

Latest commit

History

performance.md

File metadata and controls

Performance