[WIP] Allow for attention caching during CoCa generation #502

sramshetty · 2023-04-20T04:23:27Z

Currently allows for users to set caching argument in model.generate() in order to improve longer generation efficiency. At smaller lengths, the overhead doesn't seem to be worth the cost of caching, but may be an implementation problem.

Test speed-up for longer generations
If possible, add caching for text encoding

sramshetty and others added 8 commits April 15, 2023 21:50

initial caching for generation

ecb93d9

remove timing

8ad9240

fix beamsearch caching

71d01f5

WIP Setup base for text encoder caching

09e3fec

Merge branch 'mlfoundations:main' into inference_caching

df85d0c

Fix transformer caching default

2c961e6

avoid passing cache when not necessary

69a936a

simplify caching argument

8dc14ed

sramshetty marked this pull request as draft April 21, 2023 05:04

sramshetty added 3 commits April 20, 2023 22:07

fix transformer cache typing for true branch

5ed4cf2

reorder typing to address list invariance

c0686de

remove unnecessary placeholder lists

e561c8f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Allow for attention caching during CoCa generation #502

[WIP] Allow for attention caching during CoCa generation #502

sramshetty commented Apr 20, 2023

[WIP] Allow for attention caching during CoCa generation #502

Are you sure you want to change the base?

[WIP] Allow for attention caching during CoCa generation #502

Conversation

sramshetty commented Apr 20, 2023