Support Alibaba-NLP/gte-large-en-v1.5 on CPU/MPS #375

tmostak · 2024-08-08T17:38:01Z

Feature request

We'd like to run the Alibaba-NLP/gte-large-en-v1.5 model on a CPU text-embedding-router server, but are hitting

Caused by:
Could not start backend: GTE is only supported on Cuda devices in fp16 with flash attention enabled

Is there any way to implement/allow this model to run on CPU?

Motivation

For some of our clients we need to support a CPU embedding server, and would like to use the Alibaba-NLP/gte-large-en-v1.5 model to avail ourselves of the long 8192 token context length.

Your contribution

We'd be happy to test and run performance benchmarks if needed.

The text was updated successfully, but these errors were encountered:

kozistr mentioned this issue Oct 26, 2024

Run TEI model on CPU fails (says Cuda f16 and flash attention is required) #431

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Alibaba-NLP/gte-large-en-v1.5 on CPU/MPS #375

Support Alibaba-NLP/gte-large-en-v1.5 on CPU/MPS #375

tmostak commented Aug 8, 2024 •

edited

Loading

Support Alibaba-NLP/gte-large-en-v1.5 on CPU/MPS #375

Support Alibaba-NLP/gte-large-en-v1.5 on CPU/MPS #375

Comments

tmostak commented Aug 8, 2024 • edited Loading

Feature request

Motivation

Your contribution

tmostak commented Aug 8, 2024 •

edited

Loading