You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We'd like to run the Alibaba-NLP/gte-large-en-v1.5 model on a CPU text-embedding-router server, but are hitting
Caused by:
Could not start backend: GTE is only supported on Cuda devices in fp16 with flash attention enabled
Is there any way to implement/allow this model to run on CPU?
Motivation
For some of our clients we need to support a CPU embedding server, and would like to use the Alibaba-NLP/gte-large-en-v1.5 model to avail ourselves of the long 8192 token context length.
Your contribution
We'd be happy to test and run performance benchmarks if needed.
The text was updated successfully, but these errors were encountered:
Feature request
We'd like to run the
Alibaba-NLP/gte-large-en-v1.5
model on a CPUtext-embedding-router
server, but are hittingCaused by:
Could not start backend: GTE is only supported on Cuda devices in fp16 with flash attention enabled
Is there any way to implement/allow this model to run on CPU?
Motivation
For some of our clients we need to support a CPU embedding server, and would like to use the
Alibaba-NLP/gte-large-en-v1.5
model to avail ourselves of the long 8192 token context length.Your contribution
We'd be happy to test and run performance benchmarks if needed.
The text was updated successfully, but these errors were encountered: