Note: Now that BERT support has been merged into llama.cpp
, this repo is semi-defunct. The implementation in llama.cpp
is substantially faster and has much better model support. Still happy to accept PRs if they do come along though.
This is a ggml implementation of the BERT embedding architecture. It supports inference on CPU, CUDA and Metal in floating point and a wide variety of quantization schemes. Includes Python bindings for batched inference.
This repo is a fork of original bert.cpp as well as embeddings.cpp. Thanks to both of you!
Fetch this repository then download submodules and install packages with
git submodule update --init
pip install -r requirements.txt
To fetch models from huggingface
and convert them to gguf
format run something like the following (after creating the models
directory)
python bert_cpp/convert.py BAAI/bge-base-en-v1.5 models/bge-base-en-v1.5-f16.gguf
This will convert to float16
by default. To do float32
add f32
to the end of the command.
To build the C++ library for CPU/CUDA/Metal, run the following
# CPU
cmake -B build . && make -C build -j
# CUDA
cmake -DGGML_CUBLAS=ON -B build . && make -C build -j
# Metal
cmake -DGGML_METAL=ON -B build . && make -C build -j
On some distros, when compiling with CUDA, you also need to specify the host C++ compiler. To do this, I suggest setting the CUDAHOSTCXX
environment variable to your C++ bindir.
All executables are placed in build/bin
. To run inference on a given text, run
# CPU / CUDA
build/bin/main -m models/bge-base-en-v1.5-f16.gguf -p "Hello world"
# Metal
GGML_METAL_PATH_RESOURCES=build/bin/ build/bin/main -m models/bge-base-en-v1.5-f16.gguf -p "Hello world"
To force CPU usage, add the flag -c
.
You can also run everything through Python, which is particularly useful for batch inference. For instance,
from bert_cpp import BertModel
mod = BertModel('models/bge-base-en-v1.5-f16.gguf')
emb = mod.embed(batch)
where batch
is a list of strings and emb
is a numpy
array of embedding vectors.
You can quantize models with the command (using the f32
model as a base seems to work better)
build/bin/quantize models/bge-base-en-v1.5-f32.gguf models/bge-base-en-v1.5-q8_0.gguf q8_0
or whatever your desired quantization level is. Currently supported values are: q8_0
, q5_0
, q5_1
, q4_0
, and q4_1
. You can then pass these model files directly to main
as above.