Releases: finalfusion/finalfusion-python
Releases · finalfusion/finalfusion-python
Finalfusion in Python
This release marks a major change to finalfusion-python
: the entire package has been rewritten in Python and is no longer a wrapper around finalfusion-rust
.
The API is now almost on par with finalfusion-rust
and in some places even goes beyond that.
Vocab
,Storage
,Metadata
andNorms
are now accessible as properties onEmbeddings
- Any of the chunks above can be loaded by themselves from a finalfusion file
- All chunks can be constructed from within Python
- It's possible to add, remove or change embeddings
Storage
types integrate directly withnumpy
arrays- Reading and writing to all common Embedding formats (word2vec, GloVe, fastText) is supported
- The API for vocabularies and subword indexers has been made mor ergonomic:
- vocab words and the word -> index mapping are accessible as properties
SubwordVocab
s expose the subword indexer throughvocab.subword_indexer
In addition to the overhauled API, finalfusion-python
now comes with executables:
ffp-convert
to convert between embedding formatsffp-similar
andffp-analogy
for similarity and analogy queriesffp-bucket-to-explicit
to convert from bucket subword to explicit subword embeddings
Check out the documentation at https://finalfusion-python.readthedocs.io for more information!
0.6.2
0.6.1
0.6.0
Support for fastText, word2vec, and text embeddings
The largest change is this release is support for reading fastText, word2vec, and text embeddings, in addition to finalfusion embeddings.
- Add support for reading fastText (
Embeddings.read_fasttext()
), text (Embeddings.read_text()
), textdims (Embeddings.read_text()
), and word2vec (Embeddings.read_fasttext()
) formats. - Each of these newly-supported formats provides a keyword argument
lossy
. If set, the embeddings will be read lossily, permitting invalid UTF-8 in words. - Add the
embedding_similarity
method, which looks up words that are similar to a given embedding. The method for traditional word-based lookups has been renamed fromsimilarity
toword_similarity
. - Iteration over embeddings returned tuples
(word, embedding)
in previous releases. Now instances of theEmbedding
class are returned, which provideword
,embedding
, andnorm
properties.norm
is the embedding norm before normalization of an embedding using its l2 norm. - Add support for memory mapping quantized embedding matrices.
- Add the
ngram_indices
andsubword_indices
to theVocab
class. These methods return the subword indices for a given word, which can be used to retrieve the subword embeddings individually. Thengram_indices
methods returns each subword with its index, whereassubword_indices
only returns the indices. - Update to pyo3 0.8.
travis-0.5.0-rebuild
CI: Fix crate name in Travis-CI builds
0.4.0
0.3.1
New convenience methods
This release has the following changes:
- Add the
matrix_copy
method to get a numpy array copy of the embedding matrix. - Add the
vocab
method to get aVocab
instance, which provides theitem_to_indices
method to get the indices or subword indices of a word.Vocab
also provides indexing to look up the word corresponding to an index (e.g.vocab[3823]
). - Upgrade to finalfusion 0.6.
Switch to numpy arrays
- Return
numpy
arrays rather than Python lists. - Update to
pyo3
0.6. - Switch from
rust2vec
to thefinalfusion
crate.