Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow passing an empty docs [] to from_documents #205

Conversation

dudanogueira
Copy link
Collaborator

passing an empty list as docs was failing for some embedders:

embeddings2 = CohereEmbeddings(
    model="embed-english-v3.0",
)
weaviate_client = weaviate.connect_to_local()
db = WeaviateVectorStore.from_documents([], embeddings2, client=weaviate_client)

this will prevent embedding empty texts.

it was raising this error:

File "/Users/I747411/ai/venv/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 565, in encode
if all_embeddings[0].dtype == torch.bfloat16:
IndexError: list index out of range

@hsm207
Copy link
Collaborator

hsm207 commented Aug 23, 2024

Thanks for your contribution! What will happen now if a user passed an empty list? Seems odd to call from_documents and not provide any documents and not expect it to fail...😀

What's the use case for such a call?

@dudanogueira
Copy link
Collaborator Author

Hi!

The idea is to have consistent behavior.

if you pass [] as docs parameter and the embedder is OpenAI, it will work: you'll get a instantiated vectorstore returned.

However, it was failing for Cohere (and probably other embedders).

If you have for example a pipeline and for some reason you don't have docs to pass, it shouldn't fail, but exit gracefully.

I have seen some users in our forums having issues on understanding how to instantiate a vectorstore without passing docs. Passing docs as [] was being used as a good enough solution or the next try instead of not passing docs at all.

With this PR, having docs or just [] will always return a instantiated vectorstore :)

Let me know if this makes sense.

Thanks!

@hsm207
Copy link
Collaborator

hsm207 commented Aug 27, 2024

if you pass [] as docs parameter and the embedder is OpenAI, it will work: you'll get a instantiated vectorstore returned.

However, it was failing for Cohere (and probably other embedders).

If it worked for OpenAI but fail for Cohere (and probably other embeddeders), it sounds like there's a bug in either the open ai embedder or the other embedders. @efriis what are your thoughts on this?

I have seen some users in our forums having issues on understanding how to instantiate a vectorstore without passing docs.

Users can call the [WeaviateVectorStore](https://github.com/langchain-ai/langchain-weaviate/blob/a76c606089bf464a30ecaa74dccfd738f0d09183/libs/weaviate/langchain_weaviate/vectorstores.py#L72-L73) class directly to instantiate a vector store without passing docs.

This is also discussed elsewhere in the langchain docs.

By the way, the link you shared directed to the main page of the forum. Would you happen to recall the questions about this issue? I'd like to understand the context more.

@dudanogueira
Copy link
Collaborator Author

Oh... I am not sure where I got this from (pretty sure it was from a user code), but it seems I may be the one spreading this pattern around 😬

It goes back to here, where you and I even interacted at some point to help a user:
https://forum.weaviate.io/t/how-to-hide-the-client-v3-deprecation-warning/2165/11

I replicated this pattern also in one of our recipes:
https://github.com/weaviate/recipes/blob/main/integrations/llm-frameworks/langchain/loading-data/langchain-simple-pdf.ipynb

I will go around changing that code where I find so we keep to the expected pattern,
but I believe it should not break if you pass an empty list as docs.

Thanks!

@efriis
Copy link
Member

efriis commented Aug 28, 2024

Good catch - we can start testing embedding embed_documents to make sure they support [] in standard tests.

In the meantime, if this is a common pattern for weaviate (not sure this should be an encouraged pattern - why not instantiate WeaviateVectorStore() directly?) would recommend handling this case in from_documents

@hsm207
Copy link
Collaborator

hsm207 commented Aug 29, 2024

we can start testing embedding embed_documents to make sure they support [] in standard tests.

@efriis this means vector store implementors don't need to test for an empty list getting passed to from_documents, right?

@efriis
Copy link
Member

efriis commented Aug 30, 2024

if you want to support from_documents([]), you should test it

this will be a slow change because it doesn't really make sense to call embeddings.embed_documents([]). Just doing this for resilience.

@hsm207
Copy link
Collaborator

hsm207 commented Aug 30, 2024

Closing since we will tell users to instantiate the vector store directly if they don't want to pass a list of documents.

@hsm207 hsm207 closed this Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants