-
Notifications
You must be signed in to change notification settings - Fork 15
Fix: recursively search files by default, fix file ingestion with edit #29
Conversation
Signed-off-by: Daishan Peng <[email protected]>
9850761
to
b4674e3
Compare
@@ -92,7 +92,7 @@ func (s *Datastore) Ingest(ctx context.Context, datasetID string, content []byte | |||
ingestionFlow.FillDefaults(filetype, opts.TextSplitterOpts) | |||
|
|||
// Mandatory Transformation: Add filename to metadata | |||
em := &transformers.ExtraMetadata{Metadata: map[string]any{"filename": filename}} | |||
em := &transformers.ExtraMetadata{Metadata: map[string]any{"filename": filename, "absPath": opts.FileMetadata.AbsolutePath}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: abspath may not exist if the content is coming via the API (server mode).. which may be negligible at the moment as we're focusing on standalone mode 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah, good point. At some point I just want to identify the file. I guess in server mode we can just use the name because you can't upload two files with the same name :)
@@ -9,5 +9,5 @@ type VectorStore interface { | |||
AddDocuments(ctx context.Context, docs []Document, collection string) ([]string, error) // @return documentIDs, error | |||
SimilaritySearch(ctx context.Context, query string, numDocuments int, collection string) ([]Document, error) //nolint:lll | |||
RemoveCollection(ctx context.Context, collection string) error | |||
RemoveDocument(ctx context.Context, documentID string, collection string) error | |||
RemoveDocument(ctx context.Context, documentID string, collection string, where, whereDocument map[string]string) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left out the metadata filters because I wasn't sure if potential other VectorDBs would support them the same way. This is not really important to us right now, so I don't care too much.
On the other hand, we have the file metadata information and the file to documents mapping in the sqlite DB as well and thus could look up all document IDs there and then remove them directly by ID.
Both are valid ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving and merging for now to get the recurse fix in - we can follow-up on the rest later on 👍
This PR fixes two things: