Fix: recursively search files by default, fix file ingestion with edit #29

StrongMonkey · 2024-06-19T02:07:27Z

This PR fixes two things:

Recursively run askDir/ingest by default.
When ingesting duplicated files with changes, we don't prune the old content. This caused old content being present again in dataset, impacting the query.

Signed-off-by: Daishan Peng <[email protected]>

pkg/datastore/ingest.go

iwilltry42 · 2024-06-19T04:07:13Z

pkg/datastore/ingest.go

@@ -92,7 +92,7 @@ func (s *Datastore) Ingest(ctx context.Context, datasetID string, content []byte
 	ingestionFlow.FillDefaults(filetype, opts.TextSplitterOpts)

 	// Mandatory Transformation: Add filename to metadata
-	em := &transformers.ExtraMetadata{Metadata: map[string]any{"filename": filename}}
+	em := &transformers.ExtraMetadata{Metadata: map[string]any{"filename": filename, "absPath": opts.FileMetadata.AbsolutePath}}


Note: abspath may not exist if the content is coming via the API (server mode).. which may be negligible at the moment as we're focusing on standalone mode 🤔

Hmm yeah, good point. At some point I just want to identify the file. I guess in server mode we can just use the name because you can't upload two files with the same name :)

pkg/datastore/ingest.go

iwilltry42 · 2024-06-19T04:15:27Z

pkg/vectorstore/vectorstores.go

@@ -9,5 +9,5 @@ type VectorStore interface {
 	AddDocuments(ctx context.Context, docs []Document, collection string) ([]string, error)                      // @return documentIDs, error
 	SimilaritySearch(ctx context.Context, query string, numDocuments int, collection string) ([]Document, error) //nolint:lll
 	RemoveCollection(ctx context.Context, collection string) error
-	RemoveDocument(ctx context.Context, documentID string, collection string) error
+	RemoveDocument(ctx context.Context, documentID string, collection string, where, whereDocument map[string]string) error


I left out the metadata filters because I wasn't sure if potential other VectorDBs would support them the same way. This is not really important to us right now, so I don't care too much.
On the other hand, we have the file metadata information and the file to documents mapping in the sqlite DB as well and thus could look up all document IDs there and then remove them directly by ID.
Both are valid ways.

iwilltry42

Approving and merging for now to get the recurse fix in - we can follow-up on the rest later on 👍

Fix: recuisively search file by default, fix file ingestion with edit

b4674e3

Signed-off-by: Daishan Peng <[email protected]>

StrongMonkey force-pushed the fix-recuisively branch from 9850761 to b4674e3 Compare June 19, 2024 02:09

StrongMonkey commented Jun 19, 2024

View reviewed changes

pkg/datastore/ingest.go Show resolved Hide resolved

StrongMonkey requested review from iwilltry42, thedadams, njhale, g-linville and tylerslaton June 19, 2024 02:11

g-linville approved these changes Jun 19, 2024

View reviewed changes

iwilltry42 reviewed Jun 19, 2024

View reviewed changes

iwilltry42 approved these changes Jun 21, 2024

View reviewed changes

iwilltry42 changed the title ~~Fix: recuisively search file by default, fix file ingestion with edit~~ Fix: recursively search files by default, fix file ingestion with edit Jun 21, 2024

iwilltry42 merged commit b30a58d into gptscript-ai:main Jun 21, 2024
1 check passed

iwilltry42 mentioned this pull request Jun 27, 2024

Files in sub directory are not ingested when knowledge tool is accessed as remote tool. #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: recursively search files by default, fix file ingestion with edit #29

Fix: recursively search files by default, fix file ingestion with edit #29

StrongMonkey commented Jun 19, 2024

iwilltry42 Jun 19, 2024 •

edited

Loading

StrongMonkey Jun 19, 2024

iwilltry42 Jun 19, 2024

iwilltry42 left a comment

Fix: recursively search files by default, fix file ingestion with edit #29

Fix: recursively search files by default, fix file ingestion with edit #29

Conversation

StrongMonkey commented Jun 19, 2024

iwilltry42 Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

StrongMonkey Jun 19, 2024

Choose a reason for hiding this comment

iwilltry42 Jun 19, 2024

Choose a reason for hiding this comment

iwilltry42 left a comment

Choose a reason for hiding this comment

iwilltry42 Jun 19, 2024 •

edited

Loading