Skip to content
This repository has been archived by the owner on Oct 30, 2024. It is now read-only.

Fix: recursively search files by default, fix file ingestion with edit #29

Merged
merged 1 commit into from
Jun 21, 2024

Conversation

StrongMonkey
Copy link
Contributor

This PR fixes two things:

  1. Recursively run askDir/ingest by default.
  2. When ingesting duplicated files with changes, we don't prune the old content. This caused old content being present again in dataset, impacting the query.

@@ -92,7 +92,7 @@ func (s *Datastore) Ingest(ctx context.Context, datasetID string, content []byte
ingestionFlow.FillDefaults(filetype, opts.TextSplitterOpts)

// Mandatory Transformation: Add filename to metadata
em := &transformers.ExtraMetadata{Metadata: map[string]any{"filename": filename}}
em := &transformers.ExtraMetadata{Metadata: map[string]any{"filename": filename, "absPath": opts.FileMetadata.AbsolutePath}}
Copy link
Collaborator

@iwilltry42 iwilltry42 Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: abspath may not exist if the content is coming via the API (server mode).. which may be negligible at the moment as we're focusing on standalone mode 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah, good point. At some point I just want to identify the file. I guess in server mode we can just use the name because you can't upload two files with the same name :)

pkg/datastore/ingest.go Show resolved Hide resolved
@@ -9,5 +9,5 @@ type VectorStore interface {
AddDocuments(ctx context.Context, docs []Document, collection string) ([]string, error) // @return documentIDs, error
SimilaritySearch(ctx context.Context, query string, numDocuments int, collection string) ([]Document, error) //nolint:lll
RemoveCollection(ctx context.Context, collection string) error
RemoveDocument(ctx context.Context, documentID string, collection string) error
RemoveDocument(ctx context.Context, documentID string, collection string, where, whereDocument map[string]string) error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left out the metadata filters because I wasn't sure if potential other VectorDBs would support them the same way. This is not really important to us right now, so I don't care too much.
On the other hand, we have the file metadata information and the file to documents mapping in the sqlite DB as well and thus could look up all document IDs there and then remove them directly by ID.
Both are valid ways.

Copy link
Collaborator

@iwilltry42 iwilltry42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving and merging for now to get the recurse fix in - we can follow-up on the rest later on 👍

@iwilltry42 iwilltry42 changed the title Fix: recuisively search file by default, fix file ingestion with edit Fix: recursively search files by default, fix file ingestion with edit Jun 21, 2024
@iwilltry42 iwilltry42 merged commit b30a58d into gptscript-ai:main Jun 21, 2024
1 check passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants