add: docs on ignore file and remote loaders

gptscript-ai · Jul 30, 2024 · a568332 · a568332
1 parent b42f8d6
commit a568332
Show file tree

Hide file tree

Showing 6 changed files with 81 additions and 5 deletions.
diff --git a/docs/docs/02-usage.md b/docs/docs/02-usage.md
@@ -45,4 +45,62 @@ This mode is useful when you want to share the data with multiple clients or whe
 
 ```bash
 knowledge server
-```
+```
+
+## Ingestion
+
+To ingest a document, you can use the `knowledge ingest` command:
+
+```bash
+knowledge ingest --dataset my-dataset ./path/to/my-document.txt
+```
+
+:::note
+
+    By default, the dataset will be created if it doesn't exist.
+    If you don't want that, you can use the `--no-create-dataset` flag.
+
+:::
+
+### Ignoring Files
+
+You can ignore files by providing an ignore file, similar to `.gitignore`:
+
+```bash
+knowledge ingest --dataset my-dataset --ignore-file .knowledgeignore ./path/to/my-documents
+```
+
+Here's an example ignore file which basically tells knowledge to only consider Markdown files and nothing else:
+
+```gitignore
+# Ignore everything
+*
+
+# Except Markdown files in any directory
+!**/*.md   
+```
+
+:::note
+
+    Alternatively, you can use the `--ignore-extensions` flag to ignore files with specific extensions.
+
+    ```bash
+    knowledge ingest --dataset my-dataset --ignore-extensions=.txt ./path/to/my-documents
+    ```
+
+:::
+
+
+### Remote Files
+
+You can ingest remote files by providing a URL - Currently only Git Repositories are supported:
+
+```bash
+knowledge ingest --dataset my-dataset https://github.com/gptscript-ai/knowledge
+```
+
+:::note
+
+    Here, it's advisable to use a [ignore file](#ignoring-files) to avoid ingesting all the git metadata and potentially present vendor files.
+
+:::
diff --git a/docs/docs/99-cmd/knowledge.md b/docs/docs/99-cmd/knowledge.md
@@ -27,6 +27,5 @@ knowledge [flags]
 * [knowledge ingest](knowledge_ingest.md)	 - Ingest a file/directory into a dataset
 * [knowledge list-datasets](knowledge_list-datasets.md)	 - List existing datasets
 * [knowledge retrieve](knowledge_retrieve.md)	 - Retrieve sources for a query from a dataset
-* [knowledge server](knowledge_server.md)	 - 
 * [knowledge version](knowledge_version.md)	 - 
 
diff --git a/docs/docs/99-cmd/knowledge_askdir.md b/docs/docs/99-cmd/knowledge_askdir.md
@@ -21,6 +21,9 @@ knowledge askdir [--path <path>] <query> [flags]
       --flows-file string                 Path to a YAML/JSON file containing ingestion/retrieval flows ($KNOW_FLOWS_FILE)
   -h, --help                              help for askdir
       --ignore-extensions string          Comma-separated list of file extensions to ignore ($KNOW_INGEST_IGNORE_EXTENSIONS)
+      --ignore-file string                Path to a .gitignore style file ($KNOW_INGEST_IGNORE_FILE)
+      --include-hidden                    Include hidden files and directories ($KNOW_INGEST_INCLUDE_HIDDEN)
+      --no-create-dataset                 Do NOT create the dataset if it doesn't exist ($KNOW_INGEST_NO_CREATE_DATASET)
       --no-recursive                      Don't recursively ingest directories ($KNOW_NO_INGEST_RECURSIVE)
   -p, --path string                       Path to the directory to query ($KNOWLEDGE_CLIENT_ASK_DIR_PATH) (default ".")
       --server string                     URL of the Knowledge API Server ($KNOW_SERVER_URL)

diff --git a/docs/docs/99-cmd/knowledge_import.md b/docs/docs/99-cmd/knowledge_import.md
@@ -9,8 +9,9 @@ Import one or more datasets from an archive (zip) (default: all datasets)
 
 Import one or more datasets from an archive (zip) (default: all datasets).
 ## IMPORTANT: Embedding functions
-   Embedding functions are not part of exported knowledge base archives, so you'll have to know the embedding function used to import the archive.
-   This primarily concerns the choice of the embeddings provider (model).
+   When someone first ingests some data into a dataset, the embedding provider configured at that time will be attached to the dataset.
+   Upon subsequent ingestion actions, the same embedding provider must be used to ensure that the embeddings are consistent.
+   Most of the times, the only field that has to be the same is the model, as that defines the dimensionality usually.
    Note: This is only relevant if you plan to add more documents to the dataset after importing it.
 
 

diff --git a/docs/docs/99-cmd/knowledge_ingest.md b/docs/docs/99-cmd/knowledge_ingest.md
@@ -5,6 +5,18 @@ title: "knowledge ingest"
 
 Ingest a file/directory into a dataset
 
+### Synopsis
+
+Ingest a file or directory into a dataset.
+
+## Important Note
+
+The first time you ingest something into a dataset, the embedding function (model provider) you chose will be attached to that dataset.
+After that, the client must always use that same embedding function to ingest into this dataset.
+Usually, this only concerns the choice of the model, as that commonly defines the embedding dimensionality.
+This is a constraint of the Vector Database and Similarity Search, as different models yield differently sized embedding vectors and also represent the semantics differently.
+
+
 ```
 knowledge ingest [--dataset <dataset-id>] <path> [flags]
 ```
@@ -22,6 +34,9 @@ knowledge ingest [--dataset <dataset-id>] <path> [flags]
       --flows-file string                   Path to a YAML/JSON file containing ingestion/retrieval flows ($KNOW_FLOWS_FILE)
   -h, --help                                help for ingest
       --ignore-extensions string            Comma-separated list of file extensions to ignore ($KNOW_INGEST_IGNORE_EXTENSIONS)
+      --ignore-file string                  Path to a .gitignore style file ($KNOW_INGEST_IGNORE_FILE)
+      --include-hidden                      Include hidden files and directories ($KNOW_INGEST_INCLUDE_HIDDEN)
+      --no-create-dataset                   Do NOT create the dataset if it doesn't exist ($KNOW_INGEST_NO_CREATE_DATASET)
       --no-recursive                        Don't recursively ingest directories ($KNOW_NO_INGEST_RECURSIVE)
       --server string                       URL of the Knowledge API Server ($KNOW_SERVER_URL)
       --textsplitter-chunk-overlap int      Textsplitter Chunk Overlap ($KNOW_TEXTSPLITTER_CHUNK_OVERLAP) (default 256)

diff --git a/pkg/datastore/documentloader/remote/github.go b/pkg/datastore/documentloader/remote/github.go
@@ -10,7 +10,7 @@ import (
 )
 
 // CloneRepo clones a git repository to a target directory
-// @param repo the repository to clone - may contain an @ symbol to specify a commit, tag or branch (prioritized in that order)
+// repo is the repository to clone - may contain an @ symbol to specify a commit, tag or branch (prioritized in that order)
 func CloneRepo(repo, target string) error {
 
 	atSplit := strings.Split(repo, "@")