Skip to content
This repository has been archived by the owner on Oct 30, 2024. It is now read-only.

Commit

Permalink
add: docs on ignore file and remote loaders
Browse files Browse the repository at this point in the history
  • Loading branch information
iwilltry42 committed Jul 30, 2024
1 parent b42f8d6 commit a568332
Show file tree
Hide file tree
Showing 6 changed files with 81 additions and 5 deletions.
60 changes: 59 additions & 1 deletion docs/docs/02-usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,62 @@ This mode is useful when you want to share the data with multiple clients or whe
```bash
knowledge server
```
```
## Ingestion
To ingest a document, you can use the `knowledge ingest` command:
```bash
knowledge ingest --dataset my-dataset ./path/to/my-document.txt
```
:::note
By default, the dataset will be created if it doesn't exist.
If you don't want that, you can use the `--no-create-dataset` flag.
:::
### Ignoring Files
You can ignore files by providing an ignore file, similar to `.gitignore`:
```bash
knowledge ingest --dataset my-dataset --ignore-file .knowledgeignore ./path/to/my-documents
```
Here's an example ignore file which basically tells knowledge to only consider Markdown files and nothing else:
```gitignore
# Ignore everything
*
# Except Markdown files in any directory
!**/*.md
```
:::note
Alternatively, you can use the `--ignore-extensions` flag to ignore files with specific extensions.
```bash
knowledge ingest --dataset my-dataset --ignore-extensions=.txt ./path/to/my-documents
```
:::
### Remote Files
You can ingest remote files by providing a URL - Currently only Git Repositories are supported:
```bash
knowledge ingest --dataset my-dataset https://github.com/gptscript-ai/knowledge
```
:::note
Here, it's advisable to use a [ignore file](#ignoring-files) to avoid ingesting all the git metadata and potentially present vendor files.
:::
1 change: 0 additions & 1 deletion docs/docs/99-cmd/knowledge.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,5 @@ knowledge [flags]
* [knowledge ingest](knowledge_ingest.md) - Ingest a file/directory into a dataset
* [knowledge list-datasets](knowledge_list-datasets.md) - List existing datasets
* [knowledge retrieve](knowledge_retrieve.md) - Retrieve sources for a query from a dataset
* [knowledge server](knowledge_server.md) -
* [knowledge version](knowledge_version.md) -

3 changes: 3 additions & 0 deletions docs/docs/99-cmd/knowledge_askdir.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ knowledge askdir [--path <path>] <query> [flags]
--flows-file string Path to a YAML/JSON file containing ingestion/retrieval flows ($KNOW_FLOWS_FILE)
-h, --help help for askdir
--ignore-extensions string Comma-separated list of file extensions to ignore ($KNOW_INGEST_IGNORE_EXTENSIONS)
--ignore-file string Path to a .gitignore style file ($KNOW_INGEST_IGNORE_FILE)
--include-hidden Include hidden files and directories ($KNOW_INGEST_INCLUDE_HIDDEN)
--no-create-dataset Do NOT create the dataset if it doesn't exist ($KNOW_INGEST_NO_CREATE_DATASET)
--no-recursive Don't recursively ingest directories ($KNOW_NO_INGEST_RECURSIVE)
-p, --path string Path to the directory to query ($KNOWLEDGE_CLIENT_ASK_DIR_PATH) (default ".")
--server string URL of the Knowledge API Server ($KNOW_SERVER_URL)
Expand Down
5 changes: 3 additions & 2 deletions docs/docs/99-cmd/knowledge_import.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,9 @@ Import one or more datasets from an archive (zip) (default: all datasets)

Import one or more datasets from an archive (zip) (default: all datasets).
## IMPORTANT: Embedding functions
Embedding functions are not part of exported knowledge base archives, so you'll have to know the embedding function used to import the archive.
This primarily concerns the choice of the embeddings provider (model).
When someone first ingests some data into a dataset, the embedding provider configured at that time will be attached to the dataset.
Upon subsequent ingestion actions, the same embedding provider must be used to ensure that the embeddings are consistent.
Most of the times, the only field that has to be the same is the model, as that defines the dimensionality usually.
Note: This is only relevant if you plan to add more documents to the dataset after importing it.


Expand Down
15 changes: 15 additions & 0 deletions docs/docs/99-cmd/knowledge_ingest.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,18 @@ title: "knowledge ingest"

Ingest a file/directory into a dataset

### Synopsis

Ingest a file or directory into a dataset.

## Important Note

The first time you ingest something into a dataset, the embedding function (model provider) you chose will be attached to that dataset.
After that, the client must always use that same embedding function to ingest into this dataset.
Usually, this only concerns the choice of the model, as that commonly defines the embedding dimensionality.
This is a constraint of the Vector Database and Similarity Search, as different models yield differently sized embedding vectors and also represent the semantics differently.


```
knowledge ingest [--dataset <dataset-id>] <path> [flags]
```
Expand All @@ -22,6 +34,9 @@ knowledge ingest [--dataset <dataset-id>] <path> [flags]
--flows-file string Path to a YAML/JSON file containing ingestion/retrieval flows ($KNOW_FLOWS_FILE)
-h, --help help for ingest
--ignore-extensions string Comma-separated list of file extensions to ignore ($KNOW_INGEST_IGNORE_EXTENSIONS)
--ignore-file string Path to a .gitignore style file ($KNOW_INGEST_IGNORE_FILE)
--include-hidden Include hidden files and directories ($KNOW_INGEST_INCLUDE_HIDDEN)
--no-create-dataset Do NOT create the dataset if it doesn't exist ($KNOW_INGEST_NO_CREATE_DATASET)
--no-recursive Don't recursively ingest directories ($KNOW_NO_INGEST_RECURSIVE)
--server string URL of the Knowledge API Server ($KNOW_SERVER_URL)
--textsplitter-chunk-overlap int Textsplitter Chunk Overlap ($KNOW_TEXTSPLITTER_CHUNK_OVERLAP) (default 256)
Expand Down
2 changes: 1 addition & 1 deletion pkg/datastore/documentloader/remote/github.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ import (
)

// CloneRepo clones a git repository to a target directory
// @param repo the repository to clone - may contain an @ symbol to specify a commit, tag or branch (prioritized in that order)
// repo is the repository to clone - may contain an @ symbol to specify a commit, tag or branch (prioritized in that order)
func CloneRepo(repo, target string) error {

atSplit := strings.Split(repo, "@")
Expand Down

0 comments on commit a568332

Please sign in to comment.