Skip to content

Commit

Permalink
feat: bring it all together
Browse files Browse the repository at this point in the history
  • Loading branch information
timonv committed Jun 26, 2024
1 parent 91cb322 commit e9ba36e
Show file tree
Hide file tree
Showing 28 changed files with 437 additions and 90 deletions.
10 changes: 7 additions & 3 deletions astro.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ export default defineConfig({
editLink: {
baseUrl: "https://github.com/bosun-ai/swiftide-website/edit/master",
},
tableOfContents: {
minHeadingLevel: 2,
},
customCss: [
// Fontsource files for to regular and semi-bold font weights.
"@fontsource/fira-code/400.css",
Expand All @@ -25,11 +28,12 @@ export default defineConfig({
},
social: {
github: "https://github.com/bosun-ai/swiftide",
linkedin: "https://www.linkedin.com/company/bosun-ai/",
},
sidebar: [
{
label: "Introduction",
link: "/introduction/",
label: "What is swiftide?",
link: "/what-is-swiftide/",
},
{
label: "Getting Started",
Expand All @@ -40,7 +44,7 @@ export default defineConfig({
{
label: "In depth",
autogenerate: {
directory: "concepts",
directory: "in-depth",
},
},
{
Expand Down
5 changes: 0 additions & 5 deletions src/content/docs/concepts/caching-and-filtering.md

This file was deleted.

5 changes: 0 additions & 5 deletions src/content/docs/concepts/extendability.md

This file was deleted.

5 changes: 0 additions & 5 deletions src/content/docs/concepts/loading-data.md

This file was deleted.

5 changes: 0 additions & 5 deletions src/content/docs/concepts/storing-results.md

This file was deleted.

5 changes: 0 additions & 5 deletions src/content/docs/concepts/streaming-and-concurrency.md

This file was deleted.

71 changes: 67 additions & 4 deletions src/content/docs/examples/hello-world.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,72 @@ title: Hello World
description: A simple example of an ingestion pipeline
---

Guides lead a user through a specific task they want to accomplish, often with a sequence of steps.
Writing a good guide requires thinking about what your users are trying to do.
## Ingesting code into Qdrant

## Further reading
This example demonstrates how to ingest the Swiftide codebase itself.
Note that for it to work correctly you need to have OPENAI_API_KEY set, redis and qdrant
running.

- Read [about how-to guides](https://diataxis.fr/how-to-guides/) in the Diátaxis framework
The pipeline will:

- Load all `.rs` files from the current directory
- Skip any nodes previously processed; hashes are based on the path and chunk (not the
metadata!)
- Run metadata QA on each chunk; generating questions and answers and adding metadata
- Chunk the code into pieces of 10 to 2048 bytes
- Embed the chunks in batches of 10, Metadata is embedded by default
- Store the nodes in Qdrant

Note that metadata is copied over to smaller chunks when chunking. When making LLM requests
with lots of small chunks, consider the rate limits of the API.

```rust

use swiftide::{
ingestion,
integrations::{self, qdrant::Qdrant, redis::Redis},
loaders::FileLoader,
transformers::{ChunkCode, Embed, MetadataQACode},
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
tracing_subscriber::fmt::init();

let openai_client = integrations::openai::OpenAI::builder()
.default_embed_model("text-embedding-3-small")
.default_prompt_model("gpt-3.5-turbo")
.build()?;

let redis_url = std::env::var("REDIS_URL")
.as_deref()
.unwrap_or("redis://localhost:6379")
.to_owned();

let qdrant_url = std::env::var("QDRANT_URL")
.as_deref()
.unwrap_or("http://localhost:6334")
.to_owned();

ingestion::IngestionPipeline::from_loader(FileLoader::new(".").with_extensions(&["rs"]))
.filter_cached(Redis::try_from_url(redis_url, "swiftide-examples")?)
.then(MetadataQACode::new(openai_client.clone()))
.then_chunk(ChunkCode::try_for_language_and_chunk_size(
"rust",
10..2048,
)?)
.then_in_batch(10, Embed::new(openai_client.clone()))
.then_store_with(
Qdrant::try_from_url(qdrant_url)?
.batch_size(50)
.vector_size(1536)
.collection_name("swiftide-examples".to_string())
.build()?,
)
.run()
.await?;
Ok(())
}
```

Find more examples in [our repository](https://github.com/bosun-ai/swiftide/blob/master/examples)
41 changes: 41 additions & 0 deletions src/content/docs/getting-started/architecture-and-design.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
title: Architecture and Design
description: The architecture and design principles of the Swiftide project.
---

## Design principles

- **Modular**: The pipeline is built from small, composable parts.
- **Extensible**: It is easy to add new parts to the pipeline by extending straightforward traits.
- **Performance**: Performance and ease-of-use are the main goals of the library. Performance always has priority.
- **Tracable**: `tracing` is used throughout the pipeline.

### When designing integrations, transformers, chunkers

- **Simple**: The API should be simple and easy to use.
- **Sane defaults, fully configurable**: The library should have sane defaults that are easy to override.
- **Builder pattern**: The builder pattern is used to create new instances of the pipeline.

## The-things-we-talk-about

- **IngestionPipeline**: The main struct that holds the pipeline. It is a stream of IngestionNodes.
- **IngestionNode**: The main struct that holds the data. It has a path, chunk and metadata.
- **IngestionStream**: The internal stream of IngestionNodes in the pipeline.
- **Loader**: The starting point of the stream, creates and emits IngestionNodes.
- **Transformers**: Some behaviour that modifies the IngestionNodes.
- **BatchTransformers**: Transformers that transform multiple nodes.
- **Chunkers**: Transformers that split a node into multiple nodes.
- **Storages**: Persist the IngestionNodes.
- **NodeCache**: Filters cached nodes.
- **Integrations**: External libraries that can be used with the pipeline.

### Pipeline structure and traits

- from_loader (impl Loader) starting point of the stream, creates and emits IngestionNodes
- filter_cached (impl NodeCache) filters cached nodes
- then (impl Transformer) transforms the node and puts it on the stream
- then_in_batch (impl BatchTransformer) transforms multiple nodes and puts them on the stream
- then_chunk (impl ChunkerTransformer) transforms a single node and emits multiple nodes
- then_store_with (impl Storage) stores the nodes in a storage backend, this can be chained

Additionally, several generic transformers are implemented. They take implementers of `SimplePrompt` and `EmbeddingModel` to do their things.
1 change: 1 addition & 0 deletions src/content/docs/getting-started/changelog.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
title: Changelog
description: The changelog of the Swiftide project.
---

import Changelog from "../../../components/Changelog.astro";
Expand Down
1 change: 1 addition & 0 deletions src/content/docs/getting-started/feature-flags.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
title: Feature Flags
description: Available features and integrations in Swiftide.
sidebar:
order: 1
---
Expand Down
1 change: 1 addition & 0 deletions src/content/docs/getting-started/installation.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
title: Installation
description: Installation instructions for Swiftide.
sidebar:
order: 0
---
Expand Down
31 changes: 31 additions & 0 deletions src/content/docs/in-depth/caching-and-filtering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: Caching and filtering nodes
description: How to cache and filter nodes in the pipeline.
sidebar:
order: 3
---

When nodes have already been processed by the pipeline, they can often be skipped, speeding up the pipeline and saving costs. A node cache implements the `NodeCache` trait.

## The `NodeCache` trait

Which is defined as follows:

```rust
pub trait NodeCache: Send + Sync + Debug {
async fn get(&self, node: &IngestionNode) -> bool;
async fn set(&self, node: &IngestionNode);
}
```

Or in human language: "Given a Node, provide methods to set and get from the cache".

## Built in chunkers

<small>

| Name | Description | Feature Flag |
| ----- | --------------------------------------------------- | ------------ |
| Redis | Can get and set nodes using multiplexed connections | redis |

</small>
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
---
title: Chunking
description: How to chunk nodes in the pipeline.
sidebar:
order: 2
---

For quality metadata it can be important to break up text into smaller parts for both better metadata and retrieval. A chunker implements the `ChunkerTransformer` trait.

## The `ChunkerTransformer` trait

Which is defined as follows:

```rust
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
title: Overview
title: Step-by-step Introduction
description: A step-by-step introduction on how to use swiftide as a data ingestion pipeline in your project.
sidebar:
order: 0
---
Expand All @@ -8,6 +9,8 @@ Swiftide provides a pipeline model. Troughout a pipeline, `IngestionNodes` are t

import { Steps } from "@astrojs/starlight/components";

### A pipeline step-by-step

<Steps>

1. The pipeline starts with a loader:
Expand Down Expand Up @@ -86,3 +89,7 @@ import { Steps } from "@astrojs/starlight/components";
```

</Steps>

### Read more

[Reference documentation on docs.rs]("https://docs.rs/swiftide/latest/swiftide/")
31 changes: 31 additions & 0 deletions src/content/docs/in-depth/loading-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: Loading Data
description: How to load data into the pipeline.
sidebar:
order: 1
---

A pipeline starts with data and is only as good as the data it ingests. A loader implements the `Loader` trait.

## The `Loader` trait

Which is defined as follows:

```rust
pub trait Loader {
fn into_stream(self) -> IngestionStream;
}
```

Or in human language: "I can be turned into a stream". The assumption under the hood is that Loaders will yield the data they load as a stream of `IngestionNodes`. These can be files, messages, webpages and so on.

## Built in loaders

<small>

| Name | Description | Feature Flag |
| -------------- | ------------------------------------------------------------------- | ------------ |
| FileLoader | Loads files with an optional extension filter, respecting gitignore | |
| ScrapingLoader | Scrapes a website using the `spider` crate | scraping |

</small>
43 changes: 43 additions & 0 deletions src/content/docs/in-depth/prompting-embedding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
title: Prompting and Embedding
description: How to prompt and embed data in the pipeline.
sidebar:
order: 2
---

Our metadata transformers are generic over the `SimplePrompt` trait. This enables different models to be used for different usecases. Similarly, the embedding transformer is generic over the `EmbeddingModel` trait.

## The `SimplePrompt` trait

Which is defined as follows:

```rust
pub trait SimplePrompt: Debug + Send + Sync {
async fn prompt(&self, prompt: &str) -> Result<String>;
}
```

Or in human language: "Given a Prompt, give me a response".

## The `EmbeddingModel` trait

Which is defined as follows:

```rust
pub trait EmbeddingModel: Send + Sync {
async fn embed(&self, input: Vec<String>) -> Result<Embeddings>;
}
```

Or in human language: "Given a list of things to Embed, give me embeddings". The embedding transformer will link back the embeddings to the original nodes by _order_.

## Built in inference and embedding models

<small>

| Name | Description | Feature Flag |
| --------- | --------------------------------------------------------- | ------------ |
| OpenAI | Implements both SimplePrompt and Embed via `async_openai` | openai |
| FastEmbed | Implements Embed via `fastembed-rs` | fastembed |

</small>
39 changes: 39 additions & 0 deletions src/content/docs/in-depth/storing-results.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
title: Storing the results
description: How to store the results of the pipeline.
sidebar:
order: 5
---

After processing nodes in the pipeline you probably want to store the results. Pipelines support multiple storage steps, but need at least one. A storage implements the `Persist` trait.

## The `Persist` trait

Which is defined as follows:

```rust
pub trait Persist: Debug + Send + Sync {
async fn setup(&self) -> Result<()>;
async fn store(&self, node: IngestionNode) -> Result<IngestionNode>;
async fn batch_store(&self, nodes: Vec<IngestionNode>) -> IngestionStream;
fn batch_size(&self) -> Option<usize> {
None
}
}
```

Setup functions are run right away, asynchronously when the pipeline starts. This could include setting up collections, tables, connections etcetera. Because more might happen after storing, both `store` and `batch_store` are expected to return the nodes they processed.

If `batch_size` is implemented for the storage, the stream will always prefer `batch_store`.

## Built in storage

<small>

| Name | Description | Feature Flag |
| ------------- | ---------------------------------------------------- | ------------ |
| Redis | Persists nodes by default as json | redis |
| Qdrant | Persists nodes in qdrant; expects a vector to be set | qdrant |
| MemoryStorage | Persists nodes in memory; great for debugging | |

</small>
Loading

0 comments on commit e9ba36e

Please sign in to comment.