feat: bring it all together

bosun-ai · Jun 26, 2024 · e9ba36e · e9ba36e
1 parent 91cb322
commit e9ba36e
Show file tree

Hide file tree

Showing 28 changed files with 437 additions and 90 deletions.
diff --git a/astro.config.mjs b/astro.config.mjs
@@ -12,6 +12,9 @@ export default defineConfig({
       editLink: {
         baseUrl: "https://github.com/bosun-ai/swiftide-website/edit/master",
       },
+      tableOfContents: {
+        minHeadingLevel: 2,
+      },
       customCss: [
         // Fontsource files for to regular and semi-bold font weights.
         "@fontsource/fira-code/400.css",
@@ -25,11 +28,12 @@ export default defineConfig({
       },
       social: {
         github: "https://github.com/bosun-ai/swiftide",
+        linkedin: "https://www.linkedin.com/company/bosun-ai/",
       },
       sidebar: [
         {
-          label: "Introduction",
-          link: "/introduction/",
+          label: "What is swiftide?",
+          link: "/what-is-swiftide/",
         },
         {
           label: "Getting Started",
@@ -40,7 +44,7 @@ export default defineConfig({
         {
           label: "In depth",
           autogenerate: {
-            directory: "concepts",
+            directory: "in-depth",
           },
         },
         {

diff --git a/src/content/docs/concepts/caching-and-filtering.md b/src/content/docs/concepts/caching-and-filtering.md
diff --git a/src/content/docs/concepts/extendability.md b/src/content/docs/concepts/extendability.md
diff --git a/src/content/docs/concepts/loading-data.md b/src/content/docs/concepts/loading-data.md
diff --git a/src/content/docs/concepts/storing-results.md b/src/content/docs/concepts/storing-results.md
diff --git a/src/content/docs/concepts/streaming-and-concurrency.md b/src/content/docs/concepts/streaming-and-concurrency.md
diff --git a/src/content/docs/examples/hello-world.md b/src/content/docs/examples/hello-world.md
@@ -3,9 +3,72 @@ title: Hello World
 description: A simple example of an ingestion pipeline
 ---
 
-Guides lead a user through a specific task they want to accomplish, often with a sequence of steps.
-Writing a good guide requires thinking about what your users are trying to do.
+## Ingesting code into Qdrant
 
-## Further reading
+This example demonstrates how to ingest the Swiftide codebase itself.
+Note that for it to work correctly you need to have OPENAI_API_KEY set, redis and qdrant
+running.
 
-- Read [about how-to guides](https://diataxis.fr/how-to-guides/) in the Diátaxis framework
+The pipeline will:
+
+- Load all `.rs` files from the current directory
+- Skip any nodes previously processed; hashes are based on the path and chunk (not the
+  metadata!)
+- Run metadata QA on each chunk; generating questions and answers and adding metadata
+- Chunk the code into pieces of 10 to 2048 bytes
+- Embed the chunks in batches of 10, Metadata is embedded by default
+- Store the nodes in Qdrant
+
+Note that metadata is copied over to smaller chunks when chunking. When making LLM requests
+with lots of small chunks, consider the rate limits of the API.
+
+```rust
+
+use swiftide::{
+    ingestion,
+    integrations::{self, qdrant::Qdrant, redis::Redis},
+    loaders::FileLoader,
+    transformers::{ChunkCode, Embed, MetadataQACode},
+};
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    tracing_subscriber::fmt::init();
+
+    let openai_client = integrations::openai::OpenAI::builder()
+        .default_embed_model("text-embedding-3-small")
+        .default_prompt_model("gpt-3.5-turbo")
+        .build()?;
+
+    let redis_url = std::env::var("REDIS_URL")
+        .as_deref()
+        .unwrap_or("redis://localhost:6379")
+        .to_owned();
+
+    let qdrant_url = std::env::var("QDRANT_URL")
+        .as_deref()
+        .unwrap_or("http://localhost:6334")
+        .to_owned();
+
+    ingestion::IngestionPipeline::from_loader(FileLoader::new(".").with_extensions(&["rs"]))
+        .filter_cached(Redis::try_from_url(redis_url, "swiftide-examples")?)
+        .then(MetadataQACode::new(openai_client.clone()))
+        .then_chunk(ChunkCode::try_for_language_and_chunk_size(
+            "rust",
+            10..2048,
+        )?)
+        .then_in_batch(10, Embed::new(openai_client.clone()))
+        .then_store_with(
+            Qdrant::try_from_url(qdrant_url)?
+                .batch_size(50)
+                .vector_size(1536)
+                .collection_name("swiftide-examples".to_string())
+                .build()?,
+        )
+        .run()
+        .await?;
+    Ok(())
+}
+```
+
+Find more examples in [our repository](https://github.com/bosun-ai/swiftide/blob/master/examples)
diff --git a/src/content/docs/getting-started/architecture-and-design.mdx b/src/content/docs/getting-started/architecture-and-design.mdx
@@ -0,0 +1,41 @@
+---
+title: Architecture and Design
+description: The architecture and design principles of the Swiftide project.
+---
+
+## Design principles
+
+- **Modular**: The pipeline is built from small, composable parts.
+- **Extensible**: It is easy to add new parts to the pipeline by extending straightforward traits.
+- **Performance**: Performance and ease-of-use are the main goals of the library. Performance always has priority.
+- **Tracable**: `tracing` is used throughout the pipeline.
+
+### When designing integrations, transformers, chunkers
+
+- **Simple**: The API should be simple and easy to use.
+- **Sane defaults, fully configurable**: The library should have sane defaults that are easy to override.
+- **Builder pattern**: The builder pattern is used to create new instances of the pipeline.
+
+## The-things-we-talk-about
+
+- **IngestionPipeline**: The main struct that holds the pipeline. It is a stream of IngestionNodes.
+- **IngestionNode**: The main struct that holds the data. It has a path, chunk and metadata.
+- **IngestionStream**: The internal stream of IngestionNodes in the pipeline.
+- **Loader**: The starting point of the stream, creates and emits IngestionNodes.
+- **Transformers**: Some behaviour that modifies the IngestionNodes.
+- **BatchTransformers**: Transformers that transform multiple nodes.
+- **Chunkers**: Transformers that split a node into multiple nodes.
+- **Storages**: Persist the IngestionNodes.
+- **NodeCache**: Filters cached nodes.
+- **Integrations**: External libraries that can be used with the pipeline.
+
+### Pipeline structure and traits
+
+- from_loader (impl Loader) starting point of the stream, creates and emits IngestionNodes
+- filter_cached (impl NodeCache) filters cached nodes
+- then (impl Transformer) transforms the node and puts it on the stream
+- then_in_batch (impl BatchTransformer) transforms multiple nodes and puts them on the stream
+- then_chunk (impl ChunkerTransformer) transforms a single node and emits multiple nodes
+- then_store_with (impl Storage) stores the nodes in a storage backend, this can be chained
+
+Additionally, several generic transformers are implemented. They take implementers of `SimplePrompt` and `EmbeddingModel` to do their things.
diff --git a/src/content/docs/getting-started/changelog.mdx b/src/content/docs/getting-started/changelog.mdx
@@ -1,5 +1,6 @@
 ---
 title: Changelog
+description: The changelog of the Swiftide project.
 ---
 
 import Changelog from "../../../components/Changelog.astro";

diff --git a/src/content/docs/getting-started/feature-flags.mdx b/src/content/docs/getting-started/feature-flags.mdx
@@ -1,5 +1,6 @@
 ---
 title: Feature Flags
+description: Available features and integrations in Swiftide.
 sidebar:
   order: 1
 ---

diff --git a/src/content/docs/getting-started/installation.mdx b/src/content/docs/getting-started/installation.mdx
@@ -1,5 +1,6 @@
 ---
 title: Installation
+description: Installation instructions for Swiftide.
 sidebar:
   order: 0
 ---

diff --git a/src/content/docs/in-depth/caching-and-filtering.md b/src/content/docs/in-depth/caching-and-filtering.md
@@ -0,0 +1,31 @@
+---
+title: Caching and filtering nodes
+description: How to cache and filter nodes in the pipeline.
+sidebar:
+  order: 3
+---
+
+When nodes have already been processed by the pipeline, they can often be skipped, speeding up the pipeline and saving costs. A node cache implements the `NodeCache` trait.
+
+## The `NodeCache` trait
+
+Which is defined as follows:
+
+```rust
+pub trait NodeCache: Send + Sync + Debug {
+    async fn get(&self, node: &IngestionNode) -> bool;
+    async fn set(&self, node: &IngestionNode);
+}
+```
+
+Or in human language: "Given a Node, provide methods to set and get from the cache".
+
+## Built in chunkers
+
+<small>
+
+| Name  | Description                                         | Feature Flag |
+| ----- | --------------------------------------------------- | ------------ |
+| Redis | Can get and set nodes using multiplexed connections | redis        |
+
+</small>
diff --git a/src/content/docs/concepts/chunking.md → src/content/docs/in-depth/chunking.md b/src/content/docs/concepts/chunking.md → src/content/docs/in-depth/chunking.md
@@ -1,11 +1,14 @@
 ---
 title: Chunking
+description: How to chunk nodes in the pipeline.
 sidebar:
   order: 2
 ---
 
 For quality metadata it can be important to break up text into smaller parts for both better metadata and retrieval. A chunker implements the `ChunkerTransformer` trait.
 
+## The `ChunkerTransformer` trait
+
 Which is defined as follows:
 
 ```rust

diff --git a/src/content/docs/concepts/overview.mdx → ...ocs/in-depth/introducing-step-by-step.mdx b/src/content/docs/concepts/overview.mdx → ...ocs/in-depth/introducing-step-by-step.mdx
@@ -1,5 +1,6 @@
 ---
-title: Overview
+title: Step-by-step Introduction
+description: A step-by-step introduction on how to use swiftide as a data ingestion pipeline in your project.
 sidebar:
   order: 0
 ---
@@ -8,6 +9,8 @@ Swiftide provides a pipeline model. Troughout a pipeline, `IngestionNodes` are t
 
 import { Steps } from "@astrojs/starlight/components";
 
+### A pipeline step-by-step
+
 <Steps>
 
 1. The pipeline starts with a loader:
@@ -86,3 +89,7 @@ import { Steps } from "@astrojs/starlight/components";
    ```
 
 </Steps>
+
+### Read more
+
+[Reference documentation on docs.rs]("https://docs.rs/swiftide/latest/swiftide/")
diff --git a/src/content/docs/in-depth/loading-data.md b/src/content/docs/in-depth/loading-data.md
@@ -0,0 +1,31 @@
+---
+title: Loading Data
+description: How to load data into the pipeline.
+sidebar:
+  order: 1
+---
+
+A pipeline starts with data and is only as good as the data it ingests. A loader implements the `Loader` trait.
+
+## The `Loader` trait
+
+Which is defined as follows:
+
+```rust
+pub trait Loader {
+    fn into_stream(self) -> IngestionStream;
+}
+```
+
+Or in human language: "I can be turned into a stream". The assumption under the hood is that Loaders will yield the data they load as a stream of `IngestionNodes`. These can be files, messages, webpages and so on.
+
+## Built in loaders
+
+<small>
+
+| Name           | Description                                                         | Feature Flag |
+| -------------- | ------------------------------------------------------------------- | ------------ |
+| FileLoader     | Loads files with an optional extension filter, respecting gitignore |              |
+| ScrapingLoader | Scrapes a website using the `spider` crate                          | scraping     |
+
+</small>
diff --git a/src/content/docs/in-depth/prompting-embedding.md b/src/content/docs/in-depth/prompting-embedding.md
@@ -0,0 +1,43 @@
+---
+title: Prompting and Embedding
+description: How to prompt and embed data in the pipeline.
+sidebar:
+  order: 2
+---
+
+Our metadata transformers are generic over the `SimplePrompt` trait. This enables different models to be used for different usecases. Similarly, the embedding transformer is generic over the `EmbeddingModel` trait.
+
+## The `SimplePrompt` trait
+
+Which is defined as follows:
+
+```rust
+pub trait SimplePrompt: Debug + Send + Sync {
+    async fn prompt(&self, prompt: &str) -> Result<String>;
+}
+```
+
+Or in human language: "Given a Prompt, give me a response".
+
+## The `EmbeddingModel` trait
+
+Which is defined as follows:
+
+```rust
+pub trait EmbeddingModel: Send + Sync {
+    async fn embed(&self, input: Vec<String>) -> Result<Embeddings>;
+}
+```
+
+Or in human language: "Given a list of things to Embed, give me embeddings". The embedding transformer will link back the embeddings to the original nodes by _order_.
+
+## Built in inference and embedding models
+
+<small>
+
+| Name      | Description                                               | Feature Flag |
+| --------- | --------------------------------------------------------- | ------------ |
+| OpenAI    | Implements both SimplePrompt and Embed via `async_openai` | openai       |
+| FastEmbed | Implements Embed via `fastembed-rs`                       | fastembed    |
+
+</small>
diff --git a/src/content/docs/in-depth/storing-results.md b/src/content/docs/in-depth/storing-results.md
@@ -0,0 +1,39 @@
+---
+title: Storing the results
+description: How to store the results of the pipeline.
+sidebar:
+  order: 5
+---
+
+After processing nodes in the pipeline you probably want to store the results. Pipelines support multiple storage steps, but need at least one. A storage implements the `Persist` trait.
+
+## The `Persist` trait
+
+Which is defined as follows:
+
+```rust
+pub trait Persist: Debug + Send + Sync {
+    async fn setup(&self) -> Result<()>;
+    async fn store(&self, node: IngestionNode) -> Result<IngestionNode>;
+    async fn batch_store(&self, nodes: Vec<IngestionNode>) -> IngestionStream;
+    fn batch_size(&self) -> Option<usize> {
+        None
+    }
+}
+```
+
+Setup functions are run right away, asynchronously when the pipeline starts. This could include setting up collections, tables, connections etcetera. Because more might happen after storing, both `store` and `batch_store` are expected to return the nodes they processed.
+
+If `batch_size` is implemented for the storage, the stream will always prefer `batch_store`.
+
+## Built in storage
+
+<small>
+
+| Name          | Description                                          | Feature Flag |
+| ------------- | ---------------------------------------------------- | ------------ |
+| Redis         | Persists nodes by default as json                    | redis        |
+| Qdrant        | Persists nodes in qdrant; expects a vector to be set | qdrant       |
+| MemoryStorage | Persists nodes in memory; great for debugging        |              |
+
+</small>