Merge pull request #15 from LangStream/impl/blog-ollama

Blog post about Ollama.ai
LangStream · Nov 21, 2023 · 7c831ca · 7c831ca
2 parents 40c2c6f + c93addf
commit 7c831ca
Show file tree

Hide file tree

Showing 5 changed files with 212 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,5 @@ _site/
 .jekyll-metadata/
 .jekyll-cache/
 **/.DS_Store
+
+vendor/bundle/
diff --git a/_posts/2023-10-04-introducing-mmr-rerank.md b/_posts/2023-10-04-introducing-mmr-rerank.md
@@ -15,6 +15,7 @@ The default implementation of MMR in LangStream is based on the CMU paper “[Ma
 Now, let’s look at an example of how to use the re-rank agent in a LangStream pipeline. Here’s the full pipeline:
 
 ```yaml
+{% raw %}
 pipeline:
 - name: “convert-to-structure”
  type: “document-to-json”
@@ -50,6 +51,7 @@ pipeline:
   lambda: 0.5
   k1: 1.2
   b: 0.75
+{% endraw %}
 ```
 
 Let’s break it down. We start by converting the input document to a JSON structure using the `convert-to-structure` agent. Next, we compute the embeddings of the input text using the `compute-ai-embeddings` agent and store them in a field called “value.question_embeddings”. We then use the `query-vector-db` agent to query the vector database for the top 20 documents related to the input text, and store them in a field called “value.related_documents”.

diff --git a/_posts/2023-11-08-scalable-pipelines.md b/_posts/2023-11-08-scalable-pipelines.md
@@ -21,6 +21,7 @@ A typical pipeline that brings your data to the vector database is like this:
 Now, let’s look at an example of all these steps in a LangStream pipeline:
 
 ```yaml
+{% raw %}
 name: "Extract and manipulate text"
 assets:
   - name: "documents-table"
@@ -110,6 +111,7 @@ pipeline:
           expression: "value.text"
         - name: "num_tokens"
           expression: "value.chunk_num_tokens"
+{% endraw %}
 ```
 
 

diff --git a/_posts/2023-11-21-ollama.md b/_posts/2023-11-21-ollama.md
@@ -0,0 +1,206 @@
+---
+title: Using Ollama in LangStream
+categories:
+image: /images/ollama.png
+author_staff_member: enrico
+date: November 21, 2023
+---
+
+
+[Ollama](https://ollama.ai) is a tool that allows you to easily deploy and manage Open Source LLMs. You can even run the Large Language Model locally and augment it using the Retrieval Augmented Generation (RAG) technique.
+
+In this article we will show you how to use Ollama in LangStream to build a chat bot, combining the power of Ollama models with the RAG (Retrieval augmented generation) pattern.
+
+In order to build a Generate AI application you need a Large Language Model (LLM), and you have two main options:
+- use an LLM as a service, like OpenAI, Google Vertex, AWS Bedrock, etc.
+- run an Open Source Model, like Llama2 or other models taken from HuggingFace
+
+Running an open source is great for local development and testing. For production, you may want to use an LLM as a service, especially since LLM services now offer open source models.
+
+## Starting an Ollama model
+
+You can download the Ollama runtime from the website [Ollama](https://ollama.ai) and follow the instructions to run it locally.
+
+Running a model locally requires a lot of resources, depending on the size of the model.
+
+Let's start with the Llama2 13b model, which is a smaller model that can run on a laptop with 16GB of RAM.
+
+Once you have the Ollama installed you can start the LLM with this command:
+
+  ```bash
+  ollama run llama2:13b
+  ```
+
+You can start chatting with the model using the CLI tools, but you won't be able to use RAG to augment the knowledge of the model.
+
+The Ollama runtime exposes an HTTP API that allows you to interact with the model programmatically.
+This is what we use to integrate it with LangStream.
+
+
+## Connecting to the model
+
+
+In order to connect to the LLM you have to provide the configuration for an `ollama-configuration` resource in `configuration.yaml`:
+
+```yaml
+resources:
+    - type: "ollama-configuration"
+      name: "ollama"
+      id: "ollama"
+      configuration:
+        url: "http://host.docker.internal:11434"
+```
+
+In this example we use `http://host.docker.internal:11434` as the URL because usually when you develop on LangStream your application runs in a docker container and the special hostname host.docker.internal allows you to connect to the host machine. Port 11434 is the default port that the Ollama runtime uses.
+
+In LangStream it is very easy to swap the LLM, it is usually only a matter of changing the declaration of the LLM resource in `configuration.yaml`. So if you are already trying LangStream with OpenAI you can just add the Ollama configuration and start using it.
+
+Please note that the prompt may need to be changed from what you are using with OpenAI. The results depend on how the model has been trained, and some models (especially the Llama2 models) tend to be more chatty than OpenAI in its responses and may require different prompt engineering.
+
+## Using the Chat Completion API
+
+Now in order to query the LLM it is enough to use the `ai-chat-completion` agent like this:
+
+```yaml
+{% raw %}
+  - name: "ai-chat-completions"
+    type: "ai-chat-completions"
+
+    configuration:
+      ai-service: "ollama"
+      model: "${secrets.ollama.model}"
+      completion-field: "value.answer"
+      log-field: "value.prompt"
+      stream-to-topic: "answers-topic"
+      stream-response-completion-field: "value"
+      
+      min-chunks-per-message: 20
+      messages:
+        - content: |            
+              I have this list of documents that may help you in answering to my question below.
+            
+              Documents:
+              {{# value.related_documents}}
+              {{ text}}
+              {{/ value.related_documents}}
+            
+              Please try to leverage the content above in your answer as much as possible.
+              Take into consideration that I am always asking questions about the LangStream project.              
+              Do not provide information that is not related to the LangStream project.
+            
+              This is my question:
+              {{ value.question}}
+{% endraw %}
+```
+
+## Implementing RAG
+
+RAG is all about adding relevant documents to the prompt that you send to the LLM. This additional context is used to improve the quality of the responses.
+
+In order to feed more context to the LLM in the prompt you are going to populate the `value.related_documents` with some query on a vector database.
+
+In case of a chatbot you usual start from a question from the user, then you perform a similarity search on a vector database and load more context about the user and the
+current conversation. Finally you put all the relevant data into the prompt and send it to the LLM.
+
+This is a typical RAG pipeline:
+
+
+```yaml
+{% raw %}
+topics:
+  - name: "questions-topic"
+    creation-mode: create-if-not-exists
+  - name: "answers-topic"
+    creation-mode: create-if-not-exists
+  - name: "log-topic"
+    creation-mode: create-if-not-exists
+errors:
+    on-failure: "skip"
+pipeline:
+  - name: "convert-to-structure"
+    type: "document-to-json"
+    input: "questions-topic"
+    configuration:
+      text-field: "question"
+  - name: "compute-embeddings"
+    type: "compute-ai-embeddings"
+    configuration:
+      ai-service: "huggingface"
+      model: "${secrets.hugging-face.embeddings-model}" # This is the id of the model
+      model-url: "${secrets.hugging-face.embeddings-model-url}" # This is the URL of the repository containing the model
+      embeddings-field: "value.question_embeddings"
+      text: "{{ value.question }}"
+      flush-interval: 0
+  - name: "lookup-related-documents"
+    type: "query-vector-db"
+    configuration:
+      datasource: "JdbcDatasource"
+      query: "SELECT text,embeddings_vector FROM documents ORDER BY cosine_similarity(embeddings_vector, CAST(? as FLOAT ARRAY)) DESC LIMIT 20"
+      fields:
+        - "value.question_embeddings"
+      output-field: "value.related_documents"
+  - name: "re-rank documents with MMR"
+    type: "re-rank"
+    configuration:
+      max: 10 # keep only the top 5 documents, because we have and hard limit on the prompt size
+      field: "value.related_documents"
+      query-text: "value.question"
+      query-embeddings: "value.question_embeddings"
+      output-field: "value.related_documents"
+      text-field: "record.text"
+      embeddings-field: "record.embeddings_vector"
+      algorithm: "MMR"
+      lambda: 0.5
+      k1: 1.2
+      b: 0.75
+  - name: "log documents"
+    type: "log-event"
+  - name: "ai-chat-completions"
+    type: "ai-chat-completions"
+    configuration:
+      ai-service: "ollama"
+      model: "${secrets.ollama.model}"
+      # on the log-topic we add a field with the answer
+      completion-field: "value.answer"
+      # we are also logging the prompt we sent to the LLM
+      log-field: "value.prompt"
+      # here we configure the streaming behavior
+      # as soon as the LLM answers with a chunk we send it to the answers-topic
+      stream-to-topic: "answers-topic"
+      # on the streaming answer we send the answer as whole message
+      # the 'value' syntax is used to refer to the whole value of the message
+      stream-response-completion-field: "value"
+      # we want to stream the answer as soon as we have 20 chunks
+      # in order to reduce latency for the first message the agent sends the first message
+      # with 1 chunk, then with 2 chunks....up to the min-chunks-per-message value
+      # eventually we want to send bigger messages to reduce the overhead of each message on the topic
+      min-chunks-per-message: 20
+      messages:
+        - content: |            
+              I have this list of documents that may help you in answering to my question below.
+            
+              Documents:
+              {{# value.related_documents}}
+              {{ text}}
+              {{/ value.related_documents}}
+            
+              Please try to leverage the content above in your answer as much as possible.
+              Take into consideration that I am always asking questions about the LangStream project.              
+              Do not provide information that is not related to the LangStream project.
+            
+              This is my question:
+              {{ value.question}}
+{% endraw %}
+```
+With this pipleline definition you can experiment with the Llama2 13b model and augmenting its knowledge with your own data right from your laptop. This pipeline also uses a embedding model downloaded from Hugging Face that also runs locally. The entire pipeline can be run on your laptop while flying without having to pay outrageous fees for in-flight WiFi (or to the LLM service providers).
+
+Please refer to the LangStream documentation in order to learn the details about building a [RAG pipeline](https://docs.langstream.ai/patterns/rag-pattern).
+
+
+## Conclusion
+
+Using OpenSource LLMs is a great way to build AI applications, but it requires some effort to manage the models and the infrastructure.
+With LangStream and Ollama you can seamlessly develop and test open source models on your laptop and build sophisticated AI applications.
+
+
+Please send us feedback in [Slack](https://join.slack.com/t/langstream/shared_invite/zt-21leloc9c-lNaGLdiecHuWU5N31L2AeQ){:target="_blank"} or [Linen](https://www.linen.dev/invite/langstream){:target="_blank"}. If you find a bug, please open a [GitHub issue](https://github.com/LangStream/langstream/issues){:target="_blank"}.
diff --git a/images/ollama.png b/images/ollama.png