Skip to content

Commit

Permalink
Merge pull request #15 from LangStream/impl/blog-ollama
Browse files Browse the repository at this point in the history
Blog post about Ollama.ai
  • Loading branch information
cdbartholomew authored Nov 21, 2023
2 parents 40c2c6f + c93addf commit 7c831ca
Show file tree
Hide file tree
Showing 5 changed files with 212 additions and 0 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ _site/
.jekyll-metadata/
.jekyll-cache/
**/.DS_Store

vendor/bundle/
2 changes: 2 additions & 0 deletions _posts/2023-10-04-introducing-mmr-rerank.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ The default implementation of MMR in LangStream is based on the CMU paper “[Ma
Now, let’s look at an example of how to use the re-rank agent in a LangStream pipeline. Here’s the full pipeline:

```yaml
{% raw %}
pipeline:
- name: “convert-to-structure”
type: “document-to-json”
Expand Down Expand Up @@ -50,6 +51,7 @@ pipeline:
lambda: 0.5
k1: 1.2
b: 0.75
{% endraw %}
```

Let’s break it down. We start by converting the input document to a JSON structure using the `convert-to-structure` agent. Next, we compute the embeddings of the input text using the `compute-ai-embeddings` agent and store them in a field called “value.question_embeddings”. We then use the `query-vector-db` agent to query the vector database for the top 20 documents related to the input text, and store them in a field called “value.related_documents”.
Expand Down
2 changes: 2 additions & 0 deletions _posts/2023-11-08-scalable-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ A typical pipeline that brings your data to the vector database is like this:
Now, let’s look at an example of all these steps in a LangStream pipeline:

```yaml
{% raw %}
name: "Extract and manipulate text"
assets:
- name: "documents-table"
Expand Down Expand Up @@ -110,6 +111,7 @@ pipeline:
expression: "value.text"
- name: "num_tokens"
expression: "value.chunk_num_tokens"
{% endraw %}
```


Expand Down
206 changes: 206 additions & 0 deletions _posts/2023-11-21-ollama.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
---
title: Using Ollama in LangStream
categories:
image: /images/ollama.png
author_staff_member: enrico
date: November 21, 2023
---


[Ollama](https://ollama.ai) is a tool that allows you to easily deploy and manage Open Source LLMs. You can even run the Large Language Model locally and augment it using the Retrieval Augmented Generation (RAG) technique.

In this article we will show you how to use Ollama in LangStream to build a chat bot, combining the power of Ollama models with the RAG (Retrieval augmented generation) pattern.

In order to build a Generate AI application you need a Large Language Model (LLM), and you have two main options:
- use an LLM as a service, like OpenAI, Google Vertex, AWS Bedrock, etc.
- run an Open Source Model, like Llama2 or other models taken from HuggingFace

Running an open source is great for local development and testing. For production, you may want to use an LLM as a service, especially since LLM services now offer open source models.

## Starting an Ollama model

You can download the Ollama runtime from the website [Ollama](https://ollama.ai) and follow the instructions to run it locally.

Running a model locally requires a lot of resources, depending on the size of the model.

Let's start with the Llama2 13b model, which is a smaller model that can run on a laptop with 16GB of RAM.

Once you have the Ollama installed you can start the LLM with this command:

```bash
ollama run llama2:13b
```

You can start chatting with the model using the CLI tools, but you won't be able to use RAG to augment the knowledge of the model.

The Ollama runtime exposes an HTTP API that allows you to interact with the model programmatically.
This is what we use to integrate it with LangStream.


## Connecting to the model


In order to connect to the LLM you have to provide the configuration for an `ollama-configuration` resource in `configuration.yaml`:

```yaml
resources:
- type: "ollama-configuration"
name: "ollama"
id: "ollama"
configuration:
url: "http://host.docker.internal:11434"
```
In this example we use `http://host.docker.internal:11434` as the URL because usually when you develop on LangStream your application runs in a docker container and the special hostname host.docker.internal allows you to connect to the host machine. Port 11434 is the default port that the Ollama runtime uses.

In LangStream it is very easy to swap the LLM, it is usually only a matter of changing the declaration of the LLM resource in `configuration.yaml`. So if you are already trying LangStream with OpenAI you can just add the Ollama configuration and start using it.

Please note that the prompt may need to be changed from what you are using with OpenAI. The results depend on how the model has been trained, and some models (especially the Llama2 models) tend to be more chatty than OpenAI in its responses and may require different prompt engineering.

## Using the Chat Completion API

Now in order to query the LLM it is enough to use the `ai-chat-completion` agent like this:

```yaml
{% raw %}
- name: "ai-chat-completions"
type: "ai-chat-completions"
configuration:
ai-service: "ollama"
model: "${secrets.ollama.model}"
completion-field: "value.answer"
log-field: "value.prompt"
stream-to-topic: "answers-topic"
stream-response-completion-field: "value"
min-chunks-per-message: 20
messages:
- content: |
I have this list of documents that may help you in answering to my question below.
Documents:
{{# value.related_documents}}
{{ text}}
{{/ value.related_documents}}
Please try to leverage the content above in your answer as much as possible.
Take into consideration that I am always asking questions about the LangStream project.
Do not provide information that is not related to the LangStream project.
This is my question:
{{ value.question}}
{% endraw %}
```

## Implementing RAG

RAG is all about adding relevant documents to the prompt that you send to the LLM. This additional context is used to improve the quality of the responses.

In order to feed more context to the LLM in the prompt you are going to populate the `value.related_documents` with some query on a vector database.

In case of a chatbot you usual start from a question from the user, then you perform a similarity search on a vector database and load more context about the user and the
current conversation. Finally you put all the relevant data into the prompt and send it to the LLM.

This is a typical RAG pipeline:


```yaml
{% raw %}
topics:
- name: "questions-topic"
creation-mode: create-if-not-exists
- name: "answers-topic"
creation-mode: create-if-not-exists
- name: "log-topic"
creation-mode: create-if-not-exists
errors:
on-failure: "skip"
pipeline:
- name: "convert-to-structure"
type: "document-to-json"
input: "questions-topic"
configuration:
text-field: "question"
- name: "compute-embeddings"
type: "compute-ai-embeddings"
configuration:
ai-service: "huggingface"
model: "${secrets.hugging-face.embeddings-model}" # This is the id of the model
model-url: "${secrets.hugging-face.embeddings-model-url}" # This is the URL of the repository containing the model
embeddings-field: "value.question_embeddings"
text: "{{ value.question }}"
flush-interval: 0
- name: "lookup-related-documents"
type: "query-vector-db"
configuration:
datasource: "JdbcDatasource"
query: "SELECT text,embeddings_vector FROM documents ORDER BY cosine_similarity(embeddings_vector, CAST(? as FLOAT ARRAY)) DESC LIMIT 20"
fields:
- "value.question_embeddings"
output-field: "value.related_documents"
- name: "re-rank documents with MMR"
type: "re-rank"
configuration:
max: 10 # keep only the top 5 documents, because we have and hard limit on the prompt size
field: "value.related_documents"
query-text: "value.question"
query-embeddings: "value.question_embeddings"
output-field: "value.related_documents"
text-field: "record.text"
embeddings-field: "record.embeddings_vector"
algorithm: "MMR"
lambda: 0.5
k1: 1.2
b: 0.75
- name: "log documents"
type: "log-event"
- name: "ai-chat-completions"
type: "ai-chat-completions"
configuration:
ai-service: "ollama"
model: "${secrets.ollama.model}"
# on the log-topic we add a field with the answer
completion-field: "value.answer"
# we are also logging the prompt we sent to the LLM
log-field: "value.prompt"
# here we configure the streaming behavior
# as soon as the LLM answers with a chunk we send it to the answers-topic
stream-to-topic: "answers-topic"
# on the streaming answer we send the answer as whole message
# the 'value' syntax is used to refer to the whole value of the message
stream-response-completion-field: "value"
# we want to stream the answer as soon as we have 20 chunks
# in order to reduce latency for the first message the agent sends the first message
# with 1 chunk, then with 2 chunks....up to the min-chunks-per-message value
# eventually we want to send bigger messages to reduce the overhead of each message on the topic
min-chunks-per-message: 20
messages:
- content: |
I have this list of documents that may help you in answering to my question below.
Documents:
{{# value.related_documents}}
{{ text}}
{{/ value.related_documents}}
Please try to leverage the content above in your answer as much as possible.
Take into consideration that I am always asking questions about the LangStream project.
Do not provide information that is not related to the LangStream project.
This is my question:
{{ value.question}}
{% endraw %}
```
With this pipleline definition you can experiment with the Llama2 13b model and augmenting its knowledge with your own data right from your laptop. This pipeline also uses a embedding model downloaded from Hugging Face that also runs locally. The entire pipeline can be run on your laptop while flying without having to pay outrageous fees for in-flight WiFi (or to the LLM service providers).

Please refer to the LangStream documentation in order to learn the details about building a [RAG pipeline](https://docs.langstream.ai/patterns/rag-pattern).


## Conclusion

Using OpenSource LLMs is a great way to build AI applications, but it requires some effort to manage the models and the infrastructure.
With LangStream and Ollama you can seamlessly develop and test open source models on your laptop and build sophisticated AI applications.


Please send us feedback in [Slack](https://join.slack.com/t/langstream/shared_invite/zt-21leloc9c-lNaGLdiecHuWU5N31L2AeQ){:target="_blank"} or [Linen](https://www.linen.dev/invite/langstream){:target="_blank"}. If you find a bug, please open a [GitHub issue](https://github.com/LangStream/langstream/issues){:target="_blank"}.
Binary file added images/ollama.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 7c831ca

Please sign in to comment.