Skip to content

Commit

Permalink
v1 of structured_output chapter
Browse files Browse the repository at this point in the history
  • Loading branch information
souzatharsis committed Nov 22, 2024
1 parent 1af8ea8 commit 5c36c3f
Show file tree
Hide file tree
Showing 42 changed files with 5,290 additions and 239 deletions.
929 changes: 928 additions & 1 deletion poetry.lock

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,9 @@ sphinx-tojupyter = "^0.3.0"
sphinx-togglebutton = "^0.3.2"
insipid-sphinx-theme = "^0.4.2"
sphinx-press-theme = "^0.9.1"
langchain-openai = "^0.2.9"
outlines = "^0.1.5"
transformers = "^4.46.3"


[build-system]
Expand Down
Binary file modified tamingllms/_build/.doctrees/environment.pickle
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified tamingllms/_build/.doctrees/markdown/toc.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified tamingllms/_build/.doctrees/notebooks/structured_output.doctree
Binary file not shown.
2 changes: 1 addition & 1 deletion tamingllms/_build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 36945de024d591eca72cb43977f35263
config: d5b2da7e1983f3da43c754f98d2e7fad
tags: 645f666f9bcd5a90fca523b33c5a78b7
53 changes: 53 additions & 0 deletions tamingllms/_build/html/_sources/markdown/markdown-notebooks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
jupytext:
formats: md:myst
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.11.5
kernelspec:
display_name: Python 3
language: python
name: python3
---

# Notebooks with MyST Markdown

Jupyter Book also lets you write text-based notebooks using MyST Markdown.
See [the Notebooks with MyST Markdown documentation](https://jupyterbook.org/file-types/myst-notebooks.html) for more detailed instructions.
This page shows off a notebook written in MyST Markdown.

## An example cell

With MyST Markdown, you can define code cells with a directive like so:

```{code-cell}
print(2 + 2)
```

When your book is built, the contents of any `{code-cell}` blocks will be
executed with your default Jupyter kernel, and their outputs will be displayed
in-line with the rest of your content.

```{seealso}
Jupyter Book uses [Jupytext](https://jupytext.readthedocs.io/en/latest/) to convert text-based files to notebooks, and can support [many other text-based notebook files](https://jupyterbook.org/file-types/jupytext.html).
```

## Create a notebook with MyST Markdown

MyST Markdown notebooks are defined by two things:

1. YAML metadata that is needed to understand if / how it should convert text files to notebooks (including information about the kernel needed).
See the YAML at the top of this page for example.
2. The presence of `{code-cell}` directives, which will be executed with your book.

That's all that is needed to get started!

## Quickly add YAML metadata for MyST Notebooks

If you have a markdown file and you'd like to quickly add YAML metadata to it, so that Jupyter Book will treat it as a MyST Markdown Notebook, run the following command:

```
jupyter-book myst init path/to/markdownfile.md
```
55 changes: 55 additions & 0 deletions tamingllms/_build/html/_sources/markdown/markdown.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Markdown Files

Whether you write your book's content in Jupyter Notebooks (`.ipynb`) or
in regular markdown files (`.md`), you'll write in the same flavor of markdown
called **MyST Markdown**.
This is a simple file to help you get started and show off some syntax.

## What is MyST?

MyST stands for "Markedly Structured Text". It
is a slight variation on a flavor of markdown called "CommonMark" markdown,
with small syntax extensions to allow you to write **roles** and **directives**
in the Sphinx ecosystem.

For more about MyST, see [the MyST Markdown Overview](https://jupyterbook.org/content/myst.html).

## Sample Roles and Directives

Roles and directives are two of the most powerful tools in Jupyter Book. They
are like functions, but written in a markup language. They both
serve a similar purpose, but **roles are written in one line**, whereas
**directives span many lines**. They both accept different kinds of inputs,
and what they do with those inputs depends on the specific role or directive
that is being called.

Here is a "note" directive:

```{note}
Here is a note
```

It will be rendered in a special box when you build your book.

Here is an inline directive to refer to a document: {doc}`markdown-notebooks`.


## Citations

You can also cite references that are stored in a `bibtex` file. For example,
the following syntax: `` {cite}`holdgraf_evidence_2014` `` will render like
this: {cite}`holdgraf_evidence_2014`.

Moreover, you can insert a bibliography into your page with this syntax:
The `{bibliography}` directive must be used for all the `{cite}` roles to
render properly.
For example, if the references for your book are stored in `references.bib`,
then the bibliography is inserted with:

```{bibliography}
```

## Learn more

This is just a simple starter to get you started.
You can learn a lot more at [jupyterbook.org](https://jupyterbook.org).
187 changes: 187 additions & 0 deletions tamingllms/_build/html/_sources/markdown/output_size_limit-NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# Output Size Limitations

## What are Token Limits?

Tokens are the basic units that LLMs process text with. A token can be as short as a single character or as long as a complete word. In English, a general rule of thumb is that 1 token ≈ 4 characters or ¾ of a word.

The `max_output_tokens` is parameter often available in modern LLMs that determines the maximum length of text that an LLM can generate in a single response. Contrary to what one might expect, the model does not "summarizes the answer" such that it does not surpass `max_output_tokens` limit. Instead, it will stop once it reaches this limit, even mid-sentence, i.e. the response may be truncated.

| **Token Cost and Length Limitation Comparison Across Key Models** | | | | |
| Model | max_output_tokens | max_input_tokens | input_cost_per_token | output_cost_per_token |
|------------------------------|-------------------|------------------|----------------------|-----------------------|
| meta.llama3-2-11b-instruct-v1:0 | 4096 | 128000 | 3.5e-7 | 3.5e-7 |
| claude-3-5-sonnet-20241022 | 8192 | 200000 | 3e-6 | 1.5e-5 |
| gpt-4-0613 | 4096 | 8192 | 3e-5 | 6e-5 |
| gpt-4-turbo-2024-04-09 | 4096 | 128000 | 1e-5 | 3e-5 |
| gpt-4o-mini | 16384 | 128000 | 1.5e-7 | 6e-7 |
| gemini/gemini-1.5-flash-002 | 8192 | 1048576 | 7.5e-8 | 3e-7 |
| gemini/gemini-1.5-pro-002 | 8192 | 2097152 | 3.5e-6 | 1.05e-5 |

## Problem Statement

The `max_output_tokens` limit in LLMs poses a significant challenge for users who need to generate long outputs, as it may result in truncated content and/or incomplete information.

1. **Truncated Content**: Users aiming to generate extensive content, such as detailed reports or comprehensive articles, may find their outputs abruptly cut off due to the `max_output_tokens` limit. This truncation can result in incomplete information and disrupt the flow of the content.

2. **Shallow Responses**: When users expect a complete and thorough response but receive only a partial output, it can lead to dissatisfaction and frustration. This is especially true in applications where the completeness of information is critical, such as in educational tools or content creation platforms.

To effectively address these challenges, developers need to implement robust solutions that balance user expectations with technical and cost constraints, ensuring that long-form content generation remains feasible and efficient.

### Content Chunking with Contextual Linking

Content chunking with contextual linking is a technique used to manage the `max_output_tokens` limitation by breaking down long-form content into smaller, manageable chunks. This approach allows the LLM to focus on smaller sections of the input, enabling it to generate more complete and detailed responses for each chunk while maintaining coherence and context across the entire output.

1. **Chunking the Content**: The input content is split into smaller chunks. This allows the LLM to process each chunk individually, focusing on generating a complete and detailed response for that specific section of the input.

2. **Maintaining Context**: Each chunk is linked with contextual information from the previous chunks. This helps in maintaining the flow and coherence of the content across multiple chunks.

3. **Generating Linked Prompts**: For each chunk, a prompt is generated that includes the chunk's content and its context. This prompt is then used to generate the output for that chunk.

4. **Combining the Outputs**: The outputs of all chunks are combined to form the final long-form content.

By following these steps, developers can effectively manage the `max_output_tokens` limitation and generate coherent long-form content without truncation.

Let's examine a robust solution for handling long-form content generation:

```python
from typing import List, Dict
import json

class ConversationGenerator:
def __init__(self, api_client):
self.api_client = api_client

def chunk_content(self, input_content: str, chunk_size: int = 1000) -> List[str]:
"""Split input content into manageable chunks while preserving context."""
sentences = input_content.split('. ')
chunks = []
current_chunk = []
current_length = 0

for sentence in sentences:
sentence_length = len(sentence)
if current_length + sentence_length > chunk_size and current_chunk:
chunks.append('. '.join(current_chunk) + '.')
current_chunk = []
current_length = 0
current_chunk.append(sentence)
current_length += sentence_length

if current_chunk:
chunks.append('. '.join(current_chunk) + '.')
return chunks

def generate_conversation_prompts(self, content_chunks: List[str]) -> List[Dict]:
"""Generate linked conversation prompts."""
prompts = []
for i, chunk in enumerate(content_chunks):
prompt = {
"part": i + 1,
"total_parts": len(content_chunks),
"content": chunk,
"context": self._get_context(i, prompts)
}
prompts.append(prompt)
return prompts

def _get_context(self, part_index: int, previous_prompts: List[Dict]) -> str:
"""Generate context from previous parts."""
if part_index == 0:
return "Start of conversation"
return f"Continuing from part {part_index}"
```

### Testing the Implementation

Here's how to test the chunking functionality:

```python
import pytest
from conversation_generator import ConversationGenerator

def test_content_chunking():
generator = ConversationGenerator(None)
long_content = "First sentence. Second sentence. " * 100
chunks = generator.chunk_content(long_content, chunk_size=100)

assert len(chunks) > 1
for chunk in chunks:
assert len(chunk) <= 100
assert chunk.endswith('.')

def test_prompt_generation():
generator = ConversationGenerator(None)
chunks = ["Chunk 1.", "Chunk 2.", "Chunk 3."]
prompts = generator.generate_conversation_prompts(chunks)

assert len(prompts) == 3
assert prompts[0]["part"] == 1
assert prompts[1]["context"] != prompts[0]["context"]
```

## Best Practices

1. **Always Monitor Token Usage**
```python
def estimate_tokens(text: str) -> int:
"""Rough token count estimation."""
return len(text.split()) * 1.3
```

2. **Implement Graceful Fallbacks**
```python
def generate_with_fallback(prompt: str, max_retries: int = 3):
for attempt in range(max_retries):
try:
return generate_full_response(prompt)
except TokenLimitError:
prompt = truncate_prompt(prompt)
return generate_summary(prompt)
```

3. **Use Efficient Prompt Templates**
- Keep system prompts concise
- Remove redundant context
- Use compression techniques for long contexts

## Implications

Implementing context chunking with contextual linking is a practical solution to manage the output size limitations of LLMs. However, this approach comes with its own set of implications that developers must consider.

1. **Increased Development Complexity**: Implementing strategies to overcome the maximum output token length introduces additional layers of complexity to the application design. It necessitates meticulous management of context across multiple outputs to maintain coherence. Ensuring that each chunk retains the necessary context for the conversation or document can be challenging and often requires advanced logic to handle transitions seamlessly.

2. **Cost Implications**: Attempting to circumvent the `max_output_tokens` limitation by making multiple requests can increase the number of tokens processed, thereby raising the operational costs associated with using LLM services. Each additional request contributes to the overall token usage, which can quickly escalate costs, especially for applications with high-frequency interactions or large volumes of data.

3. **Performance Bottlenecks**: Generating long outputs in segments can lead to performance bottlenecks, as each segment may require additional processing time and resources, impacting the overall efficiency of the application. The need to manage and link multiple chunks can introduce latency and reduce the responsiveness of the system, which is critical for real-time applications.

By understanding these implications, developers can better prepare for the challenges associated with context chunking and contextual linking, ensuring that their applications remain efficient, cost-effective, and user-friendly.


## Future Considerations

As models evolve, we can expect several advancements that will significantly impact how we handle output size limitations:

1. **Contextual Awareness**: Future LLMs will likely have improved contextual awareness - or as Mustafa Suleyman would call "infinite memory", enabling them to better understand and manage the context of a conversation or document over long interactions. This will reduce the need for repetitive context setting and improve the overall user experience.

2. **More Efficient Token Usage**: Advances in model architecture and tokenization strategies will lead to more efficient token usage. This means that models will be able to convey the same amount of information using fewer tokens, reducing costs and improving performance.

3. **Improved Compression Techniques**: As research progresses, we can expect the development of more sophisticated compression techniques that allow models to retain essential information while reducing the number of tokens required. This will be particularly useful for applications that need to summarize or condense large amounts of data.

4. **Adaptive Token Limits**: Future models may implement adaptive token limits that dynamically adjust based on the complexity and requirements of the task at hand. This will provide more flexibility and efficiency in handling diverse use cases.

5. **Enhanced Memory Management**: Innovations in memory management will allow models to handle larger outputs without a significant increase in computational resources. This will make it feasible to deploy advanced LLMs in resource-constrained environments.

These advancements will collectively enhance the capabilities of LLMs, making them more powerful and versatile tools for a wide range of applications. However, they will also introduce new challenges and considerations that developers and researchers will need to address to fully harness their potential.


## Conclusion

In conclusion, while managing output size limitations in LLMs presents significant challenges, it also drives innovation in application design and optimization strategies. By implementing techniques such as context chunking, efficient prompt templates, and graceful fallbacks, developers can mitigate these limitations and enhance the performance and cost-effectiveness of their applications. As the technology evolves, advancements in contextual awareness, token efficiency, and memory management will further empower developers to build more robust and scalable LLM-powered systems. It is crucial to stay informed about these developments and continuously adapt to leverage the full potential of LLMs while addressing their inherent constraints.


## References

1. OpenAI Token Limits Documentation
2. Anthropic Claude Documentation
3. Google PaLM 2 Technical Specifications
4. Research papers on context window management
Loading

0 comments on commit 5c36c3f

Please sign in to comment.