-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
souzatharsis
committed
Nov 22, 2024
1 parent
1af8ea8
commit 5c36c3f
Showing
42 changed files
with
5,290 additions
and
239 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified
BIN
+18.6 KB
(130%)
tamingllms/_build/.doctrees/notebooks/structured_output.doctree
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: 36945de024d591eca72cb43977f35263 | ||
config: d5b2da7e1983f3da43c754f98d2e7fad | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
53 changes: 53 additions & 0 deletions
53
tamingllms/_build/html/_sources/markdown/markdown-notebooks.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
--- | ||
jupytext: | ||
formats: md:myst | ||
text_representation: | ||
extension: .md | ||
format_name: myst | ||
format_version: 0.13 | ||
jupytext_version: 1.11.5 | ||
kernelspec: | ||
display_name: Python 3 | ||
language: python | ||
name: python3 | ||
--- | ||
|
||
# Notebooks with MyST Markdown | ||
|
||
Jupyter Book also lets you write text-based notebooks using MyST Markdown. | ||
See [the Notebooks with MyST Markdown documentation](https://jupyterbook.org/file-types/myst-notebooks.html) for more detailed instructions. | ||
This page shows off a notebook written in MyST Markdown. | ||
|
||
## An example cell | ||
|
||
With MyST Markdown, you can define code cells with a directive like so: | ||
|
||
```{code-cell} | ||
print(2 + 2) | ||
``` | ||
|
||
When your book is built, the contents of any `{code-cell}` blocks will be | ||
executed with your default Jupyter kernel, and their outputs will be displayed | ||
in-line with the rest of your content. | ||
|
||
```{seealso} | ||
Jupyter Book uses [Jupytext](https://jupytext.readthedocs.io/en/latest/) to convert text-based files to notebooks, and can support [many other text-based notebook files](https://jupyterbook.org/file-types/jupytext.html). | ||
``` | ||
|
||
## Create a notebook with MyST Markdown | ||
|
||
MyST Markdown notebooks are defined by two things: | ||
|
||
1. YAML metadata that is needed to understand if / how it should convert text files to notebooks (including information about the kernel needed). | ||
See the YAML at the top of this page for example. | ||
2. The presence of `{code-cell}` directives, which will be executed with your book. | ||
|
||
That's all that is needed to get started! | ||
|
||
## Quickly add YAML metadata for MyST Notebooks | ||
|
||
If you have a markdown file and you'd like to quickly add YAML metadata to it, so that Jupyter Book will treat it as a MyST Markdown Notebook, run the following command: | ||
|
||
``` | ||
jupyter-book myst init path/to/markdownfile.md | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# Markdown Files | ||
|
||
Whether you write your book's content in Jupyter Notebooks (`.ipynb`) or | ||
in regular markdown files (`.md`), you'll write in the same flavor of markdown | ||
called **MyST Markdown**. | ||
This is a simple file to help you get started and show off some syntax. | ||
|
||
## What is MyST? | ||
|
||
MyST stands for "Markedly Structured Text". It | ||
is a slight variation on a flavor of markdown called "CommonMark" markdown, | ||
with small syntax extensions to allow you to write **roles** and **directives** | ||
in the Sphinx ecosystem. | ||
|
||
For more about MyST, see [the MyST Markdown Overview](https://jupyterbook.org/content/myst.html). | ||
|
||
## Sample Roles and Directives | ||
|
||
Roles and directives are two of the most powerful tools in Jupyter Book. They | ||
are like functions, but written in a markup language. They both | ||
serve a similar purpose, but **roles are written in one line**, whereas | ||
**directives span many lines**. They both accept different kinds of inputs, | ||
and what they do with those inputs depends on the specific role or directive | ||
that is being called. | ||
|
||
Here is a "note" directive: | ||
|
||
```{note} | ||
Here is a note | ||
``` | ||
|
||
It will be rendered in a special box when you build your book. | ||
|
||
Here is an inline directive to refer to a document: {doc}`markdown-notebooks`. | ||
|
||
|
||
## Citations | ||
|
||
You can also cite references that are stored in a `bibtex` file. For example, | ||
the following syntax: `` {cite}`holdgraf_evidence_2014` `` will render like | ||
this: {cite}`holdgraf_evidence_2014`. | ||
|
||
Moreover, you can insert a bibliography into your page with this syntax: | ||
The `{bibliography}` directive must be used for all the `{cite}` roles to | ||
render properly. | ||
For example, if the references for your book are stored in `references.bib`, | ||
then the bibliography is inserted with: | ||
|
||
```{bibliography} | ||
``` | ||
|
||
## Learn more | ||
|
||
This is just a simple starter to get you started. | ||
You can learn a lot more at [jupyterbook.org](https://jupyterbook.org). |
187 changes: 187 additions & 0 deletions
187
tamingllms/_build/html/_sources/markdown/output_size_limit-NOTES.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,187 @@ | ||
# Output Size Limitations | ||
|
||
## What are Token Limits? | ||
|
||
Tokens are the basic units that LLMs process text with. A token can be as short as a single character or as long as a complete word. In English, a general rule of thumb is that 1 token ≈ 4 characters or ¾ of a word. | ||
|
||
The `max_output_tokens` is parameter often available in modern LLMs that determines the maximum length of text that an LLM can generate in a single response. Contrary to what one might expect, the model does not "summarizes the answer" such that it does not surpass `max_output_tokens` limit. Instead, it will stop once it reaches this limit, even mid-sentence, i.e. the response may be truncated. | ||
|
||
| **Token Cost and Length Limitation Comparison Across Key Models** | | | | | | ||
| Model | max_output_tokens | max_input_tokens | input_cost_per_token | output_cost_per_token | | ||
|------------------------------|-------------------|------------------|----------------------|-----------------------| | ||
| meta.llama3-2-11b-instruct-v1:0 | 4096 | 128000 | 3.5e-7 | 3.5e-7 | | ||
| claude-3-5-sonnet-20241022 | 8192 | 200000 | 3e-6 | 1.5e-5 | | ||
| gpt-4-0613 | 4096 | 8192 | 3e-5 | 6e-5 | | ||
| gpt-4-turbo-2024-04-09 | 4096 | 128000 | 1e-5 | 3e-5 | | ||
| gpt-4o-mini | 16384 | 128000 | 1.5e-7 | 6e-7 | | ||
| gemini/gemini-1.5-flash-002 | 8192 | 1048576 | 7.5e-8 | 3e-7 | | ||
| gemini/gemini-1.5-pro-002 | 8192 | 2097152 | 3.5e-6 | 1.05e-5 | | ||
|
||
## Problem Statement | ||
|
||
The `max_output_tokens` limit in LLMs poses a significant challenge for users who need to generate long outputs, as it may result in truncated content and/or incomplete information. | ||
|
||
1. **Truncated Content**: Users aiming to generate extensive content, such as detailed reports or comprehensive articles, may find their outputs abruptly cut off due to the `max_output_tokens` limit. This truncation can result in incomplete information and disrupt the flow of the content. | ||
|
||
2. **Shallow Responses**: When users expect a complete and thorough response but receive only a partial output, it can lead to dissatisfaction and frustration. This is especially true in applications where the completeness of information is critical, such as in educational tools or content creation platforms. | ||
|
||
To effectively address these challenges, developers need to implement robust solutions that balance user expectations with technical and cost constraints, ensuring that long-form content generation remains feasible and efficient. | ||
|
||
### Content Chunking with Contextual Linking | ||
|
||
Content chunking with contextual linking is a technique used to manage the `max_output_tokens` limitation by breaking down long-form content into smaller, manageable chunks. This approach allows the LLM to focus on smaller sections of the input, enabling it to generate more complete and detailed responses for each chunk while maintaining coherence and context across the entire output. | ||
|
||
1. **Chunking the Content**: The input content is split into smaller chunks. This allows the LLM to process each chunk individually, focusing on generating a complete and detailed response for that specific section of the input. | ||
|
||
2. **Maintaining Context**: Each chunk is linked with contextual information from the previous chunks. This helps in maintaining the flow and coherence of the content across multiple chunks. | ||
|
||
3. **Generating Linked Prompts**: For each chunk, a prompt is generated that includes the chunk's content and its context. This prompt is then used to generate the output for that chunk. | ||
|
||
4. **Combining the Outputs**: The outputs of all chunks are combined to form the final long-form content. | ||
|
||
By following these steps, developers can effectively manage the `max_output_tokens` limitation and generate coherent long-form content without truncation. | ||
|
||
Let's examine a robust solution for handling long-form content generation: | ||
|
||
```python | ||
from typing import List, Dict | ||
import json | ||
|
||
class ConversationGenerator: | ||
def __init__(self, api_client): | ||
self.api_client = api_client | ||
|
||
def chunk_content(self, input_content: str, chunk_size: int = 1000) -> List[str]: | ||
"""Split input content into manageable chunks while preserving context.""" | ||
sentences = input_content.split('. ') | ||
chunks = [] | ||
current_chunk = [] | ||
current_length = 0 | ||
|
||
for sentence in sentences: | ||
sentence_length = len(sentence) | ||
if current_length + sentence_length > chunk_size and current_chunk: | ||
chunks.append('. '.join(current_chunk) + '.') | ||
current_chunk = [] | ||
current_length = 0 | ||
current_chunk.append(sentence) | ||
current_length += sentence_length | ||
|
||
if current_chunk: | ||
chunks.append('. '.join(current_chunk) + '.') | ||
return chunks | ||
|
||
def generate_conversation_prompts(self, content_chunks: List[str]) -> List[Dict]: | ||
"""Generate linked conversation prompts.""" | ||
prompts = [] | ||
for i, chunk in enumerate(content_chunks): | ||
prompt = { | ||
"part": i + 1, | ||
"total_parts": len(content_chunks), | ||
"content": chunk, | ||
"context": self._get_context(i, prompts) | ||
} | ||
prompts.append(prompt) | ||
return prompts | ||
|
||
def _get_context(self, part_index: int, previous_prompts: List[Dict]) -> str: | ||
"""Generate context from previous parts.""" | ||
if part_index == 0: | ||
return "Start of conversation" | ||
return f"Continuing from part {part_index}" | ||
``` | ||
|
||
### Testing the Implementation | ||
|
||
Here's how to test the chunking functionality: | ||
|
||
```python | ||
import pytest | ||
from conversation_generator import ConversationGenerator | ||
|
||
def test_content_chunking(): | ||
generator = ConversationGenerator(None) | ||
long_content = "First sentence. Second sentence. " * 100 | ||
chunks = generator.chunk_content(long_content, chunk_size=100) | ||
|
||
assert len(chunks) > 1 | ||
for chunk in chunks: | ||
assert len(chunk) <= 100 | ||
assert chunk.endswith('.') | ||
|
||
def test_prompt_generation(): | ||
generator = ConversationGenerator(None) | ||
chunks = ["Chunk 1.", "Chunk 2.", "Chunk 3."] | ||
prompts = generator.generate_conversation_prompts(chunks) | ||
|
||
assert len(prompts) == 3 | ||
assert prompts[0]["part"] == 1 | ||
assert prompts[1]["context"] != prompts[0]["context"] | ||
``` | ||
|
||
## Best Practices | ||
|
||
1. **Always Monitor Token Usage** | ||
```python | ||
def estimate_tokens(text: str) -> int: | ||
"""Rough token count estimation.""" | ||
return len(text.split()) * 1.3 | ||
``` | ||
|
||
2. **Implement Graceful Fallbacks** | ||
```python | ||
def generate_with_fallback(prompt: str, max_retries: int = 3): | ||
for attempt in range(max_retries): | ||
try: | ||
return generate_full_response(prompt) | ||
except TokenLimitError: | ||
prompt = truncate_prompt(prompt) | ||
return generate_summary(prompt) | ||
``` | ||
|
||
3. **Use Efficient Prompt Templates** | ||
- Keep system prompts concise | ||
- Remove redundant context | ||
- Use compression techniques for long contexts | ||
|
||
## Implications | ||
|
||
Implementing context chunking with contextual linking is a practical solution to manage the output size limitations of LLMs. However, this approach comes with its own set of implications that developers must consider. | ||
|
||
1. **Increased Development Complexity**: Implementing strategies to overcome the maximum output token length introduces additional layers of complexity to the application design. It necessitates meticulous management of context across multiple outputs to maintain coherence. Ensuring that each chunk retains the necessary context for the conversation or document can be challenging and often requires advanced logic to handle transitions seamlessly. | ||
|
||
2. **Cost Implications**: Attempting to circumvent the `max_output_tokens` limitation by making multiple requests can increase the number of tokens processed, thereby raising the operational costs associated with using LLM services. Each additional request contributes to the overall token usage, which can quickly escalate costs, especially for applications with high-frequency interactions or large volumes of data. | ||
|
||
3. **Performance Bottlenecks**: Generating long outputs in segments can lead to performance bottlenecks, as each segment may require additional processing time and resources, impacting the overall efficiency of the application. The need to manage and link multiple chunks can introduce latency and reduce the responsiveness of the system, which is critical for real-time applications. | ||
|
||
By understanding these implications, developers can better prepare for the challenges associated with context chunking and contextual linking, ensuring that their applications remain efficient, cost-effective, and user-friendly. | ||
|
||
|
||
## Future Considerations | ||
|
||
As models evolve, we can expect several advancements that will significantly impact how we handle output size limitations: | ||
|
||
1. **Contextual Awareness**: Future LLMs will likely have improved contextual awareness - or as Mustafa Suleyman would call "infinite memory", enabling them to better understand and manage the context of a conversation or document over long interactions. This will reduce the need for repetitive context setting and improve the overall user experience. | ||
|
||
2. **More Efficient Token Usage**: Advances in model architecture and tokenization strategies will lead to more efficient token usage. This means that models will be able to convey the same amount of information using fewer tokens, reducing costs and improving performance. | ||
|
||
3. **Improved Compression Techniques**: As research progresses, we can expect the development of more sophisticated compression techniques that allow models to retain essential information while reducing the number of tokens required. This will be particularly useful for applications that need to summarize or condense large amounts of data. | ||
|
||
4. **Adaptive Token Limits**: Future models may implement adaptive token limits that dynamically adjust based on the complexity and requirements of the task at hand. This will provide more flexibility and efficiency in handling diverse use cases. | ||
|
||
5. **Enhanced Memory Management**: Innovations in memory management will allow models to handle larger outputs without a significant increase in computational resources. This will make it feasible to deploy advanced LLMs in resource-constrained environments. | ||
|
||
These advancements will collectively enhance the capabilities of LLMs, making them more powerful and versatile tools for a wide range of applications. However, they will also introduce new challenges and considerations that developers and researchers will need to address to fully harness their potential. | ||
|
||
|
||
## Conclusion | ||
|
||
In conclusion, while managing output size limitations in LLMs presents significant challenges, it also drives innovation in application design and optimization strategies. By implementing techniques such as context chunking, efficient prompt templates, and graceful fallbacks, developers can mitigate these limitations and enhance the performance and cost-effectiveness of their applications. As the technology evolves, advancements in contextual awareness, token efficiency, and memory management will further empower developers to build more robust and scalable LLM-powered systems. It is crucial to stay informed about these developments and continuously adapt to leverage the full potential of LLMs while addressing their inherent constraints. | ||
|
||
|
||
## References | ||
|
||
1. OpenAI Token Limits Documentation | ||
2. Anthropic Claude Documentation | ||
3. Google PaLM 2 Technical Specifications | ||
4. Research papers on context window management |
Oops, something went wrong.