update output size limit

souzatharsis · Dec 7, 2024 · 44de3e0 · 44de3e0
1 parent 06ddf6e
commit 44de3e0
Show file tree

Hide file tree

Showing 13 changed files with 141 additions and 126 deletions.
diff --git a/tamingllms/_build/.doctrees/environment.pickle b/tamingllms/_build/.doctrees/environment.pickle
diff --git a/tamingllms/_build/.doctrees/notebooks/evals.doctree b/tamingllms/_build/.doctrees/notebooks/evals.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree b/tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/structured_output.doctree b/tamingllms/_build/.doctrees/notebooks/structured_output.doctree
diff --git a/tamingllms/_build/html/_sources/notebooks/output_size_limit.ipynb b/tamingllms/_build/html/_sources/notebooks/output_size_limit.ipynb
@@ -90,7 +90,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Here, we will utilize `langchain` for a content-aware sentence-splitting strategy for chunking. We will use the `CharacterTextSplitter` with `tiktoken` as our tokenizer to count the number of tokens per chunk which we can use to ensure that we do not surpass the input token limit of our model."
+    "Here, we will utilize `langchain` for a content-aware sentence-splitting strategy for chunking. Langchain offers several text splitters {cite}`langchain_text_splitters` such as JSON-, Markdown- and HTML-based or split by token. We will use the `CharacterTextSplitter` with `tiktoken` as our tokenizer to count the number of tokens per chunk which we can use to ensure that we do not surpass the input token limit of our model."
    ]
   },
   {
@@ -471,8 +471,9 @@
     "\n",
     "\n",
     "## References\n",
-    "\n",
-    "- [LangChain Text Splitter](https://langchain.readthedocs.io/en/latest/modules/text_splitter.html)."
+    "```{bibliography}\n",
+    ":filter: docname in docnames\n",
+    "```"
    ]
   },
   {

diff --git a/tamingllms/_build/html/notebooks/evals.html b/tamingllms/_build/html/notebooks/evals.html
diff --git a/tamingllms/_build/html/notebooks/output_size_limit.html b/tamingllms/_build/html/notebooks/output_size_limit.html
diff --git a/tamingllms/_build/html/notebooks/structured_output.html b/tamingllms/_build/html/notebooks/structured_output.html
diff --git a/tamingllms/_build/html/searchindex.js b/tamingllms/_build/html/searchindex.js
diff --git a/tamingllms/_build/jupyter_execute/markdown/intro.ipynb b/tamingllms/_build/jupyter_execute/markdown/intro.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "8e40cf5d",
+   "id": "cc6b3dc4",
    "metadata": {},
    "source": [
     "(intro)=\n",

diff --git a/tamingllms/_build/jupyter_execute/notebooks/output_size_limit.ipynb b/tamingllms/_build/jupyter_execute/notebooks/output_size_limit.ipynb
@@ -90,7 +90,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Here, we will utilize `langchain` for a content-aware sentence-splitting strategy for chunking. We will use the `CharacterTextSplitter` with `tiktoken` as our tokenizer to count the number of tokens per chunk which we can use to ensure that we do not surpass the input token limit of our model."
+    "Here, we will utilize `langchain` for a content-aware sentence-splitting strategy for chunking. Langchain offers several text splitters {cite}`langchain_text_splitters` such as JSON-, Markdown- and HTML-based or split by token. We will use the `CharacterTextSplitter` with `tiktoken` as our tokenizer to count the number of tokens per chunk which we can use to ensure that we do not surpass the input token limit of our model."
    ]
   },
   {
@@ -471,8 +471,9 @@
     "\n",
     "\n",
     "## References\n",
-    "\n",
-    "- [LangChain Text Splitter](https://langchain.readthedocs.io/en/latest/modules/text_splitter.html)."
+    "```{bibliography}\n",
+    ":filter: docname in docnames\n",
+    "```"
    ]
   },
   {

diff --git a/tamingllms/notebooks/output_size_limit.ipynb b/tamingllms/notebooks/output_size_limit.ipynb
@@ -90,7 +90,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Here, we will utilize `langchain` for a content-aware sentence-splitting strategy for chunking. We will use the `CharacterTextSplitter` with `tiktoken` as our tokenizer to count the number of tokens per chunk which we can use to ensure that we do not surpass the input token limit of our model."
+    "Here, we will utilize `langchain` for a content-aware sentence-splitting strategy for chunking. Langchain offers several text splitters {cite}`langchain_text_splitters` such as JSON-, Markdown- and HTML-based or split by token. We will use the `CharacterTextSplitter` with `tiktoken` as our tokenizer to count the number of tokens per chunk which we can use to ensure that we do not surpass the input token limit of our model."
    ]
   },
   {
@@ -471,8 +471,9 @@
     "\n",
     "\n",
     "## References\n",
-    "\n",
-    "- [LangChain Text Splitter](https://langchain.readthedocs.io/en/latest/modules/text_splitter.html)."
+    "```{bibliography}\n",
+    ":filter: docname in docnames\n",
+    "```"
    ]
   },
   {

diff --git a/tamingllms/references.bib b/tamingllms/references.bib
@@ -227,4 +227,12 @@ @article{long2024llms
   author={Long, Do Xuan and Ngoc, Hai Nguyen and Sim, Tiviatis and Dao, Hieu and Joty, Shafiq and Kawaguchi, Kenji and Chen, Nancy F and Kan, Min-Yen},
   journal={arXiv preprint arXiv:2408.08656},
   year={2024}
-}
+}
+
+@misc{langchain_text_splitters,
+      title={Text Splitters - LangChain Documentation},
+      author={{LangChain}},
+      year={2024},
+      howpublished={\url{https://python.langchain.com/docs/how_to/#text-splitters}},
+      note={Accessed: 12/07/2024}
+}