Skip to content

Commit

Permalink
final slide updates
Browse files Browse the repository at this point in the history
  • Loading branch information
Kubus42 committed Apr 23, 2024
1 parent 8222545 commit 32a3769
Show file tree
Hide file tree
Showing 13 changed files with 152 additions and 158 deletions.
2 changes: 1 addition & 1 deletion _freeze/site_libs/revealjs/dist/theme/quarto.css

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions _freeze/slides/about/projects/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"hash": "5ce34d344b5301714789bde7fc69ad92",
"hash": "37dac4cf62bc929bdd72e10a28cd9c91",
"result": {
"engine": "jupyter",
"markdown": "---\ntitle: \"Projects: Large Language Models\"\nformat: \n revealjs:\n theme: default\n chalkboard: true\n footer: \"Sprint: LLM, 2024\"\n logo: ../../assets/logo.svg\n---\n\n# How to develop an app with a language model\n\n## What do I have to keep in mind?\n\n## What can go wrong? \n\n## What do I need?\n\n\n# Project ideas\n\n## Question-Answering Chatbot \nBuild a chatbot that can answer questions posed by users on a specific topic provided in form of documents. Users input their questions, the chatbot retrieves relevant information from a pre-defined set of documents, and uses the information to answer the question.\n\n## Document tagging / classification \nUse GPT and its tools (e.g., function calls) and/or embeddings to classify documents or assign tags to them. Example: Sort bug reports or complaints into categories depending on the problem.\n\n## Clustering of text-based entities \nCreate a small tool that can cluster text-based entities based on embeddings, for example, groups of texts or keywords. Example: Structure a folder of text files based on their content.\n\n## Text-based RPG Game\nDevelop a text-based role-playing game where players interact with characters and navigate through a story generated by GPT. Players make choices that influence the direction of the narrative.\n\n## Sentiment Analysis Tool\nBuild an app that analyzes the sentiment of text inputs (e.g., social media posts, customer reviews) using GPT. Users can input text, and the app provides insights into the overall sentiment expressed in the text.\n\n## Text Summarization Tool \nCreate an application that summarizes long blocks of text into shorter, concise summaries. Users can input articles, essays, or documents, and the tool generates a summarized version.\n\n## Language Translation Tool \nBuild a simple translation app that utilizes GPT to translate text between different languages. Users can input text in one language, and the app outputs the translated text in the desired language. Has to include some nice tweaks.\n\n## Personalized Recipe Generator \nDevelop an app that generates personalized recipes based on user preferences and dietary restrictions. Users input their preferred ingredients and dietary needs, and the app generates custom recipes using GPT.\n\n## Lyrics Generator \nCreate a lyrics generation tool that generates lyrics based on user input such as themes, music style, emotions, or keywords. Users can explore different poetic styles and themes generated by GPT.\n\n# How to build you app\n\n## Tools \n\n- You can use everything in the Jupyterlab (put `pip list` in a terminal to see all Python packages)\n- If there are specific packages you need, we can organize them\n- You can simply build your application in a Jupyter notebook!\n- Or: Use **Dash**!\n\n\n## Dash \nPut the following files into your home in the Jupyterlab: \n\n`my_layout.py`\n\n::: {#8de0cf53 .cell execution_count=1}\n``` {.python .cell-code}\nfrom dash import html\nfrom dash import dcc\n\n\nlayout = html.Div([\n html.H1(\"Yeay, my app!\"),\n html.Div([\n html.Label(\"Enter your text:\"),\n dcc.Input(id='input-text', type='text', value=''),\n html.Button('Submit', id='submit-btn', n_clicks=0),\n ]),\n html.Div(id='output-container-button')\n])\n```\n:::\n\n\n--- \n\n`my_callbacks.py`\n\n::: {#bafd904a .cell execution_count=2}\n``` {.python .cell-code}\nfrom dash.dependencies import (\n Input, \n Output\n)\nfrom dash import html\n\n\ndef register_callbacks(app):\n @app.callback(\n Output('output-container-button', 'children'),\n [Input('submit-btn', 'n_clicks')],\n [Input('input-text', 'value')]\n )\n def update_output(n_clicks, input_value):\n if n_clicks > 0:\n return html.Div([\n html.Label(\"You entered:\"),\n html.P(input_value)\n ])\n else:\n return ''\n\n```\n:::\n\n\n--- \n\nNow you can run your own app in the Jupyterlab here: \n\n![MyApp Launcher](../../assets/my_app.png)\n\n",
"markdown": "---\ntitle: \"Projects: Large Language Models\"\nformat: \n revealjs:\n theme: default\n chalkboard: true\n footer: \"Sprint: LLM, 2024\"\n logo: ../../assets/logo.svg\n---\n\n# How to develop an app with a language model: DEMO\n\n\n# Project ideas\n\n## Question-Answering Chatbot \nBuild a chatbot that can answer questions posed by users on a specific topic provided in form of documents. Users input their questions, the chatbot retrieves relevant information from a pre-defined set of documents, and uses the information to answer the question.\n\n## Document tagging / classification \nUse GPT and its tools (e.g., function calls) and/or embeddings to classify documents or assign tags to them. Example: Sort bug reports or complaints into categories depending on the problem.\n\n## Clustering of text-based entities \nCreate a small tool that can cluster text-based entities based on embeddings, for example, groups of texts or keywords. Example: Structure a folder of text files based on their content.\n\n## Text-based RPG Game\nDevelop a text-based role-playing game where players interact with characters and navigate through a story generated by GPT. Players make choices that influence the direction of the narrative.\n\n## Sentiment Analysis Tool\nBuild an app that analyzes the sentiment of text inputs (e.g., social media posts, customer reviews) using GPT. Users can input text, and the app provides insights into the overall sentiment expressed in the text.\n\n## Text Summarization Tool \nCreate an application that summarizes long blocks of text into shorter, concise summaries. Users can input articles, essays, or documents, and the tool generates a summarized version.\n\n## Language Translation Tool \nBuild a simple translation app that utilizes GPT to translate text between different languages. Users can input text in one language, and the app outputs the translated text in the desired language. Has to include some nice tweaks.\n\n## Personalized Recipe Generator \nDevelop an app that generates personalized recipes based on user preferences and dietary restrictions. Users input their preferred ingredients and dietary needs, and the app generates custom recipes using GPT.\n\n## Lyrics Generator \nCreate a lyrics generation tool that generates lyrics based on user input such as themes, music style, emotions, or keywords. Users can explore different poetic styles and themes generated by GPT.\n\n# How to build you app\n\n## Tools \n\n- You can use everything in the Jupyterlab (put `pip list` in a terminal to see all Python packages)\n- If there are specific packages you need, we can organize them\n- You can simply build your application in a Jupyter notebook!\n- Or: Use **Dash**!\n\n\n## Dash \nPut the following files into your home in the Jupyterlab: \n\n`my_layout.py`\n\n::: {#8ad990b5 .cell execution_count=1}\n``` {.python .cell-code}\nfrom dash import html\nfrom dash import dcc\n\n\nlayout = html.Div([\n html.H1(\"Yeay, my app!\"),\n html.Div([\n html.Label(\"Enter your text:\"),\n dcc.Input(id='input-text', type='text', value=''),\n html.Button('Submit', id='submit-btn', n_clicks=0),\n ]),\n html.Div(id='output-container-button')\n])\n```\n:::\n\n\n--- \n\n`my_callbacks.py`\n\n::: {#8c56167f .cell execution_count=2}\n``` {.python .cell-code}\nfrom dash.dependencies import (\n Input, \n Output\n)\nfrom dash import html\n\n\ndef register_callbacks(app):\n @app.callback(\n Output('output-container-button', 'children'),\n [Input('submit-btn', 'n_clicks')],\n [Input('input-text', 'value')]\n )\n def update_output(n_clicks, input_value):\n if n_clicks > 0:\n return html.Div([\n html.Label(\"You entered:\"),\n html.P(input_value)\n ])\n else:\n return ''\n\n```\n:::\n\n\n--- \n\nNow you can run your own app in the Jupyterlab here: \n\n![MyApp Launcher](../../assets/my_app.png)\n\n",
"supporting": [
"projects_files"
],
Expand Down
4 changes: 2 additions & 2 deletions _freeze/slides/nlp/short_history/execute-results/html.json

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions _freeze/slides/nlp/tokenization/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"hash": "0c8226cb8426d19aabf971e84d458934",
"hash": "433b9e337499419240ec902dd611850a",
"result": {
"engine": "jupyter",
"markdown": "---\ntitle: \"Tokenization\"\nformat: \n revealjs:\n theme: default\n chalkboard: true\n footer: \"Sprint: LLM, 2024\"\n logo: ../../assets/logo.svg\n fig-align: center\n---\n\n## Tokenization\n\n::: {#f6b61e6b .cell execution_count=1}\n``` {.python .cell-code}\nsentence = \"I love reading science fiction books or books about science.\"\n```\n:::\n\n\n \n\n::: {.fragment}\n::: {.callout-note title=\"Definition\"}\nTokenization is the process of breaking down a text into smaller units called tokens. \n:::\n:::\n\n \n\n::: {.fragment}\n\n::: {#fdcdadae .cell execution_count=2}\n``` {.python .cell-code}\ntokenized_sentence = sentence.split(\" \")\nprint(tokenized_sentence)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['I', 'love', 'reading', 'science', 'fiction', 'books', 'or', 'books', 'about', 'science.']\n```\n:::\n:::\n\n\n:::\n\n\n## Counting token\n\n::: {#f3b143c8 .cell execution_count=3}\n``` {.python .cell-code}\nfrom collections import Counter\n\ntoken_counter = Counter(tokenized_sentence)\nprint(token_counter.most_common(3))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[('books', 2), ('I', 1), ('love', 1)]\n```\n:::\n:::\n\n\n \n\n::: {.fragment}\n\n::: {#1948dc77 .cell execution_count=4}\n``` {.python .cell-code}\ntokenized_sentence = sentence.replace(\".\", \" \").split(\" \")\n\ntoken_counter = Counter(tokenized_sentence)\nprint(token_counter.most_common(2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[('science', 2), ('books', 2)]\n```\n:::\n:::\n\n\n:::\n\n\n## NLTK tokenization\n\n::: {#001d43fa .cell execution_count=5}\n``` {.python .cell-code}\nfrom nltk.tokenize import wordpunct_tokenize\nfrom string import punctuation\n\ntokenized_sentence = wordpunct_tokenize(sentence)\ntokenized_sentence = [t for t in tokenized_sentence if t not in punctuation]\nprint(tokenized_sentence)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['I', 'love', 'reading', 'science', 'fiction', 'books', 'or', 'books', 'about', 'science']\n```\n:::\n:::\n\n\n## Lemmatization\n\n- Reduce words to their base or canonical form\n- Represents the dictionary form of a word (lemma)\n- Standardizes words for better text analysis accuracy\n- Example: `meeting` --> `meet` (verb)\n\n---\n\n- Helps in tasks such as text classification, information retrieval, and sentiment analysis\n- Considers context and linguistic rules\n- Retains semantic meaning of words\n- Has to involve part-of-speech tagging (see example below)\n- Determines correct lemma based on word's role in sentence\n\n\n```{mermaid}\nflowchart LR\n A(meeting)\n A --> B(\"meet (verb)\")\n A --> C(\"meeting (noun)\")\n```\n\n\n\n## Lemmatization with WordNet: Nouns\n\n::: {#0c63f98a .cell output-location='fragment' execution_count=6}\n``` {.python .cell-code}\nfrom nltk.stem import WordNetLemmatizer\n\nsentence = \"The three brothers went over three big bridges\"\n\nwnl = WordNetLemmatizer()\n\nlemmatized_sentence_token = [\n wnl.lemmatize(w, pos=\"n\") for w in sentence.split(\" \")\n]\n\nprint(lemmatized_sentence_token)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['The', 'three', 'brother', 'went', 'over', 'three', 'big', 'bridge']\n```\n:::\n:::\n\n\n## Lemmatization with WordNet: Verbs\n\n::: {#f787a80c .cell output-location='fragment' execution_count=7}\n``` {.python .cell-code}\nlemmatized_sentence_token = [\n wnl.lemmatize(w, pos=\"v\") for w in sentence.split(\" \")\n]\n\nprint(lemmatized_sentence_token)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['The', 'three', 'brothers', 'go', 'over', 'three', 'big', 'bridge']\n```\n:::\n:::\n\n\n## Lemmatization with WordNet and POS-tagging\n\n::: {#7e6031b2 .cell output-location='fragment' execution_count=8}\n``` {.python .cell-code}\npos_dict = {\n \"brothers\": \"n\", \n \"went\": \"v\",\n \"big\": \"a\",\n \"bridges\": \"n\"\n}\n\nlemmatized_sentence_token = []\nfor token in sentence.split(\" \"):\n if token in pos_dict:\n lemma = wnl.lemmatize(token, pos=pos_dict[token])\n else: \n lemma = token # leave as it is\n\n lemmatized_sentence_token.append(lemma)\n\nprint(lemmatized_sentence_token)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['The', 'three', 'brother', 'go', 'over', 'three', 'big', 'bridge']\n```\n:::\n:::\n\n\n# Bit Pair Encoding\n\n## Bit Pair Encoding: Why?\n\n- Tokenization: Breaking text into smaller chunks (tokens)\n- Traditional vocabularies: Fixed-size, memory-intensive\n- Bit pair encoding: Compression technique for large vocabularies\n\n## Bit Pair Encoding: How?\n- Pair Identification: Identifies frequent pairs of characters\n- Replacement with Single Token: Replaces pairs with single token\n- Iterative Process: Continues until stopping criterion met\n- Vocabulary Construction: Construct vocabulary with single tokens\n- Encoding and Decoding: Text encoded and decoded using constructed vocabulary\n\n## Bit Pair Encoding: Pros and Cons\n- Efficient Memory Usage\n- Retains Information\n- Flexibility\n- Computational Overhead\n- Loss of Granularity\n\n::: {.notes}\n- Reduces vocabulary size, efficient memory usage\n- Captures frequent character pairs, retains linguistic information\n- Adaptable to different tokenization strategies and corpus characteristics\n- Iterative nature can be computationally intensive\n- May lead to loss of granularity, especially for rare words\n- Effectiveness depends on tokenization strategy and corpus characteristics\n:::\n\n",
"markdown": "---\ntitle: \"Tokenization\"\nformat: \n revealjs:\n theme: default\n chalkboard: true\n footer: \"Sprint: LLM, 2024\"\n logo: ../../assets/logo.svg\n fig-align: center\n---\n\n## Tokenization\n\n::: {#55f1d863 .cell execution_count=1}\n``` {.python .cell-code}\nsentence = \"I love reading science fiction books or books about science.\"\n```\n:::\n\n\n \n\n::: {.fragment}\n::: {.callout-note title=\"Definition\"}\nTokenization is the process of breaking down a text into smaller units called tokens. \n:::\n:::\n\n \n\n::: {.fragment}\n\n::: {#a5b3e8ec .cell execution_count=2}\n``` {.python .cell-code}\ntokenized_sentence = sentence.split(\" \")\nprint(tokenized_sentence)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['I', 'love', 'reading', 'science', 'fiction', 'books', 'or', 'books', 'about', 'science.']\n```\n:::\n:::\n\n\n:::\n\n\n## Counting token\n\n::: {#a414031c .cell execution_count=3}\n``` {.python .cell-code}\nfrom collections import Counter\n\ntoken_counter = Counter(tokenized_sentence)\nprint(token_counter.most_common(3))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[('books', 2), ('I', 1), ('love', 1)]\n```\n:::\n:::\n\n\n \n\n::: {.fragment}\n\n::: {#dfe01ad2 .cell execution_count=4}\n``` {.python .cell-code}\ntokenized_sentence = sentence.replace(\".\", \" \").split(\" \")\n\ntoken_counter = Counter(tokenized_sentence)\nprint(token_counter.most_common(2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[('science', 2), ('books', 2)]\n```\n:::\n:::\n\n\n:::\n\n\n## NLTK tokenization\n\n::: {#1319f1bc .cell execution_count=5}\n``` {.python .cell-code}\nfrom nltk.tokenize import wordpunct_tokenize\nfrom string import punctuation\n\ntokenized_sentence = wordpunct_tokenize(sentence)\ntokenized_sentence = [t for t in tokenized_sentence if t not in punctuation]\nprint(tokenized_sentence)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['I', 'love', 'reading', 'science', 'fiction', 'books', 'or', 'books', 'about', 'science']\n```\n:::\n:::\n\n\n## Lemmatization\n\n- Reduce words to their base or canonical form\n- Represents the dictionary form of a word (lemma)\n- Standardizes words for better text analysis accuracy\n- Example: `meeting` --> `meet` (verb)\n\n---\n\n- Helps in tasks such as text classification, information retrieval, and sentiment analysis\n- Considers context and linguistic rules\n- Retains semantic meaning of words\n- Has to involve part-of-speech tagging (see example below)\n- Determines correct lemma based on word's role in sentence\n\n\n```{mermaid}\nflowchart LR\n A(meeting)\n A --> B(\"meet (verb)\")\n A --> C(\"meeting (noun)\")\n```\n\n\n\n## Lemmatization with WordNet: Nouns\n\n::: {#0648adde .cell output-location='fragment' execution_count=6}\n``` {.python .cell-code}\nfrom nltk.stem import WordNetLemmatizer\n\nsentence = \"The three brothers went over three big bridges\"\n\nwnl = WordNetLemmatizer()\n\nlemmatized_sentence_token = [\n wnl.lemmatize(w, pos=\"n\") for w in sentence.split(\" \")\n]\n\nprint(lemmatized_sentence_token)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['The', 'three', 'brother', 'went', 'over', 'three', 'big', 'bridge']\n```\n:::\n:::\n\n\n## Lemmatization with WordNet: Verbs\n\n::: {#b08e7d39 .cell output-location='fragment' execution_count=7}\n``` {.python .cell-code}\nlemmatized_sentence_token = [\n wnl.lemmatize(w, pos=\"v\") for w in sentence.split(\" \")\n]\n\nprint(lemmatized_sentence_token)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['The', 'three', 'brothers', 'go', 'over', 'three', 'big', 'bridge']\n```\n:::\n:::\n\n\n## Lemmatization with WordNet and POS-tagging\n\n::: {#268ae1e0 .cell output-location='fragment' execution_count=8}\n``` {.python .cell-code}\npos_dict = {\n \"brothers\": \"n\", \n \"went\": \"v\",\n \"big\": \"a\",\n \"bridges\": \"n\"\n}\n\nlemmatized_sentence_token = []\nfor token in sentence.split(\" \"):\n if token in pos_dict:\n lemma = wnl.lemmatize(token, pos=pos_dict[token])\n else: \n lemma = token # leave as it is\n\n lemmatized_sentence_token.append(lemma)\n\nprint(lemmatized_sentence_token)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['The', 'three', 'brother', 'go', 'over', 'three', 'big', 'bridge']\n```\n:::\n:::\n\n\n# Bit Pair Encoding\n\n## Bit Pair Encoding: Why?\n\n- Tokenization: Breaking text into smaller chunks (tokens)\n- Traditional vocabularies: Fixed-size, memory-intensive\n- Bit pair encoding: Compression technique for large vocabularies\n\n## Bit Pair Encoding: How?\n- Pair Identification: Identifies frequent pairs of characters\n- Replacement with Single Token: Replaces pairs with single token\n- Iterative Process: Continues until stopping criterion met\n- Vocabulary Construction: Construct vocabulary with single tokens\n- Encoding and Decoding: Text encoded and decoded using constructed vocabulary\n\n# [OpenAI Tokenizer](https://platform.openai.com/tokenizer){.external}\n\n## Bit Pair Encoding: Pros and Cons\n- Efficient Memory Usage\n- Retains Information\n- Flexibility\n- Computational Overhead\n- Loss of Granularity\n\n::: {.notes}\n- Reduces vocabulary size, efficient memory usage\n- Captures frequent character pairs, retains linguistic information\n- Adaptable to different tokenization strategies and corpus characteristics\n- Iterative nature can be computationally intensive\n- May lead to loss of granularity, especially for rare words\n- Effectiveness depends on tokenization strategy and corpus characteristics\n:::\n\n",
"supporting": [
"tokenization_files"
],
Expand Down
2 changes: 1 addition & 1 deletion docs/search.json
Original file line number Diff line number Diff line change
Expand Up @@ -1174,7 +1174,7 @@
"href": "slides/nlp/short_history.html#machine-translation",
"title": "A Short History of Natural Language Processing",
"section": "Machine Translation",
"text": "Machine Translation\n\nAutomatically translating text from one language to another\nFacilitates communication across language barriers\n\n\n\n\n\nSprint: LLM, 2024"
"text": "Machine Translation\n\nAutomatically translating text from one language to another\nFacilitates communication across language barriers"
},
{
"objectID": "slides/nlp/statistics.html#term-frequency-token-counting",
Expand Down
2 changes: 1 addition & 1 deletion docs/site_libs/revealjs/dist/theme/quarto.css

Large diffs are not rendered by default.

Loading

0 comments on commit 32a3769

Please sign in to comment.