added missing slides

Kubus42 · Apr 21, 2024 · 8222545 · 8222545
1 parent b436a8e
commit 8222545
Show file tree

Hide file tree

Showing 32 changed files with 2,373 additions and 1,120 deletions.
diff --git a/_freeze/llm/parameterization/execute-results/html.json b/_freeze/llm/parameterization/execute-results/html.json
diff --git a/_freeze/nlp/overview/execute-results/html.json b/_freeze/nlp/overview/execute-results/html.json
diff --git a/_freeze/site_libs/revealjs/dist/theme/quarto.css b/_freeze/site_libs/revealjs/dist/theme/quarto.css
diff --git a/_freeze/slides/about/projects/execute-results/html.json b/_freeze/slides/about/projects/execute-results/html.json
@@ -0,0 +1,12 @@
+{
+  "hash": "5ce34d344b5301714789bde7fc69ad92",
+  "result": {
+    "engine": "jupyter",
+    "markdown": "---\ntitle: \"Projects: Large Language Models\"\nformat: \n    revealjs:\n        theme: default\n        chalkboard: true\n        footer: \"Sprint: LLM, 2024\"\n        logo: ../../assets/logo.svg\n---\n\n# How to develop an app with a language model\n\n## What do I have to keep in mind?\n\n## What can go wrong? \n\n## What do I need?\n\n\n# Project ideas\n\n## Question-Answering Chatbot \nBuild a chatbot that can answer questions posed by users on a specific topic provided in form of documents. Users input their questions, the chatbot retrieves relevant information from a pre-defined set of documents, and uses the information to answer the question.\n\n## Document tagging / classification \nUse GPT and its tools (e.g., function calls) and/or embeddings to classify documents or assign tags to them. Example: Sort bug reports or complaints into categories depending on the problem.\n\n## Clustering of text-based entities \nCreate a small tool that can cluster text-based entities based on embeddings, for example, groups of texts or keywords. Example: Structure a folder of text files based on their content.\n\n## Text-based RPG Game\nDevelop a text-based role-playing game where players interact with characters and navigate through a story generated by GPT. Players make choices that influence the direction of the narrative.\n\n## Sentiment Analysis Tool\nBuild an app that analyzes the sentiment of text inputs (e.g., social media posts, customer reviews) using GPT. Users can input text, and the app provides insights into the overall sentiment expressed in the text.\n\n## Text Summarization Tool \nCreate an application that summarizes long blocks of text into shorter, concise summaries. Users can input articles, essays, or documents, and the tool generates a summarized version.\n\n## Language Translation Tool \nBuild a simple translation app that utilizes GPT to translate text between different languages. Users can input text in one language, and the app outputs the translated text in the desired language. Has to include some nice tweaks.\n\n## Personalized Recipe Generator \nDevelop an app that generates personalized recipes based on user preferences and dietary restrictions. Users input their preferred ingredients and dietary needs, and the app generates custom recipes using GPT.\n\n## Lyrics Generator \nCreate a lyrics generation tool that generates lyrics based on user input such as themes, music style, emotions, or keywords. Users can explore different poetic styles and themes generated by GPT.\n\n# How to build you app\n\n## Tools \n\n- You can use everything in the Jupyterlab (put `pip list` in a terminal to see all Python packages)\n- If there are specific packages you need, we can organize them\n- You can simply build your application in a Jupyter notebook!\n- Or: Use **Dash**!\n\n\n## Dash \nPut the following files into your home in the Jupyterlab: \n\n`my_layout.py`\n\n::: {#8de0cf53 .cell execution_count=1}\n``` {.python .cell-code}\nfrom dash import html\nfrom dash import dcc\n\n\nlayout = html.Div([\n    html.H1(\"Yeay, my app!\"),\n    html.Div([\n        html.Label(\"Enter your text:\"),\n        dcc.Input(id='input-text', type='text', value=''),\n        html.Button('Submit', id='submit-btn', n_clicks=0),\n    ]),\n    html.Div(id='output-container-button')\n])\n```\n:::\n\n\n--- \n\n`my_callbacks.py`\n\n::: {#bafd904a .cell execution_count=2}\n``` {.python .cell-code}\nfrom dash.dependencies import (\n    Input, \n    Output\n)\nfrom dash import html\n\n\ndef register_callbacks(app):\n    @app.callback(\n        Output('output-container-button', 'children'),\n        [Input('submit-btn', 'n_clicks')],\n        [Input('input-text', 'value')]\n    )\n    def update_output(n_clicks, input_value):\n        if n_clicks > 0:\n            return html.Div([\n                html.Label(\"You entered:\"),\n                html.P(input_value)\n            ])\n        else:\n            return ''\n\n```\n:::\n\n\n--- \n\nNow you can run your own app in the Jupyterlab here: \n\n![MyApp Launcher](../../assets/my_app.png)\n\n",
+    "supporting": [
+      "projects_files"
+    ],
+    "filters": [],
+    "includes": {}
+  }
+}
diff --git a/_freeze/slides/embeddings/embeddings/execute-results/html.json b/_freeze/slides/embeddings/embeddings/execute-results/html.json
@@ -2,7 +2,7 @@
   "hash": "b2e6d98ed4c34c193f24df3e7aa76795",
   "result": {
     "engine": "jupyter",
-    "markdown": "---\ntitle: \"Embeddings\"\nformat: \n    revealjs:\n        theme: default\n        chalkboard: true\n        footer: \"Sprint: LLM, 2024\"\n        logo: ../../assets/logo.svg\n---\n\n## Revisiting what we know \n\nEmbeddings ... \n\n- transform text into numerical vectors\n- are used in neural network architectures\n- Key benefit: Capture **semantic** similarities and relationships between words\n\n&nbsp;\n\n- Already seen: Bag of Words \n- Issue: These embeddings do not compress!\n\n\n## What are embeddings?\n\n- Represent words and text as **dense**, numerical vectors\n- Capture rich semantic information\n- Context-aware, based on surrounding text\n- Capture subtle semantic relationships\n- Compact representation compared to simple techniques such as bag of words\n\n\n## Approaches to generate embeddings:\n- Word2Vec, GloVe, FastText\n  - Train neural network to predict surrounding words\n  - CBOW or skip-gram architectures\n  - Learns semantic relationships in continuous vector space\n\n- Transformer architectures like GPT\n- Word embeddings provided by OpenAI\n\n\n## What does it look like? \n\nTrain a model to: \n\n- predict the target word based on the (surrounding) context words, **or** \n- predict the context words given a target word\n\n\n```{mermaid}\nflowchart LR\n  A[\"Input Layer (One Hot)\"] \n  A --> B[\"Embedding Layer\"]\n  B --> C[\"Sum/Average Layer\"]\n  C --> D[\"Output Layer\"]\n```\n\n\n::: {.fragment}\n### Use of the model\nThrow away the parts after the embedding layer!\n\n```{mermaid}\nflowchart LR\n  A[\"Input Layer (One Hot)\"] \n  A --> B[\"Embedding Layer\"]\n```\n\n:::\n\n\n# Matching with embeddings\n\n## Task: Find the matching document for a prompt\n\n::: {#618626c5 .cell execution_count=1}\n``` {.python .cell-code}\ntexts = [\n  \"This is the first document.\",\n  \"This document is the second document.\",\n  \"And this is the third one.\"\n]\n\nprompt = \"Is this the first document?\"\n```\n:::\n\n\n## Get the OpenAI client\n\n::: {#d7259430 .cell execution_count=2}\n``` {.python .cell-code}\n# prerequisites\n\nimport os\nfrom llm_utils.client import get_openai_client, OpenAIModels\n\nMODEL = OpenAIModels.EMBED.value # choose the embedding model\n\n# get the OpenAI client\nclient = get_openai_client(\n    model=MODEL,\n    config_path=os.environ.get(\"CONFIG_PATH\")\n)\n```\n:::\n\n\n## Get the embeddings\n\n::: {#2e816842 .cell execution_count=3}\n``` {.python .cell-code}\n# get the embeddings\nresponse = client.embeddings.create(\n    input=texts,\n    model=MODEL\n)\n\ntext_embeddings = [emb.embedding for emb in response.data]\n\nresponse = client.embeddings.create(\n    input=[prompt],\n    model=MODEL\n)\n\nprompt_embedding = response.data[0].embedding\n```\n:::\n\n\n## Compute the similarity \n\n::: {#0aa416fe .cell output-location='fragment' execution_count=4}\n``` {.python .cell-code}\nimport numpy as np \n\ndef cosine_similarity(vec1: np.array, vec2: np.array) -> float: \n    return np.dot(vec1, vec2) / ( np.linalg.norm(vec1) * np.linalg.norm(vec2) )\n\n\nfor text, text_embedding in zip(texts, text_embeddings):\n    similarity = cosine_similarity(text_embedding, prompt_embedding)\n    print(f\"{text}: {round(similarity, 2)}\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nThis is the first document.: 0.95\nThis document is the second document.: 0.88\nAnd this is the third one.: 0.8\n```\n:::\n:::\n\n\n# Visualization and clustering\n\n## Define some words to visualize\n\n::: {#26caab2c .cell execution_count=5}\n``` {.python .cell-code}\n# Define a list of words to visualize\nwords = [\n    \"king\", \"queen\", \"man\", \"woman\", \"apple\", \"banana\", \n    \"grapes\", \"cat\", \"dog\", \"happy\", \"sad\"\n]\n\n# Get embeddings for the words\nresponse = client.embeddings.create(\n    input=words,\n    model=MODEL\n)\n\nembeddings = [emb.embedding for emb in response.data]\n```\n:::\n\n\n## Apply T-SNE to the embedding vectors \n\n::: {#aad23d37 .cell output-location='slide' execution_count=6}\n``` {.python .cell-code}\nimport numpy as np\nimport matplotlib.pyplot as plt\n\nfrom sklearn.manifold import TSNE\n\n# Apply t-SNE dimensionality reduction\ntsne = TSNE(\n    n_components=2, \n    random_state=42,\n    perplexity=5 # see documentation to set this correctly\n)\nembeddings_2d = tsne.fit_transform(np.array(embeddings))\n\n# Plot the embeddings in a two-dimensional scatter plot\nplt.figure(figsize=(9, 7))\nfor i, word in enumerate(words):\n    x, y = embeddings_2d[i]\n    plt.scatter(x, y, marker='o', color='red')\n    plt.text(x, y, word, fontsize=9)\n\nplt.xlabel(\"t-SNE dimension 1\")\nplt.ylabel(\"t-SNE dimension 2\")\nplt.grid(True)\nplt.xticks([])\nplt.yticks([])\nplt.show()\n```\n\n::: {.cell-output .cell-output-display}\n![](embeddings_files/figure-revealjs/cell-7-output-1.png){width=718 height=555 fig-align='center'}\n:::\n:::\n\n\n## Cluster the embeddings \n\n::: {#35151cbd .cell execution_count=7}\n``` {.python .cell-code}\n# do the clus#| tering\nimport numpy as np\nfrom sklearn.cluster import KMeans\n\nn_clusters = 5\n\n# define the model\nkmeans = KMeans(\n  n_clusters=n_clusters,\n  n_init=\"auto\",\n  random_state=2 # do this to get the same output\n)\n\n# fit the model to the data\nkmeans.fit(np.array(embeddings))\n\n# get the cluster labels\ncluster_labels = kmeans.labels_\n```\n:::\n\n\n## Visualize with T-SNE \n\n::: {#6bed121c .cell output-location='slide' execution_count=8}\n``` {.python .cell-code}\nimport matplotlib.pyplot as plt\n\nfrom sklearn.manifold import TSNE\n\n# Apply t-SNE dimensionality reduction\ntsne = TSNE(\n  n_components=2, \n  random_state=42,\n  perplexity=5 # see documentation to set this correctly\n)\nembeddings_2d = tsne.fit_transform(np.array(embeddings))\n\n# Define a color map for clusters\ncolors = plt.cm.viridis(np.linspace(0, 1, n_clusters))\n\n# Plot the embeddings in a two-dimensional scatter plot\nplt.figure(figsize=(9, 7))\nfor i, word in enumerate(words):\n    x, y = embeddings_2d[i]\n    cluster_label = cluster_labels[i]\n    color = colors[cluster_label]\n    plt.scatter(x, y, marker='o', color=color)\n    plt.text(x, y, word, fontsize=9)\n\nplt.xlabel(\"t-SNE dimension 1\")\nplt.ylabel(\"t-SNE dimension 2\")\nplt.grid(True)\nplt.xticks([])\nplt.yticks([])\nplt.show()\n```\n\n::: {.cell-output .cell-output-display}\n![](embeddings_files/figure-revealjs/cell-9-output-1.png){width=718 height=555 fig-align='center'}\n:::\n:::\n\n\n",
+    "markdown": "---\ntitle: \"Embeddings\"\nformat: \n    revealjs:\n        theme: default\n        chalkboard: true\n        footer: \"Sprint: LLM, 2024\"\n        logo: ../../assets/logo.svg\n---\n\n## Revisiting what we know \n\nEmbeddings ... \n\n- transform text into numerical vectors\n- are used in neural network architectures\n- Key benefit: Capture **semantic** similarities and relationships between words\n\n&nbsp;\n\n- Already seen: Bag of Words \n- Issue: These embeddings do not compress!\n\n\n## What are embeddings?\n\n- Represent words and text as **dense**, numerical vectors\n- Capture rich semantic information\n- Context-aware, based on surrounding text\n- Capture subtle semantic relationships\n- Compact representation compared to simple techniques such as bag of words\n\n\n## Approaches to generate embeddings:\n- Word2Vec, GloVe, FastText\n  - Train neural network to predict surrounding words\n  - CBOW or skip-gram architectures\n  - Learns semantic relationships in continuous vector space\n\n- Transformer architectures like GPT\n- Word embeddings provided by OpenAI\n\n\n## What does it look like? \n\nTrain a model to: \n\n- predict the target word based on the (surrounding) context words, **or** \n- predict the context words given a target word\n\n\n```{mermaid}\nflowchart LR\n  A[\"Input Layer (One Hot)\"] \n  A --> B[\"Embedding Layer\"]\n  B --> C[\"Sum/Average Layer\"]\n  C --> D[\"Output Layer\"]\n```\n\n\n::: {.fragment}\n### Use of the model\nThrow away the parts after the embedding layer!\n\n```{mermaid}\nflowchart LR\n  A[\"Input Layer (One Hot)\"] \n  A --> B[\"Embedding Layer\"]\n```\n\n:::\n\n\n# Matching with embeddings\n\n## Task: Find the matching document for a prompt\n\n::: {#f2e2f7c7 .cell execution_count=1}\n``` {.python .cell-code}\ntexts = [\n  \"This is the first document.\",\n  \"This document is the second document.\",\n  \"And this is the third one.\"\n]\n\nprompt = \"Is this the first document?\"\n```\n:::\n\n\n## Get the OpenAI client\n\n::: {#45714e17 .cell execution_count=2}\n``` {.python .cell-code}\n# prerequisites\n\nimport os\nfrom llm_utils.client import get_openai_client, OpenAIModels\n\nMODEL = OpenAIModels.EMBED.value # choose the embedding model\n\n# get the OpenAI client\nclient = get_openai_client(\n    model=MODEL,\n    config_path=os.environ.get(\"CONFIG_PATH\")\n)\n```\n:::\n\n\n## Get the embeddings\n\n::: {#8c247e09 .cell execution_count=3}\n``` {.python .cell-code}\n# get the embeddings\nresponse = client.embeddings.create(\n    input=texts,\n    model=MODEL\n)\n\ntext_embeddings = [emb.embedding for emb in response.data]\n\nresponse = client.embeddings.create(\n    input=[prompt],\n    model=MODEL\n)\n\nprompt_embedding = response.data[0].embedding\n```\n:::\n\n\n## Compute the similarity \n\n::: {#0dbffcf7 .cell output-location='fragment' execution_count=4}\n``` {.python .cell-code}\nimport numpy as np \n\ndef cosine_similarity(vec1: np.array, vec2: np.array) -> float: \n    return np.dot(vec1, vec2) / ( np.linalg.norm(vec1) * np.linalg.norm(vec2) )\n\n\nfor text, text_embedding in zip(texts, text_embeddings):\n    similarity = cosine_similarity(text_embedding, prompt_embedding)\n    print(f\"{text}: {round(similarity, 2)}\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nThis is the first document.: 0.95\nThis document is the second document.: 0.88\nAnd this is the third one.: 0.8\n```\n:::\n:::\n\n\n# Visualization and clustering\n\n## Define some words to visualize\n\n::: {#02a80e22 .cell execution_count=5}\n``` {.python .cell-code}\n# Define a list of words to visualize\nwords = [\n    \"king\", \"queen\", \"man\", \"woman\", \"apple\", \"banana\", \n    \"grapes\", \"cat\", \"dog\", \"happy\", \"sad\"\n]\n\n# Get embeddings for the words\nresponse = client.embeddings.create(\n    input=words,\n    model=MODEL\n)\n\nembeddings = [emb.embedding for emb in response.data]\n```\n:::\n\n\n## Apply T-SNE to the embedding vectors \n\n::: {#9785ce16 .cell output-location='slide' execution_count=6}\n``` {.python .cell-code}\nimport numpy as np\nimport matplotlib.pyplot as plt\n\nfrom sklearn.manifold import TSNE\n\n# Apply t-SNE dimensionality reduction\ntsne = TSNE(\n    n_components=2, \n    random_state=42,\n    perplexity=5 # see documentation to set this correctly\n)\nembeddings_2d = tsne.fit_transform(np.array(embeddings))\n\n# Plot the embeddings in a two-dimensional scatter plot\nplt.figure(figsize=(9, 7))\nfor i, word in enumerate(words):\n    x, y = embeddings_2d[i]\n    plt.scatter(x, y, marker='o', color='red')\n    plt.text(x, y, word, fontsize=9)\n\nplt.xlabel(\"t-SNE dimension 1\")\nplt.ylabel(\"t-SNE dimension 2\")\nplt.grid(True)\nplt.xticks([])\nplt.yticks([])\nplt.show()\n```\n\n::: {.cell-output .cell-output-display}\n![](embeddings_files/figure-revealjs/cell-7-output-1.png){width=718 height=555 fig-align='center'}\n:::\n:::\n\n\n## Cluster the embeddings \n\n::: {#77ac9d72 .cell execution_count=7}\n``` {.python .cell-code}\n# do the clus#| tering\nimport numpy as np\nfrom sklearn.cluster import KMeans\n\nn_clusters = 5\n\n# define the model\nkmeans = KMeans(\n  n_clusters=n_clusters,\n  n_init=\"auto\",\n  random_state=2 # do this to get the same output\n)\n\n# fit the model to the data\nkmeans.fit(np.array(embeddings))\n\n# get the cluster labels\ncluster_labels = kmeans.labels_\n```\n:::\n\n\n## Visualize with T-SNE \n\n::: {#80a31df2 .cell output-location='slide' execution_count=8}\n``` {.python .cell-code}\nimport matplotlib.pyplot as plt\n\nfrom sklearn.manifold import TSNE\n\n# Apply t-SNE dimensionality reduction\ntsne = TSNE(\n  n_components=2, \n  random_state=42,\n  perplexity=5 # see documentation to set this correctly\n)\nembeddings_2d = tsne.fit_transform(np.array(embeddings))\n\n# Define a color map for clusters\ncolors = plt.cm.viridis(np.linspace(0, 1, n_clusters))\n\n# Plot the embeddings in a two-dimensional scatter plot\nplt.figure(figsize=(9, 7))\nfor i, word in enumerate(words):\n    x, y = embeddings_2d[i]\n    cluster_label = cluster_labels[i]\n    color = colors[cluster_label]\n    plt.scatter(x, y, marker='o', color=color)\n    plt.text(x, y, word, fontsize=9)\n\nplt.xlabel(\"t-SNE dimension 1\")\nplt.ylabel(\"t-SNE dimension 2\")\nplt.grid(True)\nplt.xticks([])\nplt.yticks([])\nplt.show()\n```\n\n::: {.cell-output .cell-output-display}\n![](embeddings_files/figure-revealjs/cell-9-output-1.png){width=718 height=555 fig-align='center'}\n:::\n:::\n\n\n",
     "supporting": [
       "embeddings_files"
     ],

diff --git a/_freeze/slides/llm/openai_api/execute-results/html.json b/_freeze/slides/llm/openai_api/execute-results/html.json
@@ -0,0 +1,12 @@
+{
+  "hash": "633cc26588919f4ac288bf0fa8d4c236",
+  "result": {
+    "engine": "jupyter",
+    "markdown": "---\ntitle: \"The OpenAI API\"\nformat: \n    revealjs:\n        theme: default\n        chalkboard: true\n        footer: \"Sprint: LLM, 2024\"\n        logo: ../../assets/logo.svg\n        fig-align: center\n---\n\n## Let's get started \n\nThe great thing about APIs is that we can start right away without too much preparation! \n\nIn this sprint, we will use the OpenAI API for completions and embeddings.\n\nResource: [OpenAI API docs](https://platform.openai.com/docs/introduction){.external}\n\n## Authentication\n\nTypically, it's as simple as this:\n\n::: {#7c51b865 .cell execution_count=1}\n``` {.python .cell-code}\n# setting up the client in Python\nimport os\nfrom openai import OpenAI\n\nclient = OpenAI(\n    api_key=os.environ.get(\"OPENAI_API_KEY\")\n)\n```\n:::\n\n\n## Authentication for the seminar\nFor the sprint, we have hosted some models in Azure. \n\n::: {#1a2b7eb7 .cell execution_count=2}\n``` {.python .cell-code}\nimport os\nfrom llm_utils.client import get_openai_client, OpenAIModels\n\nprint(f\"GPT3: {OpenAIModels.GPT_3.value}\")\nprint(f\"GPT4: {OpenAIModels.GPT_4.value}\")\nprint(f\"Embedding model: {OpenAIModels.EMBED.value}\")\n\nMODEL = OpenAIModels.GPT_4.value\n\nclient = get_openai_client(\n    model=MODEL,\n    config_path=os.environ.get(\"CONFIG_PATH\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nGPT3: gpt3\nGPT4: gpt4\nEmbedding model: embed\n```\n:::\n:::\n\n\n## Creating a completion\n\n::: {#f926b01f .cell execution_count=3}\n``` {.python .cell-code}\nchat_completion = client.chat.completions.create(\n    messages=[\n        {\n            \"role\": \"user\",\n            \"content\": \"How old is the earth?\",\n        }\n    ],\n    model=MODEL \n)\n\n# check out the type of the response\n\nprint(f\"Response: {type(chat_completion)}\") # a ChatCompletion object\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nResponse: <class 'openai.types.chat.chat_completion.ChatCompletion'>\n```\n:::\n:::\n\n\n## Retrieving the response \n\n::: {#20e3f8f6 .cell execution_count=4}\n``` {.python .cell-code}\n# print the message we want\nprint(f\"\\nResponse message: {chat_completion.choices[0].message.content}\")\n\n# check the tokens used \nprint(f\"\\nTotal tokens used: {chat_completion.usage.total_tokens}\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nResponse message: The Earth is approximately 4.54 billion years old.\n\nTotal tokens used: 25\n```\n:::\n:::\n\n\n",
+    "supporting": [
+      "openai_api_files"
+    ],
+    "filters": [],
+    "includes": {}
+  }
+}