Code
@@ -493,6 +520,7 @@Exercise: Embedding similarity
Cosine similarity between king and pawn: 0.829.
diff --git a/_freeze/llm/gpt/execute-results/html.json b/_freeze/llm/gpt/execute-results/html.json index 8070b26..dc0ed74 100644 --- a/_freeze/llm/gpt/execute-results/html.json +++ b/_freeze/llm/gpt/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "ed968d8cc94d9558b0bf965f6a3a3035", + "hash": "73c0bdfe90047738a7b587281309f0d8", "result": { "engine": "jupyter", - "markdown": "---\ntitle: GPT\nformat:\n html:\n code-fold: true\n---\n\n", + "markdown": "---\ntitle: GPT\nformat:\n html:\n code-fold: false\n---\n\n- **Definition of GPT**: GPT is a state-of-the-art large language model developed by OpenAI. It belongs to the family of Transformer-based architectures and is renowned for its ability to generate coherent and contextually relevant text across a wide range of tasks.\n- **Key Features of GPT**: Highlight the key features that distinguish GPT from other LLMs, such as its autoregressive nature, the use of self-attention mechanisms, and the ability to generate text of variable length.\n- **Pre-training Objective**: GPT is pre-trained using an unsupervised learning objective known as language modeling. During pre-training, it learns to predict the next word in a sequence based on the preceding context, which enables it to capture the statistical properties of natural language.\n- **Architecture of GPT**: Provide an overview of the architecture of GPT, which consists of multiple layers of Transformer blocks. Each block includes self-attention layers, feed-forward neural networks, and layer normalization, allowing GPT to process input sequences and generate output sequences effectively.\n- **Fine-tuning and Adaptation**: GPT can be fine-tuned on specific tasks or domains with labeled data to adapt its pre-trained knowledge to new tasks. This fine-tuning process allows GPT to achieve state-of-the-art performance on a wide range of natural language processing tasks.\n- **Applications of GPT**: Discuss the diverse applications of GPT across various domains, including text generation, summarization, translation, question-answering, conversation generation, and more. Highlight real-world examples and use cases where GPT has been deployed successfully.\n- **Recent Advancements and Versions**: Mention the evolution of GPT over time, including the release of different versions such as GPT-1, GPT-2, GPT-3, and any subsequent versions or variants. Discuss the improvements and advancements introduced in each iteration.\n- **Challenges and Limitations:** Acknowledge the challenges and limitations associated with GPT, such as the potential for generating biased or inappropriate content, the need for large-scale computational resources, and the difficulty of fine-tuning for specific tasks without overfitting.\n\n\n\n## Completions and how they work\n\n### 1. Prompt:\n\nThe prompt serves as the cornerstone of completion generation, acting as the initial input or context upon which the model bases its predictions and generates completions. Its significance lies in its ability to set the tone, theme, and direction for the subsequent text generation process. Prompts can vary widely in length and complexity, ranging from concise prompts that elicit specific responses to more extensive prompts that allow for nuanced and detailed completions. The effectiveness of the prompt in guiding the completion generation process depends on its clarity, relevance, and specificity to the desired task or objective.\n\n### 2. Model Architecture:\n\nCompletions derive their power from sophisticated machine learning models, with transformer-based architectures like GPT (Generative Pre-trained Transformer) leading the forefront. These models undergo extensive training on vast amounts of text data, spanning diverse domains and languages, to develop a deep understanding of human language. Through this training process, the models learn to capture the intricacies of grammar, syntax, semantics, and context inherent in natural language. The architecture of these models is designed to efficiently process and analyze input text, enabling them to capture long-range dependencies within text and generate coherent completions that align with the provided prompt.\n\n### 3. Tokenization:\n\nBefore processing the prompt and generating completions, the input text undergoes tokenization, a crucial preprocessing step that breaks it down into smaller units known as tokens. These tokens typically represent words or subwords and serve as the fundamental building blocks for the model's understanding of the text. Tokenization enables the model to analyze the underlying structure of the text at a granular level, facilitating more effective learning and prediction. Each token encapsulates a discrete unit of meaning within the text and serves as input to the model during the completion generation process.\n\n### 4. Probability Distribution:\n\nCentral to the completion generation process is the prediction of the likelihood of each possible token that could follow the prompt. This prediction is based on the model's learned parameters and contextual understanding of the input text. The model computes a probability distribution over the vocabulary of tokens, assigning a probability score to each token to indicate its likelihood of occurrence given the context provided by the prompt. This probability distribution guides the selection of tokens during the completion generation process, ensuring that the generated completions are coherent and contextually relevant.\n\n### 5. Sampling Strategy:\n\nTo generate completions, the model employs various sampling strategies to select tokens from the probability distribution. Greedy sampling, for example, selects the token with the highest probability at each step, favoring the most probable tokens but potentially leading to repetitive or predictable completions. In contrast, random sampling randomly selects tokens according to their probabilities, introducing variability and unpredictability into the generated completions. Top-k sampling restricts token selection to the top-k most probable tokens, striking a balance between diversity and coherence in the completions. Each sampling strategy offers unique trade-offs in terms of diversity, coherence, and computational efficiency, allowing users to tailor the completion generation process to their specific needs and preferences.\n\n### Conclusion:\n\nCompletions represent a sophisticated approach to natural language processing, leveraging advanced machine learning models and algorithms to generate coherent and contextually relevant text based on given input. By understanding the underlying components and mechanisms of completions, users can harness their power to develop innovative applications and solutions across a wide range of domains and use cases. As research in NLP continues to advance, the capabilities and applications of completions are expected to evolve, driving further innovation and exploration in the field of human-computer interaction.\n\n", "supporting": [ "gpt_files" ], diff --git a/_freeze/llm/gpt_api/execute-results/html.json b/_freeze/llm/gpt_api/execute-results/html.json new file mode 100644 index 0000000..556f51d --- /dev/null +++ b/_freeze/llm/gpt_api/execute-results/html.json @@ -0,0 +1,12 @@ +{ + "hash": "dc7e1b2db90c90886a913f41ce9fd215", + "result": { + "engine": "jupyter", + "markdown": "---\ntitle: The OpenAI API\nformat:\n html:\n code-fold: false\n---\n\nResource: [OpenAI API docs](https://platform.openai.com/docs/introduction){.external}\n\n\nLet's get started with the OpenAI API for GPT. \n\n\n### Authentication\n\nGetting started with the OpenAI Chat Completions API requires signing up for an account on the OpenAI platform. \nOnce you've registered, you'll gain access to an API key, which serves as a unique identifier for your application to authenticate requests to the API. \nThis key is essential for ensuring secure communication between your application and OpenAI's servers. \nWithout proper authentication, your requests will be rejected.\nYou can create your own account, but for the seminar we will provide the client with the credential within the Jupyterlab (TODO: Link).\n\n::: {#a9cb3d89 .cell execution_count=1}\n``` {.python .cell-code}\n# setting up the client in Python\n\nimport os\nfrom openai import OpenAI\n\nclient = OpenAI(\n api_key=os.environ.get(\"OPENAI_API_KEY\")\n)\n```\n:::\n\n\n### Requesting Completions\n\nMost interaction with GPT and other models consist in generating completions for certain tasks (TODO: Link to completions)\n\nTo request completions from the OpenAI API, we use Python to send HTTP requests to the designated API endpoint. \nThese requests are structured to include various parameters that guide the generation of text completions. \nThe most fundamental parameter is the prompt text, which sets the context for the completion. \nAdditionally, you can specify the desired model configuration, such as the engine to use (e.g., \"gpt-4\"), as well as any constraints or preferences for the generated completions, such as the maximum number of tokens or the temperature for controlling creativity (TODO: Link parameterization)\n\n::: {#662daa55 .cell execution_count=2}\n``` {.python .cell-code}\n# creating a completion\nchat_completion = client.chat.completions.create(\n messages=[\n {\n \"role\": \"user\",\n \"content\": \"How old is the earth?\",\n }\n ],\n model=\"gpt-3.5-turbo\"\n)\n```\n:::\n\n\n### Processing\n\nOnce the OpenAI API receives your request, it proceeds to process the provided prompt using the specified model. \nThis process involves analyzing the context provided by the prompt and leveraging the model's pre-trained knowledge to generate text completions. \nThe model employs advanced natural language processing techniques to ensure that the generated completions are coherent and contextually relevant. \nBy drawing from its extensive training data and understanding of human language, the model aims to produce responses that closely align with human-like communication.\n\n### Response\n\nAfter processing your request, the OpenAI API returns a JSON-formatted response containing the generated text completions. \nDepending on the specifics of your request, you may receive multiple completions, each accompanied by additional information such as a confidence score indicating the model's level of certainty in the generated text. \nThis response provides valuable insights into the quality and relevance of the completions, allowing you to tailor your application's behavior accordingly.\n\n### Error Handling\n\nWhile interacting with the OpenAI API, it's crucial to implement robust error handling mechanisms to gracefully manage any potential issues that may arise. \nCommon errors include providing invalid parameters, experiencing authentication failures due to an incorrect API key, or encountering rate limiting restrictions. B\ny handling errors effectively, you can ensure the reliability and resilience of your application, minimizing disruptions to the user experience and maintaining smooth operation under varying conditions. \nImplementing proper error handling practices is essential for building robust and dependable applications that leverage the capabilities of the OpenAI Chat Completions API effectively.\n\n", + "supporting": [ + "gpt_api_files" + ], + "filters": [], + "includes": {} + } +} \ No newline at end of file diff --git a/_freeze/llm/intro/execute-results/html.json b/_freeze/llm/intro/execute-results/html.json index 0849fa2..2d075b8 100644 --- a/_freeze/llm/intro/execute-results/html.json +++ b/_freeze/llm/intro/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "8dca43e41c2fe7ee9d13d53651464ebf", + "hash": "5e7211dcd0b9a5f67a0d2f337e328ba7", "result": { "engine": "jupyter", - "markdown": "---\ntitle: Introduction to LLM\nformat:\n html:\n code-fold: true\n---\n\n", + "markdown": "---\ntitle: Introduction to LLM\nformat:\n html:\n code-fold: true\n---\n\n- **Definition of Large Language Models**: Large Language Models (LLMs) are deep learning models trained on vast amounts of text data to understand and generate human-like text. They use advanced techniques such as Transformers and self-attention mechanisms to process and generate sequences of words.\n- **Pre-training and Fine-tuning**: LLMs are typically pre-trained on large text corpora using unsupervised learning techniques, where they learn the statistical properties of natural language. After pre-training, they can be fine-tuned on specific tasks or domains with labeled data to adapt their knowledge and capabilities.\n- **Transformer Architecture**: Transformers are the backbone of LLMs, consisting of multiple layers of self-attention mechanisms and feed-forward neural networks. They excel at capturing long-range dependencies in sequential data, making them well-suited for NLP tasks.\n- **Self-Attention Mechanism**: Self-attention allows LLMs to weigh the importance of each word in a sequence based on its relationship with other words in the sequence. This mechanism enables them to capture contextual information effectively and generate coherent text.\n\n", "supporting": [ "intro_files" ], diff --git a/_freeze/llm/parameterization/execute-results/html.json b/_freeze/llm/parameterization/execute-results/html.json index 03d31e8..ffb6c73 100644 --- a/_freeze/llm/parameterization/execute-results/html.json +++ b/_freeze/llm/parameterization/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "0a40f87ac1e0fa3422b741e047ac6703", + "hash": "bc2adb1e48802eac607894bda896dc2e", "result": { "engine": "jupyter", - "markdown": "---\ntitle: Parameterization of GPT\nformat:\n html:\n code-fold: true\n---\n\n- **Temperature**: Temperature is a parameter that controls the randomness of the generated text. Lower temperatures result in more deterministic outputs, where the model tends to choose the most likely tokens at each step. Higher temperatures introduce more randomness, allowing the model to explore less likely tokens and produce more creative outputs. It's often used to balance between generating safe, conservative responses and more novel, imaginative ones.\n\n- **Max Tokens**: Max Tokens limits the maximum length of the generated text by specifying the maximum number of tokens (words or subwords) allowed in the output. This parameter helps to control the length of the response and prevent the model from generating overly long or verbose outputs, which may not be suitable for certain applications or contexts.\n\n- **Top P (Nucleus Sampling)**: Top P, also known as nucleus sampling, dynamically selects a subset of the most likely tokens based on their cumulative probability until the cumulative probability exceeds a certain threshold (specified by the parameter). This approach ensures diversity in the generated text while still prioritizing tokens with higher probabilities. It's particularly useful for generating diverse and contextually relevant responses.\n\n- **Frequency Penalty**: Frequency Penalty penalizes tokens based on their frequency in the generated text. Tokens that appear more frequently are assigned higher penalties, discouraging the model from repeatedly generating common or redundant tokens. This helps to promote diversity in the generated text and prevent the model from producing overly repetitive outputs.\n\n- **Presence Penalty**: Presence Penalty penalizes tokens that are already present in the input prompt. By discouraging the model from simply echoing or replicating the input text, this parameter encourages the generation of responses that go beyond the provided context. It's useful for generating more creative and novel outputs that are not directly predictable from the input.\n\n- **Stop Sequence**: Stop Sequence specifies a sequence of tokens that, if generated by the model, signals it to stop generating further text. This parameter is commonly used to indicate the desired ending or conclusion of the generated text. It helps to control the length of the response and ensure that the model generates text that aligns with specific requirements or constraints.\n\n- **Repetition Penalty**: Repetition Penalty penalizes repeated tokens in the generated text by assigning higher penalties to tokens that appear multiple times within a short context window. This encourages the model to produce more varied outputs by avoiding unnecessary repetition of tokens. It's particularly useful for generating coherent and diverse text without excessive redundancy.\n\n- **Length Penalty**: Length Penalty penalizes the length of the generated text by applying a penalty factor to longer sequences. This helps to balance between generating concise and informative responses while avoiding excessively long or verbose outputs. Length Penalty is often used to control the length of the generated text and ensure that it remains coherent and contextually relevant.\n\n", + "markdown": "---\ntitle: Parameterization of GPT\nformat:\n html:\n code-fold: true\n---\n\n- **Temperature**: Temperature is a parameter that controls the randomness of the generated text. Lower temperatures result in more deterministic outputs, where the model tends to choose the most likely tokens at each step. Higher temperatures introduce more randomness, allowing the model to explore less likely tokens and produce more creative outputs. It's often used to balance between generating safe, conservative responses and more novel, imaginative ones.\n\n- **Max Tokens**: Max Tokens limits the maximum length of the generated text by specifying the maximum number of tokens (words or subwords) allowed in the output. This parameter helps to control the length of the response and prevent the model from generating overly long or verbose outputs, which may not be suitable for certain applications or contexts.\n\n- **Top P (Nucleus Sampling)**: Top P, also known as nucleus sampling, dynamically selects a subset of the most likely tokens based on their cumulative probability until the cumulative probability exceeds a certain threshold (specified by the parameter). This approach ensures diversity in the generated text while still prioritizing tokens with higher probabilities. It's particularly useful for generating diverse and contextually relevant responses.\n\n- **Frequency Penalty**: Frequency Penalty penalizes tokens based on their frequency in the generated text. Tokens that appear more frequently are assigned higher penalties, discouraging the model from repeatedly generating common or redundant tokens. This helps to promote diversity in the generated text and prevent the model from producing overly repetitive outputs.\n\n- **Presence Penalty**: Presence Penalty penalizes tokens that are already present in the input prompt. By discouraging the model from simply echoing or replicating the input text, this parameter encourages the generation of responses that go beyond the provided context. It's useful for generating more creative and novel outputs that are not directly predictable from the input.\n\n- **Stop Sequence**: Stop Sequence specifies a sequence of tokens that, if generated by the model, signals it to stop generating further text. This parameter is commonly used to indicate the desired ending or conclusion of the generated text. It helps to control the length of the response and ensure that the model generates text that aligns with specific requirements or constraints.\n\n- **Repetition Penalty**: Repetition Penalty penalizes repeated tokens in the generated text by assigning higher penalties to tokens that appear multiple times within a short context window. This encourages the model to produce more varied outputs by avoiding unnecessary repetition of tokens. It's particularly useful for generating coherent and diverse text without excessive redundancy.\n\n- **Length Penalty**: Length Penalty penalizes the length of the generated text by applying a penalty factor to longer sequences. This helps to balance between generating concise and informative responses while avoiding excessively long or verbose outputs. Length Penalty is often used to control the length of the generated text and ensure that it remains coherent and contextually relevant.\n\n\n\n## Roles: \n\n::: {#3a526819 .cell execution_count=1}\n``` {.python .cell-code}\nfrom openai import OpenAI\nclient = OpenAI()\n\ncompletion = client.chat.completions.create(\n model=\"gpt-3.5-turbo\",\n messages=[\n {\"role\": \"system\", \"content\": \"You are a poetic assistant, skilled in explaining complex programming concepts with creative flair.\"},\n {\"role\": \"user\", \"content\": \"Compose a poem that explains the concept of recursion in programming.\"}\n ]\n)\n\nprint(completion.choices[0].message)\n```\n:::\n\n\n## Function calling: \nhttps://platform.openai.com/docs/guides/function-calling\n\n", "supporting": [ "parameterization_files" ], diff --git a/_freeze/llm/prompting/execute-results/html.json b/_freeze/llm/prompting/execute-results/html.json new file mode 100644 index 0000000..bb0b0f3 --- /dev/null +++ b/_freeze/llm/prompting/execute-results/html.json @@ -0,0 +1,12 @@ +{ + "hash": "d550bb699b724b045b6d42c1602ebf1c", + "result": { + "engine": "jupyter", + "markdown": "---\ntitle: Prompting\nformat:\n html:\n code-fold: true\n---\n\n**Resources:** \n- https://platform.openai.com/docs/guides/prompt-engineering\n- \n\n", + "supporting": [ + "prompting_files" + ], + "filters": [], + "includes": {} + } +} \ No newline at end of file diff --git a/_freeze/nlp/tokenization/execute-results/html.json b/_freeze/nlp/tokenization/execute-results/html.json index 64fee11..9be1827 100644 --- a/_freeze/nlp/tokenization/execute-results/html.json +++ b/_freeze/nlp/tokenization/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "63d3c15b4de64fd875f71925b65296bf", + "hash": "015b9b99e573753e3289e6d5f046eca5", "result": { "engine": "jupyter", - "markdown": "---\ntitle: Tokenization\nformat:\n html:\n code-fold: false\n---\n\nTODO: Some introductory sentence.\n\n## Simple word tokenization\nA key element for a computer to understand the words we speak or type is the concept of word tokenization. \nFor a human, the sentence \n\n::: {#712dd934 .cell execution_count=1}\n``` {.python .cell-code}\nsentence = \"I love reading science fiction books or books about science.\"\n```\n:::\n\n\nis easy to understand since we are able to split the sentence into its individual parts in order to figure out the meaning of the full sentence.\nFor a computer, the sentence is just a simple string of characters, like any other word or longer text.\nIn order to make a computer understand the meaning of a sentence, we need to help break it down into its relevant parts.\n\nSimply put, word tokenization is the process of breaking down a piece of text into individual words or so-called tokens. \nIt is like taking a sentence and splitting it into smaller pieces, where each piece represents a word.\nWord tokenization involves analyzing the text character by character and identifying boundaries between words. \nIt uses various rules and techniques to decide where one word ends and the next one begins. \nFor example, spaces, punctuation marks, and special characters often serve as natural boundaries between words.\n\nSo let's start breaking down the sentence into its individual parts.\n\n::: {#44e2a40a .cell execution_count=2}\n``` {.python .cell-code}\ntokenized_sentence = sentence.split(\" \")\nprint(tokenized_sentence)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['I', 'love', 'reading', 'science', 'fiction', 'books', 'or', 'books', 'about', 'science.']\n```\n:::\n:::\n\n\nOnce we have tokenized the sentence, we can start anaylzing it with some simple statistical methods. \nFor example, in order to figure out what the sentence might be about, we could count the most frequent words. \n\n::: {#345bd0ce .cell execution_count=3}\n``` {.python .cell-code}\nfrom collections import Counter\n\ntoken_counter = Counter(tokenized_sentence)\nprint(token_counter.most_common(2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[('books', 2), ('I', 1)]\n```\n:::\n:::\n\n\nUnfortunately, we already realize that we have not done the best job with our \"tokenizer\": The second occurence of the word `science` is missing do to the punctuation. \nWhile this is great as it holds information about the ending of a sentence, it disturbs our analysis here, so let's get rid of it. \n\n::: {#d420f19e .cell execution_count=4}\n``` {.python .cell-code}\ntokenized_sentence = sentence.replace(\".\", \" \").split(\" \")\n\ntoken_counter = Counter(tokenized_sentence)\nprint(token_counter.most_common(2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[('science', 2), ('books', 2)]\n```\n:::\n:::\n\n\nSo that worked.\nAs you can imagine, tokenization can get increasingly difficult when we have to deal with all sorts of situations in larger corpora of texts (see also the exercise). \nSo it is great that there are already all sorts of libraries available that can help us with this process. \n\n::: {#cd707ce2 .cell execution_count=5}\n``` {.python .cell-code}\nfrom nltk.tokenize import wordpunct_tokenize\nfrom string import punctuation\n\ntokenized_sentence = wordpunct_tokenize(sentence)\ntokenized_sentence = [t for t in tokenized_sentence if t not in punctuation]\nprint(tokenized_sentence)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['I', 'love', 'reading', 'science', 'fiction', 'books', 'or', 'books', 'about', 'science']\n```\n:::\n:::\n\n\n## Advanced word tokenization\n\nTODO: Write\n\n", + "markdown": "---\ntitle: Tokenization\nformat:\n html:\n code-fold: false\n---\n\nTODO: Some introductory sentence.\n\n## Simple word tokenization\nA key element for a computer to understand the words we speak or type is the concept of word tokenization. \nFor a human, the sentence \n\n::: {#4561108d .cell execution_count=1}\n``` {.python .cell-code}\nsentence = \"I love reading science fiction books or books about science.\"\n```\n:::\n\n\nis easy to understand since we are able to split the sentence into its individual parts in order to figure out the meaning of the full sentence.\nFor a computer, the sentence is just a simple string of characters, like any other word or longer text.\nIn order to make a computer understand the meaning of a sentence, we need to help break it down into its relevant parts.\n\nSimply put, word tokenization is the process of breaking down a piece of text into individual words or so-called tokens. \nIt is like taking a sentence and splitting it into smaller pieces, where each piece represents a word.\nWord tokenization involves analyzing the text character by character and identifying boundaries between words. \nIt uses various rules and techniques to decide where one word ends and the next one begins. \nFor example, spaces, punctuation marks, and special characters often serve as natural boundaries between words.\n\nSo let's start breaking down the sentence into its individual parts.\n\n::: {#36ccc15c .cell execution_count=2}\n``` {.python .cell-code}\ntokenized_sentence = sentence.split(\" \")\nprint(tokenized_sentence)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['I', 'love', 'reading', 'science', 'fiction', 'books', 'or', 'books', 'about', 'science.']\n```\n:::\n:::\n\n\nOnce we have tokenized the sentence, we can start anaylzing it with some simple statistical methods. \nFor example, in order to figure out what the sentence might be about, we could count the most frequent words. \n\n::: {#4b5bb732 .cell execution_count=3}\n``` {.python .cell-code}\nfrom collections import Counter\n\ntoken_counter = Counter(tokenized_sentence)\nprint(token_counter.most_common(2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[('books', 2), ('I', 1)]\n```\n:::\n:::\n\n\nUnfortunately, we already realize that we have not done the best job with our \"tokenizer\": The second occurence of the word `science` is missing do to the punctuation. \nWhile this is great as it holds information about the ending of a sentence, it disturbs our analysis here, so let's get rid of it. \n\n::: {#54b1654a .cell execution_count=4}\n``` {.python .cell-code}\ntokenized_sentence = sentence.replace(\".\", \" \").split(\" \")\n\ntoken_counter = Counter(tokenized_sentence)\nprint(token_counter.most_common(2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[('science', 2), ('books', 2)]\n```\n:::\n:::\n\n\nSo that worked.\nAs you can imagine, tokenization can get increasingly difficult when we have to deal with all sorts of situations in larger corpora of texts (see also the exercise). \nSo it is great that there are already all sorts of libraries available that can help us with this process. \n\n::: {#7a06e431 .cell execution_count=5}\n``` {.python .cell-code}\nfrom nltk.tokenize import wordpunct_tokenize\nfrom string import punctuation\n\ntokenized_sentence = wordpunct_tokenize(sentence)\ntokenized_sentence = [t for t in tokenized_sentence if t not in punctuation]\nprint(tokenized_sentence)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['I', 'love', 'reading', 'science', 'fiction', 'books', 'or', 'books', 'about', 'science']\n```\n:::\n:::\n\n\n## Advanced word tokenization\n\nTODO: Write\n\n\nFrom the docs: \n\nhttps://platform.openai.com/tokenizer\n\nA helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).\n\n", "supporting": [ "tokenization_files" ], diff --git a/_quarto.yml b/_quarto.yml index 2ac1b49..5a734cd 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -12,7 +12,7 @@ format: website: page-navigation: true back-to-top-navigation: true - + navbar: tools: - icon: github @@ -56,6 +56,10 @@ website: contents: - llm/intro.qmd - llm/gpt.qmd + - llm/gpt_api.qmd + - llm/exercises/ex_gpt_start.ipynb + - llm/exercises/ex_gpt_chatbot.ipynb + - llm/prompting.qmd - llm/parameterization.qmd - llm/exercises/ex_gpt_parameterization.ipynb diff --git a/docs/about/assignment.html b/docs/about/assignment.html index 04f19b8..b9c2e54 100644 --- a/docs/about/assignment.html +++ b/docs/about/assignment.html @@ -249,6 +249,30 @@ + +
Task: Use the OpenAI embeddings API and simple embedding arithmetics to introduce more context to word similarities.
Instructions:
reptile
and add it to python
. You can use numpy
for this.python
and this sum. What do you notice?Solution
-= ["python", "snake", "javascript", "reptile"]
@@ -510,7 +541,7 @@ words Exercise: Embedding similarity
= [emb.embedding for emb in response.data] embeddings
print(f"Similarity between '{words[0]}' and '{words[1]}': {round(cosine_similarity(embeddings[0], embeddings[1]), 3)}.")
@@ -523,6 +554,7 @@ Exercise: Embedding similarity
Similarity between 'python + reptile' and 'snake': 0.894.