gimseng · Macro1027 · Jul 4, 2024
diff --git a/011/exercise/readme.md b/011/exercise/readme.md
@@ -0,0 +1,24 @@
+# Problem Statement
+In this problem, you will train a simple Q-learning agent from scratch using the Gymnasium environment. The goal is to implement a Q-learning algorithm that allows an agent to learn and improve its performance through interactions with the CartPole-v1 environment.
+
+### You will:
+- Set up the Gymnasium environment.
+- Implement the Q-learning algorithm.
+- Train the Q-learning agent.
+- Evaluate the agent's performance.
+
+### Methods to Use:
+- Gymnasium Environment: Set up and use the CartPole-v1 environment.
+- Q-Learning Algorithm: Implement the Q-learning algorithm, which involves updating a Q-table based on the agent's experiences.
+- State Discretization: Discretize the continuous state space into a finite set of discrete states.
+- Training Loop: Train the Q-learning agent over multiple episodes, updating the Q-table based on the rewards received.
+- Evaluation: Evaluate the performance of the trained agent by measuring its average reward over a set of episodes.
+
+### Exercise Steps:
+- Set Up the Gymnasium Environment: Initialize the CartPole-v1 environment.
+- Define the state space and action space.
+- Initialize the Q-table with zeros.
+- Define the hyperparameters: learning rate (alpha), discount factor (gamma), exploration rate (epsilon), and exploration decay rate.
+- Create a function to discretize the continuous state space into discrete bins.
+- Implement the training loop where the agent interacts with the environment, updates the Q-table, and decays the exploration rate.
+- Measure the agent's performance by calculating the average reward over a set of evaluation episodes.
diff --git a/011/solutions/q_learning.ipynb b/011/solutions/q_learning.ipynb
@@ -0,0 +1,143 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gymnasium as gym\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Initialize the Gym environment\n",
+    "env = gym.make(\"CartPole-v1\")\n",
+    "\n",
+    "# Set up the Q-table\n",
+    "num_features = env.observation_space.shape[0]\n",
+    "state_space = [30] * num_features\n",
+    "q_table = np.zeros(state_space + [env.action_space.n])\n",
+    "\n",
+    "# Define hyperparameters\n",
+    "alpha = 0.1  # Learning rate\n",
+    "gamma = 0.99  # Discount factor\n",
+    "epsilon = 1.0  # Exploration rate\n",
+    "epsilon_decay = 0.995 \n",
+    "min_epsilon = 0.01"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Discretize the state space\n",
+    "def discretize_state(state):\n",
+    "    bins = [np.linspace(-4.8, 4.8, state_space[0] - 1),\n",
+    "            np.linspace(-4, 4, state_space[1] - 1),\n",
+    "            np.linspace(-0.418, 0.418, state_space[2] - 1),\n",
+    "            np.linspace(-4, 4, state_space[3] - 1)]\n",
+    "    return tuple(np.digitize(state[i], bins[i]) for i in range(len(state)))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Training the Q-learning agent\n",
+    "num_episodes = 10000\n",
+    "for episode in range(num_episodes):\n",
+    "    # Discretise state\n",
+    "    state = discretize_state(env.reset()[0])\n",
+    "    done = trunc = False\n",
+    "    \n",
+    "    while not done and not trunc:\n",
+    "        # Using epsilon-greedy action selection\n",
+    "        if np.random.random() < epsilon:\n",
+    "            action = env.action_space.sample()\n",
+    "        else:\n",
+    "            action = np.argmax(q_table[state])\n",
+    "        \n",
+    "        next_state, reward, done, trunc, _ = env.step(action)\n",
+    "        next_state = discretize_state(next_state)\n",
+    "        \n",
+    "        # Penalise stopping\n",
+    "        if done and reward == 0:\n",
+    "            reward = -100\n",
+    "        \n",
+    "        # Update q table\n",
+    "        q_table[state][action] = q_table[state][action] + alpha * (reward + gamma * np.max(q_table[next_state]) - q_table[state][action])\n",
+    "        state = next_state\n",
+    "\n",
+    "    # Decay epsilon\n",
+    "    if epsilon > min_epsilon:\n",
+    "        epsilon *= epsilon_decay"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Average reward over 100 episodes: 151.74\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Evaluate the agent\n",
+    "total_rewards = 0\n",
+    "for episode in range(100):\n",
+    "    state = discretize_state(env.reset()[0])\n",
+    "    done = False\n",
+    "    while not done:\n",
+    "        action = np.argmax(q_table[state])\n",
+    "        next_state, reward, done, _, _ = env.step(action)\n",
+    "        state = discretize_state(next_state)\n",
+    "        total_rewards += reward\n",
+    "\n",
+    "print(f\"Average reward over 100 episodes: {total_rewards / 100}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/011/solutions/readme.md b/011/solutions/readme.md
@@ -0,0 +1,14 @@
+# My solution
+
+See the following jupyter notebook:
+
+- `q_learning.ipynb`: Jupyter notebook with the implementation of the Q-learning algorithm.
+
+ [![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/011/solution/q-learning.ipynb)
+ [![View in nbviewer](https://github.com/jupyter/design/blob/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/011/solution/q-learning.ipynb)
+
+The first part of the notebook initialises the gym environment, q table and training hyperparameters. 
+
+Then, a function is created to discretise the continuous state space into discrete bins. 
+
+Following this, the q table is updated through the training loop. Lastly, the agent is evaluated based on an evaluation run.