Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gymnasium Q-learning on CartPole environment #209

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions 011/exercise/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Problem Statement
In this problem, you will train a simple Q-learning agent from scratch using the Gymnasium environment. The goal is to implement a Q-learning algorithm that allows an agent to learn and improve its performance through interactions with the CartPole-v1 environment.

### You will:
- Set up the Gymnasium environment.
- Implement the Q-learning algorithm.
- Train the Q-learning agent.
- Evaluate the agent's performance.

### Methods to Use:
- Gymnasium Environment: Set up and use the CartPole-v1 environment.
- Q-Learning Algorithm: Implement the Q-learning algorithm, which involves updating a Q-table based on the agent's experiences.
- State Discretization: Discretize the continuous state space into a finite set of discrete states.
- Training Loop: Train the Q-learning agent over multiple episodes, updating the Q-table based on the rewards received.
- Evaluation: Evaluate the performance of the trained agent by measuring its average reward over a set of episodes.

### Exercise Steps:
- Set Up the Gymnasium Environment: Initialize the CartPole-v1 environment.
- Define the state space and action space.
- Initialize the Q-table with zeros.
- Define the hyperparameters: learning rate (alpha), discount factor (gamma), exploration rate (epsilon), and exploration decay rate.
- Create a function to discretize the continuous state space into discrete bins.
- Implement the training loop where the agent interacts with the environment, updates the Q-table, and decays the exploration rate.
- Measure the agent's performance by calculating the average reward over a set of evaluation episodes.
143 changes: 143 additions & 0 deletions 011/solutions/q_learning.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import gymnasium as gym\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Initialize the Gym environment\n",
"env = gym.make(\"CartPole-v1\")\n",
"\n",
"# Set up the Q-table\n",
"num_features = env.observation_space.shape[0]\n",
"state_space = [30] * num_features\n",
"q_table = np.zeros(state_space + [env.action_space.n])\n",
"\n",
"# Define hyperparameters\n",
"alpha = 0.1 # Learning rate\n",
"gamma = 0.99 # Discount factor\n",
"epsilon = 1.0 # Exploration rate\n",
"epsilon_decay = 0.995 \n",
"min_epsilon = 0.01"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Discretize the state space\n",
"def discretize_state(state):\n",
" bins = [np.linspace(-4.8, 4.8, state_space[0] - 1),\n",
" np.linspace(-4, 4, state_space[1] - 1),\n",
" np.linspace(-0.418, 0.418, state_space[2] - 1),\n",
" np.linspace(-4, 4, state_space[3] - 1)]\n",
" return tuple(np.digitize(state[i], bins[i]) for i in range(len(state)))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Training the Q-learning agent\n",
"num_episodes = 10000\n",
"for episode in range(num_episodes):\n",
" # Discretise state\n",
" state = discretize_state(env.reset()[0])\n",
" done = trunc = False\n",
" \n",
" while not done and not trunc:\n",
" # Using epsilon-greedy action selection\n",
" if np.random.random() < epsilon:\n",
" action = env.action_space.sample()\n",
" else:\n",
" action = np.argmax(q_table[state])\n",
" \n",
" next_state, reward, done, trunc, _ = env.step(action)\n",
" next_state = discretize_state(next_state)\n",
" \n",
" # Penalise stopping\n",
" if done and reward == 0:\n",
" reward = -100\n",
" \n",
" # Update q table\n",
" q_table[state][action] = q_table[state][action] + alpha * (reward + gamma * np.max(q_table[next_state]) - q_table[state][action])\n",
" state = next_state\n",
"\n",
" # Decay epsilon\n",
" if epsilon > min_epsilon:\n",
" epsilon *= epsilon_decay"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average reward over 100 episodes: 151.74\n"
]
}
],
"source": [
"# Evaluate the agent\n",
"total_rewards = 0\n",
"for episode in range(100):\n",
" state = discretize_state(env.reset()[0])\n",
" done = False\n",
" while not done:\n",
" action = np.argmax(q_table[state])\n",
" next_state, reward, done, _, _ = env.step(action)\n",
" state = discretize_state(next_state)\n",
" total_rewards += reward\n",
"\n",
"print(f\"Average reward over 100 episodes: {total_rewards / 100}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
14 changes: 14 additions & 0 deletions 011/solutions/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# My solution

See the following jupyter notebook:

- `q_learning.ipynb`: Jupyter notebook with the implementation of the Q-learning algorithm.

[![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/011/solution/q-learning.ipynb)
[![View in nbviewer](https://github.com/jupyter/design/blob/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/011/solution/q-learning.ipynb)

The first part of the notebook initialises the gym environment, q table and training hyperparameters.

Then, a function is created to discretise the continuous state space into discrete bins.

Following this, the q table is updated through the training loop. Lastly, the agent is evaluated based on an evaluation run.