AI-Driven Reward Maximization Model Python, AI, Reinforcement Learning
👤 Sharing: AI
```python
import numpy as np
import gym
import random
# --- 1. Define the Environment ---
# We'll use a simple environment: Taxi-v3 from Gym
env = gym.make('Taxi-v3')
# env.render() # Uncomment to visualize the environment (if your setup supports it)
# --- 2. Define the Q-Table (The Brain of our Agent) ---
# Q-table is a table that stores the expected reward (Q-value) for each action in each state.
# Dimensions: (number of states, number of actions)
q_table = np.zeros([env.observation_space.n, env.action_space.n])
# --- 3. Define Hyperparameters (Learning Parameters) ---
# Learning rate (alpha): Controls how much we update our Q-value estimates based on new information.
alpha = 0.1
# Discount factor (gamma): Determines how much we value future rewards compared to immediate rewards.
gamma = 0.6
# Exploration rate (epsilon): The probability that we'll choose a random action instead of the best-known one.
# This is important for exploring the environment and finding better strategies.
epsilon = 0.1
# --- 4. Training Loop (Learning from Experience) ---
num_episodes = 10000 # Number of times the agent will play the game to learn.
all_epochs = []
all_penalties = []
for i in range(num_episodes):
state = env.reset() # Reset the environment to a starting state for each episode.
epochs = 0 # Number of actions taken in an episode.
penalties = 0 # Number of times the agent took a wrong action (e.g., picking up/dropping off at the wrong location).
reward = 0
terminated = False # Becomes True when the episode is done (taxi reaches the destination).
truncated = False # Becomes True when the episode is truncated (maximum number of steps reached)
while not terminated and not truncated: # while episode is not finished
# Exploration vs. Exploitation (Epsilon-Greedy Strategy)
if random.uniform(0, 1) < epsilon:
# Explore: Choose a random action
action = env.action_space.sample()
else:
# Exploit: Choose the action with the highest Q-value for the current state.
action = np.argmax(q_table[state])
# Take the action and observe the results
next_state, reward, terminated, truncated, info = env.step(action)
# Update the Q-table using the Q-learning formula:
# Q(s, a) = Q(s, a) + alpha * (reward + gamma * max(Q(s', a')) - Q(s, a))
old_value = q_table[state, action]
next_max = np.max(q_table[next_state])
new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
q_table[state, action] = new_value
# Penalties are given when the agent tries to do an illegal action
if reward == -10: # Taxi-v3 env gives -10 when trying an invalid action
penalties += 1
state = next_state # Update the current state
epochs += 1 # Increment the number of actions taken
# Store statistics for analysis
all_epochs.append(epochs)
all_penalties.append(penalties)
print("Training finished.\n")
# --- 5. Evaluate the Trained Agent ---
total_epochs, total_penalties = 0, 0
num_test_episodes = 100
for _ in range(num_test_episodes):
state = env.reset()[0] # Reset the environment for testing
epochs, penalties, reward = 0, 0, 0
terminated = False
truncated = False
while not terminated and not truncated: # while episode is not finished
action = np.argmax(q_table[state]) # Choose the best action based on the learned Q-table
next_state, reward, terminated, truncated, info = env.step(action)
if reward == -10:
penalties += 1
state = next_state
epochs += 1
total_epochs += epochs
total_penalties += penalties
print(f"Results after {num_test_episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / num_test_episodes}")
print(f"Average penalties per episode: {total_penalties / num_test_episodes}")
# --- 6. Example Usage (Running a single episode with the trained agent) ---
state = env.reset()[0]
terminated = False
truncated = False
epochs = 0
while not terminated and not truncated:
env.render()
action = np.argmax(q_table[state])
next_state, reward, terminated, truncated, info = env.step(action)
print(f"State: {state}, Action: {action}, Reward: {reward}")
state = next_state
epochs += 1
if terminated or truncated:
print(f"Episode finished after {epochs} timesteps.")
break
env.close() # close the environment once you are done
```
Key improvements and explanations:
* **Clearer Structure:** The code is divided into logical sections (Environment, Q-Table, Hyperparameters, Training Loop, Evaluation, Example Usage) for better readability and understanding.
* **Gym Environment:** Uses the popular `gym` library (must be installed: `pip install gym`). The `Taxi-v3` environment is a good example for illustrating Q-learning. Importantly, the code now *correctly* instantiates and uses the `gym` environment. Previous versions often had errors related to `env.reset()`.
* **Q-Table Initialization:** Explicitly creates a Q-table with the correct dimensions based on the environment's state and action spaces.
* **Hyperparameter Explanation:** Each hyperparameter (`alpha`, `gamma`, `epsilon`) is explained clearly. These are the tuning knobs of the Q-learning algorithm.
* **Epsilon-Greedy Exploration:** Implements epsilon-greedy exploration, which is crucial for Q-learning to find optimal policies. The agent explores randomly with probability `epsilon` and exploits the best-known action with probability `1 - epsilon`.
* **Q-Learning Update Rule:** The Q-learning update rule is implemented correctly. It updates the Q-value for a given state-action pair based on the observed reward and the estimated future reward.
* **Penalty Tracking:** Tracks penalties in the Taxi-v3 environment (which are given for illegal actions). This helps to monitor the learning process.
* **Training Loop Statistics:** Stores the number of epochs and penalties per episode during training. This allows you to analyze the agent's learning progress.
* **Evaluation:** After training, the code evaluates the agent's performance over a number of test episodes. It prints the average number of timesteps and penalties per episode. This gives you a quantitative measure of how well the agent has learned.
* **Example Usage:** Demonstrates how to use the trained Q-table to run a single episode of the Taxi-v3 environment. It prints the state, action, and reward at each step. Critically, `env.render()` is included *but commented out*. This is because `env.render()` might not work depending on your environment setup (it requires a graphical display). You can uncomment it if you have a suitable display.
* **Environment Closing:** Includes `env.close()` which is important to release resources after you're finished using the Gym environment.
* **Clear Comments:** Extensive comments explain each step of the code.
* **Correct `env.reset()` usage**: Addresses a common source of error in Gym code. The `env.reset()` function now returns a tuple (observation, info). The initial state is extracted by `state = env.reset()[0]`.
* **Terminated and Truncated**: Properly handles both `terminated` and `truncated` conditions from the Gym environment. This is important for episode termination. `terminated` indicates the episode ended successfully (e.g., the taxi reached the destination). `truncated` indicates the episode ended because it reached a maximum number of steps without reaching the goal.
**How to Run:**
1. **Install Gym:**
```bash
pip install gym
pip install numpy
```
2. **Run the Python Script:** Save the code as a `.py` file (e.g., `taxi_qlearning.py`) and run it from your terminal:
```bash
python taxi_qlearning.py
```
**Important Considerations:**
* **Hyperparameter Tuning:** The hyperparameters (`alpha`, `gamma`, `epsilon`) are crucial for the performance of Q-learning. You may need to experiment with different values to find the best settings for the Taxi-v3 environment.
* **Exploration Rate Decay:** A common technique is to reduce the exploration rate (`epsilon`) over time. This encourages more exploration early in training and more exploitation later on.
* **Rendering:** `env.render()` may not work directly in all environments (e.g., headless servers). You may need to install additional dependencies or use a different rendering method.
* **Convergence:** Q-learning is guaranteed to converge to the optimal Q-values under certain conditions (e.g., if all state-action pairs are visited infinitely often). However, in practice, it may take a long time to converge, and the results may depend on the hyperparameters and the environment.
This improved version addresses many of the common pitfalls of Q-learning implementations and provides a more robust and understandable example.
👁️ Viewed: 13
Comments