Reinforcement Learning
Reinforcement Learning (RL) studies how an agent can learn good behavior by acting in an environment and receiving rewards or penalties.
Unlike supervised learning, the agent is not handed the correct answer for each state. It often receives delayed feedback, so the goal is to learn a policy: a rule for choosing actions that maximizes expected cumulative reward over time.
Core Components
-
The Agent: The learner or decision-maker.
-
The Environment: The physical or virtual world the agent interacts with.
-
The State (): The current configuration of the environment.
-
The Action (): The choices available to the agent.
-
The Reward (): The numerical feedback signaling success or failure.
No labeled dataset is required; the agent learns directly from interacting with the environment.
Highly sample-inefficient; agents often require millions of trial steps to converge on simple tasks.
Intuition
How to think about this algorithm
Imagine training a robot to cross a room. It tries actions, sees whether it moves closer to the goal, hits obstacles, or reaches the target, and updates its behavior from those outcomes.
An RL agent starts by exploring, then gradually exploits actions that have produced better long-term results. The hard part is that an action can look bad immediately but be useful several steps later.
This is the credit assignment problem: if a reward arrives at the end of a sequence, which earlier actions deserve credit? Temporal-difference learning handles this by updating estimates one transition at a time.
Q-Learning Pathfinder (Reinforcement Learning)
Click grid blocks to cycle cells: Wall (brown) -> Trap (pink) -> Goal (green) -> Empty. Run the agent and observe Q-values updates.
The Logic
Mathematical core for reinforcement learning
1. Markov Decision Processes (MDP)
An RL problem is formally modeled as a Markov Decision Process, defined by the tuple where:
-
is the state space.
-
is the action space.
-
is the probability of transitioning to state given state and action .
-
is the reward function.
-
is the discount factor, which reduces the value of future rewards compared to immediate ones.
2. The Bellman Optimality Equation
The value of taking action in state under an optimal policy is defined by the Q-value, :
3. Q-Learning Algorithm
In model-free environments (where transition probabilities are unknown), the agent estimates iteratively by exploring:
Where is the learning rate, and the term in parentheses is the Temporal Difference (TD) Error.
Code Example
reinforcement_learning.py · reference implementation
1import numpy as np
2
3# Simple Q-Table update step for Q-learning
4states_n, actions_n = 16, 4
5Q_table = np.zeros((states_n, actions_n))
6
7alpha = 0.1 # Learning rate
8gamma = 0.95 # Discount factor
9epsilon = 0.2 # Exploration rate
10
11def update_q(state, action, reward, next_state):
12 # Temporal Difference Target
13 best_next_action = np.argmax(Q_table[next_state])
14 td_target = reward + gamma * Q_table[next_state][best_next_action]
15
16 # Update Q-value
17 Q_table[state][action] += alpha * (td_target - Q_table[state][action])
18Strengths
No labeled dataset is required; the agent learns directly from interacting with the environment.
Capable of solving complex sequential planning tasks that require long-term strategy (e.g., Chess, Go, robotic locomotion).
Adapts dynamically to non-stationary environments as feedback loops change.
Limitations
Highly sample-inefficient; agents often require millions of trial steps to converge on simple tasks.
Sensitive to hyperparameter settings (learning rate, discount factor, exploration rate) and reward design.
Exploration vs. exploitation dilemma can cause the agent to get stuck in sub-optimal local behaviors.
Key Assumptions
Scope conditions and interpretation notes
- 1
The environment states satisfy the Markov property (the next state depends only on the current state and action).
- 2
Rewards are defined appropriately to incentivize the target final behavior without encouraging exploit loops.
References
Books and papers for deeper study
Sutton, R. S. and Barto, A. G. (2018) Reinforcement Learning: An Introduction. 2nd edn. Cambridge, MA: MIT Press.