Introduction

Remember learning to ride a bike? You probably fell a few times, adjusted your balance, and eventually figured it out through trial and error. That's essentially what reinforcement learning does - but for machines.

Unlike supervised learning where we show the model labeled examples ("this is a cat, this is a dog"), RL agents learn by doing. They take actions, see what happens, get rewards or penalties, and gradually figure out what works. It's how AlphaGo learned to beat world champions and how robots learn to walk.

In this guide, we'll explore the math behind RL. I promise to keep it practical - we'll focus on the core concepts you actually need to understand, without drowning in excessive formalism. Let's dive in!

What is Reinforcement Learning?

Think of RL as teaching a dog new tricks. You don't explain in words what to do - instead, you reward good behavior (treats!) and discourage bad behavior. Over time, the dog learns what actions lead to treats.

In RL, we have:

Agent: The learner (like our dog, or an AI)
Environment: The world it interacts with
Actions: What it can do
Rewards: Feedback signals (positive or negative)

The agent tries different actions, sees what rewards it gets, and learns to choose actions that maximize total reward over time. Simple concept, powerful results!

MDPs: Making it Mathematical

Here's where we get a bit formal (but I promise it's worth it). The mathematical framework behind RL is called a Markov Decision Process (MDP).

Formal Definition

An MDP is defined by a 5-tuple: $M = (S, A, P, R, \gamma)$

S: Set of states (all possible situations the agent can be in)
A: Set of actions (all possible choices the agent can make)
P: Transition probability function $P(s'|s,a)$ - probability of reaching state $s'$ from state $s$ by taking action $a$ . You can sometimes encouter it as $T(s,a, s')$ in the literature.
R: Reward function $R(s,a,s')$ - immediate reward for transitioning from $s$ to $s'$ via action $a$
$\gamma$ : Discount factor ( $0 ≤ \gamma ≤ 1$ ) - determines the importance of future rewards

The Key Idea: The Markov Property

The "Markov" part means the future only depends on where you are now, not how you got there. If you're playing chess, the current board position tells you everything - you don't need to remember every move that led to this position.

Mathematically:

P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0, a_0) = P(s_{t+1} | s_t, a_t)

This "memoryless" property makes the math tractable while still capturing most real-world problems surprisingly well.

Quick Recap: An MDP is just a formal way to describe a decision-making problem where outcomes are partially random and partially controllable. It's defined by states, actions, transition probabilities, rewards, and a discount factor.

Policies: Your Strategy for Success

A policy is just your strategy - it tells you what action to take in each situation.

You can have:

Deterministic policy: Always do the same thing in the same situation (e.g., "always turn left at the red door"). This can be simply encoded with a dictionary.

\pi: S \to A

Stochastic policy: Sometimes do different things (e.g., "70% of the time turn left, 30% turn right"). This can be represented as a probability distribution over actions, say a dict of action probabilities that need to be sampled.

\pi(a\mid s): S \times A \to [0,1]

The whole point of RL is finding the optimal policy - the strategy that gets you the most reward in the long run. But how do we know which policy is best? That's where value functions come in.

Value Functions: How Good is the current situation, in the long run?

Value functions answer the question: "How good is this situation for me?" They let us quantify the expected future rewards.

State-Value Function V(s)

"If I'm in state s and follow my policy, what's my expected total reward?"

V^{\pi}(s) = E_{\pi}[G_t \mid S_t = s]

where the total return $G_t$ is just the sum of all future rewards (discounted for mathematical convergence):

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots

The discount factor $γ$ makes future rewards worth less (a dollar today is better than a dollar tomorrow).

Action-Value Function Q(s,a)

"If I'm in state s, take action a, then follow my policy, what's my expected total reward?"

Q^{\pi}(s,a) = E_{\pi}[G_t \mid S_t = s, A_t = a]

The Q-function is super important - once you know it, finding the best action is trivial: just pick the action with the highest Q-value!

The relationship between V and Q is straightforward - the value of a state is just the average Q-value over the actions you might take:

V^{\pi}(s) = \sum_a \pi(a\mid s) Q^{\pi}(s,a)

The Bellman Equations: The Heart of RL

Here's the big insight that makes RL work: the value of where you are now equals the immediate reward you get plus the value of where you'll end up (discounted).

Think about it - if you're deciding whether to take a job, you consider both the immediate salary and the future career prospects. That's the Bellman equation!

For the State-Value Function

V^{\pi}(s) = \sum_a \pi(a\mid s) \sum_{s'} P(s'\mid s,a)[R(s,a,s') + \gamma V^{\pi}(s')]

Translation: "The value of state s = expected immediate reward + discounted value of next state"

For the Optimal Policy

For the best possible policy, we don't average over actions - we take the best one:

V^*(s) = \max_a \sum_{s'} P(s'\mid s,a)[R(s,a,s') + \gamma V^*(s')]

Q^*(s,a) = \sum_{s'} P(s'\mid s,a)[R(s,a,s') + \gamma \max_{a'} Q^*(s',a')]

These equations are beautiful because they're recursive - they break a complex problem (find optimal behavior forever) into simple steps (get immediate reward + solve remaining problem).

Why this matters: Most RL algorithms are just different ways of solving these equations!

Solving MDPs: Two Approaches

Now that we have the theory, how do we actually find optimal policies? There are two main philosophies:

1. Value-Based Methods

"Figure out how good each state/action is, then choose the best actions"

First, compute the value function (how good is each state?)
Then, act greedily (always pick the action leading to the best state)

This is like planning a road trip by first figuring out how far each city is from your destination, then always driving toward the closest city. Without further planning so to say, not trying to optimize the route with skipping some steps for instance or reprioritizing with further checkpoints down the road.

2. Policy-Based Methods

"Directly search for the best strategy"

Skip the value function entirely
Directly optimize the policy to maximize reward

This is like trying different driving strategies and keeping what works, without necessarily understanding why.

Most modern algorithms actually combine both approaches (actor-critic methods), but that's beyond our scope here.

When You Know Everything: Dynamic Programming

If you know the complete MDP (all transition probabilities and rewards), you can solve it exactly. This rarely happens in real life, but it's conceptually important.

Value Iteration

Start with random guesses for each state's value, then repeatedly apply the Bellman equation:

V_{k+1}(s) = \max_a \sum_{s'} P(s'\mid s,a)[R(s,a,s') + \gamma V_k(s')]

Each iteration gets you closer to the true optimal values. Once you have the optimal values, the optimal policy is just "pick the action that leads to the highest-value state."

Policy Iteration

A slightly different approach - alternate between:

Evaluate your current policy (compute how good it is)
Improve it (update to be greedy with respect to current values)

Both converge to the optimal policy, but policy iteration often gets there faster.

Reality check: These methods need complete knowledge of the environment, which we rarely have. That's where the next section comes in!

Learning from Experience: Model-Free RL

Here's where things get practical. In the real world, we usually don't know $P(s'|s,a)$ (what happens when I do X?) or $R(s,a,s')$ (how much reward will I get?). We have to learn by doing.

Monte Carlo: Learning from Complete Episodes

The simplest approach:

Run an episode from start to finish
See what total reward you got
Update your value estimates based on actual experience

It's like learning by reflection - you try something, see how it went, and adjust. The downside? You have to wait until the end of an episode, and you learn slowly.

Temporal Difference (TD) Learning: Learn as You Go

TD methods are smarter - they learn from each step:

V(s) \leftarrow V(s) + \alpha [R + \gamma V(s') - V(s)]

The term $R + \gamma V(s') - V(s)$ is called the TD error - it's the difference between what you predicted and what you actually observed.

Think of it like adjusting your expectations after each new piece of information, rather than waiting for the full story.

Putting It All Together

Let's take a breath and recap what we've covered:

MDPs give us a mathematical framework for decision-making problems
Policies are strategies that tell us what to do
Value functions tell us how good states or actions are
Bellman equations give us a recursive way to compute values
Dynamic programming solves MDPs when we know everything about the problem at hand
Model-free methods learn from experience when we don't

The beautiful part? All these concepts build on each other. Value functions use the Bellman equations, policies are derived from value functions, and learning algorithms approximate these mathematical objects through experience.

Now, let's see how these ideas come together in one of the most important RL algorithms: Q-Learning!

Q-Learning: The Algorithm That Started It All

Q-Learning is elegant in its simplicity. Remember the Q-function we talked about? Q-Learning learns it directly from experience, without needing to know the transition probabilities.

The Q-Learning Update Rule

Q(s,a) \leftarrow Q(s,a) + \alpha [R + \gamma \max_{a'} Q(s',a') - Q(s,a)]

Let's break this down:

$Q(s,a)$ is your current estimate
$R + \gamma \max_{a'} Q(s',a')$ is what you actually observed (reward + best future value)
The difference is your prediction error
$α$ is the learning rate (how fast you update)

The algorithm is simple:

Start in some state s
Pick an action a (usually with some randomness to explore)
Observe reward R and next state s'
Update Q(s,a) using the equation above
Repeat

The magic: This simple update, repeated many times, provably converges to the optimal Q-function! Once you have that, the optimal policy is just "always pick the action with the highest Q-value."

Exploration vs Exploitation

"The eternal struggle: Play it safe with guaranteed paycheck, or risk it all for buried treasure?" 🗺️💰

One crucial detail: should you always pick the best action (exploit), or try random actions to discover better ones (explore)?

A common solution is ε-greedy: with probability ε, pick a random action; otherwise, pick the best one. Start with high ε (explore a lot), gradually decrease it as you learn.

Where Do We Go From Here?

We've covered the core theory, but modern RL goes much further:

Deep Q-Networks (DQN): Use neural networks to approximate Q for huge state spaces (how Atari games were conquered)
Policy Gradients: Directly optimize policies instead of values
Actor-Critic: Combine the best of both worlds
Multi-Agent RL: Multiple agents learning together (or competing!)

The fundamentals we covered are the foundation for all of these. Master the basics - MDPs, value functions, Bellman equations - and the advanced stuff becomes much more approachable.

Conclusion: The Path Forward with Reinforcement Learning

Reinforcement learning brings together elegant theory and practical experimentation. By thinking in terms of states, actions, and rewards, and then using value functions and the Bellman equations to formalize future returns, we get a flexible framework that scales from toy gridworlds to real-world problems.

Who benefits from learning these ideas?

Data scientists and researchers exploring decision-making and RL algorithms
Software engineers embedding adaptive behaviors or analytics into products
Students and instructors learning a mathematically grounded approach to learning from interaction

What this foundation gives you:

A clear vocabulary (MDPs, policies, value functions) so you can read and implement research
Practical algorithms (Q-Learning, TD methods) that work without a full model of the environment
A path to scale: once you understand the fundamentals, you can move to function approximation and deep RL

The best part is practical: you can start experimenting with small environments today and steadily build toward larger systems.

Next Steps

Read Sutton & Barto, "Reinforcement Learning: An Introduction" — the canonical, freely available textbook that covers theory and algorithms in depth: https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf
Try environments with OpenAI Gym (classic control, Atari) to get hands-on practice: https://gym.openai.com/ or https://www.gymlibrary.dev, stay tuned for more!
Follow David Silver's RL course for video lectures and notes: https://www.davidsilver.uk/teaching/
Explore OpenAI Spinning Up for practical implementations and clear explanations: https://spinningup.openai.com/
Use Stable Baselines3 or RLlib for higher-level RL tooling and reproducible experiments: https://stable-baselines3.readthedocs.io/ and https://docs.ray.io/en/latest/rllib/index.html
Read foundational papers e.g., Mnih et al. 2015 DQN when you move into deep RL research (Q-learning based RL)
Join the community: RL subreddit, Deep RL Slack/Discord groups, and conference proceedings (NeurIPS, ICML, ICLR)

Final Encouragement

The fundamentals you learned here are the lever that opens the rest of the field. Start small, run experiments, and iterate — RL rewards patience and careful tuning. Whether you're building research experiments, game agents, or production systems, these core ideas will keep you grounded.

Happy learning and experimenting!