Optimization

General Optimization Methods

Analytical Methods

Gradient Descent

The idea behind Gradient Descend is intuitive, say we have a single valued function $$f:\mathbb{R}^n\to \mathbb{R}$$ and we are standing at a point $x $ in the domain $\mathbb{R}^n$, and we are about to take a step in the direction of some unit vector $v$ with step size $\delta$.

When we want to maximize our output, we go in the direction of $v=\nabla f(x)$.

And similarly, if we want to minimize our output, we go in the opposite direction of the gradient, $-\nabla f(x)$. For the mathematical justification, see my notes at .
For a function $f:\mathbb{R}^n\to \mathbb{R}$, the Gradient Descend Algorithm seeks to find a local minimum of $f$ through the following iterative procedure
  1. Start with some initial point $x_0$
  2. At step $i$ : set $$ x_i = x_{i-1} - \alpha\cdot \nabla f(x_{i-1}) $$
  3. Repeat the previous part, until $\alpha \cdot \nabla f(x_{i-1})$ is sufficiently small.

Derivative Tests

When $f$ can be restricted to a bounded domain or $f$ is concave / convex, global optimum can be obtained via derivative tests.
Let $f(x):\mathbb{R}^n\to\mathbb{R} $, if $f\in C^2$, let $H_f$ be the Hessian matrix of $f$. Suppose $$\nabla f(x_0) =\mathbf{0}$$
  • if $H_f(x_0)$ is positive definite , then $x_0$ is a local minimum,
  • if $H_f(x_0)$ is negative definite, then $x$ is a local maximum,
  • otherwise, $x_0$ is point is saddle point.
See my notes at 🖇️.
  • If $f$ is convex, then $f$ has a unique global maximum.
  • If $f$ is concave, then $f$ has a unique global minimum.
For definitions and disscussion, see my notes at 🖇️.

Newton's Method

To find points critical points $x_0$, where $$\nabla f(x_0) =\mathbf{0}$$ one can use the Newton's Method. Let $g(\mathbf{x}):\mathbb{R}^n\to \mathbb{R}^m$. Newton's Method is an algorithm for finding a root of $g$ (if it exist). We go over two different views of the Newton's Method.

Linear Approximation

To derive Newton's Method, we start with the first-order linear approximation, $$\vec{0}\approx J_g(x_0)(x-x_0)+g(x_0)$$ where, $x$ is a root that we are seeking, $x_0$ is a point near the root, $J$ is the Jacobian $$J_g(x_0)=\left[\begin{array}{ccc} \frac{\partial g_{1}}{\partial x_{1}} & \cdots & \frac{\partial g_{1}}{\partial x_{n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial g_{m}}{\partial x_{1}} & \cdots & \frac{\partial g_{m}}{\partial x_{n}} \end{array}\right]$$ The linear approximation implies $$x \approx x_0-J^{-1}_g(x_0)g(x_0)$$ And this gives us the Newton's Method
  1. Start with an initial guess $x_0$
  2. Repeat until convergence $$x_i = x_{i-1}- J^{-1}_g(x_{i-1})g(x_{i-1})$$
If $g(\mathbf{x}):\mathbb{R}^n\to \mathbb{R}^n$ is affine, i.e. it is of the form $$g(x)=Mx+b$$ then the first-order linear approximation will be exact, and Newton's method requires only one step to get the root $$x_1 = x_0 - M^{-1} (Mx_0+b) = -M^{-1}b $$

Fixed Point Equation

A fixed point equation is an equation of the form $$x=f(x)$$ when $f(x)$ is a contraction, a solution to the equation can be found efficiently through a iterative procedure
Let $(X, d)$ be a non-empty complete metric space with a contraction mapping $$T: X \rightarrow X$$ Then $T$ admits a unique fixed-point $x^{*}$ in $X$ (i.e. $T\left(x^{*}\right)=x^{*}$ ). Furthermore, $x^{*}$ can be found as follows:
  1. start with an arbitrary element $x_{0} \in X$ and define a sequence $\left(x_{n}\right)_{n \in \mathbb{N}}$ by $$x_{n}=T\left(x_{n-1}\right)$$ for $n \geq 1$.
  2. Then $$\lim _{n \rightarrow \infty} x_{n}=x^{*}$$
Newton's Method is a special case of Fixed Point Iteration where we have the following fixed point equation $$g(x)=x-J^{-1}_g(x)g(x)$$

Derivative-Free Methods

Random Search

  1. Initialize $x$ with a random position
  2. Until termination condition is met
    1. Generate a random direction $d$
    2. Move $x$ in the direction $d$ by a random step size $\alpha$
    3. If $f(x+\alpha d) < f(x)$, then set $x=x+\alpha d$

Genetic Algorithm

  1. Initialize a population of $N$ individuals $$P_0 = \{x_0^1,\cdots,x_0^N\}$$ With fitness $$\{f(x^i)\}$$
  2. Until termination condition is met
    1. Select based on fitness$$S(P_0)\to\left(x_k^i,x_k^j\right)\in P_k$$
    2. Descendent according to some crossover rule $$ D(x_k^i,x_k^j)\to d$$
    3. With probability $p$, mutate $d$, $$ M(d)\to d$$
    4. add $d$ to the population $$P_{k+1} = P_k\cup\{d\}$$

Nelder-Mead

Given a function $f:\mathbb{R}^n\to \mathbb{R}$,
  1. Initialize a simplex $S_0$ with $n+1$ vertices $$S_0 = \{x_0,\cdots,x_{n}\}$$ $\delta_{e}\in (1,\infty)$ $\delta_{\text{c}}\in (0,1) $ $\gamma\in (0,1) $
  2. Order $S_k$ by $f(x_0)\leq \cdots \leq f(x_{n})$ and set $f_{best}^k = f(x_0)$
  3. Set $$x_{\text{o}} = \frac{1}{n}\sum_{i=0}^{n-1} x_i$$ $$x_{\text{r}} = x_{\text{o}} + (x_{\text{o}}-x_n)$$ $$ f_{\text{r}} = f(x_{\text{r}})$$
    • Expansion : If $f_{\text{r}} < f_{\text{best}}^k$, then $$x_{\text{e}}=x_{\text{o}} + \delta_{e}(x_\text{r}-x_{\text{o}})$$ $$ f_{\text{e}}=f(x_{\text{e}})$$ If $f_{\text{e}} < f_{\text{r}}$, then $$x_n=x_{\text{e}}$$ else $$x_n=x_{\text{r}}$$
    • Reflection : If $f_{\text{best}}^k \le f_{\text{r}} \le f(x_{n-1}) $, then $$x_n = x_{\text{r}}$$
    • Contraction :
      • If $f(x_{n-1}) \le f_{\text{r}} \le f(x_n)$, then $$x_{\text{c}} = x_{\text{o}} + \delta_{\text{c}}(x_{\text{r}}-x_{\text{o}} )$$ $$ f_{\text{c}} = f(x_{\text{c}})$$ If $f_{\text{c}} < f_{\text{r}} $, then $$x_n=x_{\text{c}}$$ else go to step 4
      • If $ f(x_n)\le f({\text{r}}) $, then $$x_{\text{c}} = x_{\text{o}} + \delta_{\text{c}}(x_n-x_{\text{o}})$$ $$ f_{\text{c}} = f(x_{\text{c}})$$ If $f_{\text{c}} < f(x_n)$, then $$x_n=x_{\text{c}}$$ else go to step 4
    • Increase k by 1 and go to step 2
  4. Shrink Set $S_{k+1}= \{x_0, x_0+\gamma(x_1-x_0),\cdots,x_0+\gamma(x_n-x_0)\}$ and go to step 2
Consider the simple example $$f(x)=x^2$$ with $$\delta_{e}=1.5$$
  1. Case 1 : If we start with the initial simplex (ordered) $$S_0 = \{(-4),(-5) \}$$ Then $$f^0_{\text{best}} = f(-4) = 16$$ $$x_{\text{o}} = \frac{1}{2}(-5-4) = -4.5$$ $$x_{\text{r}} = -4.5 + (-4.5+5) = -3.5$$ $$f_{\text{r}} = f(-3.5) = 12.25$$ Since $f_{\text{r}} < f_{\text{best}}^0 $ $$x_{\text{e}}=-4.5 + 1.5(-4.5+5)=-2.25$$ $$f_{\text{e}}=f(-2.25)=5.0625 < f_{\text{r}} $$ So we set $$x_1=x_{\text{e}}=-2.25$$

Linear Programming

Standard Form

A linear program in standard form is written as: $$\begin{align} \text{minimize} \quad & \mathbf{c}^T\mathbf{x} \\ \text{subject to} \quad & \mathbf{A}\mathbf{x} = \mathbf{b} \\ & \mathbf{x} \geq \mathbf{0} \end{align}$$ where:
  • $\mathbf{x} \in \mathbb{R}^n$ is the vector of decision variables
  • $\mathbf{c} \in \mathbb{R}^n$ is the vector of objective function coefficients
  • $\mathbf{A} \in \mathbb{R}^{m \times n}$ is the constraint coefficient matrix
  • $\mathbf{b} \in \mathbb{R}^m$ is the right-hand side of the constraints
  • The feasible region of a linear program is a convex polyhedron.
  • If an optimal solution exists, it occurs at a vertex of the feasible region.
  • If multiple optimal solutions exist, then at least one of them is at a vertex of the feasible region.

Simplex Method

  1. Convert the linear program to argumented form via slack variables: $$\begin{align} \text{minimize} \quad & \mathbf{c}^T\mathbf{x} \\ \text{subject to} \quad & \mathbf{A}\mathbf{x} + diag(\mathbf{s}) = \mathbf{b} \\ & \mathbf{c}^T\mathbf{x} - z = 0 \\ & \mathbf{x}, \mathbf{s} \geq \mathbf{0} \end{align}$$ where $\mathbf{s}$ are slack variables and $z$ is the objective value
  2. Find an initial basic feasible solution
  3. While the solution can be improved:
    1. Choose an entering variable (most negative reduced cost)
    2. Choose a leaving variable (minimum ratio test)
    3. Pivot to update the basis
  4. If no negative reduced costs exist, the current solution is optimal

Duality

For every linear program (the primal), there exists a corresponding dual linear program:

Primal: $$\begin{align} \text{minimize} \quad & \mathbf{c}^T\mathbf{x} \\ \text{subject to} \quad & \mathbf{A}\mathbf{x} = \mathbf{b} \\ & \mathbf{x} \geq \mathbf{0} \end{align}$$

Dual: $$\begin{align} \text{maximize} \quad & \mathbf{b}^T\mathbf{y} \\ \text{subject to} \quad & \mathbf{A}^T\mathbf{y} \leq \mathbf{c} \\ \end{align}$$

where $\mathbf{y} \in \mathbb{R}^m$ is the vector of dual variables.
If the primal problem has an optimal solution $\mathbf{x}^*$, then the dual problem also has an optimal solution $\mathbf{y}^*$, and $$\mathbf{c}^T\mathbf{x}^* = \mathbf{b}^T\mathbf{y}^*$$
If $\mathbf{x}^*$ and $\mathbf{y}^*$ are optimal solutions to the primal and dual problems, then: $$x_j^* \cdot (c_j - \mathbf{a}_j^T\mathbf{y}^*) = 0 \quad \forall j$$ $$y_i^* \cdot (b_i - \mathbf{A}_i\mathbf{x}^*) = 0 \quad \forall i$$ where $\mathbf{a}_j$ is the $j$-th column of $\mathbf{A}$ and $\mathbf{A}_i$ is the $i$-th row of $\mathbf{A}$.

Interior Point Methods

While the simplex method moves along the boundary of the feasible region, interior point methods traverse the interior of the feasible region. These methods are particularly effective for large-scale problems.

  1. Start with a strictly feasible point (in the interior of the feasible region)
  2. Until convergence:
    1. Compute a search direction that improves the objective while staying away from the boundary
    2. Determine an appropriate step size
    3. Move in the computed direction by the determined step size

Modern interior point methods include:

Python Implementations

CVXPY : Modeling language for convex optimization
OR-Tools : Solving combinatorial optimization problems
Pyomo : A modeling language for large-scale optimization

Multi-Stage Stochastic Optimization

Dynamic Programming

Dynamic Programming solves complex optimization problems by breaking them down into simpler subproblems. The key idea is to:

  1. Define a sequence of value functions $V_1, V_2, \ldots, V_n$, where $V_i(y)$ represents the optimal value at stage $i$ given state $y$
  2. Use the Bellman equation to recursively relate these value functions:
$$V_{i-1}(y) = \max_x[f_i(y,x) + V_i(g_i(y,x))]$$ where:
  • $V_i(y)$ is the value function at stage $i$ and state $y$
  • $f_i(y,x)$ represents the immediate reward/cost from decision $x$ at state $y$
  • $g_i(y,x)$ describes the system transition from state $y$ under decision $x$

Key principles of Dynamic Programming:

  • Optimal Substructure: The optimal solution contains optimal solutions to subproblems
  • Overlapping Subproblems: The same subproblems are encountered multiple times
  • Backward Induction: Starting from the last stage and moving backward, we solve each stage sequentially until reaching the initial stage

Stochastic Dual Dynamic Programming

Stochastic Dual Dynamic Programming (SDDP) is an advanced optimization technique that extends dynamic programming to handle multi-stage stochastic optimization problems with high-dimensional state spaces.

SDDP addresses problems of the form: $$\begin{align} \min_{x_1, \ldots, x_T} \quad & \mathbb{E}\left[\sum_{t=1}^{T} c_t(x_t, u_t)\right] \\ \text{subject to} \quad & x_t \in \mathcal{X}_t(x_{t-1}, u_t), \quad t = 1, \ldots, T \end{align}$$ where:
  • $T$ is the number of time stages
  • $u_t$ is the decision variable at stage $t$
  • $x_t$ is the random variable (uncertainty) at stage $t$
  • $c_t$ is the cost function at stage $t$
  • $\mathcal{U}_t$ is the feasible region at stage $t$, which depends on the previous decision and current uncertainty
The key equation in SDDP is the Bellman equation: $$V_t(x_t) = \min_{u_t \in \mathcal{U}_t(x_t)} \left\{c_t(x_t, u_t) + \mathbb{E}[V_{t+1}(x_{t+1})|x_t, u_t]\right\}, \quad t = 1,\ldots,T$$ where:
  • $V_t(x_t)$ is the optimal cost-to-go function at stage $t$ given state $x_t$
  • $c_t(x_t, u_t)$ is immediate cost incurred at stage $t$
  • $u_t$ are decision variables at stage $t$
  • $\mathcal{U}_t(x_t)$ defines feasible decisions based on current state $x_t$
  • The expectation operator $\mathbb{E}[\cdot]$ accounts for uncertainty about future states
  • Terminal condition: $V_{T+1}(x_{T+1}) = 0$
This recursive formulation is the Bellman equation that we have seen earlier.

Algorithm Structure

  1. Initialization:
    1. Build initial approximations of the cost-to-go functions for all stages
    2. Set iteration counter $k = 0$
  2. Forward Pass:
    1. Generate sample paths of the random variables $\{x_t^k\}_{t=1}^T$
    2. For each stage $t = 1, \ldots, T$:
      1. Solve the stage problem using the current approximation of the cost-to-go function $$u_t^k = \arg \min_{u_t \in \mathcal{U}_t(x_t)} \left[c_t(x_t, u_t) + \tilde{V}_{t+1}(x_{t+1})\right]$$ where $\tilde{V}_{t+1}(\cdot)$ is approximated by existing cuts
      2. Store the optimal solutions $\{u
      3. _t^k\}$ and state variables
  3. Backward Pass:
    1. For each stage $t = T, \ldots, 2$ (working backwards):
      1. For each sampled state from the forward pass:
      2. Compute the dual variables (shadow prices) of the constraints
      3. Generate and add cutting planes to improve the approximation of the cost-to-go function The cutting planes take the form: $$V_{t+1}(x_{t+1}) \geqslant a_{t+1}^Tx_{t+1} + b_{t+1}$$ where $a_{t+1}$ are the dual variables and $b_{t+1}$ is the intercept term
  4. Convergence Check:
    1. Calculate upper and lower bounds on the optimal value
    2. If bounds are sufficiently close or maximum iterations reached, stop
    3. Otherwise, increment $k$ and return to step 2

Example: Hydropower Planning

Consider a multi-reservoir hydropower system planning problem:

  • State variables: Water levels in reservoirs
  • Decision variables: Water releases and power generation
  • Uncertainty: Inflows to reservoirs (dependent on rainfall)
  • Objective: Minimize expected cost of operation and thermal generation
  • Constraints: Water balance, generation limits, transmission constraints

This problem has continuous state and action spaces, multiple stages (e.g., weekly planning for a year), and significant uncertainty, making it suitable for SDDP.

In the hydropower planning context, SDDP would:

  1. In the forward pass: Simulate operation decisions for sample inflow scenarios
  2. In the backward pass: Build cutting planes that approximate the future cost function
  3. Iteratively refine the approximation until convergence

The resulting policy provides operation decisions (water releases) for any reservoir level and inflow scenario, balancing immediate generation with future water value.

Advantages and Limitations

Advantages:

Limitations:

Extensions

Several extensions to the basic SDDP algorithm have been developed:

Reinforcement Learning

Reinforcement Learning problems are typically formalized as Markov Decision Processes (MDPs), defined by the tuple $(S, A, P, R, \gamma)$:
  • $S$ is the state space
  • $A$ is the action space
  • $P$ is the transition probability: $P(s'|s,a)$
  • $R$ is the reward function: $R(s,a,s')$
  • $\gamma \in [0,1]$ is the discount factor, balancing immediate vs. future rewards

The objective is to find an optimal policy $\pi^*$ that maximizes the expected cumulative discounted reward: $$\max_{\pi} \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]$$

Value Functions and Optimality

The state-value function $V^{\pi}(s)$ represents the expected return starting from state $s$ and following policy $\pi$:

$$V^{\pi}(s) = \mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s\right]$$

The action-value function $Q^{\pi}(s, a)$ represents the expected return after taking action $a$ in state $s$ and following policy $\pi$:

$$Q^{\pi}(s, a) = \mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s, a_t = a\right]$$

The optimal value functions satisfy the Bellman Equation:

$$V^*(s) = \max_a \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^*(s')]$$ $$Q^*(s,a) = \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma \max_{a'} Q^*(s',a')]$$

Once we know $Q^*$, the optimal policy can be derived as:

$$\pi^*(s) = \arg\max_a Q^*(s,a)$$

Dynamic Programming Approaches

  1. Initialize $V(s)$ arbitrarily for all states
  2. Repeat until convergence:
    1. For each state $s \in S$: $$V(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V(s')]$$
  3. Compute the optimal policy: $$\pi(s) \leftarrow \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V(s')]$$
  1. Initialize policy $\pi$ arbitrarily
  2. Repeat until convergence:
    1. Policy Evaluation: Compute $V^\pi$ by solving: $$V^\pi(s) = \sum_{s'} P(s'|s,\pi(s))[R(s,\pi(s),s') + \gamma V^\pi(s')]$$
    2. Policy Improvement: Update policy using: $$\pi(s) \leftarrow \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]$$

Model-Free Methods

Unlike dynamic programming approaches, model-free methods don't require knowledge of the transition probabilities and reward function. They learn directly from experience.

Temporal Difference Learning

  1. Initialize $Q(s,a)$ arbitrarily for all state-action pairs
  2. For each episode:
    1. Initialize state $s$
    2. For each step of the episode:
      1. Choose action $a$ from $s$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)
      2. Take action $a$, observe reward $r$ and next state $s'$
      3. Update Q-value: $$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$$
      4. $s \leftarrow s'$
  1. Initialize $Q(s,a)$ arbitrarily for all state-action pairs
  2. For each episode:
    1. Initialize state $s$
    2. Choose action $a$ from $s$ using policy derived from $Q$
    3. For each step of the episode:
      1. Take action $a$, observe reward $r$ and next state $s'$
      2. Choose action $a'$ from $s'$ using policy derived from $Q$
      3. Update Q-value: $$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma Q(s',a') - Q(s,a)]$$
      4. $s \leftarrow s'$; $a \leftarrow a'$

The key difference between Q-Learning and SARSA is that Q-Learning is an off-policy method (learns the value of the optimal policy), while SARSA is an on-policy method (learns the value of the policy being followed).

Policy Gradient Methods

Policy gradient methods directly optimize the policy without using a value function. The policy is typically parameterized by a vector $\theta$: $$\pi_\theta(a|s) = P(a|s,\theta)$$

The objective is to maximize the expected return:

$$J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]$$

The policy gradient theorem gives the gradient of this objective:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s,a)\right]$$

Deep Reinforcement Learning

Deep Q-Network (DQN) combines Q-learning with deep neural networks to approximate the Q-function. Key innovations include:
  • Experience replay: Store and randomly sample transitions to break correlation between consecutive samples
  • Target networks: Use a separate network for generating targets to improve stability
  • Double DQN: Address overestimation bias by decoupling selection and evaluation of actions

Other notable deep RL algorithms include A3C (Asynchronous Advantage Actor-Critic), PPO (Proximal Policy Optimization), and SAC (Soft Actor-Critic).

  1. Initialize replay memory D with capacity N
  2. Initialize Q-network with random weights θ
  3. Initialize target Q-network with weights θ⁻ = θ
  4. For each episode:
    1. Initialize state $s$
    2. For each step of the episode:
      1. Select action $a$ using ε-greedy policy based on Q
      2. Execute action $a$, observe reward $r$ and next state $s'$
      3. Store transition $(s, a, r, s')$ in D
      4. Sample random minibatch of transitions from D
      5. Set target: $y_i = r_i + \gamma \max_{a'} Q(s'_i, a'; \theta^-)$
      6. Perform gradient descent step on $(y_i - Q(s_i, a_i; \theta))^2$
      7. Every C steps, update θ⁻ = θ

Applications

Reinforcement learning has been successfully applied to numerous optimization problems:

Comparison with Other Optimization Methods

RL vs. Dynamic Programming:

RL vs. Supervised Learning:

RL vs. Traditional Optimization:

Challenges in RL:

Multi-Armed Bandits

The multi-armed bandit problem is a simplified reinforcement learning setting that focuses purely on the exploration-exploitation tradeoff. It represents the core challenge of deciding whether to exploit current knowledge for immediate reward or explore to gather more information.

In the classic multi-armed bandit problem:

  • There are $K$ actions (arms)
  • At each time step $t$, the agent selects an action $a_t \in \{1, 2, \ldots, K\}$
  • After selecting action $a_t$, the agent receives a reward $r_t \sim R(a_t)$
  • The goal is to maximize the cumulative reward $\sum_{t=1}^T r_t$ over $T$ steps
  • Each arm has an unknown expected reward $\mu_a = \mathbb{E}[R(a)]$

This differs from full RL in that there's no state transition—the environment doesn't change based on the agent's actions.

Performance in bandit algorithms is often measured using regret, which quantifies the difference between the rewards obtained by the agent and the rewards that could have been obtained by always selecting the optimal arm:

$$\text{Regret}(T) = T\mu^* - \mathbb{E}\left[\sum_{t=1}^T r_t\right]$$ where $\mu^* = \max_a \mu_a$ is the expected reward of the optimal arm.

The goal is to minimize regret, which means effectively balancing exploration (trying different arms to learn their expected rewards) and exploitation (selecting the arm currently believed to be best).

Common Strategies

  1. Initialize estimated values $Q(a) = 0$ for all actions $a$
  2. Initialize action counts $N(a) = 0$ for all actions $a$
  3. For each time step $t = 1, 2, \ldots, T$:
    1. With probability $\epsilon$:
      • Select a random action $a_t$ (exploration)
    2. Otherwise:
      • Select $a_t = \arg\max_a Q(a)$ (exploitation)
    3. Receive reward $r_t$
    4. Update:
      • $N(a_t) = N(a_t) + 1$
      • $Q(a_t) = Q(a_t) + \frac{1}{N(a_t)}(r_t - Q(a_t))$

Epsilon-greedy is simple but effective. The parameter $\epsilon$ controls the exploration rate—higher values lead to more exploration. A common variant is to decrease $\epsilon$ over time.

  1. Initialize estimated values $Q(a) = 0$ for all actions $a$
  2. Initialize action counts $N(a) = 0$ for all actions $a$
  3. For each time step $t = 1, 2, \ldots, T$:
    1. For each action that has been selected at least once, compute the UCB score: $$\text{UCB}(a) = Q(a) + c\sqrt{\frac{\ln t}{N(a)}}$$ where $c$ is an exploration parameter
    2. Select $a_t = \arg\max_a \text{UCB}(a)$, or if an action has never been selected, choose it
    3. Receive reward $r_t$
    4. Update:
      • $N(a_t) = N(a_t) + 1$
      • $Q(a_t) = Q(a_t) + \frac{1}{N(a_t)}(r_t - Q(a_t))$

UCB adds an exploration bonus to actions that have been selected less frequently, with the bonus decreasing as the action is selected more often. This approach is optimal in certain settings and has theoretical guarantees on regret.

  1. For each action $a$, initialize prior distribution parameters (e.g., $\alpha_a = 1, \beta_a = 1$ for Beta distribution)
  2. For each time step $t = 1, 2, \ldots, T$:
    1. For each action $a$, sample a value $\theta_a$ from its posterior distribution
    2. Select $a_t = \arg\max_a \theta_a$
    3. Receive reward $r_t$
    4. Update the posterior distribution for action $a_t$ based on $r_t$
      • For Bernoulli rewards: $\alpha_{a_t} = \alpha_{a_t} + r_t$, $\beta_{a_t} = \beta_{a_t} + (1 - r_t)$
      • For Gaussian rewards: Update mean and precision based on the new observation

Thompson sampling is a Bayesian approach that maintains a probability distribution over the expected reward of each arm. It naturally balances exploration and exploitation by sampling from these distributions.

Consider a scenario with 3 slot machines (arms), where the true expected rewards are $\mu_1 = 0.3$, $\mu_2 = 0.5$, and $\mu_3 = 0.7$, but these values are unknown to the agent.
  • With epsilon-greedy ($\epsilon = 0.1$), the agent would mostly select arm 3 once it discovers its high reward, but would still occasionally explore arms 1 and 2.
  • With UCB, the agent would initially try all arms, then focus on arm 3, but periodically check the other arms with decreasing frequency.
  • With Thompson sampling, the agent would maintain Beta distributions for each arm. As it collects more data, the distribution for arm 3 would concentrate around 0.7, making it increasingly likely to be selected.

As $T$ increases, all these strategies would converge to primarily selecting arm 3, but they differ in how quickly they identify the best arm and how they handle exploration.

Contextual Bandits

Contextual bandits extend the multi-armed bandit problem by introducing context or state information. In this setting, the expected reward of each arm depends on the current context.

In the contextual bandit problem:

  • At each time step $t$, the agent observes a context $x_t \in \mathcal{X}$
  • Based on the context, the agent selects an action $a_t \in \{1, 2, \ldots, K\}$
  • The agent receives a reward $r_t \sim R(x_t, a_t)$
  • The goal is to learn a policy $\pi: \mathcal{X} \rightarrow \{1, 2, \ldots, K\}$ that maximizes the expected reward

This formulation bridges the gap between the stateless multi-armed bandit problem and the full RL problem with state transitions.

Common approaches to contextual bandits include:

Applications

Multi-armed bandit algorithms are widely used in scenarios where exploration must be balanced with exploitation:

Contextual bandits, in particular, have found success in personalized recommendations and news article selection, where the context includes user information and past behavior.

Dynamic Programming Approaches

  1. Initialize $V(s)$ arbitrarily for all states
  2. Repeat until convergence:
    1. For each state $s \in S$: $$V(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V(s')]$$
  3. Compute the optimal policy: $$\pi(s) \leftarrow \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V(s')]$$
  1. Initialize policy $\pi$ arbitrarily
  2. Repeat until convergence:
    1. Policy Evaluation: Compute $V^\pi$ by solving: $$V^\pi(s) = \sum_{s'} P(s'|s,\pi(s))[R(s,\pi(s),s') + \gamma V^\pi(s')]$$
    2. Policy Improvement: Update policy using: $$\pi(s) \leftarrow \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]$$