The goal of the example is to find a minimal point .
On the interactive demo, the red line
is pointing in the direction of the Gradient, $\nabla f$ (a fancy word for derivative) ,
with the magnitude of the gradient indicated by the length.
The green line is pointing in the opposite direction of the gradient,
with the length given by $\|\alpha*\nabla f\|_2$.
In each step of the algorithm, we will be move in the direction of the green line, with
step-size given by the length.
(We will say more about "why" below.)
You may adjust $\alpha$ by give it a new value
You may get a new starting point by click the following button
To proceed, click :)
The Gradient Descend Algorithm is also implemented for Linear Regression
on the page Linear Regression.
Click & Drag to change the view of the 3D surface.
Scroll on the plot to zoom.
The goal of the example is to find a minimal point .
On the interactive demo, the blue arrow
is pointing in the opposite direction of the Gradient, $\nabla f$ (a fancy word for
derivative) ,
with its size proportional to the magnitude of the gradient $\|\nabla f\|_2$.
The Gradient Descend is also implemented for Linear Regression
on the page Linear Regression.
The idea behind Gradient Descend is intuitive, say we have a single valued function
$$f:\mathbb{R}^n\to \mathbb{R}$$
and we are standing at a point $x $ in the domain $\mathbb{R}^n$, and we are about to take a step in the
direction
of some unit vector $v$ with step size $\delta$.
When we want to maximize our output, we go in the direction of $v=\nabla f(x)$.
And similarly, if we want to minimize our output, we go in the opposite direction of the gradient,
$-\nabla f(x)$. For the mathematical justification, see my notes at .
For a function $f:\mathbb{R}^n\to \mathbb{R}$, the Gradient Descend Algorithm seeks to find a local
minimum of $f$ through the following iterative procedure
Start with some initial point $x_0$
At step $i$ : set $$ x_i = x_{i-1} - \alpha\cdot \nabla f(x_{i-1}) $$
Repeat the previous part, until $\alpha \cdot \nabla f(x_{i-1})$ is sufficiently small.
Derivative Tests
When $f$ can be restricted to a bounded domain or $f$ is concave / convex, global optimum can be obtained via
derivative tests.
Let $f(x):\mathbb{R}^n\to\mathbb{R} $, if $f\in C^2$, let $H_f$ be the
Hessian
matrix of $f$.
Suppose
$$\nabla f(x_0) =\mathbf{0}$$
if $H_f(x_0)$ is positive definite , then $x_0$ is a local minimum,
if
$H_f(x_0)$ is negative definite, then $x$ is a local maximum,
If $f$ is convex, then $f$ has a unique global maximum.
If $f$ is concave, then $f$ has a unique global minimum.
For definitions and disscussion, see my notes at 🖇️.
Newton's Method
To find points critical points $x_0$, where $$\nabla f(x_0) =\mathbf{0}$$ one can use the Newton's Method.
Let $g(\mathbf{x}):\mathbb{R}^n\to \mathbb{R}^m$. Newton's Method is an algorithm
for finding a root of $g$ (if it exist).
We go over two different views of the Newton's Method.
Linear Approximation
To derive Newton's Method, we start with the first-order linear approximation,
$$\vec{0}\approx J_g(x_0)(x-x_0)+g(x_0)$$ where, $x$ is a root that we are seeking, $x_0$ is a point near
the
root,
$J$ is the Jacobian
$$J_g(x_0)=\left[\begin{array}{ccc}
\frac{\partial g_{1}}{\partial x_{1}} & \cdots & \frac{\partial g_{1}}{\partial x_{n}} \\
\vdots & \ddots & \vdots \\
\frac{\partial g_{m}}{\partial x_{1}} & \cdots & \frac{\partial g_{m}}{\partial x_{n}}
\end{array}\right]$$
The linear approximation implies $$x \approx x_0-J^{-1}_g(x_0)g(x_0)$$
And this gives us the Newton's Method
Start with an initial guess $x_0$
Repeat until convergence
$$x_i = x_{i-1}- J^{-1}_g(x_{i-1})g(x_{i-1})$$
If $g(\mathbf{x}):\mathbb{R}^n\to \mathbb{R}^n$ is affine, i.e. it is of the form $$g(x)=Mx+b$$ then the
first-order
linear approximation will be exact,
and Newton's method requires only one step to get the root
$$x_1 = x_0 - M^{-1} (Mx_0+b) = -M^{-1}b $$
Fixed Point Equation
A fixed point equation is an equation of the form
$$x=f(x)$$
when $f(x)$ is a contraction, a solution to the equation can be found efficiently through a iterative
procedure
Let $(X, d)$ be a non-empty complete metric space with a contraction mapping $$T: X
\rightarrow X$$ Then $T$ admits a unique fixed-point $x^{*}$ in $X$ (i.e.
$T\left(x^{*}\right)=x^{*}$ ).
Furthermore, $x^{*}$ can be found as follows:
start with an arbitrary element $x_{0} \in X$ and define a
sequence $\left(x_{n}\right)_{n \in \mathbb{N}}$ by $$x_{n}=T\left(x_{n-1}\right)$$ for $n
\geq
1$.
Then $$\lim _{n
\rightarrow \infty} x_{n}=x^{*}$$
Newton's Method is a special case of Fixed Point Iteration where we have the following fixed
point equation
$$g(x)=x-J^{-1}_g(x)g(x)$$
Derivative-Free Methods
Random Search
Initialize $x$ with a random position
Until termination condition is met
Generate a random direction $d$
Move $x$ in the direction $d$ by a random step size $\alpha$
If $f(x+\alpha d) < f(x)$, then set $x=x+\alpha d$
Genetic Algorithm
Initialize a population of $N$ individuals
$$P_0 = \{x_0^1,\cdots,x_0^N\}$$
With fitness
$$\{f(x^i)\}$$
Until termination condition is met
Select based on fitness$$S(P_0)\to\left(x_k^i,x_k^j\right)\in P_k$$
Descendent according to some crossover rule
$$ D(x_k^i,x_k^j)\to d$$
With probability $p$, mutate $d$,
$$ M(d)\to d$$
add $d$ to the population
$$P_{k+1} = P_k\cup\{d\}$$
Nelder-Mead
Given a function $f:\mathbb{R}^n\to \mathbb{R}$,
Initialize a simplex $S_0$ with $n+1$ vertices
$$S_0 = \{x_0,\cdots,x_{n}\}$$
$\delta_{e}\in (1,\infty)$
$\delta_{\text{c}}\in (0,1) $
$\gamma\in (0,1) $
Order $S_k$ by $f(x_0)\leq \cdots \leq f(x_{n})$
and set $f_{best}^k = f(x_0)$
Expansion : If $f_{\text{r}} < f_{\text{best}}^k$, then
$$x_{\text{e}}=x_{\text{o}} + \delta_{e}(x_\text{r}-x_{\text{o}})$$ $$
f_{\text{e}}=f(x_{\text{e}})$$ If $f_{\text{e}} < f_{\text{r}}$, then
$$x_n=x_{\text{e}}$$ else $$x_n=x_{\text{r}}$$
Reflection :
If $f_{\text{best}}^k \le f_{\text{r}} \le f(x_{n-1}) $, then
$$x_n = x_{\text{r}}$$
Contraction :
If $f(x_{n-1}) \le f_{\text{r}} \le f(x_n)$,
then
$$x_{\text{c}} = x_{\text{o}} +
\delta_{\text{c}}(x_{\text{r}}-x_{\text{o}} )$$ $$
f_{\text{c}} = f(x_{\text{c}})$$
If $f_{\text{c}} < f_{\text{r}} $, then $$x_n=x_{\text{c}}$$ else go to step 4
If $ f(x_n)\le f({\text{r}}) $, then
$$x_{\text{c}} = x_{\text{o}} +
\delta_{\text{c}}(x_n-x_{\text{o}})$$ $$
f_{\text{c}} = f(x_{\text{c}})$$
If $f_{\text{c}} < f(x_n)$, then $$x_n=x_{\text{c}}$$ else go to step 4
Increase k by 1 and go to step 2
Shrink
Set $S_{k+1}= \{x_0, x_0+\gamma(x_1-x_0),\cdots,x_0+\gamma(x_n-x_0)\}$ and go to step 2
Consider the simple example
$$f(x)=x^2$$
with
$$\delta_{e}=1.5$$
Case 1 : If we start with the initial simplex (ordered)
$$S_0 = \{(-4),(-5) \}$$
Then
$$f^0_{\text{best}} = f(-4) = 16$$
$$x_{\text{o}} = \frac{1}{2}(-5-4) = -4.5$$
$$x_{\text{r}} = -4.5 + (-4.5+5) = -3.5$$
$$f_{\text{r}} = f(-3.5) = 12.25$$
Since $f_{\text{r}} < f_{\text{best}}^0 $ $$x_{\text{e}}=-4.5 + 1.5(-4.5+5)=-2.25$$
$$f_{\text{e}}=f(-2.25)=5.0625 < f_{\text{r}} $$ So we set $$x_1=x_{\text{e}}=-2.25$$
Linear Programming
Standard Form
A linear program in standard form is written as:
$$\begin{align}
\text{minimize} \quad & \mathbf{c}^T\mathbf{x} \\
\text{subject to} \quad & \mathbf{A}\mathbf{x} = \mathbf{b} \\
& \mathbf{x} \geq \mathbf{0}
\end{align}$$
where:
$\mathbf{x} \in \mathbb{R}^n$ is the vector of decision variables
$\mathbf{c} \in \mathbb{R}^n$ is the vector of objective function coefficients
$\mathbf{A} \in \mathbb{R}^{m \times n}$ is the constraint coefficient matrix
$\mathbf{b} \in \mathbb{R}^m$ is the right-hand side of the constraints
The feasible region of a linear program is a convex polyhedron.
If an optimal solution exists, it occurs at a vertex of the feasible region.
If multiple optimal solutions exist, then at least one of them is at a vertex of the feasible
region.
Simplex Method
Convert the linear program to argumented form via slack variables:
$$\begin{align}
\text{minimize} \quad & \mathbf{c}^T\mathbf{x} \\
\text{subject to} \quad & \mathbf{A}\mathbf{x} + diag(\mathbf{s}) = \mathbf{b} \\
& \mathbf{c}^T\mathbf{x} - z = 0 \\
& \mathbf{x}, \mathbf{s} \geq \mathbf{0}
\end{align}$$
where $\mathbf{s}$ are slack variables and $z$ is the objective value
Find an initial basic feasible solution
While the solution can be improved:
Choose an entering variable (most negative reduced cost)
Choose a leaving variable (minimum ratio test)
Pivot to update the basis
If no negative reduced costs exist, the current solution is optimal
Duality
For every linear program (the primal), there exists a corresponding dual linear program:
where $\mathbf{y} \in \mathbb{R}^m$ is the vector of dual variables.
If the primal problem has an optimal solution $\mathbf{x}^*$, then the dual problem also has an optimal
solution $\mathbf{y}^*$, and
$$\mathbf{c}^T\mathbf{x}^* = \mathbf{b}^T\mathbf{y}^*$$
If $\mathbf{x}^*$ and $\mathbf{y}^*$ are optimal solutions to the primal and dual problems, then:
$$x_j^* \cdot (c_j - \mathbf{a}_j^T\mathbf{y}^*) = 0 \quad \forall j$$
$$y_i^* \cdot (b_i - \mathbf{A}_i\mathbf{x}^*) = 0 \quad \forall i$$
where $\mathbf{a}_j$ is the $j$-th column of $\mathbf{A}$ and $\mathbf{A}_i$ is the $i$-th row of
$\mathbf{A}$.
Interior Point Methods
While the simplex method moves along the boundary of the feasible region, interior point methods traverse the
interior of the feasible region. These methods are particularly effective for large-scale problems.
Start with a strictly feasible point (in the interior of the feasible region)
Until convergence:
Compute a search direction that improves the objective while staying away from the
boundary
Determine an appropriate step size
Move in the computed direction by the determined step size
Modern interior point methods include:
Path-following methods: Follow the central path to the optimal solution
Potential reduction methods: Minimize a potential function that combines the objective and a
barrier term
Affine-scaling methods: Scale the variables to stay away from boundaries
$V_i(y)$ is the value function at stage $i$ and state $y$
$f_i(y,x)$ represents the immediate reward/cost from decision $x$ at state $y$
$g_i(y,x)$ describes the system transition from state $y$ under decision $x$
Key principles of Dynamic Programming:
Optimal Substructure: The optimal solution contains optimal solutions to subproblems
Overlapping Subproblems: The same subproblems are encountered multiple times
Backward Induction: Starting from the last stage and moving backward, we solve each stage
sequentially until reaching the initial stage
Stochastic Dual Dynamic Programming
Stochastic Dual Dynamic Programming (SDDP) is an advanced optimization technique that extends dynamic
programming to handle multi-stage stochastic optimization problems with high-dimensional state spaces.
SDDP addresses problems of the form:
$$\begin{align}
\min_{x_1, \ldots, x_T} \quad & \mathbb{E}\left[\sum_{t=1}^{T} c_t(x_t, u_t)\right] \\
\text{subject to} \quad & x_t \in \mathcal{X}_t(x_{t-1}, u_t), \quad t = 1, \ldots, T
\end{align}$$
where:
$T$ is the number of time stages
$u_t$ is the decision variable at stage $t$
$x_t$ is the random variable (uncertainty) at stage $t$
$c_t$ is the cost function at stage $t$
$\mathcal{U}_t$ is the feasible region at stage $t$, which depends on the previous decision and
current uncertainty
The key equation in SDDP is the Bellman equation:
$$V_t(x_t) = \min_{u_t \in \mathcal{U}_t(x_t)} \left\{c_t(x_t, u_t) + \mathbb{E}[V_{t+1}(x_{t+1})|x_t,
u_t]\right\}, \quad t = 1,\ldots,T$$
where:
$V_t(x_t)$ is the optimal cost-to-go function at stage $t$ given state $x_t$
$c_t(x_t, u_t)$ is immediate cost incurred at stage $t$
$u_t$ are decision variables at stage $t$
$\mathcal{U}_t(x_t)$ defines feasible decisions based on current state $x_t$
The expectation operator $\mathbb{E}[\cdot]$ accounts for uncertainty about future states
Terminal condition: $V_{T+1}(x_{T+1}) = 0$
This recursive formulation is the Bellman equation that we have seen earlier.
Algorithm Structure
Initialization:
Build initial approximations of the cost-to-go functions for all stages
Set iteration counter $k = 0$
Forward Pass:
Generate sample paths of the random variables $\{x_t^k\}_{t=1}^T$
For each stage $t = 1, \ldots, T$:
Solve the stage problem using the current approximation of the cost-to-go function
$$u_t^k = \arg \min_{u_t \in \mathcal{U}_t(x_t)} \left[c_t(x_t, u_t) +
\tilde{V}_{t+1}(x_{t+1})\right]$$
where $\tilde{V}_{t+1}(\cdot)$ is approximated by existing cuts
Store the optimal solutions $\{u
_t^k\}$ and state variables
Backward Pass:
For each stage $t = T, \ldots, 2$ (working backwards):
For each sampled state from the forward pass:
Compute the dual variables (shadow prices) of the constraints
Generate and add cutting planes to improve the approximation of the cost-to-go
function
The cutting planes take the form:
$$V_{t+1}(x_{t+1}) \geqslant a_{t+1}^Tx_{t+1} + b_{t+1}$$
where $a_{t+1}$ are the dual variables and $b_{t+1}$ is the intercept term
Convergence Check:
Calculate upper and lower bounds on the optimal value
If bounds are sufficiently close or maximum iterations reached, stop
Otherwise, increment $k$ and return to step 2
Example: Hydropower Planning
Consider a multi-reservoir hydropower system planning problem:
State variables: Water levels in reservoirs
Decision variables: Water releases and power generation
Uncertainty: Inflows to reservoirs (dependent on rainfall)
Objective: Minimize expected cost of operation and thermal generation
Constraints: Water balance, generation limits, transmission constraints
This problem has continuous state and action spaces, multiple stages (e.g., weekly planning for a
year), and significant uncertainty, making it suitable for SDDP.
In the hydropower planning context, SDDP would:
In the forward pass: Simulate operation decisions for sample inflow scenarios
In the backward pass: Build cutting planes that approximate the future cost function
Iteratively refine the approximation until convergence
The resulting policy provides operation decisions (water releases) for any reservoir level and inflow
scenario, balancing immediate generation with future water value.
Advantages and Limitations
Advantages:
Can handle high-dimensional continuous state spaces where traditional DP suffers from the curse of
dimensionality
Provides both primal solutions (decisions) and dual information (marginal values)
Delivers upper and lower bounds on the optimal solution value
Handles complex constraints and objective functions
Limitations:
Requires convexity of the problem (for the cuts to be valid)
Computational intensity increases with the number of stages and random variables
May struggle with very rare events or "black swan" scenarios
Classical version assumes stagewise independence of random variables
Extensions
Several extensions to the basic SDDP algorithm have been developed:
Risk-averse SDDP: Incorporates risk measures like CVaR (Conditional Value-at-Risk) instead of
expectation
Inexact SDDP: Uses approximate solutions in the backward pass to reduce computational burden
Multistage stochastic decomposition: Combines SDDP with stochastic approximation techniques
SDDP with auto-regressive random variables: Handles temporal dependencies in the uncertain parameters
Reinforcement Learning
Reinforcement Learning problems are typically formalized as Markov Decision Processes (MDPs), defined by
the tuple $(S, A, P, R, \gamma)$:
$S$ is the state space
$A$ is the action space
$P$ is the transition probability: $P(s'|s,a)$
$R$ is the reward function: $R(s,a,s')$
$\gamma \in [0,1]$ is the discount factor, balancing immediate vs. future rewards
The objective is to find an optimal policy $\pi^*$ that maximizes the
expected cumulative discounted
reward:
$$\max_{\pi} \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]$$
Value Functions and Optimality
The state-value function $V^{\pi}(s)$ represents the expected return
starting
from state $s$
and following policy $\pi$:
Unlike dynamic programming approaches, model-free methods don't require knowledge of the transition
probabilities and reward function. They learn directly from experience.
Temporal Difference Learning
Initialize $Q(s,a)$ arbitrarily for all state-action pairs
For each episode:
Initialize state $s$
For each step of the episode:
Choose action $a$ from $s$ using policy derived from $Q$ (e.g.,
$\epsilon$-greedy)
Take action $a$, observe reward $r$ and next state $s'$
The key difference between Q-Learning and SARSA is that Q-Learning is an off-policy method (learns the
value of the optimal policy), while SARSA is an on-policy method (learns the value of the policy being
followed).
Policy Gradient Methods
Policy gradient methods directly optimize the policy without using a value function. The policy is
typically parameterized by a vector $\theta$:
$$\pi_\theta(a|s) = P(a|s,\theta)$$
RL can handle much larger state spaces through function approximation
RL vs. Supervised Learning:
SL requires labeled training data
RL learns from reward signals, not explicit labels
RL addresses the credit assignment problem in sequential decisions
RL vs. Traditional Optimization:
Traditional optimization typically requires an explicit objective function
RL can learn from sparse, delayed rewards
RL naturally handles stochastic environments
RL can discover novel solutions not anticipated by human designers
Challenges in RL:
Sample efficiency: May require many interactions to learn effectively
Exploration-exploitation tradeoff: Balancing learning new information vs. exploiting current
knowledge
Stability: Training can be unstable, especially with function approximation
Reward design: Crafting reward functions that lead to desired behavior
Multi-Armed Bandits
The multi-armed bandit problem is a simplified reinforcement learning setting that focuses purely on the
exploration-exploitation tradeoff. It represents the core challenge of deciding whether to exploit current
knowledge for immediate reward or explore to gather more information.
In the classic multi-armed bandit problem:
There are $K$ actions (arms)
At each time step $t$, the agent selects an action $a_t \in \{1, 2, \ldots, K\}$
After selecting action $a_t$, the agent receives a reward $r_t \sim R(a_t)$
The goal is to maximize the cumulative reward $\sum_{t=1}^T r_t$ over $T$ steps
Each arm has an unknown expected reward $\mu_a = \mathbb{E}[R(a)]$
This differs from full RL in that there's no state transition—the environment doesn't change based on
the agent's actions.
Performance in bandit algorithms is often measured using regret, which
quantifies the difference between the rewards obtained by the agent and the rewards that could have
been obtained by always selecting the optimal arm:
$$\text{Regret}(T) = T\mu^* - \mathbb{E}\left[\sum_{t=1}^T r_t\right]$$
where $\mu^* = \max_a \mu_a$ is the expected reward of the optimal arm.
The goal is to minimize regret, which means effectively balancing exploration (trying different arms
to learn their expected rewards) and exploitation (selecting the arm currently believed to be best).
Common Strategies
Initialize estimated values $Q(a) = 0$ for all actions $a$
Initialize action counts $N(a) = 0$ for all actions $a$
Epsilon-greedy is simple but effective. The parameter $\epsilon$ controls the exploration rate—higher
values lead to more exploration. A common variant is to decrease $\epsilon$ over time.
Initialize estimated values $Q(a) = 0$ for all actions $a$
Initialize action counts $N(a) = 0$ for all actions $a$
For each time step $t = 1, 2, \ldots, T$:
For each action that has been selected at least once, compute the UCB score:
$$\text{UCB}(a) = Q(a) + c\sqrt{\frac{\ln t}{N(a)}}$$
where $c$ is an exploration parameter
Select $a_t = \arg\max_a \text{UCB}(a)$, or if an action has never been selected, choose
it
UCB adds an exploration bonus to actions that have been selected less frequently, with the bonus
decreasing as the action is selected more often. This approach is optimal in certain settings and has
theoretical guarantees on regret.
For each action $a$, initialize prior distribution parameters (e.g., $\alpha_a = 1, \beta_a = 1$
for Beta distribution)
For each time step $t = 1, 2, \ldots, T$:
For each action $a$, sample a value $\theta_a$ from its posterior distribution
Select $a_t = \arg\max_a \theta_a$
Receive reward $r_t$
Update the posterior distribution for action $a_t$ based on $r_t$
For Gaussian rewards: Update mean and precision based on the new observation
Thompson sampling is a Bayesian approach that maintains a probability distribution over the expected
reward of each arm. It naturally balances exploration and exploitation by sampling from these
distributions.
Consider a scenario with 3 slot machines (arms), where the true expected rewards are $\mu_1 = 0.3$,
$\mu_2 = 0.5$, and $\mu_3 = 0.7$, but these values are unknown to the agent.
With epsilon-greedy ($\epsilon = 0.1$), the agent would mostly select arm 3 once it discovers
its high reward, but would still occasionally explore arms 1 and 2.
With UCB, the agent would initially try all arms, then focus on arm 3, but periodically check
the other arms with decreasing frequency.
With Thompson sampling, the agent would maintain Beta distributions for each arm. As it collects
more data, the distribution for arm 3 would concentrate around 0.7, making it increasingly
likely to be selected.
As $T$ increases, all these strategies would converge to primarily selecting arm 3, but they differ
in how quickly they identify the best arm and how they handle exploration.
Contextual Bandits
Contextual bandits extend the multi-armed bandit problem by introducing context or state information. In this
setting, the expected reward of each arm depends on the current context.
In the contextual bandit problem:
At each time step $t$, the agent observes a context $x_t \in \mathcal{X}$
Based on the context, the agent selects an action $a_t \in \{1, 2, \ldots, K\}$
The agent receives a reward $r_t \sim R(x_t, a_t)$
The goal is to learn a policy $\pi: \mathcal{X} \rightarrow \{1, 2, \ldots, K\}$ that maximizes
the expected reward
This formulation bridges the gap between the stateless multi-armed bandit problem and the full RL
problem with state transitions.
Common approaches to contextual bandits include:
LinUCB: Assumes a linear relationship between contexts and rewards
Neural Bandits: Uses neural networks to model the relationship between contexts and rewards
Random Forests: Employs tree-based methods to handle complex, non-linear contexts
Applications
Multi-armed bandit algorithms are widely used in scenarios where exploration must be balanced with
exploitation:
A/B testing: Testing different website designs or marketing strategies
Clinical trials: Allocating patients to treatments that appear most effective
Recommendation systems: Suggesting content that users are likely to engage with
Online advertising: Selecting ads that maximize click-through rates
Dynamic pricing: Adjusting prices to maximize revenue
Resource allocation: Distributing computing resources among different tasks
Contextual bandits, in particular, have found success in personalized recommendations and news article
selection, where the context includes user information and past behavior.
Dynamic Programming Approaches
Initialize $V(s)$ arbitrarily for all states
Repeat until convergence:
For each state $s \in S$:
$$V(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V(s')]$$