Optimization

The goal of the example is to find a minimal point .
The red line is pointing in the direction of the Gradient, $\nabla f$ (a fancy word for derivative) , with the magnitude of the gradient indicated by the length.
The green line is pointing in the opposite direction of the gradient, with the length given by $\|\alpha*\nabla f\|_2$. In each step of the algorithm, we will be move in the direction of the green line, with step-size given by the length. (We will say more about "why" below.)
You may adjust $\alpha$ by give it a new value
$\alpha$ =
You may get a new starting point by click the following button
To proceed, click :)
The Gradient Descend Algorithm is also implemented for Linear Regression on the page Linear Regression.

Click & Drag to change the view of the 3D surface.
Scroll on the plot to zoom.
The goal of the example is to find a minimal point .
The blue arrow is pointing in the opposite direction of the Gradient, $\nabla f$ (a fancy word for derivative) , with its size proportional to the magnitude of the gradient $\|\nabla f\|_2$.
The Gradient Descend is also implemented for Linear Regression on the page Linear Regression.

The idea behind Gradient Descend is intuitive, say we have a single valued function $$f:\mathbb{R}^n\to \mathbb{R}$$ and we are standing at a point $x $ in the domain $\mathbb{R}^n$, and we are about to take a step in the direction of some unit vector $v$ with step size $\delta$.

When we want to maximize our output, we go in the direction of $v=\nabla f(x)$.

And similarly, if we want to minimize our output, we go in the opposite direction of the gradient, $-\nabla f(x)$. For the mathematical justification, see my notes at .

For a function $f:\mathbb{R}^n\to \mathbb{R}$, the Gradient Descend Algorithm seeks to find a local minimum of $f$ through the following iterative procedure

Start with some initial point $x_0$
At step $i$ : set $$ x_i = x_{i-1} - \alpha\cdot \nabla f(x_{i-1}) $$
Repeat the previous part, until $\alpha \cdot \nabla f(x_{i-1})$ is sufficiently small.

Derivative Tests

Let $f(x):\mathbb{R}^n\to\mathbb{R} $, if $f\in C^2$, let $H_f$ be the Hessian matrix of $f$. Suppose $$\nabla f(x_0) =\mathbf{0}$$

if $H_f(x_0)$ is positive definite , then $x_0$ is a local minimum,
if $H_f(x_0)$ is negative definite, then $x$ is a local maximum,
otherwise, $x_0$ is point is saddle point.

See my notes at 🖇️.

If $f$ is convex, then $f$ has a unique global minimum.
If $f$ is concave, then $f$ has a unique global maximum.

For definitions and disscussion, see my notes at 🖇️.

Newton's Method

Linear Approximation

To derive Newton's Method, we start with the first-order linear approximation, $$\vec{0}\approx J_g(x_0)(x-x_0)+g(x_0)$$ where, $x$ is a root that we are seeking, $x_0$ is a point near the root, $J$ is the Jacobian $$J_g(x_0)=\left[\begin{array}{ccc} \frac{\partial g_{1}}{\partial x_{1}} & \cdots & \frac{\partial g_{1}}{\partial x_{n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial g_{m}}{\partial x_{1}} & \cdots & \frac{\partial g_{m}}{\partial x_{n}} \end{array}\right]$$ The linear approximation implies $$x \approx x_0-J^{-1}_g(x_0)g(x_0)$$ And this gives us the Newton's Method

Start with an initial guess $x_0$
Repeat until convergence $$x_i = x_{i-1}- J^{-1}_g(x_{i-1})g(x_{i-1})$$

If $g(\mathbf{x}):\mathbb{R}^n\to \mathbb{R}^n$ is affine, i.e. it is of the form $$g(x)=Mx+b$$ then the first-order linear approximation will be exact, and Newton's method requires only one step to get the root $$x_1 = x_0 - M^{-1} (Mx_0+b) = -M^{-1}b $$

Fixed Point Equation

A fixed point equation is an equation of the form $$x=f(x)$$ when $f(x)$ is a contraction, a solution to the equation can be found efficiently through a iterative procedure

Let $(X, d)$ be a non-empty complete metric space with a contraction mapping $$T: X \rightarrow X$$ Then $T$ admits a unique fixed-point $x^{*}$ in $X$ (i.e. $T\left(x^{*}\right)=x^{*}$ ). Furthermore, $x^{*}$ can be found as follows:

start with an arbitrary element $x_{0} \in X$ and define a sequence $\left(x_{n}\right)_{n \in \mathbb{N}}$ by $$x_{n}=T\left(x_{n-1}\right)$$ for $n \geq 1$.
Then $$\lim _{n \rightarrow \infty} x_{n}=x^{*}$$

Newton's Method is a special case of Fixed Point Iteration where we have the following fixed point equation $$g(x)=x-J^{-1}_g(x)g(x)$$

Derivative-Free Methods

Random Search

Initialize $x$ with a random position
Until termination condition is met
1. Generate a random direction $d$
2. Move $x$ in the direction $d$ by a random step size $\alpha$
3. If $f(x+\alpha d) < f(x)$, then set $x=x+\alpha d$

Genetic Algorithm

Initialize a population of $N$ individuals $$P_0 = \{x_0^1,\cdots,x_0^N\}$$ With fitness $$\{f(x^i)\}$$
Until termination condition is met
1. Select based on fitness$$S(P_0)\to\left(x_k^i,x_k^j\right)\in P_k$$
2. Descendent according to some crossover rule $$ D(x_k^i,x_k^j)\to d$$
3. With probability $p$, mutate $d$, $$ M(d)\to d$$
4. add $d$ to the population $$P_{k+1} = P_k\cup\{d\}$$

Nelder-Mead

Given a function $f:\mathbb{R}^n\to \mathbb{R}$,

Initialize a simplex $S_0$ with $n+1$ vertices $$S_0 = \{x_0,\cdots,x_{n}\}$$ $\delta_{e}\in (1,\infty)$ $\delta_{\text{c}}\in (0,1) $ $\gamma\in (0,1) $
Order $S_k$ by $f(x_0)\leq \cdots \leq f(x_{n})$ and set $f_{best}^k = f(x_0)$
Set $$x_{\text{o}} = \frac{1}{n}\sum_{i=0}^{n-1} x_i$$ $$x_{\text{r}} = x_{\text{o}} + (x_{\text{o}}-x_n)$$ $$ f_{\text{r}} = f(x_{\text{r}})$$
- Expansion : If $f_{\text{r}} < f_{\text{best}}^k$, then $$x_{\text{e}}=x_{\text{o}} + \delta_{e}(x_\text{r}-x_{\text{o}})$$ $$ f_{\text{e}}=f(x_{\text{e}})$$ If $f_{\text{e}} < f_{\text{r}}$, then $$x_n=x_{\text{e}}$$ else $$x_n=x_{\text{r}}$$
- Reflection : If $f_{\text{best}}^k \le f_{\text{r}} \le f(x_{n-1}) $, then $$x_n = x_{\text{r}}$$
- Contraction :
  - If $f(x_{n-1}) \le f_{\text{r}} \le f(x_n)$, then $$x_{\text{c}} = x_{\text{o}} + \delta_{\text{c}}(x_{\text{r}}-x_{\text{o}} )$$ $$ f_{\text{c}} = f(x_{\text{c}})$$ If $f_{\text{c}} < f_{\text{r}} $, then $$x_n=x_{\text{c}}$$ else go to step 4
  - If $ f(x_n)\le f({\text{r}}) $, then $$x_{\text{c}} = x_{\text{o}} + \delta_{\text{c}}(x_n-x_{\text{o}})$$ $$ f_{\text{c}} = f(x_{\text{c}})$$ If $f_{\text{c}} < f(x_n)$, then $$x_n=x_{\text{c}}$$ else go to step 4
Shrink Set $S_{k+1}= \{x_0, x_0+\gamma(x_1-x_0),\cdots,x_0+\gamma(x_n-x_0)\}$ and go to step 2

Consider the simple example $$f(x)=x^2$$ with $$\delta_{e}=1.5$$

Case 1 : If we start with the initial simplex (ordered) $$S_0 = \{(-4),(-5) \}$$ Then $$f^0_{\text{best}} = f(-4) = 16$$ $$x_{\text{o}} = \frac{1}{2}(-5-4) = -4.5$$ $$x_{\text{r}} = -4.5 + (-4.5+5) = -3.5$$ $$f_{\text{r}} = f(-3.5) = 12.25$$ Since $f_{\text{r}} < f_{\text{best}}^0 $ $$x_{\text{e}}=-4.5 + 1.5(-4.5+5)=-2.25$$ $$f_{\text{e}}=f(-2.25)=5.0625 < f_{\text{r}} $$ So we set $$x_1=x_{\text{e}}=-2.25$$

Linear Programming

Standard Form

A linear program in standard form is written as: $$\begin{align} \text{minimize} \quad & \mathbf{c}^T\mathbf{x} \\ \text{subject to} \quad & \mathbf{A}\mathbf{x} = \mathbf{b} \\ & \mathbf{x} \geq \mathbf{0} \end{align}$$ where:

$\mathbf{x} \in \mathbb{R}^n$ is the vector of decision variables
$\mathbf{c} \in \mathbb{R}^n$ is the vector of objective function coefficients
$\mathbf{A} \in \mathbb{R}^{m \times n}$ is the constraint coefficient matrix
$\mathbf{b} \in \mathbb{R}^m$ is the right-hand side of the constraints

The feasible region of a linear program is a convex polyhedron.
Fundamental Theorem of Linear Programming
If the feasible region is non-empty and bounded,
then the optimal solution occurs at a vertex of the feasible region.
If an optimal solution exists, it occurs at a vertex of the feasible region.
If multiple optimal solutions exist, then at least one of them is at a vertex of the feasible region.

Simplex Method

Duality

For every linear program (the primal), there exists a corresponding dual linear program:

Primal: $$\begin{align} \text{minimize} \quad & \mathbf{c}^T\mathbf{x} \\ \text{subject to} \quad & \mathbf{A}\mathbf{x} = \mathbf{b} \\ & \mathbf{x} \geq \mathbf{0} \end{align}$$

Dual: $$\begin{align} \text{maximize} \quad & \mathbf{b}^T\mathbf{y} \\ \text{subject to} \quad & \mathbf{A}^T\mathbf{y} \leq \mathbf{c} \\ \end{align}$$

where $\mathbf{y} \in \mathbb{R}^m$ is the vector of dual variables.

If the primal problem has an optimal solution $\mathbf{x}^*$,r then the dual problem also has an optimal solution $\mathbf{y}^*$, and $$\mathbf{c}^T\mathbf{x}^* = \mathbf{b}^T\mathbf{y}^*$$

If $\mathbf{x}^*$ and $\mathbf{y}^*$ are optimal solutions to the primal and dual problems, then: $$x_j^* \cdot (c_j - \mathbf{a}_j^T\mathbf{y}^*) = 0 \quad \forall j$$ $$y_i^* \cdot (b_i - \mathbf{A}_i\mathbf{x}^*) = 0 \quad \forall i$$ where $\mathbf{a}_j$ is the $j$-th column of $\mathbf{A}$ and $\mathbf{A}_i$ is the $i$-th row of $\mathbf{A}$.

Interior Point Methods

While the simplex method moves along the boundary of the feasible region, interior point methods traverse the interior of the feasible region. These methods are particularly effective for large-scale problems.

Start with a strictly feasible point (in the interior of the feasible region)
Until convergence:
1. Compute a search direction that improves the objective while staying away from the boundary
2. Determine an appropriate step size
3. Move in the computed direction by the determined step size

Modern interior point methods include:

Path-following methods: Follow the central path to the optimal solution
Potential reduction methods: Minimize a potential function that combines the objective and a barrier term
Affine-scaling methods: Scale the variables to stay away from boundaries

Python Implementations

CVXPY : Modeling language for convex optimization

OR-Tools : Solving combinatorial optimization problems

GLOP (Google Linear Optimization Package):
- LP, Simplex method, no integer variables
CP-SAT (Constraint Programming SATisfiability)
- IP, no continuous variables
- constraint propagation
- Parallel search
SCIP (Sustainable Constraint Integer Programming)
- MIP, with continuous variables
- constraint propagation
- Branch-and-Cut

Pyomo : A modeling language for large-scale optimization

IPOPT (Interior Point Optimizer)
- NLP , continuous only
- Interior Point, Active-Set
HiGHS (High-Performance Optimization Solver)
```
solver = SolverFactory('appsi_highs')  
```
- LP, MIP, QP
- Simplex, Interior Point, Branch-and-Cut, Active-Set

Multi-Stage Stochastic Optimization

Dynamic Programming

Dynamic Programming solves complex optimization problems by breaking them down into simpler subproblems. The key idea is to:

Define a sequence of value functions $V_1, V_2, \ldots, V_n$, where $V_i(y)$ represents the optimal value at stage $i$ given state $y$
Use the Bellman equation to recursively relate these value functions:

$$V_{i-1}(y) = \max_x[f_i(y,x) + V_i(g_i(y,x))]$$ where:

$V_i(y)$ is the value function at stage $i$ and state $y$
$f_i(y,x)$ represents the immediate reward/cost from decision $x$ at state $y$
$g_i(y,x)$ describes the system transition from state $y$ under decision $x$

Key principles of Dynamic Programming:

Optimal Substructure: The optimal solution contains optimal solutions to subproblems
Overlapping Subproblems: The same subproblems are encountered multiple times
Backward Induction: Starting from the last stage and moving backward, we solve each stage sequentially until reaching the initial stage

Stochastic Dual Dynamic Programming

Stochastic Dual Dynamic Programming (SDDP) is an advanced optimization technique that extends dynamic programming to handle multi-stage stochastic optimization problems with high-dimensional state spaces.

SDDP addresses problems of the form: $$\begin{align} \min_{x_1, \ldots, x_T} \quad & \mathbb{E}\left[\sum_{t=1}^{T} c_t(x_t, u_t)\right] \\ \text{subject to} \quad & x_t \in \mathcal{X}_t(x_{t-1}, u_t), \quad t = 1, \ldots, T \end{align}$$ where:

$T$ is the number of time stages
$u_t$ is the decision variable at stage $t$
$x_t$ is the random variable (uncertainty) at stage $t$
$c_t$ is the cost function at stage $t$
$\mathcal{U}_t$ is the feasible region at stage $t$, which depends on the previous decision and current uncertainty

The key equation in SDDP is the Bellman equation: $$V_t(x_t) = \min_{u_t \in \mathcal{U}_t(x_t)} \left\{c_t(x_t, u_t) + \mathbb{E}[V_{t+1}(x_{t+1})|x_t, u_t]\right\}, \quad t = 1,\ldots,T$$ where:

$V_t(x_t)$ is the optimal cost-to-go function at stage $t$ given state $x_t$
$c_t(x_t, u_t)$ is immediate cost incurred at stage $t$
$u_t$ are decision variables at stage $t$
$\mathcal{U}_t(x_t)$ defines feasible decisions based on current state $x_t$
The expectation operator $\mathbb{E}[\cdot]$ accounts for uncertainty about future states
Terminal condition: $V_{T+1}(x_{T+1}) = 0$

This recursive formulation is the Bellman equation that we have seen earlier.

Algorithm Structure

Initialization:
1. Build initial approximations of the cost-to-go functions for all stages
2. Set iteration counter $k = 0$
Forward Pass:
1. Generate sample paths of the random variables $\{x_t^k\}_{t=1}^T$
2. For each stage $t = 1, \ldots, T$:
  1. Solve the stage problem using the current approximation of the cost-to-go function $$u_t^k = \arg \min_{u_t \in \mathcal{U}_t(x_t)} \left[c_t(x_t, u_t) + \tilde{V}_{t+1}(x_{t+1})\right]$$ where $\tilde{V}_{t+1}(\cdot)$ is approximated by existing cuts
  2. Store the optimal solutions $\{u
Backward Pass:
1. For each stage $t = T, \ldots, 2$ (working backwards):
  1. For each sampled state from the forward pass:
  2. Compute the dual variables (shadow prices) of the constraints
  3. Generate and add cutting planes to improve the approximation of the cost-to-go function The cutting planes take the form: $$V_{t+1}(x_{t+1}) \geqslant a_{t+1}^Tx_{t+1} + b_{t+1}$$ where $a_{t+1}$ are the dual variables and $b_{t+1}$ is the intercept term
Convergence Check:
1. Calculate upper and lower bounds on the optimal value
2. If bounds are sufficiently close or maximum iterations reached, stop
3. Otherwise, increment $k$ and return to step 2

Example: Hydropower Planning

Consider a multi-reservoir hydropower system planning problem:

State variables: Water levels in reservoirs
Decision variables: Water releases and power generation
Uncertainty: Inflows to reservoirs (dependent on rainfall)
Objective: Minimize expected cost of operation and thermal generation
Constraints: Water balance, generation limits, transmission constraints

This problem has continuous state and action spaces, multiple stages (e.g., weekly planning for a year), and significant uncertainty, making it suitable for SDDP.

In the hydropower planning context, SDDP would:

In the forward pass: Simulate operation decisions for sample inflow scenarios
In the backward pass: Build cutting planes that approximate the future cost function
Iteratively refine the approximation until convergence

The resulting policy provides operation decisions (water releases) for any reservoir level and inflow scenario, balancing immediate generation with future water value.

Advantages and Limitations

Advantages:

Can handle high-dimensional continuous state spaces where traditional DP suffers from the curse of dimensionality
Provides both primal solutions (decisions) and dual information (marginal values)
Delivers upper and lower bounds on the optimal solution value
Handles complex constraints and objective functions

Limitations:

Requires convexity of the problem (for the cuts to be valid)
Computational intensity increases with the number of stages and random variables
May struggle with very rare events or "black swan" scenarios
Classical version assumes stagewise independence of random variables

Extensions

Reinforcement Learning

Reinforcement Learning problems are typically formalized as Markov Decision Processes (MDPs), defined by the tuple $(S, A, P, R, \gamma)$:

$S$ is the state space
$A$ is the action space
$P$ is the transition probability: $P(s'|s,a)$
$R$ is the reward function: $R(s,a,s')$
$\gamma \in [0,1]$ is the discount factor, balancing immediate vs. future rewards

The objective is to find an optimal policy $\pi^*$ that maximizes the expected cumulative discounted reward: $$\max_{\pi} \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]$$

Value Functions and Optimality

The state-value function $V^{\pi}(s)$ represents the expected return starting from state $s$ and following policy $\pi$:

$$V^{\pi}(s) = \mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s\right]$$

The action-value function $Q^{\pi}(s, a)$ represents the expected return after taking action $a$ in state $s$ and following policy $\pi$:

$$Q^{\pi}(s, a) = \mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s, a_t = a\right]$$

The optimal value functions satisfy the Bellman Equation:

$$V^*(s) = \max_a \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^*(s')]$$ $$Q^*(s,a) = \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma \max_{a'} Q^*(s',a')]$$

Once we know $Q^*$, the optimal policy can be derived as:

$$\pi^*(s) = \arg\max_a Q^*(s,a)$$

Dynamic Programming Approaches

Initialize $V(s)$ arbitrarily for all states
Repeat until convergence:
1. For each state $s \in S$: $$V(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V(s')]$$
Compute the optimal policy: $$\pi(s) \leftarrow \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V(s')]$$

Initialize policy $\pi$ arbitrarily
Repeat until convergence:
1. Policy Evaluation: Compute $V^\pi$ by solving: $$V^\pi(s) = \sum_{s'} P(s'|s,\pi(s))[R(s,\pi(s),s') + \gamma V^\pi(s')]$$
2. Policy Improvement: Update policy using: $$\pi(s) \leftarrow \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]$$

Model-Free Methods

Unlike dynamic programming approaches, model-free methods don't require knowledge of the transition probabilities and reward function. They learn directly from experience.

Temporal Difference Learning

Initialize $Q(s,a)$ arbitrarily for all state-action pairs
For each episode:
1. Initialize state $s$
2. For each step of the episode:
  1. Choose action $a$ from $s$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)
  2. Take action $a$, observe reward $r$ and next state $s'$
  3. Update Q-value: $$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$$
  4. $s \leftarrow s'$

Initialize $Q(s,a)$ arbitrarily for all state-action pairs
For each episode:
1. Initialize state $s$
2. Choose action $a$ from $s$ using policy derived from $Q$
3. For each step of the episode:
  1. Take action $a$, observe reward $r$ and next state $s'$
  2. Choose action $a'$ from $s'$ using policy derived from $Q$
  3. Update Q-value: $$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma Q(s',a') - Q(s,a)]$$
  4. $s \leftarrow s'$; $a \leftarrow a'$

The key difference between Q-Learning and SARSA is that Q-Learning is an off-policy method (learns the value of the optimal policy), while SARSA is an on-policy method (learns the value of the policy being followed).

Policy Gradient Methods

Policy gradient methods directly optimize the policy without using a value function. The policy is typically parameterized by a vector $\theta$: $$\pi_\theta(a|s) = P(a|s,\theta)$$

The objective is to maximize the expected return:

$$J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]$$

The policy gradient theorem gives the gradient of this objective:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s,a)\right]$$