Primal: $$\begin{align} \text{minimize} \quad & \mathbf{c}^T\mathbf{x} \\ \text{subject to} \quad & \mathbf{A}\mathbf{x} = \mathbf{b} \\ & \mathbf{x} \geq \mathbf{0} \end{align}$$
Dual: $$\begin{align} \text{maximize} \quad & \mathbf{b}^T\mathbf{y} \\ \text{subject to} \quad & \mathbf{A}^T\mathbf{y} \leq \mathbf{c} \\ \end{align}$$
where $\mathbf{y} \in \mathbb{R}^m$ is the vector of dual variables.While the simplex method moves along the boundary of the feasible region, interior point methods traverse the interior of the feasible region. These methods are particularly effective for large-scale problems.
Modern interior point methods include:
solver = SolverFactory('appsi_highs')
Dynamic Programming solves complex optimization problems by breaking them down into simpler subproblems. The key idea is to:
Key principles of Dynamic Programming:
Stochastic Dual Dynamic Programming (SDDP) is an advanced optimization technique that extends dynamic programming to handle multi-stage stochastic optimization problems with high-dimensional state spaces.
Consider a multi-reservoir hydropower system planning problem:
This problem has continuous state and action spaces, multiple stages (e.g., weekly planning for a year), and significant uncertainty, making it suitable for SDDP.
In the hydropower planning context, SDDP would:
The resulting policy provides operation decisions (water releases) for any reservoir level and inflow scenario, balancing immediate generation with future water value.
Advantages:
Limitations:
Several extensions to the basic SDDP algorithm have been developed:
The objective is to find an optimal policy $\pi^*$ that maximizes the expected cumulative discounted reward: $$\max_{\pi} \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]$$
The state-value function $V^{\pi}(s)$ represents the expected return starting from state $s$ and following policy $\pi$:
$$V^{\pi}(s) = \mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s\right]$$The action-value function $Q^{\pi}(s, a)$ represents the expected return after taking action $a$ in state $s$ and following policy $\pi$:
$$Q^{\pi}(s, a) = \mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s, a_t = a\right]$$The optimal value functions satisfy the Bellman Equation:
$$V^*(s) = \max_a \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^*(s')]$$ $$Q^*(s,a) = \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma \max_{a'} Q^*(s',a')]$$Once we know $Q^*$, the optimal policy can be derived as:
$$\pi^*(s) = \arg\max_a Q^*(s,a)$$Unlike dynamic programming approaches, model-free methods don't require knowledge of the transition probabilities and reward function. They learn directly from experience.
The key difference between Q-Learning and SARSA is that Q-Learning is an off-policy method (learns the value of the optimal policy), while SARSA is an on-policy method (learns the value of the policy being followed).
The objective is to maximize the expected return:
$$J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]$$The policy gradient theorem gives the gradient of this objective:
$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s,a)\right]$$