Probability & Statistics

This page initially started as notes for my students in Elementary Statistics, Spring 2022.

I'm planning to update this page to serve as a general probability & statistics reference.

Probability

Probability Space

A Probability Space is a triple $(\Omega, \mathcal{F}, P)$ where
  1. $\Omega$ is a sample space (a set of outcomes)
  2. $\mathcal{F}$ is a sigma-algebra (a collection of subsets of $\Omega$)
  3. $P$ is a probability measure (a function from $\mathcal{F}$ to $[0,1]$)
For events $E_1, E_2\in \mc{F}$, $E_1$ and $E_2$ are independent if $$P(E_1\cap E_2)=P(E_1)P(E_2)$$ The conditional probability $\mb{P}(E_1|E_2)$, is defined as $$P(E_1|E_2)= \frac{P(E_1\cap E_2)}{P(E_2)}$$ assuming $E_2$ has non-zero probability.

Random Variables

A Random Variable is a measurable function $$X: \Omega \to (R,\mc{R}) $$ where $(R,\mc{R})$ is a measurable space.

The probability that $X$ takes on a value in a measurable set $E\in \mc{R}$ is written as $$ \P{E}= P(X^{-1} E) $$


$\mb{P}($ $< Z<$ $)\simeq$

Moments

The set of all random variables $$ \{X :\Omega \to \R \} $$ is a vector space under the trivial addition and scalar multiplication operations. We can always be enlarge $\Omega$ to accomodate new random variables. Thus $\Omega$ is often ommited when talk about random variables. Denote $$ \ms{L}_2= \{ X:\Omega \to \R \mid \E[X^2]< \infty \} $$ then $\ms{L}_2$ is a Hilbert Space with the inner product $$ \langle X,Y \rangle = \E[XY]:= \iint_{\mathbb{R^2}} xy\ f_X f_Y\ d x d y $$ $$ \norm{X}^2=\langle X,X \rangle $$ This also allows us to view probability problems geometrically.
The $k$-th moment of $X$ is defined as $$\mu_k=\E[X^k]:= \int_\Omega X(\omega)^k P(d\omega)=\int_\R x^k \P{dx}= \int_{\mathbb{R}} x^k f_X d x$$
  • $\mu_1$ is the mean of $X$, will also be denoted as $\mu_X$.
  • $\mu_2$ is the variance of $X$, will also be denoted as $\sigma_X^2$.
    $\sigma_X$ is the standard deviation of $X$.
Let $X:\Omega\to \R^{\ge 0}$ $$\E[X]= \int_0^\infty \P{(X\ge x)}\ d x$$ More generally, for any $n\in \N$, $X:\Omega\to \R$ $$\E[|X|^n] = \int_0^\infty \P{(|X|^n\ge x)}\ d x$$ $$ \begin{aligned} \E[X]&= \int_\Omega X \ dP \\ &= \int_\Omega\left(\int_0^\infty 1[X>y] \ dy\right) \ dP\\ &= \int_0^\infty \int_\Omega 1[X>y] \ d P \ dy\\ &= \int_0^\infty \P{(X\ge y)}\ d y \end{aligned} $$

Join Distribution

Let $X=(x_1,x_2)$ where $x$ are two real-valued random variables, then the joint distribution of $X$ is characterized by $$ \P[X]{ [a,b]\times [c,d]} = \P{x_1\in [a,b], x_2\in [c,d]} $$ The expectation of $f(X)$ is defined as $$ \E[f(X)]=\int_{\R^2} f(x_1,x_2) \P[X]{dx_1 dx_2} $$ The covariance of $X$ and $Y$ is defined as $$\mathtt{Cov}(X, Y)=\mb{E}[(X-\mu_X)(Y-\mu_Y)]=\Inn{X-\mu_X}{Y-\mu_Y}$$ The correlation of $X$ and $Y$ is defined as $$\rho(X,Y) =\frac{\mathtt{Cov}(X, Y)}{\norm{X}\norm{Y}} \color{blue}= \cos( \angle(X-\mu_X,Y-\mu_Y))$$

Statistics 1

Moments

Central tendency

Mean : $$\mu(X) = \frac{1}{n} \sum_{i=1}^n x_i$$ Median : assume the data is sorted $$\mathtt{Med}(X)= \begin{cases}X_{\lceil \frac{n}{2}\rceil} & \text { if } \mathrm{n} \text { is odd } \\ \frac{X_{\lceil\frac{n-1}{2}\rceil}+X_{\lfloor\frac{n+1}{2}\rfloor}}{2} & \text { if } \mathrm{n} \text { is even }\end{cases}$$

Dispersion

Variance : $$\sigma^2(X) = \frac{1}{n} \sum_{i=1}^n (x_i-\mu)^2$$ Standard Deviation : $$\sigma(X) :=\sqrt{\sigma^2} $$

Symmetry

Skewness : $$\gamma_1(X) = \frac{1}{n}\sum_{i=1}^n (\frac{x_i-\mu}{\sigma})^3$$

Shape

Kurtosis : $$\gamma_2(X) = \frac{1}{n}\sum_{i=1}^n (\frac{x_i-\mu}{\sigma})^4$$
For a standard normal distribution,
  1. The mean is 0
  2. The variance is 1
  3. The skewness is 0
  4. The kurtosis is 3

Transformations to Normal

A data is said to be right-skewed if the skewness is positive, and left-skewed if the skewness is negative.
Box-Cox Transformation
The Box-Cox Transformation is defined as $$ Y = \begin{cases} \frac{X^\lambda-1}{\lambda} & \text{if } \lambda \neq 0 \\ \log(X) & \text{if } \lambda = 0 \end{cases} $$
  1. where $\lambda$ is a parameter, and $X$ is positive.
  2. $\lambda$ is chosen to maximize the log-likelihood function.
  3. If $X$ is not positive, then a shift $$X\mapsto X+max(X)+1$$ is needed.
boxcoximport numpy as np
from scipy.stats import boxcox
import matplotlib.pyplot as plt

# Sample data (positive values only)
data = np.random.exponential(scale=0.1, size=1000)

# Apply Box-Cox transformation
transformed_data, lambda_opt = boxcox(data)

print(f"Optimal lambda: {lambda_opt}")

# Plot original vs transformed data
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].hist(data, bins=30, color='skyblue', edgecolor='black')
ax[0].set_title("Original Data")
ax[1].hist(transformed_data, bins=30, color='lightgreen', edgecolor='black')
ax[1].set_title("Box-Cox Transformed Data")
plt.show()

Normal Approximation

$$\begin{cases} \mathtt{P}(-1< Z < 1) \simeq 0.68\\ \mathtt{P}(-2 < Z < 2) \simeq 0.95\\ \mathtt{P}(-3 < Z < 3) \simeq 0.997 \end{cases}$$ The probability of getting a value between $-1$ and $1$ is more than half,
and it is very unlikely to be outside of $(-2, 2)$.
Any computation for $ \mathscr{N}(\mu,\sigma)$ can be converted to a computation for $ \mathscr{N}(0,1)$ by $$Z = \frac{X-\mu}{\sigma}$$ Results can be convert back to $ \mathscr{N}(\mu,\sigma)$ by $$X = Z*\sigma + \mu$$

Covariance & Correlation

$\rho=$
In the book, the correlation coefficient is defined by the following formula $$\rho(X,Y)=\mb{E}\{ \frac{X-\mu_X}{\sigma_X} \frac{Y-\mu_Y}{\sigma_Y}\} = \frac{1}{n} \sum_{i=1}^{n}\left(\frac{X_{i}-\mu_X}{\sigma_{X}}\right)\left(\frac{Y_{i}-\mu_Y}{\sigma_{Y}}\right) $$ A different, but equivalent, way to introduce the correlation coefficient, is to first define covariance and consider correlation as a normalized version of it.

Regression

Regression Line

The regression line is the line passing through $(\mu_X,\mu_Y)$ with slope $$\frac{\rho \cdot \sigma_Y}{\sigma_X}$$ Since this is a 100 level introductory class for students from all majors, we will not go into the technical details. If interested, see the page on Linear Regression for the general case.

RMS Error for Regression

Root Mean Square Error (RMS) is a measure of the error between the predicted value and the actual value. $$\mathtt{RMS} = \sqrt{\frac{\sum_{i=1}^n(\hat{y_i}-y_i)^2}{\mathtt{n}}}$$ Among all lines, the one that makes the smallest RMS error in predicting $y$ from $x$ is the regression line.

Hypothesis Testing

Normal & Student-T distribution

Given a set of points, $$x_1,\cdots,x_n$$ drawn from the normal distribution $\ms{N}(\mu,\sigma^2)$, then $$\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\sim\ms{N}(0,1) $$ and $$\frac{\bar{X}-\mu}{S/\sqrt{n}}\sim\mathtt{T}(n-1)$$ Where $S^{2}=\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2} $ is the sample variance. The second result is especially useful, since in practice $\sigma$ is often unknown, and we have to estimate it with the sample standard deviation, $S$.

As can be seem from the demo above, when $n>30$, the Student-T distribution is very close to that of the Standard Normal. Thus in practice, if $n>30$ then we use the $Z$-test regardless whether $\sigma$ is known.

t-test : One sample

Given a set of points, $$x_1,\cdots,x_n$$ drawn from the normal distribution $\ms{N}(\mu,\sigma^2)$, we want to test the hypothesis $$H_0: \mu=\mu_0\qquad \color{lightgray} H_0: \mu\le \mu_0$$ $$H_1: \mu\neq \mu_0 \qquad \color{lightgray} H_1: \mu> \mu_0$$ The test statistic is $$T=\frac{\bar{X}-\mu_0}{S/\sqrt{n}}$$ If $H_0$ is true, then $T\sim \mathtt{T}(n-1)$

The p-value is the probability of getting a value more extreme than $T$ $$ \text{p} = \P{|T|> |t|} = 2(1-\P{T\le |t|})\qquad \color{lightgray} \text{p} = \P{T> t} = 1-\P{T\le t} $$

Boostrap Method for Hypothesis Testing

Given a set of points, $$x_1,\cdots,x_n$$ we want to test the hypothesis $$H_0: T=T_0\qquad \color{lightgray} H_0: T\le T_0$$ $$H_1: T\neq T_0 \qquad \color{lightgray} H_1: T> T_0$$
  1. Shift the data so that the null hypothesis is true, e.g. for $T=\mu$, $\mu_0=0$ we shift the data by $\bar{x}$, $$x_1',\cdots,x_n'$$ after this step, the null hypothesis is true, i.e. $\bar{x}'=0$
  2. Resample with replacement from the shifted data to get a new set of points, $$x_1^*,\cdots,x_n^*$$ repeat $B$ times.
  3. Calculate the test statistic for each of the $B$ sets of points, $$T_1,\cdots,T_B$$
  4. Use the $T_1,\cdots,T_B$ to create a sampling distribution for $T$.
    This is the distribution of $T$ under the assumption that $H_0$ is true.
  5. Compare the observed value of $T$ with the sampling distribution of $T$ to get the p-value.

    e.g. for $H_1: T> T_0$, the p-value is $$\text{p} = \P{T> T_0} = \frac{|T_i > T_0|}{B}$$

Normality

The Shapiro-Wilk test is a test of normality The null hypothesis is that the data is normally distributed.
with the test statistic given by $$W=\frac{\left(\sum_{i=1}^{n} a_{i} x_{(i)}\right)^{2}}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}$$ where $x_{(i)}$ is the $i$-th smallest value of $x$.

Random Processes

Wiener process

$\left(W_t,t\in \R_+\right)$ is called a Wiener process if A Wiener process is continuous everywhere but nowhere differentiable.

Martingale

A martingale is a stochastic process $X_t$ such that $$\E\{X_t\mid X_s\} = X_s$$ for all $t>s$. A Wiener process is a martingale.

Stock Price

$$\color{royalblue}\bf dS_t = \mu S_t dt + \underbrace{\overbrace{\sigma}^{fluctuation} S_t dW_t}_{uncertainty}, \qquad S_0>0 \implies \color{blue} S_t = S_0 + \int_0^t \mu S_t dt + \underbrace{\int_0^t \sigma S_t dW_t}_{\text{ito integral}} \implies \color{darkblue}\bf S_t = S_0 e^{(\mu-\frac{1}{2}\sigma^2)t + \sigma W_t} \sim lognormal(\ln S_0+\mu t - \frac{1}{2}\sigma^2 t, \sigma^2 t) $$ If $B_t$ is the value of a savings account at time $t$, then $$B_t = B_0 \exp(\mu t), \qquad dB_t = \mu B_t dt$$ where $\mu$ is the interest rate.

Discounted Stock Price

$$\tilde{S}_t = S_t/B_t = \exp(-\mu t)S_t$$

Bayesian Statistics

Bayesian Theorem

The Bayesian Theorem is the following formula $$ \begin{align*} \underbrace{\P{A|D}}_{\text{Posterior}} = \frac{\overbrace{\P{D|A}}^{\text{Likelihood}}\overbrace{\P{A}}^{\text{Prior}}}{\underbrace{\P{D}}_{\text{Evidence}}} &\propto \P{D|A}\P{A} \qquad \small \color{blue} D=\{d_1,\cdots,d_n\} \quad (n\text{ data points}) \\& = \P{d_1|A}\P{d_2|A,d_1}\cdots\P{d_n|A,d_1,\cdots,d_{n-1}}\P{A} \\ {\small \color{red} (\text{Naive independent Assumption})} &= \P{d_1|A}\P{d_2|A}\cdots\P{d_n|A}\P{A} \end{align*}$$

Applications to Finance

Portfolio Theory

CAPM

The Capital Asset Pricing Model (CAPM) is the following linear regression model: $$ (R_p-R_f)=\alpha_p+\beta_{p}(R_m-R_f) $$ Let $R_p, \sigma_p$ be the return and volatility of a portfolio.

The beta of portfolio $p$ with respect to the market is given by $$\beta_{p}=\frac{\mathrm{Cov}(R_p,R_m)}{\sigma_m^2}= \frac{\rho_{p,m}\sigma_p}{\sigma_m}$$

Sharpe Ratio

Let $R_a, \sigma_a$ be the return and volatility of a asset, and $R_f$ be mean return of the risk-free asset. Then the Sharpe ratio of the portfolio is defined as $$S_a=\frac{\E{R_a-R_f}}{\sigma_a}$$ The Sharpe ratio can be viewed as a standardized measure of expected return $$ \text{Treynor ratio} = \frac{\E{R_a-R_f}}{\beta_a} $$ $$ \text{Generalized Sharpe ratio} = \frac{\E{[R_a-R_b]}}{\sigma_a} $$

Efficient Frontier

The Efficient Frontier is the collection of risk-return pairs $$ \{(\sigma_P,\E R_P) \mid\ !\exists P'\ :\ \E R_P = \E R_P' \wedge \sigma_{P'}<\sigma_P \}$$
Let $P$ be a risky portfolio, and $R_f$ be the return of the risk-free asset.
Let $C$ be a combination of $P$ and the risk-free asset.

The collection of all risk-return pair $$ (\sigma_C, \E(R_C) )$$ for all possible combinations $C$ gives the Capital Allocation Line (CAL).
For a given risky portfolio $P$, the CAL is given by the line $$\E(R_C)=R_f+\sigma_C S_P$$ Let $CAL_T$ be the CAL tangent to the efficient frontier.
Let $P$ be the corresponding portfolio.
Then $P$ is the Tangency portfolio, and $P$ has the highest Sharpe ratio among all portfolios.

Risks

VaR and CVaR

For a portfolio loss $L$, $\VaR_{\alpha}(L)$ value at risk at confidence level $\alpha$ is:

$$\text{VaR}_\alpha = \inf\{l \in \mathbb{R} : \mathbb{P}(L > l) \leq 1 - \alpha\}$$

$\CVaR$ ( Conditional value at risk / Expected Shortfall ):

$$\text{CVaR}_\alpha = \mathbb{E}[L \mid L \geq \text{VaR}_\alpha]$$

Partial Moments

Upside and Downside return and volatility

$$ \mu_{k}^+=\E{[X|X\geq \tau]} \qquad \mu_{k}^-=\E{[X|X\leq \tau]} $$ $$\sigma_+(X,Y) = \E [ \max(X-\mu_X,0) \max(Y-\mu_Y,0) ] $$ $$\sigma_-(X,Y) = \E [ \min(X-\mu_X,0) \min(Y-\mu_Y,0) ] $$ $$\sigma^2_+(X) = \E [ \max(X-\mu_X,0)^2 ]\qquad \sigma^2_-(X) = \E [ \min(X-\mu_X,0)^2 ] $$ $$\rho_+(X,Y) = \frac{\sigma_+(X,Y)}{\sigma_+(X)\sigma_+(Y)}\qquad \rho_-(X,Y) = \frac{\sigma_-(X,Y)}{\sigma_-(X)\sigma_-(Y)} $$ Downside mean and standard deviation are measures of "risks".

Upside mean and standard deviation are measures of "rewards".

Upside and Downside Beta

$$ \beta_+ = \frac{\sigma_+(Y)}{\sigma_+(X)}\rho_+(X,Y) $$ $$ \beta_- = \frac{\sigma_-(Y)}{\sigma_-(X)}\rho_-(X,Y) $$