This page initially started as notes for my students in Elementary Statistics, Spring 2022.
I'm planning to update this page to serve as a general probability & statistics reference.
Probability
Probability Space
A Probability Space is a triple $(\Omega, \mathcal{F}, P)$ where
$\Omega$ is a sample space (a set of outcomes)
$\mathcal{F}$ is a sigma-algebra (a collection of subsets of $\Omega$)
$P$ is a probability measure (a function from $\mathcal{F}$ to $[0,1]$)
For events $E_1, E_2\in \mc{F}$, $E_1$ and $E_2$ are independent if
$$P(E_1\cap E_2)=P(E_1)P(E_2)$$
The
conditional probability $\mb{P}(E_1|E_2)$, is defined as
$$P(E_1|E_2)= \frac{P(E_1\cap E_2)}{P(E_2)}$$
assuming $E_2$ has non-zero probability.
If $$|\Omega|<\infty ,\quad \mathcal{F}=2^{\Omega}, \quad P(E)=\frac{|E|}{|\Omega|}$$ then $(\Omega,
\mathcal{F}, P)$ is a probability space.
Let $$\Omega=\{x_1x_2\cdots x_n :x_1\in \{0,1\} \},\quad\ \mc{F}=\sigma(\Omega)$$
$$ P(x_1x_2\cdots x_n)=p^j(1-p)^{n-j},\quad j = |\{x_i:x_i=1 \}|$$
then $(\Omega, \mathcal{F}, P)$ is a probability space.
Random Variables
A Random Variable is a measurable function
$$X: \Omega \to (R,\mc{R}) $$
where $(R,\mc{R})$ is a measurable space.
The probability that $X$ takes on a value in a measurable set $E\in \mc{R}$ is written
as
$$
\P{E}= P(X^{-1} E)
$$
If $|R|<\infty$, then $X$ is a discrete random variable
$$\P{E}=\sum_{x\in E} \P{X=x}\delta_x$$
If $|R|=\infty$, then $X$ is a
continuous random variable.
The density of $X$ is a measurable function $f_X$ such that
$$\P{X \in E}=\int_{X^{-1} E} d P=\int_E f_X(x)\ d x$$
$\mb{P}($ $< Z<$ $)\simeq$
If $$\Omega=\{1,2,\cdots,n\}^2,\quad\P{E}=\dfrac{|E|}{n^2} $$ $$
X(\omega)=\sum_{i=1}^2 \omega_i$$ then $X$ is a random variable.
$$\P{X\in (-\pi,\pi)} = \frac{|\{(1,1),(1,2),(2,1) \}|}{n^2}=\dfrac{3}{n^2}$$
Let $$\Omega=\{x_1x_2\cdots x_n :x_1\in \{0,1\} \},\quad\ \mc{F}=\sigma(\Omega)$$
$$ P(x_1x_2\cdots x_n)=p^j(1-p)^{n-j},\quad j = |\{x_i:x_i=1 \}|$$
$$
X(\omega)=\sum_{i=1}^n \omega_i$$ then $X$ is a random variable, known as binomial random
variable,
with
$$\P{X=k}=\left(\begin{array}{l}
n \\
k
\end{array}\right) p^k (1-p)^{n-k}, \qquad k\in\{0,1,\cdots,n\}$$
The probability density function
$$f(x)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}, \qquad x\in \R$$
defines the Gaussian Random Variable with mean $\mu$ and variance $\sigma^2$.
Moments
The set of all random variables
$$
\{X :\Omega \to \R \}
$$ is a vector space under the trivial addition and scalar multiplication operations.
We can always be enlarge $\Omega$ to accomodate new random variables.
Thus $\Omega$ is often ommited when talk about random variables.
Denote
$$
\ms{L}_2= \{ X:\Omega \to \R \mid \E[X^2]< \infty \} $$ then $\ms{L}_2$ is a Hilbert Space with the inner
product
$$
\langle X,Y \rangle = \E[XY]:= \iint_{\mathbb{R^2}} xy\ f_X f_Y\ d x d y
$$
$$
\norm{X}^2=\langle X,X \rangle
$$
This also allows us to view probability problems geometrically.
The $k$-th moment of $X$ is defined as
$$\mu_k=\E[X^k]:= \int_\Omega X(\omega)^k P(d\omega)=\int_\R x^k \P{dx}= \int_{\mathbb{R}} x^k f_X d x$$
$\mu_1$ is the mean of $X$, will also be denoted as $\mu_X$.
$\mu_2$ is the variance of $X$, will also be denoted as $\sigma_X^2$.
$\sigma_X$ is the standard deviation of $X$.
If $X\sim \operatorname{Bin}(n,p)$, then $$\mb{E}(X)=np$$
$$\sigma^2(X)=npq$$
Let $X:\Omega\to \R^{\ge 0}$
$$\E[X]= \int_0^\infty \P{(X\ge x)}\ d x$$
More generally, for any $n\in \N$, $X:\Omega\to \R$
$$\E[|X|^n] = \int_0^\infty \P{(|X|^n\ge x)}\ d x$$
$$
\begin{aligned}
\E[X]&= \int_\Omega X \ dP \\
&= \int_\Omega\left(\int_0^\infty 1[X>y] \ dy\right) \ dP\\
&= \int_0^\infty \int_\Omega 1[X>y] \ d P \ dy\\
&= \int_0^\infty \P{(X\ge y)}\ d y
\end{aligned}
$$
Join Distribution
Let $X=(x_1,x_2)$ where $x$ are two real-valued random variables,
then the joint distribution of $X$ is characterized by
$$
\P[X]{ [a,b]\times [c,d]} = \P{x_1\in [a,b], x_2\in [c,d]}
$$
The expectation of $f(X)$ is defined as
$$
\E[f(X)]=\int_{\R^2} f(x_1,x_2) \P[X]{dx_1 dx_2}
$$
The covariance of $X$ and $Y$ is defined as
$$\mathtt{Cov}(X, Y)=\mb{E}[(X-\mu_X)(Y-\mu_Y)]=\Inn{X-\mu_X}{Y-\mu_Y}$$
The correlation of $X$ and $Y$ is defined as
$$\rho(X,Y) =\frac{\mathtt{Cov}(X, Y)}{\norm{X}\norm{Y}} \color{blue}= \cos( \angle(X-\mu_X,Y-\mu_Y))$$
Statistics 1
Moments
Central tendency
Mean :
$$\mu(X) = \frac{1}{n} \sum_{i=1}^n x_i$$
Median : assume the data is sorted
$$\mathtt{Med}(X)= \begin{cases}X_{\lceil \frac{n}{2}\rceil} & \text { if } \mathrm{n} \text { is odd } \\
\frac{X_{\lceil\frac{n-1}{2}\rceil}+X_{\lfloor\frac{n+1}{2}\rfloor}}{2} & \text { if } \mathrm{n} \text
{ is even }\end{cases}$$
boxcoximport numpy as np
from scipy.stats import boxcox
import matplotlib.pyplot as plt
# Sample data (positive values only)
data = np.random.exponential(scale=0.1, size=1000)
# Apply Box-Cox transformation
transformed_data, lambda_opt = boxcox(data)
print(f"Optimal lambda: {lambda_opt}")
# Plot original vs transformed data
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].hist(data, bins=30, color='skyblue', edgecolor='black')
ax[0].set_title("Original Data")
ax[1].hist(transformed_data, bins=30, color='lightgreen', edgecolor='black')
ax[1].set_title("Box-Cox Transformed Data")
plt.show()
Normal Approximation
$$\begin{cases} \mathtt{P}(-1< Z < 1) \simeq 0.68\\ \mathtt{P}(-2 < Z < 2) \simeq 0.95\\ \mathtt{P}(-3 < Z < 3)
\simeq 0.997 \end{cases}$$ The probability of getting a value between $-1$ and $1$ is more than half, and it
is
very unlikely to be outside of $(-2, 2)$.
Any computation for $ \mathscr{N}(\mu,\sigma)$ can be
converted
to a computation for $ \mathscr{N}(0,1)$ by
$$Z = \frac{X-\mu}{\sigma}$$
Results can be convert back to $ \mathscr{N}(\mu,\sigma)$ by
$$X = Z*\sigma + \mu$$
Covariance & Correlation
$\rho=$
In the book, the correlation coefficient is defined by the following
formula
$$\rho(X,Y)=\mb{E}\{ \frac{X-\mu_X}{\sigma_X} \frac{Y-\mu_Y}{\sigma_Y}\} = \frac{1}{n}
\sum_{i=1}^{n}\left(\frac{X_{i}-\mu_X}{\sigma_{X}}\right)\left(\frac{Y_{i}-\mu_Y}{\sigma_{Y}}\right) $$
A different, but equivalent, way to introduce the correlation coefficient, is to first define
covariance and consider correlation as a normalized version of it.
Regression
Regression Line
The regression line is the line passing through $(\mu_X,\mu_Y)$ with slope
$$\frac{\rho \cdot \sigma_Y}{\sigma_X}$$
Since this is a 100 level introductory class for students from all majors, we will not go into the
technical
details.
If interested, see the page on Linear Regression for
the
general case.
RMS Error for Regression
Root Mean Square Error (RMS) is a measure of the error between the predicted value and the actual value.
$$\mathtt{RMS} = \sqrt{\frac{\sum_{i=1}^n(\hat{y_i}-y_i)^2}{\mathtt{n}}}$$
Among all lines, the one that makes the smallest RMS error in predicting $y$ from $x$ is
the
regression line.
Hypothesis Testing
Normal & Student-T distribution
Given a set of points, $$x_1,\cdots,x_n$$ drawn from the normal distribution
$\ms{N}(\mu,\sigma^2)$,
then $$\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\sim\ms{N}(0,1) $$
and $$\frac{\bar{X}-\mu}{S/\sqrt{n}}\sim\mathtt{T}(n-1)$$ Where $S^{2}=\frac{1}{n-1}
\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2} $ is the sample variance.
The second result is especially useful, since in practice $\sigma$ is often unknown, and we have to estimate
it
with the sample standard deviation, $S$.
As can be seem from the demo above, when $n>30$, the Student-T distribution is very close to
that
of the
Standard
Normal. Thus in practice, if $n>30$ then we use the $Z$-test regardless whether $\sigma$ is known.
t-test : One sample
Given a set of points, $$x_1,\cdots,x_n$$ drawn from the normal distribution
$\ms{N}(\mu,\sigma^2)$,
we want to test the hypothesis
$$H_0: \mu=\mu_0\qquad \color{lightgray} H_0: \mu\le \mu_0$$
$$H_1: \mu\neq \mu_0 \qquad \color{lightgray} H_1: \mu> \mu_0$$
The test statistic is
$$T=\frac{\bar{X}-\mu_0}{S/\sqrt{n}}$$
If $H_0$ is true, then $T\sim \mathtt{T}(n-1)$
The p-value is the probability of getting a value more extreme than $T$
$$
\text{p} = \P{|T|> |t|} = 2(1-\P{T\le |t|})\qquad \color{lightgray} \text{p} = \P{T> t} = 1-\P{T\le t}
$$
Boostrap Method for Hypothesis Testing
Given a set of points, $$x_1,\cdots,x_n$$
we want to test the hypothesis
$$H_0: T=T_0\qquad \color{lightgray} H_0: T\le T_0$$
$$H_1: T\neq T_0 \qquad \color{lightgray} H_1: T> T_0$$
Shift the data so that the null hypothesis is true, e.g. for $T=\mu$, $\mu_0=0$ we shift the data by $\bar{x}$,
$$x_1',\cdots,x_n'$$
after this step, the null hypothesis is true, i.e. $\bar{x}'=0$
Resample with replacement from the shifted data to get a new set of points, $$x_1^*,\cdots,x_n^*$$ repeat $B$
times.
Calculate the test statistic for each of the $B$ sets of points, $$T_1,\cdots,T_B$$
Use the $T_1,\cdots,T_B$ to create a sampling distribution for $T$.
This is the distribution of $T$ under the assumption that $H_0$ is true.
Compare the observed value of $T$ with the sampling distribution of $T$ to get the p-value.
e.g. for $H_1: T> T_0$, the p-value is
$$\text{p} = \P{T> T_0} = \frac{|T_i > T_0|}{B}$$
Normality
The
Shapiro-Wilk test
is a test of normality
The null hypothesis is that the data is normally distributed.
with the test statistic given by $$W=\frac{\left(\sum_{i=1}^{n} a_{i}
x_{(i)}\right)^{2}}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}$$
where $x_{(i)}$ is the $i$-th smallest value of $x$.
Random Processes
Wiener
process
$\left(W_t,t\in \R_+\right)$ is called a Wiener
process if
$W_0=0$
$W_t-W_s\sim \ms{N}(0,t-s)$ for all $t>s$
$W_t$ has independent increments
$W_t$ has continuous paths
A Wiener process is continuous everywhere but nowhere differentiable.
Martingale
A martingale is a stochastic process $X_t$ such that
$$\E\{X_t\mid X_s\} = X_s$$
for all $t>s$.
A Wiener process is a martingale.
Stock Price
$$\color{royalblue}\bf
dS_t = \mu S_t dt + \underbrace{\overbrace{\sigma}^{fluctuation} S_t dW_t}_{uncertainty}, \qquad S_0>0
\implies \color{blue} S_t = S_0 + \int_0^t \mu S_t dt + \underbrace{\int_0^t \sigma S_t dW_t}_{\text{ito integral}}
\implies \color{darkblue}\bf S_t = S_0 e^{(\mu-\frac{1}{2}\sigma^2)t + \sigma W_t} \sim lognormal(\ln S_0+\mu t -
\frac{1}{2}\sigma^2 t, \sigma^2 t)
$$
If $B_t$ is the value of a savings account at time $t$, then
$$B_t = B_0 \exp(\mu t), \qquad dB_t = \mu B_t dt$$
where $\mu$ is the interest rate.
Discounted Stock Price
$$\tilde{S}_t = S_t/B_t = \exp(-\mu t)S_t$$
Bayesian Statistics
Bayesian Theorem
The Bayesian Theorem is the following formula
$$ \begin{align*}
\underbrace{\P{A|D}}_{\text{Posterior}} =
\frac{\overbrace{\P{D|A}}^{\text{Likelihood}}\overbrace{\P{A}}^{\text{Prior}}}{\underbrace{\P{D}}_{\text{Evidence}}}
&\propto \P{D|A}\P{A} \qquad \small \color{blue} D=\{d_1,\cdots,d_n\} \quad (n\text{ data points})
\\& = \P{d_1|A}\P{d_2|A,d_1}\cdots\P{d_n|A,d_1,\cdots,d_{n-1}}\P{A}
\\ {\small \color{red} (\text{Naive independent Assumption})} &= \P{d_1|A}\P{d_2|A}\cdots\P{d_n|A}\P{A}
\end{align*}$$
Applications to Finance
Portfolio Theory
CAPM
The Capital Asset Pricing Model (CAPM) is the following linear regression model:
$$
(R_p-R_f)=\alpha_p+\beta_{p}(R_m-R_f)
$$
$R_p$ is the return of a portfolio $p$
$R_f, R_m$ is the return of the risk-free asset and the market portfolio
$\alpha_p$ is the alpha of the portfolio $p$.
$\beta_{p}$ is the beta of the portfolio $p$.
Let $R_p, \sigma_p$ be the return and volatility of a portfolio.
The beta of portfolio $p$ with respect to the market is given by
$$\beta_{p}=\frac{\mathrm{Cov}(R_p,R_m)}{\sigma_m^2}= \frac{\rho_{p,m}\sigma_p}{\sigma_m}$$
Sharpe Ratio
Let $R_a, \sigma_a$ be the return and volatility of a asset, and $R_f$ be mean return of the risk-free
asset.
Then the Sharpe ratio of the portfolio is defined as
$$S_a=\frac{\E{R_a-R_f}}{\sigma_a}$$
The Sharpe ratio can be viewed as a standardized measure of expected return
$$
\text{Treynor ratio} = \frac{\E{R_a-R_f}}{\beta_a}
$$
$$
\text{Generalized Sharpe ratio} = \frac{\E{[R_a-R_b]}}{\sigma_a}
$$
Efficient Frontier
The Efficient Frontier is the collection of risk-return pairs $$
\{(\sigma_P,\E R_P) \mid\ !\exists P'\ :\ \E R_P = \E R_P' \wedge \sigma_{P'}<\sigma_P \}$$
Let $P$ be a risky portfolio, and $R_f$ be the return of the risk-free asset.
Let $C$ be a combination of $P$ and the risk-free asset.
The collection of all risk-return pair $$ (\sigma_C, \E(R_C) )$$
for all possible combinations $C$ gives the Capital Allocation Line (CAL).
For a given risky portfolio $P$, the CAL is given by the line
$$\E(R_C)=R_f+\sigma_C S_P$$
Let $CAL_T$ be the CAL tangent to the efficient frontier.
Let $P$ be the corresponding portfolio.
Then $P$ is the Tangency portfolio, and
$P$ has the highest Sharpe ratio among all portfolios.
Risks
VaR and CVaR
For a portfolio loss $L$, $\VaR_{\alpha}(L)$
value at risk
at confidence level $\alpha$ is: