R4. Probability review

This recitation was adapted from a probability review document by Rosita Fu. The Jupyter Notebook cheatsheet was written by Grace Liu.

You can download Rosita’s notes. The go into maximum likelihood estimation as well, which we will cover later in the term.

Contents:

In this recitation we will cover:

Axioms of Probability
Conditional Probability and Marginalization
Random Variables, PDFs, PMFs, and CDFs
Expectation and Moment Generating Functions
Common Distributions

This notebook will only contain the important formulas and definitions, please see Rosita’s notes for more thorough breakdown of the material.

Axioms of Probability

To get started with probability, we can define a few terms:

Sample space: Set of all total outcomes \(\Omega\)

Event: Subset of \(\Omega\), A \(\in \Omega\)

Probability (P): A function that assigns a likelihood to the occurrence all events in \(\Omega\)

There are a number of axioms that must apply to P

\(\sum_{A_i\in \Omega}P(A_i) = 1\)
\(P(\emptyset) = 0\)
\(P(A) + P(\overline{A}) = 1\)
\(P(A_i) \in [0,1]\)
If \(A_1, A_2, ...A_n\) disjoint (non-overlapping), then \(\sum P(\cup A_i) = 1\)
\(P(A\cup B) = P(A) + P(B) - P(A\cap B)\)

Conditional Probability and Marginalization

If two events are independent, then

\begin{align} P(A,B) = P(A)P(B) \end{align}

The probability of two events both happening if one is dependent on the other is

\begin{align} P(A,B) = P(A\vert B)P(B) \end{align}

where \(P(A\vert B)\) is the conditional probability of A occurring given that B has occurred.

Rearranging the above equation, we have that

\begin{align} P(A\vert B) = \frac{P(A,B)}{P(B)} \end{align}

Bayes’s Theorem follows from the fact that \(P(A,B) = P(B,A)\) to give

\begin{align} &P(A\vert B) P(B) = P(B\vert A) P(A) \\[1em] &\Rightarrow P(A\vert B) = \frac{P(B\vert A)P(A)}{P(B)} \end{align}

When you have probabilities expressed in terms of two parameters \(P(X\vert Y)\): you can marginalize out one of the parameters by summing over all possible values of it to get

\[P(X) = \sum_i P(X\vert Y_i) P(Y_i)\]

in the discrete case, or, in the continuous case:

\[P(x) = \int P(x\vert y) f(y) dy\]

Random Variables

A random variable is a function that matches outcomes in sample space to real numbers. The probability distribution of \(\Omega\) induces a probability distribution of an R.V. on real numbers.

A discrete R.V. can only take on a finite number of values (e.g. integers). Its probability distribution is defined by its probability mass function (PMF) \(p(x_i) = P(X=x_i)\)

A continuous R.V. can take on an infinite number of values, and the probability density function (PDF) defines the likelihood for the value of the R.V. to be near a given value.

Properties of the probability density function f(x): 1. f(x) \(\geq\) 0 2. \(\int_{-\infty}^{\infty} f(x) dx = 1\) 3. \(P(X\in [a,b]) = \int_{a}^{b}f(x)dx\) 4. \(P(X=a) = 0\)

The cumulative distribution function (CDF) describes the likelihood that the value of the random variable is less than or equal to a given value.

\[F(x) = P(X\leq x)\]

It is the integral of the PMF/PDF. In the discrete case:

\[F(x) = \sum_{i, x_i\leq x}p(x_i)\]

In the continuous case:

\[F(x) = \int_{-\infty}^{x} f(x)dx\]

\[f(x) = \frac{dF(x)}{dx}\]

Expectation and Moments

The expectation of a random variable is the weighted average of all possible values. If you sample out of a distribution a large number of times, the mean of the samples would be close to the expectation value. This is called the Law of Large Numbers

discrete:

\[E[X] = \sum_i x_i\cdot p(x_i)\]
continuous:

\[E[X] = \int_{-\infty}^{\infty} x\cdot f(x) dx\]

The expectation of a value is also known as the first moment . Additional moment generating functions (MGFs) are calculated as:

discrete:

\[E[X^n] = \sum_i x_i^n\cdot p(x_i)\]
continuous:

\[E[X^n] = \int_{-\infty}^{\infty} x^n\cdot f(x) dx\]

Moment generating functions can also be centered around the mean: \(E[(X-\mu)^n]\), where \(\mu = E[x]\).

Notably, the centered second MGF is what we call the variance .

The expression of variance is:

\[V[X] = E[(X-E[X])^2]\]

\[V[X] = E[X^2] - (E[X])^2\]

Common Distributions

I highly recommend looking at Justin’s Distribution Explorer and playing with the different distributions yourself, but here are some of the common distributions with stories that you might come across in this class.

Discrete distributions

Bernoulli : There are only two outcomes: success and failure, where the probability of success is \(p\).

Binomial : After performing \(n\) independent Bernoulli trials with probability of success \(p\), the number of successes is Bernouli distributed

Geometric : If we perform Bernoulli trials with probability of success \(p\), the number of failures before the first success is Geometric distributed

Poisson: A Poisson process is a memoryless process with rate of arrival \(\lambda\). The number of arrivals in a unit of time is modeled by the Poisson Distribution

Continuous distributions

Uniform: There is a limited range of possible values \([a,b]\). Within that range, every value has equal probability.

Exponential: The inter-arrival time for a Poisson process with rate of arrival \(\lambda\) is Exponentially distributed.

Gamma: The amount of time it takes to accumulate \(\alpha\) arrivals of a Poisson process with arrival rate \(\lambda\)

Gaussian/Normal: Parametrized by the mean \(\mu\) and variance \(\sigma^2\). By the Central Limit Theorem , all distributions tend toward the Normal distribution at large enough sample size. Most processes that are a sum of many individual subprocesses will be Normally distributed.