In probability, a measure space $(\Omega, \Sigma, \Prb)$ dictates the likelihood of certain events occurring. The events we may measure are the elements of $\Sigma$, and a likelihood of some $A \in \Sigma$ is measured by $\Prb A \in [0,1]$. Our events of interest are usually those in which some undetermined quantity $\X$ lands in some target set. $$ \begin{gathered} \Prb\big(160{\rm cm} \leq \operatorname{height} \leq 187{\rm cm}\big) = 0.95 \\ \Prb\big(60{\rm kg} \leq \operatorname{mass} \leq 80{\rm kg}\big) = 0.8 \\ \Prb\big(22 \leq \operatorname{age} \leq 28\big) = 0.01 \\ \Prb\big( \X \in A\big) = 0.0001 \\ \end{gathered} $$ Explicitly, these quantities $\X$ are measurable functions $\X: \Omega \rightarrow \stateSpace$ from $(\Omega, \Sigma)$ to their respective measurable codomains $(\stateSpace, \stateAlgebra)$, and the event sets are preimages. $$ \big\{ \X \in A \big\} = X^{-1}A \in \Sigma $$ This perspective of $\Sigma/\stateAlgebra$-measurable functions $\X$ is nice, in that we are able to mentally replace $\X: \Omega \rightarrow \stateSpace$ with $\stateVar \in \stateSpace$ in the sense that any operation $f(\stateVar)$ we do to $\stateVar$ may be replaced with function composition $f(\X) = f \circ X: \Omega \rightarrow \bbF$.
Example. Just as if we can calculate body mass index (BMI) from quantities $\operatorname{height}, \operatorname{mass} \in \bbR$ $$ \operatorname{BMI} = \frac{\operatorname{mass}}{\operatorname{height}^2} \in \bbR, $$ we can construct an undetermined quantity $\operatorname{BMI}: \Omega \rightarrow \bbR$ from those $\operatorname{height}, \operatorname{mass}: \Omega \rightarrow \bbR$ via the same operation $$ \operatorname{BMI}(\omega) = \frac{\operatorname{mass}(\omega)}{\operatorname{height}(\omega)^2} $$ and simply write $\operatorname{BMI} = \operatorname{mass}/\operatorname{height}^2$.
Caveat. We must have that the output set $\bbF$ of our operation $f$ has some associated $\sigma$-algebra $\scrF$ such that our operation $f: \stateSpace \rightarrow \bbF$ is $\stateAlgebra/\scrF$-measurable.
The benefit to the approach of mentally replacing quantities $\stateVar \in \stateSpace$ with undetermined versions $\X: \Omega \rightarrow \stateSpace$ is our ability to apply operations like those in the example above. The downside of this approach is that it abstracts all the information out of $(\Omega, \Sigma, \Prb)$, making it hard for newcomers to probability to understand how probabililty spaces are constructed to satisfy properties we study. Take, for instance, a Python[undefined],[undefined] program which generates samples from distributions like so.
import numpy
X = numpy.random.exponential(scale=1/69.0)
Y = numpy.random.normal(0.0, 1.0)
Z = numpy.random.normal(Y, X)
We can quickly mentally model this code in terms of uncertain quantities, decided like the code. $$\begin{gathered} X \sim \operatorname{Exponential}(69) \\ Y \sim \operatorname{Normal}(0, 1) \\ Z \sim \operatorname{Normal}(Y, X) \\ \end{gathered}$$ However, the notation above hides the structure of probability spaces $(\Omega, \Sigma, \Prb)$ which equip such undetermined quantities $X, Y, Z$. This abstraction may leave people unfulfilled: without having explicit spaces to tinker with, learners may struggle with building familiarity with the theory.
This post seeks to address how we can think of probability generatively; that is, we seek to demonstrate how one may construct probability spaces $(\Omega, \Sigma, \Prb)$ which correspond to generative sampling algorithms like the Python code above.
If I want some probability space $(\Omega, \Sigma, \Prb)$ with a single undetermined quantity $\X: \Omega \rightarrow \stateSpace$ with some distribution $\mu$, $$ \Prb(\X \in A) = \mu A, \quad A \in \stateAlgebra, $$ the construction is very easy. $$ (\Omega, \Sigma, \Prb, \X) = (\stateSpace, \stateAlgebra, \mu, \operatorname{identity}) $$ This will clearly satisfy the distribution. $$ \Prb(\X \in A) = \Prb(\X^{-1}A) = \Prb(A) = \mu(A) $$
Example. Denoting $\mu$ as a $\operatorname{Exponential}(69)$ probability measure on the Borel sets $\scrR$ of $\bbR$, $$ \mu(A) = \int_{A \cap [0,\infty)} \frac{1}{69} e^{-69x} \rmd x, $$ We can construct our space like so. $$ (\Omega, \Sigma, \Prb, \X) = (\bbR, \scrR, \mu, \operatorname{identity}) $$ Note that any $f: \Omega \rightarrow \bbF$ that is $\Sigma/\scrF$-measurable is necessarily such that $$ f(\stateVar) = f(\X(\stateVar)) $$ so all undetermined quantities $f$ in the space $(\Omega, \Sigma, \Prb)$ are effectively deterministic operations of $\X$.
This is a silly edge case, as we typically deal with undetermined quantities that relate to each other in some way. In the next section, we will discuss how we build our probability space through successive enlargements when consecutively introducing new undetermined quantities.
In our Python program, we were able to consecutively initialize variables X
, Y
, Z
by sequentially calling functions in the numpy.random
module.
Each function call allocated more memory for the program, like in the diagram below.
M1=[X]
M2=[X,Y]
M3=[X,Y,Z]
Just as the allocated memory enlarges
M1
M2
Mn
as we consecutively initialize variables X1
, X2
, $\ldots$, Xn
in a program, we can enlarge our probability spaces
with the declaration of undetermined quantities $X_1, X_2, \ldots, X_n$. I see there being two types of ways perform such enlargements, which I will call independent sampling and conditional sampling. Each type of enlargement will leverage some probability space $(\Omega, \Sigma, \Prb)$ to create a new space $(\tilde\Omega,\tilde\Sigma,\tilde\Prb)$ with the following properties.
To this end, we may suppose as though our enlargement never occurred, and that our original abstract $(\Omega, \Sigma, \Prb)$ is in fact our enlarged $(\tilde\Omega,\tilde\Sigma,\tilde\Prb)$ with a new undetermined quantity in addition to our previous ones.
Provided two probabililty spaces $(\Omega, \Sigma, \Prb)$ and $(\bbY, \scrY, \mu)$, we may define a new space $(\tilde\Omega, \tilde\Sigma,\tilde\Prb)$ via the product operations. $$\begin{aligned} \tilde\Omega &= \Omega \times \bbY = \big\{ (\omega, y) : \omega \in \Omega, y \in \bbY \big\} \\ \tilde\Sigma &= \Sigma \otimes \scrY = \sigma\big(A \times B : A \in \Sigma, B \in \scrY \big) \\ \tilde\Prb &= \Prb \otimes \mu \\ &\quad (\Prb\otimes\mu)(A \times B) = \Prb(A)\cdot \mu(B) \end{aligned}$$ Note that we only defined $\Prb\otimes\mu$ on the rectangles $\big\{ A \times B : A \in \Sigma, B \in \scrY \big\}$, but the partial definition on these rectangles uniquely extends to some measure on $\tilde\Sigma$. Now, we may define $Y: \tilde\Omega \rightarrow \bbY$ via $$ Y(\omega, y) = y. $$ We immediately have that $Y$ is $\tilde\Sigma/\scrY$ measurable; for all $A \in \scrY$, we get $$ Y^{-1}A = \Omega \times A \in \tilde\Sigma. $$ Its distribution is also $\mu$. $$\begin{aligned} \tilde\Prb(Y \in A) &= \tilde\Prb(Y^{-1}A) \\ &= \tilde\Prb(\Omega \times A) \\ &= (\Prb\otimes\mu)(\Omega\times A) \\ &= \Prb\Omega\cdot\mu A \\ &=\mu A \end{aligned}$$ Further, any undetermined quantity $X: \Omega \rightarrow \stateSpace$ we had on our previous space $(\Omega, \Sigma, \Prb)$ can be extended to the new space via the following definition $\tilde\X: \tilde\Omega \rightarrow \stateSpace$. $$ \tilde\X(\omega, y) = \X(\omega) $$ The $\Sigma/\stateAlgebra$-measurability that held for $\X$ is lifted to $\tilde\Sigma/\stateAlgebra$-measurability of $\tilde\X$; indeed, each $A \in \stateAlgebra$ is such that $\X^{-1}A \in \Sigma$, so $$ \tilde\X^{-1}A = \X^{-1}A \times \bbY \in \tilde\Sigma. $$ The $\tilde\Prb$-distribution of $\tilde\X$ is the also the exact same as the $\Prb$-distribution of $\X$. $$\begin{aligned} \tilde\Prb(\tilde \X \in A) &= \tilde\Prb(\tilde \X^{-1}A) \\ &= \tilde\Prb(\X^{-1}A \times \bbY) \\ &= (\Prb\otimes\mu)(\X^{-1}A \times \bbY) \\ &= \Prb(\X^{-1}A) \cdot \mu\bbY \\ &= \Prb(\X \in A) \end{aligned}$$
Lastly, the phrase independent sampling comes from the fact that counterparts $\tilde\X$ of undetermined quantities $\X$ in our original space are all independent of our new quantity $Y$. $$\begin{aligned} \tilde\Prb\big(\tilde\X \in A, Y \in B\big) &= \tilde\Prb\big(\tilde\X^{-1}A \cap Y^{-1}B\big) \\ &= \tilde\Prb\big((\X^{-1}A \times \bbY) \cap (\Omega \times B) \big) \\ &= \tilde\Prb\big(\X^{-1}A \times B \big) \\ &= (\Prb\otimes\mu)\big(\X^{-1}A \times B \big) \\ &= \Prb(\X^{-1}A) \cdot \mu B \\ &= \Prb(\X \in A) \cdot \tilde\Prb(Y \in B) \\ &= \tilde\Prb(\tilde\X \in A) \cdot \tilde\Prb(Y \in B) \end{aligned}$$Example. Consider $\mu$ as in our previous example. $$\mu(A) = \int_{A \cap [0,\infty)} \frac{1}{69}e^{-69\stateVar} \rmd\stateVar$$ By introducing a $\operatorname{Normal}(0, 1)$ measure $\nu$ on $(\bbR, \scrR)$ $$\nu(A) = \int_A (2\pi)^{-1/2} \exp\Big(-\frac{y^2}{2} \Big)\rmd y, $$ we may construct a probability space as follows. $$ (\Omega, \Sigma, \Prb) = (\bbR\times\bbR, \scrR\otimes\scrR, \mu\otimes\nu) $$ Defining $\X, Y: \Omega \rightarrow \bbR$ defined by $\X(\stateVar, y) = \stateVar$ and $Y(\stateVar, y) = y$, we see the following joint distribution. $$\begin{aligned} \Prb(X \in A, Y \in B) &=\Prb(X^{-1}A \cap Y^{-1}B) \\ &=\Prb\big((A \times \bbR) \cap (\bbR \times B)\big) \\ &=\Prb(A \times B) \\ &=(\mu \otimes \nu)(A \times B) \\ &= \mu(A) \cdot \nu(B) \end{aligned}$$ In other words, $\X$ is $\operatorname{Exponential}(69)$ distributed, while $Y$ is independently $\operatorname{Normal}(0,1)$ distributed. Note that any function $\Sigma/\scrF$-measurable function $f: \Omega \rightarrow \bbF$ is immediately of the form $$ f(\stateVar, y) = f\big(\X(\stateVar, y), Y(\stateVar, y) \big) $$ so all undetermined quantities $f$ in the space $(\Omega, \Sigma, \Prb)$ are effectively deterministic operations of quantities $\X$ and $Y$.
Fix a probabililty space $(\Omega, \Sigma, \Prb)$ and a measurable space $(\bbY, \scrY)$. A transition kernel $\kappa: \Omega \times \scrY \rightarrow \bbR$ is an object satisfying the following.
With these objects, we may define a probability space $(\tilde\Omega, \tilde\Sigma, \tilde\Prb)$, defined as follows.
$$\begin{aligned} \tilde\Omega &= \Omega \times \bbY \\ \tilde\Sigma &= \Sigma \otimes \scrY \\ \tilde\Prb &= \Prb \ast \kappa \\ &\quad (\Prb\ast\kappa)(A \times B) = \int_A \kappa(\omega, B) \Prb(\rmd\omega) \end{aligned}$$ Again, we may define a $\tilde\Sigma/\scrY$-measurable function $Y: \tilde\Omega \rightarrow \bbY$ via $$ Y(\omega, y) = y. $$ Although the measurable space $(\tilde\Omega, \tilde\Sigma)$ is the same as before, the new measure $\tilde\Prb$ this time makes the distribution of $Y$ rather complicated. $$\begin{aligned} \tilde\Prb(Y \in A) &= \tilde\Prb(Y^{-1}A) \\ &= \tilde\Prb(\Omega \times A) \\ &= \int_\Omega \kappa(\omega, A) \Prb(\rmd\omega) \end{aligned}$$ Intuitively, we may think of the last expression as a $\Prb$-weighted average over a family of likelihoods $\{\kappa(\omega, A)\}_{\omega\in\Omega}$. Each $\kappa(\omega, A)$ is the likelihood of the event $\{ Y \in A \}$ for a specific instance $\omega \in \Omega$. To this end, $\kappa(\omega, \cdot)$ effectively serves as a conditional distribution for $Y$, given a specific $\omega \in \Omega$.
Note that if we curry the first argument of a transition kernel $\kappa$ so that it is a function of the form $\omega \mapsto (A \mapsto \kappa(\omega, A))$, we may recognize $\kappa: \Omega \rightarrow \bbM_1(\scrY)$, where $\bbM_1(\scrY)$ is the set of probability measures on $(\bbY, \scrY)$. We may equip $\bbM_1(\scrY)$ with a $\sigma$-algebra $\scrM_1(\scrY)$, weakly determined by integration maps. That is, letting $\bbR_+ = [0,\infty)$, $\scrR_+$ be the relative algebra of $\scrR$ on $\bbR_+$, $$ \scrR_+ = \Big\{ A \cap \bbR_+ : A \in \scrR \Big\}, $$ and associating each bounded $\scrY/\scrR_+$-measurable function $f: \bbY \rightarrow \bbR_+$ with an integration map $I_f: \bbM_1(\scrY) \rightarrow \bbR$ by $$ I_f(\mu) = \int_\bbY f(y) \mu(\rmd y), $$ the following is a $\sigma$-algebra on $\bbM_1(\scrY)$. $$ \scrM_1(\scrY) = \sigma\Big( I_f \;|\; f \text{ is bounded and } \scrY/\scrR_+\text{-measurable}\Big) $$
Result 0. Given a transition kernel $\kappa$, the map $\omega \mapsto (A \mapsto \kappa(\omega, A))$ is $\Sigma/\scrM_1(\scrY)$-measurable.
Result 1. Given a $\Sigma/\scrM_1(\scrY)$-measurable map $\kappa: \Omega \rightarrow \bbM_1(\scrY)$, the map $(\omega, A) \mapsto \kappa(\omega)(A)$ is a transition kernel.
With this equivalence, $\kappa$ is an undetermined measure that distributes according to $\Prb$, and through the algebraic operation $(\Prb, \kappa) \mapsto \Prb\ast\kappa$, we are able to introduce a new source of indeterminism on top of that from $(\Omega, \Sigma, \Prb)$.
The integral structure of the distribution of the new undetermined quantity $Y$ depends on each of these sources $\kappa, \Prb$.
However, this property is unique to the new undetermined quantity $Y$.
That is, for each $\X: \Omega \rightarrow \stateSpace$, the lifted $\tilde\X: \tilde\Omega \rightarrow \stateSpace$ preserves distribution, acting independently of $\kappa$.
$$\begin{aligned}
\tilde\Prb(\tilde X \in A)
&=\tilde\Prb(\tilde X^{-1}A) \\
&=\tilde\Prb(X^{-1}A \times \bbY) \\
&=\int_{X^{-1}A} \kappa(\omega, \bbY) \Prb(\rmd\omega) \\
&=\int_{X^{-1}A} \Prb(\rmd\omega) \\
&=\Prb(X^{-1}A) \\
&=\Prb(X \in A)
\end{aligned}$$
We may again compare this to the Python program: The declaration of variables X
and Y
at lines 2 and 3 are declared before Z
is declared at line 4, so their values should not depend on the later allocation. Conversely, Z
is declared in a fashion that depends on X
and Y
, so its value should depend on the earlier allocation.
Example. Consider probability space $(\bbR \times \bbR, \scrR \otimes \scrR, \mu \otimes \nu)$ as in our previous example. $$\begin{gathered} \mu(A) = \int_A \frac{1}{69} e^{-69\stateVar} \rmd\stateVar \\ \nu(B) = \int_B (2\pi)^{-1/2} \exp\Big(-\frac{y}{2}\Big) \rmd y \end{gathered}$$ Define a kernel $\kappa: (\bbR\times\bbR) \times \scrR \rightarrow \bbR$ as follows. $$ \kappa((\stateVar,y), A) = \int_A (2\pi\stateVar)^{-1/2} \exp\Big(-\frac{(z-y)^2}{2\stateVar} \Big) \rmd z $$ This allows us to declare our probability space. $$\begin{aligned} \Omega &= \bbR^3, \\ \Sigma &= \scrR \otimes \scrR \otimes \scrR,\\ \Prb &= (\mu \otimes \nu) \ast\kappa \end{aligned}$$ Now let $\X, Y, Z: \Omega \rightarrow \bbR$ be the component maps. The joint distribution of $(\X, Y)$ is $\mu\otimes\nu$. $$\begin{aligned} \Prb(\X \in A, Y \in B) &= \big((\mu\otimes\nu)\ast\kappa\big)(A \times B \times \bbR) \\ &= \int_{A\times B} \kappa\big((\stateVar, y), \bbR\big) (\mu\otimes\nu)(\rmd\stateVar,\rmd y) \\ &= \int_{A\times B} (\mu\otimes\nu)(\rmd\stateVar,\rmd y) \\ &= (\mu\otimes\nu)(A \times B) \end{aligned}$$ While conditional distributions have numerous characterizations, we can think of taking a conditional probability $$\begin{aligned} \Prb(Z \in C | \X \in A, Y \in B) &= \frac{\Prb(\X \in A, Y \in B, Z \in C)}{\Prb(\X \in A, Y \in B)} \\ &= \frac{\big((\mu\otimes\nu)\ast\kappa\big)(A \times B \times C)}{(\mu\otimes\nu)(A\times B)} \\ &= \frac{1}{(\mu\otimes\nu)(A\times B)} \int_{A \times B} \kappa\big((\stateVar, y), C\big) (\mu\otimes\nu)(\rmd\stateVar,\rmd y) \end{aligned}$$ and decrease $A \times B \rightarrow \{(\stateVar, y)\}$ to claim $Z$ is conditionally $\operatorname{Normal}(Y, \X)$. $$\begin{aligned} \Prb(Z \in C | \X =\stateVar, Y = y) &= \lim_{A \times B \rightarrow \{(\stateVar, y)\}} \Prb(Z \in C | \X \in A, Y \in B) \\ &= \kappa\big((\stateVar, y), C\big) \\ &= \int_C (2\pi\stateVar)^{-1/2} \exp\Big(-\frac{(z-y)^2}{2\stateVar} \Big) \rmd z \end{aligned}$$
If we have a measure $\mu$ and a transition kernel $\kappa$ that is constant in its first coordinate, i.e. there is a probability measure $\nu$ such that $$ \kappa(\omega, A) = \nu(A) $$ it is easy to see that $\mu \ast\kappa= \mu\otimes\nu$. $$\begin{aligned} (\mu\ast\kappa)(A \times B) &= \int_A \kappa(\omega, B) \mu(\rmd\stateVar) \\ &= \int_A \nu(B) \mu(\rmd\stateVar) \\ &= \mu(A)\nu(B) \\ &= (\mu\otimes\nu)(A\times B) \end{aligned}$$
A trick that is sometimes exploited is what I would like to call latent sampling, in which we construct a probability space $(\Omega, \Sigma, \Prb)$ with undetermined quantities $\X: \Omega \rightarrow \stateSpace$ and $Y: \Omega \rightarrow \bbY$ which are correlated through a deterministic operation of a latent and independent $W: \Omega \rightarrow \bbW$. $$ Y = f(\X, W) $$ We can easily construct this in two ways. The first of which is by making $(\Omega, \Sigma, \Prb)$ the independent sampling of $\X, W$. $$\begin{aligned} \Omega &= \stateSpace \times \bbW \\ \Sigma &= \stateAlgebra \otimes \scrW \\ \Prb &= \mu \otimes \nu \\ \X(\stateVar, w) &= \stateVar \\ W(\stateVar, w) &= w \\ Y(\stateVar, w) &= f(\stateVar, w) \end{aligned}$$ The second of which is by directly making it a conditional sampling of $\X, Y$. $$\begin{aligned} \Omega &= \stateSpace \times \bbY \\ \Sigma &= \stateAlgebra \otimes \scrY \\ \Prb &= \mu \ast \kappa \\ \kappa(\stateVar, B) &= \int_\bbW 1_B\big(f(\stateVar, w)\big) \nu(\rmd w) \\ \X(\stateVar, y) &= \stateVar \\ Y(\stateVar, y) &= y \end{aligned}$$ Note that the second approach is somehow more efficient in that the quantities $\X$ and $Y$ of interest are the component maps. This comes at the cost of being less generative, in that the latent variable $W$ is nonexistent. In fact, the second approach can be seen as the most efficient construction of a model bearing $\X$ and $Y$, in that it is embedded in any model as seen in the distributions. $$\begin{aligned} \Prb(\X \in A, Y \in B) &= \Prb(X \in A, f(X, W) \in B) \\ &= \int_{\stateSpace \times \bbW} 1_A(\stateVar) 1_B\big(f(\stateVar, w)\big) (\mu\otimes\nu)(\rmd\stateVar,\rmd w) \\ &= \int_A \int_\bbW 1_B\big(f(\stateVar, w)\big) \nu(\rmd w) \mu(\rmd\stateVar) \\ &= \int_A \kappa(\stateVar, B) \mu(\rmd\stateVar) \\ &= (\mu\ast\kappa)(A \times B) \end{aligned}$$
We now see how to generate probability spaces $(\Omega, \Sigma, \Prb)$ with undetermined quantities $X_1, \ldots, X_n$ exhibiting interesting relations. We start by picking a distribution for $X_1$, set $(\Omega_1, \Sigma_1, \Prb_1)$ as the target space with said distribution, and then inductively perform enlargements $(\Omega_i, \Sigma_i, \Prb_i) \rightarrow (\Omega_{i+1}, \Sigma_{i+1}, \Prb_{i+1})$ like above, each time adding a new undetermined quantity $X_{i+1}$, until we end at $(\Omega, \Sigma, \Prb) = (\Omega_n, \Sigma_n, \Prb_n)$ equipped with $X_1, \ldots, X_n$ (up to identification).
As we saw in the examples, our space $(\Omega, \Sigma)$ will be a product space with $n$ factors, each $\X_i$ is the $i$-th component map, and each undetermined quantity will take the form $f(\X_1, \ldots, \X_n)$. To this end, our spaces are constructed so that $\X_1, \ldots, \X_n$ determine the entire space. This matches our programming analogy, in which the memory of the program consists of the variables X1
, $\ldots$, Xn
.
Note that when we say extend inductively, we mean to suggest a finite amount of enlargements. As it stands, we have only proven enlargements that add a single undetermined quantity at a time. This means we still don't have measure spaces $(\Omega, \Sigma, \Prb)$ which equip stochastic processes $(X_i)_{i\in I}$ for infinite $I$. This should still align with our computer program analogy. For countably infinite $I$, we would need a never-ending infinite for-loop to get all $(X_i)_{i\in I}$. For uncountably infinite $I$, the analogy would not even work.
To remedy this, we think of stochastic processes slightly differently than a collection of undetermined quantities $(\X_i)_{i\in I}$. We instead think of the entire sequence $\X=(\X_i)_{i\in I}$, and show how to construct measures associated to processes $\X: \Omega \rightarrow \bbX^I$. These measures will again be defined on product algebras $\scrX^I$ over $\bbX^I$, and we will be able to describe the measures on $\scrX^I$ through finite-dimensional projections (even when $I$ is uncountable). To read more on this, consider the following post, which discusses how to construct stochastic processes.