Conditional Probability
\[ \def\E{\mathsf E} \def\A{\mathscr A} \def\B{\mathscr B} % for general sigma alg \def\C{\mathscr C} \def\EE{\mathscr E} \def\F{\mathscr F} \def\G{\mathscr G} \def\H{\mathscr H} \def\S{\mathscr S} \def\BB{\mathcal B} % for borel \def\P{\mathsf P} \def\Q{\mathsf Q} \def\R{\mathsf R} \def\RR{\mathbb R} \def\px{\phantom{X}} \def\tF{\tilde{\mathscr F}} \def\dint{\displaystyle\int} \]
Index | Other Stuff | Background
0.1 Nutshell Summary
Let \((\Omega,\F,\P)\) be a probability space and \(\G\subset\F\) a sub-sigma algebra. The conditional probability of a set \(A\in\F\) given information in \(\G\) is a \(\G\)-measurable function \(\P(A\mid\G)\) on \(\Omega\) such that \[ \P(B\cap A) = \int_B P(A\mid \G)\,d\P \] for all \(B\in\G\). Any two such functions are equal a.e. \([\P]\).
If \(X\) is a real-valued random variable on \(\Omega\) then, the conditional expectation of \(X\) given \(\G\) is a \(\G\)-measurable function \(\E[X\mid\G]\) such that \[ \int_B \E[X\mid \G](\omega)\P(d\omega) = \int_B X \P(d\omega) \] for all \(B\in\G\). Again, any two such functions are equal a.e. \([\P]\).
If \(T\) is another real-valued random variable on \(\Omega\), then \(\E[X\mid T]\) means \(\E[X\mid\sigma(T)]\). Since it is \(\sigma(T)\)-measurable, it factors through \(T\) giving a function \(\RR\to\RR\) so that \(\E[X\mid T] = \phi(T)\). Then \(\E[X\mid T=t]\) equals \(\phi(t)\). Here, \(\E[X\mid T=t]\) is defined as a function with the “right” integrals against real Borel sets: \(\forall B\in\BB(\RR),\ \int_B \E[X\mid T=t]\P_T(dt)=\int_{T^{-1}}(B) X\,d\P\).
Conditional probabilities are simply conditional expectations of indicator functions \(1_A\). Using Pollard’s all-in-one notation, conditional probabilities and expectations satisfy \[ \P(BA) = \P(B\,\P(A\mid G)), \quad \P[BX] = \P(B\,\E[X\mid G]). \] We try to write the \(\G\)-measurable function first.
It is usually possible to combine the functions \(\P(A\mid\G)\) for different \(A\) into a function \(p:\Omega\times \F \to [0,1]\) called a regular conditional probability satisfying:
- MEAS: \(\forall\omega\in\Omega\), \(B\mapsto p(\omega, B)\) is a probability measure.
- VER: \(\forall A\in\F\), the function \(\omega\mapsto p(\omega, A)\) is a version of \(\P(A\mid\G)\).
If the condition in MEAS only holds a.e., then the missing values of \(\omega\) can be replaced with an arbitrary probability measure so that it holds for all \(\omega\) and condition VER is still true. Thus for all or a.e. can be used interchangeably in MEAS. Since \(\P(A\mid\G)\) is \(\G\)-measurable, so is the function in VER.
The conditional and regular conditional probabilities are connected via \[ \begin{gather} \P(A\cap B) = \int_B 1_A(\omega')\,p(\omega,d\omega') = p_\omega BA \\ \E[X\mid\G](\omega) = \int_\Omega X(\omega')\,p(\omega,d\omega') = p_\omega X. \end{gather} \]
There is a chain of increasing generality, from working on \((\Omega, \F)\) with \(\P\) to working on \(S\) and \(\sigma(T)\) with \(P_X\).
It was one of Kolmogorov’s many great ideas to use the Radon-Nikodým theorem (RNT) to define conditional probabilities and conditional expectations (Kolmogorov and Bharucha-Reid 2013). The idea can be applied in several flavors, which we now explore.
0.2 Notation
- \((\Omega,\F,\P)\) is a probability space and \((L,\A)\) and \((M,\B)\) is are measure spaces
- \(\G\subset\F\) is a sub-sigma algebra
- \(A\in\F\) and \(B\in\G\) are measurable subsets
- \(X:(\Omega,\F)\to(\RR,\BB(\RR))\) is an \(\F\)-measurable random variable
- \(H:(\Omega,\G)\to(\RR,\BB(\RR))\) is a \(\G\)-measurable random variable
- \(S:(\Omega,\F)\to(L,\A)\) is a \(\F\) measurable random element, taking values in \((L,\A)\)
- \(T:(\Omega,\F)\to(M,\B)\) is a \(\F\) measurable random element, taking values in \((M,\B)\)
- \(\P_T\) is a measure of \(\B\) defined by \(\P_T(B) = \P(T^{-1}(B)) = \P(T\in B)\)
The (Lebesgue) integral used in probability is written in many different ways: \[ \int_B X d\P = \int_B X(\omega) \P(d\omega) = \int 1_BX d\P = \E[1_B X] = \P1_B X. \] The last notation suggested by Pollard (2002), emphasizes the role of expectation as a linear operator and makes the connection to \(\P\) is especially clear.
0.3 Conditional Probabilities and Conditional Expectations
We can define a measure \(\nu\) on \(\G\) by \[ \nu(B) = \int_B X d\P. \] The measure \(\nu\) is absolutely continuous with respect to \(\P\), \(\nu\ll\P\), meaning \[ \P(B)=0\implies \nu(B)=0. \] Therefore, RNT gives a \(\G\)-measurable function \(\xi:\Omega\to \RR\) so that \[ \nu(B) = \int_B \xi d\P = \P1_B\xi \] i.e., \(\xi\) and \(X\) have the same integrals over \(\G\)-measurable subsets of \(\Omega\). Since we can build any \(\G\)-measurable function \(H\) from sums of indicator functions, we can extend the identity \(\nu(B)=\P 1_B\xi\) to any \(H\) to obtain the fundamental identity of conditional probability \[ \P H\xi = \P HX. \] The function \(\xi\) is given special names and notations in different situations, the most common of which are conditional probability and conditional expectation.
Define \(\E[X\mid T=t]\) to be any \(\B\)-measurable function so that \[ \forall B\in\B, \ \ \dint_B \E[X\mid T=t] \,\P_T(dt) = \dint_{T^{-1}(B)} X\,d\P. \] Generalizing to \(H\) we could use Pollard’s notation to write1 \[ \P_T^t H(t) \E[X\mid T=t] = \P H(T)X. \] The conditional expectation of \(X\) given \(T=t\) exists by the standard RNT argument (applied to measure on \(\B\) defined by integral on the right) combined with the transformation theorem REF.
The next table summarizes particular instances of these definitions, illustrating their connection to the fundamental equality \(\P H\xi=\P HX\).
\(\G\) | \(H\) | \(X\) | \(\xi\) notation | \(\P H\xi\) | \(\P HX\) |
---|---|---|---|---|---|
General case | fundamental equality: \(\P H\xi = \P HX\) | ||||
\(\G\) | \(H\) | \(X\) | \(\xi\) | \(\dint H\xi\,d\P\) | \(\dint HX \,d\P\) |
Specific examples | |||||
\(\G\) | \(1_B\) | \(1_A\) | \(\P(A\mid\G)\) | \(\dint_B \P(A\mid\G)\,d\P\) | \(\P(A\cap B)\) |
\(\G\) | \(1_B\) | \(X\) | \(\E[X\mid\G]\) | \(\dint_B \E[X\mid\G]\,d\P\) | \(\dint_B X \,d\P\) |
\(\sigma(T)\) | \(1_{T^{-1}(B)}\) | \(X\) | \(\E[X\mid T]=\varphi(T)\) factorization theorem \(\E[X\mid T=t]=\varphi(t)\) uniqueness, right |
\(\dint_{T^{-1}(B)} \E[X\mid T]\,d\P\) \(=\dint_{T^{-1}(B)}^{\phantom{X}} \varphi(T) \,d\P\quad\) factorization \(=\dint_B^{\phantom{X}} \varphi \,d\P_T\quad\) transformation \(=\dint_B^{\phantom{X}} \E[X\mid T=t] \,\P_T(dt)\quad\) definition |
\(\dint_{T^{-1}(B)} X \,d\P\) |
0.4 Regular Conditional Probability
The function \(\xi\) given by RNT depends on \(\G\) and \(X\) and is a function of \(\omega\in\Omega\). In applications, \(\G\) is usually fixed while different \(X\) are considered. To reflect these dependences, write \(\xi(\omega)=\xi(X \mid \G)(\omega)\). In the case \(X=1_A\), we can write \(\xi(A\mid \G)(\omega)\) using Pollard’s identification of a set with its indicator function. It would be very handy if the different \(\xi\) (one for each \(X\)) could be stitched together consistently to create a function \(P:\F\times\Omega\to[0,1]\) with
- \(\forall A\in\F,\ \omega\to P(A, \omega)\) a \(\G\)-measurable function
- \(\forall \omega\in\Omega,\ A\to P(A, \omega)\) is a probability measure
- The function in item 1 is a version of the function \(\xi(A\mid\G)\)
Notice that item 3 implies the function in item 1 is \(\G\)-measurable because \(\xi\) is \(\G\)-measurable. Such a \(P\) is called a regular conditional probability (rcp) function, because we are stitching together all the functions in item 1 consistently to form a regular2 probability measure in item 2. See Section 0.8 for other names given by different textbooks.
When \(\G=\sigma(T)\) then \(P:\F\times M\to [0,1]\) and some authors (Pollard 2002, p. Dellacherie2011bk) write \(P(A,t)=P_t(A)\). Hoffman-Jørgensen (2017a) writes \(P^T(A\mid t)\). In these cases we have a decomposition of \(\P\) \[ \P H(T)X = \P_T^t P_t^\omega H(T(\omega))X(\omega). \]
Conditional expectations can be computed using an rcp \(P\) \[ \E[X\mid \G](\cdot) = \int X(\omega)\,P(d\omega, \cdot). \] To see this, start with a simple function \(X=1_A\). Then, by rcp definition item 3, \[ \E[1_A\mid \G](\cdot)=\P(A\mid\G)(\cdot)=P(A,\cdot)=\int 1_A(x)P(dx, \cdot). \] The result follows for general \(X\) by approximating using simple functions and using monotone convergence. See (Billingsley 2017 Thm 34.5)
A function satisfying conditions 1 and 2 above is called a probability kernel (or Markov kernel) from \((M,\B)\) to \((\Omega,\F)\). If the probability measure in item 2 is just a measure, then \(P\) is called a kernel.
More generally, a second random element \(S\) is used to create different distributions \(\P_S\) on \((L,\A)\). A regular conditional distribution of \(S\) given \(T\) is a Markov kernel \(P^T_S\) from \((M,\B)\) to \((L,\A)\) so that \(P^T_S(A\mid t)\) is a conditional distribution of \(S\) given \(T\). The fundamental equality for \(P^T_S\) is3 \[ \forall A\in\A,\ \forall B\in\B,\quad \P(S\in A, T\in B) = \int_B P^T_S(A\mid t)\,\P_T(dt). \] This decomposition can be understood as a generalization \[ \P = \P_T P \quad \leftrightarrow \quad \P = \P_T P^T_S. \] (Apologies for the hodge-podge notation!) A special case is \(S=X\) is a (real-valued) random variable.
0.5 Disintegrations
In addition, if \(P_t\{T\not=t\}=0\) then the decomposition of \(\P\) further simplifies \[ \P H(T)X = \P_T^t P_t^\omega H(T(\omega))X(\omega)= \P_T^t H(t) P_t^\omega X(\omega) = \P_T^t H(t)\varphi(t) \] where, again, \(\varphi(t) = P^\omega_t X(\omega) = \E[X\mid T=t]\). In this case, Pollard4 calls \(\{ P_t\}_t\) a disintegration of \(\P\) with respect to \(\P_T\). Following Dellacherie and Meyer (1978), he proves existence using a general disintegration theorem that starts with a measure on a product space and decomposes it into a kernel between the products and the marginal distribution of one product. The general theorem is applied to the image of \(\P\) under the map \(\omega\mapsto (\omega, T\omega)\in \Omega\times M\) assuming the graph of this embedding is \(\A\otimes\B\)-measurable. Graphs are measurable when \(\B\) is separably (countably) generated, which is the assumption in (Rogers and Williams 1994, p. II.89, Theorem).
0.6 Existence Results
In the general case, an rcd for \(S\) given \(T\) exists if \((L,\A)\) is Borel, Kallenberg (2001a) Theorem 6.3. Hoffman-Jørgensen (2017a) 10.29 describes a more complicated set of assumptions which are satisfied for Borel \((L,\A)\) and \(P_S\) a Radon measure.
Rogers and Williams (1994) II.89.1 works with probabilities rather than distributions (i.e., no \(S\)) on \(\Omega\) a Lusin space and relative to \(\G\subset\F\). The proof starts assuming \(\Omega\) is compact metrizable and works with a countable dense subset of the continuous functions on \(\Omega\). It makes a version of the conditional expectation of each element using Riesz Representation. From there, bootstrap to general functions by using uniform convergence (for all continuous functions), to bounded measurable functions using a monotone class argument, and finally to integrable functions by truncation. To get a proper (correct support, “live where they should”) conditional expectation, they assume that \(\G\) is a countably generated sub-sigma algebra. The latter assumption shows that the set with correct support is \(\G\) measurable with measure 1—see also discussion of Chang and Pollard (1997) below. Their results highlight there are two assumptions driving different aspects of the conclusion:
- Existence of an rcd depends on topological properties of \(\Omega\): they essentially limit measurable functions; and
- Correct support of the rcd depends on separability of \(\G\), which makes the graph measurable and allows application of the disintegration theorem.
Dellacherie and Meyer (1978) also starts with a compact metric space and Borel sigma-field. He extends to any separable metrizable space with Borel sigma-field and assumes that \(\P\) is tight (is approximated by a compact set). Then a rcp relative to any sub-sigma algebra.
Pollard (2002) proves a disintegration theorem on a metric space with its Borel sigma-algebra and a function \(T\) into \((M,\B)\). He assumes that the graph of \(T\) is \(\F\otimes\B\)-measurable. The metric space assumption limits the topological properties of \(\Omega\) and the measurable graph assumption results in a proper rcd. Chang and Pollard (1997) Theorem 1 (existence) is the same: works on a metric space with a sigma-finite Radon measure5 and where \(T\) takes values in a space with a countably generated sigma algebra that contains all singleton sets.
Chang and Pollard (1997) Theorem 2 states that if \(\lambda\) on a metric space \(\Omega\) with Borel sigma algebra with has a \((T,\mu)\)-disintegration ${_t}, with \(\lambda\) and \(\mu\) each sigma finite then
- \(T\lambda \ll \mu\) with density \(\lambda_t \Omega\)
- \(\{\lambda_t\}\) are finite for \(\mu\)-a.e. \(t\) iff \(T\lambda\) is sigma-finite
- \(\{\lambda_t\}\) are probabilities for \(\mu\)-a.e. \(t\) iff \(\mu=T\lambda\)
- If \(T\lambda\) is sigma-finite then \((T\lambda)\{\lambda_t\Omega=0\}=0\) and \((T\lambda)\{\lambda_t\Omega=\infty\}=0\). For \(T\lambda\)-a.e. \(t\) the measures \[ \bar\lambda_t(\cdot) = \frac{\lambda_t(\cdot)}{\lambda_t\Omega}\{ 0 <\lambda_t\Omega < \infty\} \] are probabilities that give a \(T\)-disintegration of \(\lambda\).
Proof. Let \(l(t)=\lambda_t \Omega\). For non-negative measurable \(g\) \[ (T\lambda)g = \lambda^x g(Tx) = \mu^t\lambda^x_t g(Tx) = \mu^t g(t)l(t) \tag{1}\] because \(g(Tx)=g(t)\) mod \(\lambda_t\).
- If \(g\ge 0\) and \(\mu g=0\) then \(g(t)=0\) mod \(\mu\) and so \(g(t)l(t)=0\) mod \(\mu\). Equation 1 exactly expresses that \(l(t)\) is the density(!).
- Sigma-finite iff there exists a strictly positive real-valued function with finite integral, so there exists \(h > 0\) with \(\mu h<\infty\), so by assumption \(h\) exist. If \(l(t)<\infty\) mod \(\mu\), the function \(g(t)=h(t) / (1+l(t)) > 0\) mod \(\mu\) and \((T\lambda)g = \mu^t g(t)l(t) \le \mu h < \infty\) so \(T\lambda\) is sigma-finite. Conversely, if \((T\lambda)k = \mu kl < \infty\) for \(k>0\) then \(k(t)l(t)<\infty\) mod \(\mu\) (by 1? why needed), which gives finiteness of \(l(t)\) mod \(\mu\).
- If \(l(t)=1\) mod \(\mu\) then Equation 1 shows that \((T\lambda)g=\mu g\). Conversely, let \(h>0\) have \(\mu h<\infty\) (by sigma finite assumption). Choose \(g(t)=h(t)\{ l(t) < 1\}\) in Equation 1 and using the assumption that \(T\lambda=\mu\) gives \[ \infty > \mu^t h(t)\{ l(t) < 1 \} = ?? \] which implies \(\mu\{ l<1\}=0\). Similarly \(\mu\{ l>1\}=0\).
- …SORT OUT.
Chang and Pollard (1997) appendix proves the existence of the disintegration theorem, mirroring Rogers and Williams (1994). EXPAND.
Breiman (1992a) uses a random element to move the problem to a “nice” space and describes this as “like passing over to representation space” to get rid of a difficulty.
0.7 Blackwell and Proper rcps
Blackwell (1956) - Lusin spaces and three weird examples (Doob’s example).
Blackwell and Ryll-Nardzewski (1963) - everywhere proper cds
Blackwell and Dubins (1975) (w Dubins): for countably generated \(\F\), if \(\G\subset\F\) is not countably generated then no rcp given \(\G\) is proper (lives in the right place). Not countably generated is implied by the existence of an extreme (only taking values 0 or 1) probability measure on \(\G\) and which is supported by no \(\G\)-atom in \(\G\). READ THIS ONE!
0.8 Common Notation
Table 1 describes the notation used by different authors for (r)cps, (r)cds, and disintegrations (an rcp with extra assumptions). Here we use the abbreviations regular, conditional, probability, and distribution in various combinations. Rcps apply to \(\P\), rcds apply to \(\P_X\), so \(X\) allows you to vary the probability. Per Breiman (1992a), \(X\) can be used to move to a “nice” space, ensuring that an rcd exists. Existence is determined by properties of the range measure space, see
- rc probabiliity \(\leftrightarrow\) \(\P\) on \(\Omega\)
- rc distributions \(\leftrightarrow\) conditional distribution of \(X\) on \(\Omega\)
Reference | Nomenclature | Disintegration |
---|---|---|
Billingsley (2017) | cd, Thm 33.3 | n/a |
Breiman (1992a) | rcp, 4.3 Def 4.27 rcd for rv |
n/a |
Dellacherie and Meyer (1978) | c “laws” | Ch III.70 w support |
Dudley (2018) | rcp, 10.2; rcd p.342 | n/a |
Hoffman-Jørgensen (2017a) | rcd, II.10.1 | II.10.30 |
Kallenberg (2001a) | rcp Ch.6; rcd \(S\mid \B\) | Thm 6.4 |
Parthasarathy (1967) | rcpd given \(T\), V.8 p.146 | |
Pollard (2002) | rcd, remark w/o | cpd Ch 5.2<4> w support |
Rogers and Williams (1994) | rcp, II.42.3 | see II.89.1 iii, iv |
Williams (1991a) | rcp 9.9 | n/a |
https://stats.stackexchange.com/questions/531705/regular-conditional-distribution-vs-conditional-distribution
If \(X\) is \(\F\)-measurable and \(Y\) is \(\G\)-measurable and \(f(X,Y)\) is jointly measurable and integrable then \[ \E[f(X,Y)\mid \G](\omega) = \int f(x, Y(\omega))\P_\omega(dx) \] ??
Concept | Notation | Conditions |
---|---|---|
conditional probability | \(\P(A\mid\F)(\cdot)\) | \(\nu(B)=\P(B\cap A)\) for \(B\in\F\), \(\nu\ll\P\) \(\nu(B)=\int_B P(A\mid\G)\,d\P\) for \(B\in\G\) |
regular conditional probability defined by \(\G\) |
\(p(\cdot, A)\) | MEAS: \(\forall\omega\in\Omega\), \(B\mapsto p(\omega, B)\) is a probability measure on \(\F\). VER\((\F)\): \(\forall A\in\F\), \(\omega\mapsto p(\omega, A)\) is a version of \(\P(A\mid\G)\). |
regular conditional probability defined by \(\G=\sigma(T)\) |
\(p_t(A)\) \(P^T(A\mid t)\) |
\(T:(\Omega, \F)\to(E,\EE)\) random element MEAS: \(\forall t\in E\), \(B\mapsto p(t, B)\) is a probability measure on \(\F\). VER\((\P,\sigma(T))\): \(\forall A\in\F\), \(t\mapsto p(t, A)\) is a version of \(\P(A\mid\sigma(T))\). |
(regular) conditional distribution for rv \(X\) and \(\P_X\) |
\(\mathrm{rcdf}(\P,\G)\) | \(X:(\Omega,\F)\to(\RR,\BB(\RR))\) is a rv MEAS: as rcp VER\((\P_X, \sigma(T))\): as rcp |
(regular) conditional distribution for random element \(S:(\Omega,\F)\to(L,\A)\) defining \(\P_S\) |
\(\mathrm{rcd}(\P_S,\sigma(T))\) | \(T:(\Omega,\F)\to(M,\B)\) random variable MEAS: \(\forall t\in M\), \(A\mapsto p(t, A)\) is a probability measure on \(\A\). VER\((\P_S,\sigma(T))\): \(\forall A\in\A\), \(t\mapsto p(t, A)\) is a version of \(\P_S(A\mid T)\). |
disintegration as above, \(\mu=T\P\) |
\(\P=\mu P_t\) | \(P_t\{T\not=t\}=0\) \(\P^x f(x, Tx) = \mu^tP^x_t\,f(x,t)\) |
1 Notation
Pith. These notes follow Hoffman-Jørgensen (2017b). Thus, the general set up is a measure space \((\Omega,\F,\P)\) with random elements (variables) \(S:(L,\A)\to ??\) and \(T:(M,\B)\to ??\). The variable \(S\) is used to create a distribution \(S\P=\P_S\) and \(T\) is observed information. We are interested in predicting \(S\) given \(T\). The most common special case is \(S=T=\) the identity, \((L,\A)=(\Omega,\F)\) and \(M=\Omega\) but \(\B\subset\F\).
Periphery. Probabilitists, analysts, set theorists, and statisticians all write probability and they tend to use slightly different notation. These differences can make it hard to learn probability.
Compounding these differences, probability needs notation for spaces, sigma-algebras, measures, subsets, and random variables. The one thing everyone agrees that \(\BB\) denotes Borel sets, the sigma algebra generated by the open sets when the sample space has a topology. Table 3 provides a summary of notation across some common textbooks.
We mix and match notation for expectations and rcps. Hoffman-Jørgensen’s notation \(P^T_S(\cdot\mid\cdot)\) makes all the components of the rcp clear. However, Pollard’s de Finetti inspired notation that dispenses with distinctions between sets and indicator functions, and between probabilities and expectations is very concise and helps emphasize the parallel’s between probabilities and expectations (Pollard 2002). Pollard uses \(P_t\) for \(P^T(\cdot\mid t)\). Conditional probabilities/expectations become \(P_t f\) or \(P_t B\) and decomposition \(\mu^t P^x_t f(x,t)\).
Always we use \(\mu f=\int f\,d\mu=\int f(x)\mu(dx)\) and allow \(\mu A=\mu 1_A\).
2 Motivation
The general theory of conditional probability builds on the elementary discrete theory, which conditions on sets of positive measure.
S1.6 Let \((\Omega, \F, \P)\) be a probability space, and let \(A\in\F\) be an event with \(\P(A)>0\). The conditional probability with respect to \(A\) means
\[ \P(B\mid A) = \frac{\P(AB)}{\P(A)}. \]
If \(X\) is an integrable random variable then
\[ \E[X\mid A]=\frac{\E[X1_A]}{\P(A)}. \]
S2. Let \(\A=\{A_1,A_2,\dots\}\) be a countable decomposition of \(\Omega\) into sets with \(\P(A_i)>0\). The conditional probability with respect to \(\A\) is the random variable \(\P(B\mid\A)(\cdot)\) defined by
\[ \P(B\mid \A)(\omega) = \sum_i \P(B\mid A_i)1_{A_i}(\omega). \]
Thus
\[ \P(B\mid \A)(\omega) = \frac{\P(BA_\omega)}{\P(A_\omega)} \]
where \(A_\omega\) is the member of the decomposition containing \(\omega\). This definition makes it clear that if \(\A=\{\Omega\}\) then \(\P(B\mid\A)=\P(B)\). Clearly
\[ \E\P(B\mid\A) = \P(B). \]
The random variable \(\P(B\mid\A)\) is measurable wrt \(\G=\sigma(\A)\) and is also denoted \(\P(B\mid \G)\).
S3. Countable decompositions are naturally generated by random variables taking a finite or countably infinite number of different values. If \(X\) takes the value \(x_i\), define \(A_i=\{X=x_i\}\). It is natural to define the conditional expectation wrt \(X\) using \(\P(\cdot\mid\A)\). It is denoted \(\P(\cdot\mid X)\).
S4. What about conditional probabilities wrt sets of measure zero? These occur naturally when conditioning on the outcome of a continuous random variable.
H1. Think about varying the conditioning set \(A\) systematically. Let \(Y=\{y_1,\dots,y_n\}\) be a discrete space with the subset sigma algebra. Let \(T:\Omega\to Y\) map \(A_i\) to \(y_i\). The values of \(Y\) split up \(\Omega\), providing the sets over which we wish to condition. Now let \(B\in\F\) and \(A\subset Y\), and define a measure on \(Y\) by \(\nu_B(A)=\P(B\cap T^{-1}(A))\). Clearly \(\nu_B\ll T\P\) as measures on \(Y\), using Pollard notation \(T\P(A)=\P(T^{-1}(A))\). Therefore RNT gives an integrable function \(p_B\) on \(Y\) so that \[ \nu_B(A)=\P(B\cap T^{-1}(A)) = \int_A p_B(y)d(T\P)(y). \] Call \(p_B(y)=p(B, y)\) the conditional probability of \(B\) given \(y\) or given that \(T=y\).
If \(A\) has \(T\P(A)=\P(T^{-1}(A))>0\) we can divide to get \[ \P(B\mid T^{-1}(A))=\frac{\P(B\cap T^{-1}(A))}{\P(T^{-1}(A))} =\frac{1}{\P(T^{-1}(A))}\,\int_A p(B, y)\,d(T\P)(y) \] Halmos then says
Since the left term is the conditional probability of \(B\) given \(T^{-1}(A)\), it is formally possible that as “\(A\) shrinks to \(y\)” the left term should tend to the conditional probability of \(B\) given \(y\) and the right term should tend to the integrand \(p(B, y)\). The RNT is a rigorous substitute for this rather shaky “difference quotient” approach.
He then shows that the collection of functions \(p(\cdot, y)\) act like a measure for each \(y\), e.g., they lie between 0 and 1 and are countably additive for \(T\P\) a.e. \(y\). For example if \(B_n\) is a disjoint sequence in \(\Omega\) then \[ \begin{align} \int_A p(\bigcup_n B_n, y)\,d(T\mu)(y) &= \P((\bigcup_n B_n) \cap T^{-1}(A)) \\ &= \sum_n \P( B_n \cap T^{-1}(A)) \\ &= \sum_n \int_A p(B_n, y)\,d(T\mu)(y) \\ &= \int_A \sum_n p(B_n, y)\,d(T\mu)(y) \\ \end{align} \] and apply uniqueness from RNT.
H2. Thus \(p(\cdot, y)\) behaves like a measure. Similar proofs show it is monotone, increases from zero to 1 etc. In all these proofs the excesptional sets depends on the particular \(B_n\) under consideration, and “it is in general incorrect to conclude that \(p(\cdot, y)\) is a measure a.e. \(y\).”
H3. Extending \[ \begin{align} \int_B p(A,y)d(T\P)(y) &= \P(A\cap T^{-1}(B)) \\ &= \int_{T^{-1}(B)} 1_A(x)\,d\P(x) \end{align} \] from \(1_A\) to any integrable \(f\) on \(\Omega\), we can consider its integral \[ \nu(f)=\int_{T^{-1}(B)} f(x)\,d\P(x) \] for measurable \(B\subset Y\) as a measure on subsets of \(Y\). Then \(\nu\ll T\P\) (because you would be integrating over a set of measure zero), we can apply RNT to get an integrable function \(e(f, y)\) on \(Y\) so that \[ \int_{T^{-1}(B)} f(x)\,d\P(x) = \int_{B} e(f, y)\,d(T\P)(y) \] for measurable \(A\). Call \(e(f, y)\) the conditional expectation of \(f\) given \(y\). We’d like \[ e(f, y) = \int f(x)\, p(dx, y) \] but \(p(dx,y)\) isn’t a measure a.e. \(y\). We can say that if \(f\) is integrable on \(Y\), then \(fT\) is integrable on \(X\) and \(e(fT, y) = f(y)\) \(T\P\) a.e. \(y\). That is, the conditional expectation lives where it is supposed to.
Notice that Halmos starts with conditional probability by considering \(f=1_A\) and then considers general \(f\) to get conditional expectations. In contrast, other approaches start with general \(f\) to get conditional expectations immediately and then specialize to \(f=1_A\) for conditional probabilities.
3 Kolmogorov conditional probability
Kolmogorov’s axiomatic treatment of probability defines probability based three axioms: non-negativity, normalization, and countable additivity. Kolmogorov’s axioms apply to a wide range of probabilistic scenarios, including both discrete and continuous cases, allowing for generalization and application in various fields.
3.1 Conditional Expectation wrt Sets of Measure Zero
S5. Let \(\G\subset\F\) be a sub-sigma algebra. As usual splitting \(X=X^+-X^-\) lets us reduce to the case of non-negative integrable variables. The conditional expectation of \(X\) wrt \(\G\) is a non-negative (extended) random variable \(\E[X\mid\G]\) such that
- \(\E[X\mid\G]\) is \(\G\)-measureable;
- for every \(A\in\G\)
\[ \int_A X(\omega)\P(d\omega) = \int_A \E[X\mid\G](\omega)\P(d\omega). \]
We generally write the integral on the left simply as \(\int_A Xd\P\).
5.R-W. Rogers and Williams p. 138 say
An experiment has been performed. The only information available to you regarding which sample point \(\omega\) has been chosen is the set of values \(Z(\omega)\) for every \(\G\)-measurable random variable \(Z\), or, equivalently, the values \(1_G(\omega)\) for every \(G\in\G\). Then \(\E[X\mid \G](\omega)\) is regarded as (a.s. equal to) the “expected value of \(X(\omega)\) given this information.”
6. Existence. The set function on \(\G\) \[ \Q(A) = \int_A Xd\P \] is a measure on \((\Omega, \G)\) and is absolutely continuous wrt \(\P\) (if \(\P(A)=0\) then \(\Q(A)=0\)). Therefore the Radon-Nikodým theorem gives a non-negative random variable \(\E[X\mid\G]\) such that \[ \Q(A) = \int_A \E[X\mid\G] d\P. \] It is unique up to sets of \(\P\)-measure zero. Thus there are many “variants” of the conditional expectation. Conditional expectations are Radon-Nikodým derivatives \[ \E[X\mid\G] = \frac{d\Q}{d\P}. \]
3.2 Conditional Probability
S7. The conditional probability of an event \(A\in\F\) with respect to \(\G\) is \(\P(A\mid\G) = \E[1_A\mid\G]\). It inherits two properties: it is \(\G\)-measurable and \[ \P(A\cap B) = \int_B \P(A\mid\G)d\P \] for all \(B\in\G\).
S8. If \(\G=\sigma(Y)\) is generated by a random variable \(Y\) then \(\E[\cdot\mid\G]\) (resp. \(\P(\cdot\mid\G)\)) is written \(\E[\cdot\mid Y]\) ( \(\P[\cdot\mid Y]\)) and is called the conditional expectation (probability) wrt \(Y\).
S9. These definitions extend those in the finite case.
S10. Conditional probabilities have all the “usual properties”: monotone, Jensen, linear, tower. In addition
- If \(\G_1\subset \G_2\) then \(\E[\E[X\mid \G_2]\mid \G_1]=\E[X\mid \G_1]\) and \(\E[\E[X\mid \G_1]\mid \G_2]=\E[X\mid \G_1]\).
- If \(X\) is independent of \(\G\) (i.e., independent of \(1_B\) for \(B\in\G\)) then \(\E[X\mid\G]=\E[X]\) (a.s.).
- If \(X\) is \(\G\)-measurable then \(\E[XY\mid\G]=X\E[Y\mid \G]\) (a.s.).
- Bounded convergence, monotone convergence, Fatou, countably additive.
10.1. If \(A\in\G\) then \(\P(A\mid\G) = 1_A\): to know the outcome of the “experiment” \(\G\) tells you which element of \(\G\) occurs. Note that \(1_A\) is \(\G\)-measurable and satisfies \(\P(A\cap B)= \int_B 1_A\,p\P\) for all \(B\in\G\) showing it satisfies the requirements to be a conditional probability.
10.2. If \(\G=\{\emptyset, \Omega\}\) then \(\P(A\mid\G)\) is a constant and so must equal \(\P(A)\) (\(\P(A)=\P(A\cap\Omega)=\int_\Omega 1_B\,d\P=\P(B)\). Nothing is learnt from this \(\G\)-experiment.
10.3. If the set \(A\) is independent of \(\G\) then \(\P(A\cap G)=\P(A)\P(G)\) for each \(G\in\G\). Since then \(\P(A\cap G)= \int_G \P(A)\,d\P\) it follows that \(\P(A\mid\G)=P(A)\) with probability 1.
3.3 Conditional probability and conditional expectation comparison
The next table compares the Kolmogorov RNT approach to conditional probabilities and expectations showing the transition via \(A\leftrightarrow 1_A\).
Step | Probability | Transition | Expectation |
---|---|---|---|
Given ingredients a sub-sigma algebra and set or rv |
\(\G\subset\F\) \(A\in \F\) |
\(\G\subset\F\) \(1_A:\Omega\to\{0,1\}\) |
\(\G\subset\F\) \(X:\Omega\to\RR\). |
Consider the measure \(\nu\) on \(B\in\G\), \(\nu\ll P\) |
\(\nu(B)=\P(B\cap A)\) | \(\nu(B)=\int_B 1_A\,d\P\) | \(\nu(B)=\int_B X\,d\P=\E[X1_B]\). |
RNT gives function \(\xi\), where \(\forall B\in\G\) \(\int_B \xi\,d\P\) equals |
\(\P(B\cap A)\) | both | \(\int_B X\,d\P\) |
and \(\xi\) is called | \(\P(A\mid\G)\) | \(\E[1_A\mid\G]\) | \(\E(X\mid\G)\). |
Using \(\xi\), \(nu(B)=\) | \(\int_B \P(A\mid\G)(\omega)\,\P(d\omega)\) | same | \(\int_B \E(X\mid\G)(\omega)\,\P(d\omega)\). |
Countable additivity is proved by converting back to \(\P\) and appealing to RNT uniqueness. Other properites handled similarly. |
For disjoint \(A_i\), \(A=\bigcup_i A_i\), \(\int_B \P(\bigcup_i A_i\mid\G)\,d\P\) \(\px=\P(B\cap \bigcup A_i)\) \(\px=\P(\bigcup_i (B\cap A_i)\) \(\px=\sum_i \P(B\cap A_i)\) \(\px=\P(B\cap A)\) \(\px=\int_B \P(A\mid\G)\,d\P\) |
For \(X_i\ge 0\), \(X=\sum_i X_i\), and \(X\) integrable \(\int_B \E(\sum X_i\mid\G)\,d\P\) \(\px=\int_B \sum_i X_i\,d\P\) \(\px=\sum_i\int_B X_i\,d\P\) \(\px=\sum_i\int_B \E[X_i\mid \G]\,d\P\) \(\px=\int_B \sum_i\E[X_i\mid \G]\,d\P\) |
3.4 Expectation Conditional on Another Random Variable
S11. Since the function \(\E[X\mid Y]\) is \(\G_Y=\sigma(Y)\) measurable, it factors through \(Y\) (Thm II.4.3) giving a Borel function \(m\) so that
\[ \E[X\mid Y](\omega) = m(Y(\omega)) \]
for all \(\omega\). Denote \(m(y)\) by \(\E[X\mid Y=y]\) and call it the conditional expectation of \(X\) wrt the event \(\{Y=y\}\). Thus for \(A\in\G_Y\)
\[ \int_A Xd\P = \int_A \E[X\mid Y]d\P = \int_A m(Y)d\P \]
Note shorthand notation: \(\int_A m(Y)d\P=\int_A m(Y(\omega))\P(d\omega)\).
S12. Let \(P_Y(y)=\P(Y\le y)\) be the distribution function of \(Y\). Then by the usual change of variable formula
\[ \int_{\{Y\in B\}} Xd\P = \int_B m(y)P_Y(y) \]
for \(B\) Borel in \(\RR\).
S13. The conditional probability of \(A\in\F\) given \(Y=y\) is \(\P(A\mid Y=y)=\E[1_A\mid Y=y]\). This notation extends the discrete case and agrees with calculations performed with a bivariate density.
3.5 The problem of uncountably many measure zero sets
S14. If \(B_1,B_2,\dots\) is a sequence of pairwise disjoint sets then
\[ \P(\sum B_i\mid \G) = \sum P(B_i\mid \G)\quad\text{(a.s.)}. \]
However, we cannot assume that \(\P(\cdot\mid \G)(ω)\) is a measure on \(\G\) because of the equation only holds a.s. and there are uncountably many sequences of \(B_i\). (For each sequence countable additivity holds a.e., but we would like it to apply a.e. over all sequences. That is, we want to interpret \(\P(B\mid \G)(ω)\) as a function \(P(\omega, B)\) on \(\Omega\times \F\) giving a measure for each \(\omega\) and a function for each \(B\). If that were so, then we could write
\[ \E[X\mid \G](\cdot) = \int_\Omega X(\omega) P(\cdot, d\omega). \]
4 Kernels
\((\Omega,\F,\P)\) is a probability space. \(\G\) is a sub-sigma algebra of \(\F\). A kernel combines two random elements
- \(T:(\Omega, \F)\to(E,\EE)\), and
- \(X:(\Omega,\F)\to(S,\S)\).
\(T\) defines what you observe and it gives a partition of \(\Omega\) into measurable subsets \(\{T=t\}\). \(X\) is what you care about or are trying to estimate.
A kernel or random measure from \(E\) to \(S\) is a function \(p:E\times \S\to [0,1]\) satisfying
- MF: For fixed \(A\in\S\) the function \(t\mapsto p(t, A)\) is \(\EE\) measurable.
- MEAS: For each \(A\in\S\), \(p(\cdot, A)\) is a (probability) measure. In applications it will be a version of conditional probability.
If the measure in 1. is a probability then the kernel is a probability kernel or Markov kernel.
A kernel determines an operator on suitable functions on \(S\) via \[ f \mapsto pf(t) = \int f(s)\, p(t, ds) = E[f\mid T=t] . \]
5 Regular conditional probabilities
5.1 Regular Conditional Probability (rcp)
S15. A function \(P(\omega, B)\) on \(\Omega\times \F\) is a regular conditional probability wrt \(\G\) if it is a probability kernel from ?? to ?? where for each \(B\in\F\) the function \(\omega\mapsto P(\omega, B)\) is a variant of the conditional probability \(\P(B\mid\G)\), i.e., \(P(\omega, B)=\P(B\mid\G)(\omega)\) a.s.
The next theorem shows why rcps are useful.
16. Theorem. Let \(P\) be a regular conditional probability wrt \(\G\), and let \(X\) be an integrable random variable. Then
\[ \E[X\mid \G](\omega) = \int_\Omega X(\nu) P(\omega, d\nu) \quad\text{(a.s.)}. \]
Proof. If \(X=1_B\) for \(B\in\F\), unraveling the definitions shows the claim becomes simply
\[ \begin{align*} \E[1_B\mid \G](\omega) &= \P(B\mid\G)(\omega) \\ &= \int_\Omega 1_B(\nu) \P(d\nu\mid \G)(\omega) \\ &= \int_\Omega 1_B(\nu) P(\omega, d\nu) \end{align*} \]
by part 2 of the rcp definition. The general result holds by approximating using simple functions and using the limit properties of conditional expectations.
5.2 Regular Conditional Distributions and Regular Distribution Functions
S17. A regular conditional probability is determined by a probability and a sub-sigma algebra. Often we want to adjust the probability to the distribution of a random variable (retaining the same sub-sigma algebra). The next two definitions handle these cases.
- Let \((E, \EE)\) be a measurable space, \(X\) a random variable on \(\Omega\) with values in \(E\), and \(\G\subset \F\) a sub-sigma algebra. A function \(Q:\Omega\times \EE\to\RR\) is a regular conditional distribution (rcd) of \(X\) wrt \(\G\) is a probability kernel from \(E\) to \(\F\) where for each \(B\in\EE\) the function \(\omega\mapsto Q(\omega, B)\) is a variant of the conditional probability \(\omega\mapsto \P(X\in B\mid \G)(\omega)\), i.e.,
\[ Q(\omega, B) = \P(X\in B\mid\G)(\omega)\quad\text{(a.s.)}. \]
- Let \(X\) be a random variable[1]. A function \(F:\Omega\times\RR\to\RR\) is a regular distribution function (rcf) for \(X\) with respect to \(\G\) if
- \(F(\omega, x)\) is a distribution function on \(\RR\) for each \(\omega\);
- \(F(\omega, x) = \P(X\le x\mid\G)(\omega)\) a.s. for each \(x\in\RR\).
The distribution function only measures events \(\{X\le x\}\) whereas the distribution measures any \(B\in\EE\).
S18. Theorem. Let \(X\) be a random variable. Then \(X\) has a regular distribution function and a regular conditional distribution function with respect to \(\G\).
Proof. It is important that \(X\) takes values in \(\RR\) which has a countable dense subset, the rationals. It is also important that a distribution function is increasing and continuous from the right. This means that it is specified by its values at rationals. The proof starts by constructing a regular distribution function. For each rational, it uses any variant of the conditional probability to define distribution functions depending on \(\omega\). These functions fail to be monotone, have right limits, start at 0, or end at 1 on sets of measure zero. However, there are only countably many of these (they are index by rationals). Therefore, their union is also of measure zero and on its complement we have a distribution function. Off it, we just select any distribution function (to satisfy 2.i). Condition 2.ii follows by continuity and the fact that conditional probabilities respect limits. To extend to a regular conditional distribution integrate the rdf just constructed over \(B\). Since all Borel sets can be constructed from \(\{X\le x\}\), where the rdf gives the right answer, the rcd will too using a monotone class argument.
Here are a few more details to give the flavor. For each rational \(r\in\RR\), define
\[ F_r(\omega) = \P(X\le r\mid\G)(\omega) \]
where \(\P(X\le r\mid\G)=E[1_{X\le r}\mid\G]\) is any variant of the conditional probability wrt \(\G\) of the event \(\{X\le x\}\). Let \(\{r_i\}\) be an enumeration of the rationals. If \(r_i<r_j\) then
\[ \P(X\le r_i\mid\G) \le \P(X\le r_i\mid\G)\quad\text{(a.e.)} \]
and therefore if \(A_{ij}=\{\omega\mid F_{r_j}(\omega) < F_{r_i}(\omega) \}\), \(A=\bigcup A_{ij}\), we have \(\P(A)=0\). Off this set we have a monotonic function. Play a similar game with
\[ B_i=\left\{ \omega\mid\lim_{n\to\infty} F_{r_i+1/n}(\omega)\not=F_{r_i}(\omega) \right\}, \quad B=\bigcup_i B_i \]
to ensure right limits and
\[ C = \left\{ \omega\mid \lim_{n\to\infty} F_n(\omega) \not=0 \right\} \bigcup \left\{ \omega\mid \lim_{n\to-\infty} F_n(\omega) >0 \right\} \]
to ensure appropriate left/right tail behavior. Then define
\[ F(\omega, x) = \begin{cases} \lim_{r\downarrow x} F_r(x) & \omega\not\in A\cup B\cup C \\ G(x)& \omega\in A\cup B\cup C \end{cases} \]
where \(G\) is any distribution function. \(F(\omega, \cdot)\) is the distribution function required by 2.i. It takes the correct values, 2.ii by construction on the rationals and then because both \(F\) and the underlying cp.
Using \(F\), define a contender rcd as a Lebesgue-Stieltjes integral
\[ Q(\omega, B) = \int_B F(\omega, dx). \]
Standard nonsense shows this satisfies 1.i and ii.
Billingsley (2017) (p.460) calls this a conditional distribution of \(X\) given \(\G\). The proof is the same.
5.3 Extension to Borel Spaces
S19. A measure space \((E, \EE)\) is a Borel space if there is a bi-measurable injection \(\phi:(E, \EE)\to(\RR, \BB(\RR))\). (It is isomorphic to a Borel subset of a complete separable metric space=Polish space). Complete separable metric spaces, \(\RR^n\) and \(\RR^\infty\) are all Borel spaces.
20. Theorem. Let \(X\) be a random element with values in a Borel space \((E, \EE)\). Then there is a rcd of \(X\) wrt \(\G\).
Proof. Apply the previous result to the real-valued variable \(\phi\circ X\) and pull back.
Br.1. Breiman (1992b) explains that by moving the problem to \((E, \EE)\) you get rid of the piling up measure zero sets, and can guarantee the existence of a rcd. (It is called a distribution because \(X\P=\P(X^{-1}(\cdot)))\) is the distribution of \(X\). You need the random variable in addition to the original measure \(\P\). See discussion in Kallenberg p. 47 where \(X\) is called a random element (or variable if real-valued).)
5.4 Hoffmann-Jørgensen’s “Useful Rules”
Hoffman-Jørgensen (2017a), p.121 shows conditional expectations generally behave as you expect, provided CONDITION ON \(M\).
Let \((\Omega, \F,\P)\) be a probability space and let
- \(S,T\) be random elements on \(\Omega\) with values in \((L,\A)\), and \((M,\B)\) respectively.
- \(\psi:(L\times M, \A\otimes \B)\to\RR\) so that \(\psi(S,T)\) is \(\P\)-integrable.
- \(\phi:(L,\A)\to \RR\) is \(\P\)-measurable.
- \(X\) be a random variable on \(\Omega\).
Let \(P^T_S(\cdot\mid\cdot):\A\times M\to [0,1]\) be an rcp for \(S\) given \(T\). Thus \(P^T_S(A\mid t)\) is
- a probability measure in \(A\) for all \(t\in M\),
- is \(\B\)-measurable as a function of \(t\) for all \(A\in \A\) (probability kernel), and
- satisfies the rcp condition \[ \P(S\in A, T\in B) = \int_B P^T_S(A\mid t)\P_T(dt). \tag{2}\] for \(P_T\) a.e. \(t\in M\).
TEST
a.\(\P_S=\P_T P^T_S\) b.\(\E[\psi(S,T)] = \dint_M \P_T(dt)\int_L\psi(s,t)P^T_S(ds\mid t)\) c.\(\E[X\mid T=t] = \dint_L \E[X\mid S=s,T=t]P^T_S(ds\)
Then, the useful rules say that \[ \begin{align} a.\quad & \P_S=\P_T P^T_S \label{eq-31} \\ b.\quad & \E[\psi(S,T)] = \int_M \P_T(dt)\int_L\psi(s,t)P^T_S(ds\mid t) \label{eq-32} \\ c.\quad & \E[X\mid T=t] = \int_L \E[X\mid S=s,T=t]P^T_S(ds\mid t) \label{eq33} \\ d.\quad & \E[\psi(S,T)\mid T=t] = \E[\psi(S,t) \mid T=t] \label{eq34} \\ e.\quad & \E[\phi(S) \mid T=t] = (P^T_S(\phi))(t) = \int_L \phi(s)P^T_S(ds\mid t) \label{eq-35} \\ f.\quad & \E[\psi(S,T)\mid T=t] = \E[\psi(S,t)\mid T=t] \label{eq36} \end{align} \tag{3}\]
provided that we use the rcp to compute conditional expectations in the last formula. Let \(\Gamma(S,T)=\{ (S(\omega), T(\omega)\mid \omega\in\Omega \}\) be the graph of \(T\) against \(S\). If \(\Gamma(S,T)\in\A\otimes\B\) then \[ \begin{align} g.\quad & P^T_S(S(T^{-1}(t))\mid T=t)=1\ P_T\text{\ a.e.\ } t\in M. \end{align} \]
In particular, if \(P^T(F\mid t)\) is a rcp of \(\P\) given \(T\), and if the graph \(\Gamma(T)=\{ (\omega, T\omega) \mid \omega\in\Omega \}\) is \(\F\otimes\B\) measurable then \[ \begin{align} h.\quad & P^T(T=t\mid t) = 1 \end{align} \] for \(P_T\) a.e. \(t\in M\).
The measurable requirement is discussed further in XX and it cannot be dropped, see Section 7.2.2.
Proof.
- This is just Equation 2 applied to \(B=M\).
- By Equation 2, this holds for \(\psi=1_{A\times B}\) for \(A\in\A\), \(B\in\B\). Then use standard integration arguments.
- Given \(B\in \B\), set \(\psi(s,t)=1_B(t)\E[X\mid S=s,T=t]\) in b. Then \[ \begin{align} \int_{T^{-1}(B)}X\,d\P &= \E[1_B(T)X] \\ &= \E[\E[1_B(T)X \mid S,T]] \\ &= \E[1_B(T)\E[X \mid S,T]] \\ &= \E[\psi(S,T)] \\ &= \int_B \P_T(dt)\int_L \E[X\mid S=s,T=t]P^T_S(ds) \end{align} \] so the result follows from the defining property of a conditional expectation.
- By definition of rcp, \(\psi(s,t)=\E[\psi(S,T) \mid S=s,T=t]\) so d follows from c.
- Same from d.
- Same.
- Let \(L(t)=S(T^{-1}(t))=\{ S(\omega)\mid T(\omega)=t \}\) and \(\psi=1_\Gamma(S,T)\). Then \(\psi(S(\omega), T(\omega))=1\) and \(\psi(S(\omega), t)=1_{L(t)}(S(\omega))\), \(\forall\,\omega,t\in M\). Hence g follows from e. Notice that you need the diagonal to be measurable in order that \(\psi\) is \(\A\otimes \B\)-measurable.
- Follows from g setting \(S(\omega)=\omega\).
6 Disintegrations
Given measures \(\lambda\) on \(\Omega\) and \(\mu\) on \(M\) we can form the product measure \(\Lambda\) on xx. We have \[ \Lambda f = \mu^t \lambda^x_t f(x, t). \] Disintegrations are the opposite: start with an measure on the product, decompose it into the marginal on one projection and a kernel on the other. The connection to rcps comes from embedding \(\mathcal X\) into the product \(\mathcal X\times M\) by \(\iota: x\mapsto (x, Tx)\). The image \(\iota\lambda\) of the measure \(\lambda\) on \(\mathcal X\) becomes a measure on the product space which we can then decompose. Again, under certain regularity conditions, everything is hunky dory.
6.1 Pachl Disintegrations
Pachl (1978) defines a disintegration as follows.
Setup.
Measure spaces \((X,\A)\), \((Y,\B)\). \(\R\) a probability on \(\sigma(\A\otimes\B)\) and put \(\Q=\pi_Y(\R)\) so \(\Q(F)=\R(X\times F),\ \forall F\in\B\).
Result.
Suppose \(\forall y\in Y\) there is a sigma-algebra \(\A_y\) on \(X\) and a probability \(P_y\) on \(\A_y\) such that
- \(\forall E\in\A,\ \exists N\in\B:\ QN=0\), \(E\in\A_y\ \forall y\in Y\setminus N\) and \(y\mapsto P_y(E)\) is \((\B,Y\setminus N)\)-measurable
- \(\forall E\in\A,\ \forall F\in \B\) \[ \int_F P_y(E)Q(dy) = R(E\times F). \]
The family \(\{(\A_y, P_y)_{y\in Y} \}\) is called a \(Q\)-disintegration of \(R\).
The definition does not assume \(\A_y \supset \A\) so that complete Lebesgue probability is disintegrable, see Section 7.3.
6.2 Hoffmann-Jørgensen Disintegrations
6.2.1 RCP
Setup.
- \((\Omega, \F, \P)\)
- \((L,\A)\)
- \((M,\B)\)
- \(S:(\Omega, \F) \to (L,\A)\)
- \(T:(\Omega, \F) \to (M,\B)\)
Result.
RCP \(P^T_S(\cdot\mid\cdot):\A\times M\to\RR\) (kernel) s.t.
- \(P^T_S(A\mid\cdot)\) is \(\B\)-measurable \(\forall A\in\A\)
- \(P^T_S(\cdot\mid t)\) is a measure on \(\A\) for ?ae \(t\in M\)
- \(P^T_S\) is a version of conditional probability \[ \P(S\in A, T\in B) = \int_B P^T_S(A\mid t)P_T(dt). \]
6.2.2 Disintegration
Setup.
- \((L,\A, \mu)\)
- \((M,\B, \nu)\)
- \(\phi:(L, \A) \to (M, \B)\)
- \(\phi\mu \ll \nu\)
Result.
Disintegration \(\xi:\A\times M\to \RR\):
- \(\xi(A\mid\cdot)\) is \(\B\)-measurable \(\forall A\in\A\)
- \(\xi(\cdot\mid t)\) is a measure on \(\A\) for ?ae \(t\in M\)
- \(\xi\) is a version of conditional probability \[ \mu(A \cap \phi^{-1}(B)) = \int_B \xi(A\mid t)\nu(dt). \]
If the graph of \(\phi\) is \(\A\otimes\B\)-measurable then \(\xi(\phi^{-1}(t)\mid t) = 1,\ \forall t\in M\).
RCP | Disintegration |
---|---|
\(S\P\) | \(\mu\) |
\(T\P\) | \(\nu\) |
\(\Gamma\) graph of \(S\) against \(T\) | Graph of function \(\phi\) |
\(S\P(A) = \P(S\in A)\)
6.2.3 Pollard Disintegrations
Pollard (2002) 5.2<4> defines an RCP for \(S\) the identity, i.e., for \(\P\) rather than \(S\P\) and works with \(T\P\) on \((M,\B)\). He also requires \(P^T(T\not=t\mid t)=0\) a.e. \(t\). In a remark, he compares this definition to an rcp which has a requirement that \(\E[h(T)X]=\E[h(t)\E[X\mid T]]\) and points out the differences is just the diagonal measurability requirement.
Given a kernel \(P\) and distribution \(\nu\) on \((M,\B)\) (Pollard’s \(\mathbb Q\)) you get a measure \(\nu\otimes P\) on the product \((\Omega\times M, \F\otimes\B)\) by (Pollard and standard notation) \[ \begin{align} (\nu\otimes P)(f) &= \nu^t P^x_t g(\omega,t) \\ &= \nu^t P^\omega_t g(\omega,T\omega) \\ &= P^\omega g(\omega,T\omega) \\ \end{align} \]
The disintegration goes the other way. If you start with a measure \(\Gamma\) on \(\F\otimes\B\) and a measure \(\mu\) on \(\B\) can you decompose \(\Gamma\) into a kernel \(\Lambda\) so that \(\Gamma=\mu\otimes\Lambda\)?
7 Examples
?Others from Chang and Pollard?
7.1 Duplication Map
Let \(\Omega=[0,1]\) and \(T(x)=\{2x\}\) where \(\{x\}=x-\lfloor x\rfloor\) is the fractional part of \(x\).
7.2 Coco Examples / Diagonal not measurable
7.2.1 Billingsley
Following Billingsley (2017), Section 33.
B.1. There are pathological examples showing that the interpretation of conditional probability in terms of an observer with partial information breaks down in certain cases.
Let \((\Omega, \mathscr F, \mathsf P)\) be the unit interval with Lebesgue measure on the \(\sigma\)-field of Borel subsets of \(\Omega\). Take \(\mathscr G\) to be the \(\sigma\)-field of sets that are either countable or cocountable. Then the function identically equal to \(\mathsf P(A)\) is a version of \(\mathsf P(A\mid \mathscr G)\). The function is \(\mathscr G\)-measurable and integrable, and has the “right” integral over \(G\in\mathscr G\) because \(\mathsf P(G)\) is either 0 or 1 for every G in \(\mathscr G\) . Therefore, \[ \mathsf P(A \mid \mathscr G)_\omega = \mathsf P(A) \tag{4}\] with probability 1. But since \(\mathscr G\) contains all one-point sets, to know which elements of \(\mathscr G\) contain \(\omega\) is to know \(\omega\) itself. Thus \(\mathscr G\) viewed as an experiment should be completely informative—the observer given the information in \(\mathscr G\) should know \(\omega\) exactly—and so one might expect that \[ \mathsf P(A \mid\mathscr G)_\omega= \begin{cases} 1 & \text{if\ } \omega\in A, \\ 0 & \text{if\ } \omega\not\in A. \end{cases} \tag{5}\]
B.2. Switch perspective from fixing \(A\), and getting a function \(\omega\mapsto \P(A\mid G)(\omega)\), to fixing \(\omega\) and looking for a measure. Suppose that \(A_0,\dots,A_r\in\F\) partition \(\Omega\) and set \(\G=\sigma(A_1,\dots,A_r)\). If \(\P(A_0)=0\) but other \(\P(A_i)>0\), then one version of \(\P(B\mid\G) is\)$ (B = \[\begin{cases} 17 & \omega\in B_0 \\ \P(B\cap A_i) & \omega\in B_i,\ i=1,\dots,r \end{cases}\]The mathematical definition gives Equation 4; the heuristic considerations lead to Equation 5. Of course, Equation 4 is right and Equation 5 is wrong. The heuristic view breaks down in certain cases but is nonetheless illuminating and cannot, since it does not intervene in proofs, lead to any difficulties.
. $$
Now, for each \(B\), \(B\mapsto\P(B\mid\G)(\omega)\) is a measure on \(\F\) for \(\omega\in B_1\cup\dots\cup B_r\) but not if \(\omega\in B_0\). This is the “wrong version”. If \(17\) is replaced by \(\P(B)\) then we get a probability measure in \(B\).
B.3. If \(\omega_0\) is an atom of positive probability (any set containing \(\omega_0\) has positive measure). Fix any versions of \(\P(B\mid\G)\). For each \(B\) the set \(\{ \omega \mid \P(B\mid\G)(\omega) < 0 \}\in\G\), since the conditional probability is \(\G\)-measurable. Since, for a fixed \(B\), we get a conditional probability, this set must have measure 0. Therefore it cannot contain \(\omega_0\) and thus \(\P(A\mid\G)_{\omega_0}\ge 0\). Similarly \(\P(\Omega\mid\G)_{\omega_0}=1\) and \(\P(\bigcup_i B_i \mid\G)_{\omega_0} = \sum\P(B_i\mid\G)_{\omega_0}\) over disjoint unions. Therefore \(\P(A\mid\G)_{\omega_0}\) is a probability as \(A\) ranges over \(\F\).
Thus conditional probabilities behave like probabilities at points of positive probability. That they may not do so at points of probability \(0\) causes no problem because individual such points have not effect on the probabilities of sets.
7.2.2 Hoffmann-Jørgensen
Let \(\Omega\) be diffuse (uncountable, all points measurable with measure zero). \(S\) is the identity, but \(T\) is the identity to \((\Omega, \B)\) where \(\B\) is the sigma algebra generated by null sets (countable sets?!), \(\B\subset \F\).
Claim. \(P^T(F\mid t)=\P(F)\) is a rcp for \(\P\) given \(T\). (Pf. It is clearly a measure in \(F\) for all \(t\), and is a measurable function in \(t\) for all \(F\) because it is constant. Finally, consider \(\P(A \cap B)\). If \(\P(B)=0\) then \(\P(A\cap B)=0\) because \(A\cap B\subset B\). If \(\P(B)=1\) then \(\P(A\cap B)=\P(A)\) by complements. In both cases \(\P(A\cap B)=\P(A)\P(B)=\int_B P(A)\P_T(dt)\).)
Since \(\{T=t\}=\{t\}\), \(P^T(T=t\mid t)=\P(\{t\})=0\) for all \(t\in\Omega\). If \(\Gamma = \{(\omega,\omega)\}\) is the diagonal in \(\Omega\times\Omega\) and \(\psi=1_\Gamma\), then \(\psi(S,T)=1\) is the identity and \(\psi(S,t)=1_{\{t\}}\). Hence \[ \E[\psi(S,T)\mid T=t]=1,\quad \E[\psi(S,t)\mid T=t]=0 \quad\forall t. \] In this case \(\Gamma\not\in\F\otimes\B\) and \(\psi\) is not \(\F\otimes\B\)-measurable. This is the same as Billingsley’s coco example, #sec-billcoco.
7.3 Lebesgue measure example
Example (Pachl (1978)). Assume the continuum hypothesis. Put \(X=[0,1]^2\), \(Y=[0,1]\). \(\P\), \(\Q\) are ordinary Lebesgue measures on the sigma-algebras \(\A\), \(\B\) of Lebesgue measurable subsets of \(X\) and \(Y\). Define the probability \(\R\) on \(\sigma(\A\otimes \B)\) (NOT complete!) by \[ \R(G) = \P\{ (x_1,x_2)\in X\mid ((x_1,x_2), x_2)\in G \} \] for \(G\in \sigma(\A\otimes \B)\). There is no \(Q\)-distintegration with \(\A_y\supset \A\) for all \(y\in Y\). If \(\{(\A_y, P_y)_{y\in Y} \}\) is a disintegration (so \(P_y\) is a measure on \(X=[0,1]^2\)) then \(P_y\) extends one-dimensional Lebesgue measure in \([0,1] \times \{y\} \subset X\) to \(\A_y\). But \(\A\) contains all subsets of \([0,1]\times \{y\}\) (it is a null set). Thus if \(\A_y\supset \A\) we obtain an extension of Lebesgue probability in \([0,1]\) to the sigma-algebra of all subsets of \([0,1]\), which is inconsistent with the CH per Ulam’s theorem. (This is the “Strange Example” from Hoffmann-Jørgensen (1994) vol II p.150-51.)
Note. This is an example of why Simon says the LM sets are evil.
7.4 Doob’s example
7.4.1 Stoyanov
Following Stoyanov (2013).
Let \((\Omega, \F, \P)\) be a probability space and \(\F_1\) a σ-field such that \(\F_1 \subset \F\). Recall that the conditional probability \(\P(A\mid \F_1)\) is defined \(\P\)-a.s. as an \(\F_1\)-measurable function of \(\omega\) such that \[ \P(AB) = \int_B\P(A\mid \F_1)d\P(\omega)\quad\forall B\in\F_1. \] The conditional probability \(\P(A\mid \F_1)\), \(A\in\F\) is said to be regular if there exists a function \(P(A, ω)\), \(A \in \F, \omega\in \Omega\) (notation \(P\) not \(\P\)), which satisfies the following two properties:
- \(P(A, \cdot) = \P(A\mid \F_1)(\cdot)\) \(\P\)-a.s. for an arbitrary \(A \in\F\);
- for each \(\omega\), \(P(·, \omega)\) is a probability measure on \(\F\).
If condition (ii) is satisfied and condition (i) holds for all \(\omega\) (not only for \(\P\)-almost all \(\omega\)), then \(\P(A\mid \F_1)\) is called a proper regular conditional probability. (In terms of distributions we speak about regular and proper regular conditional distributions.) Regular conditional probabilities exist in many cases, but proper regular conditional probabilities do not always exist, as can be seen below.
Let \((\Omega, \F, λ)\) be a probability space with \(\omega = [0, 1]\), \(\F\) the \(σ\)-field of the Lebesgue measurable sets in \([0, 1]\) and \(λ\) the Lebesgue measure. It is well known that in the interval \([0, 1]\) there is a non-measurable (in Lebesgue sense) set, say \(N\), such that its outer measure is \(λ^*(N)=1\) and its inner measure is \(λ_*(N)=0\). For details see Halmos (1974) and my note on the Halmos set \(N\) is “half of \(\Omega\)” in the sense that \(\Omega=N\cup (N+1)\), so it “should have” measure \(1/2\).
Define a new \(\sigma\)-field \(\tF\) which is generated by \(\F\) and the set \(N\). Thus \(\tF\) consists of sets of the form \(N B_1 ∪ N^c B_2\) where \(B_1,B_2\in\F\). Define also the measure \(\P\) on the measurable space \(([0, 1], \tF,\P)\) by \[ \P(N B_1 ∪ N^c B_2) = \frac{1}{2}(\lambda(B_1)+\lambda(B_2)). \]
It is easy to check that \(\P\) is well defined and defines a probability on \(\tF\), so the triplet \(([0, 1], \tF,\P)\) is a probability space. For every \(B \in \F\) we have \[ \P(N B ∪ N^c B) = \P(B) = \lambda(B) \] and hence \(\P\) coincides with \(λ\) on \(\F\), that is \(\P\vert_\F = λ\). Moreover, \[ \P(N) = \frac{1}{2}. \] Now we shall prove the following statement: on the probability space \(([0, 1], \tF,\P)\) there is no regular conditional probability \(P(A, \F)\), \(A\in\tF\) with respect to the σ-field \(\F\).
Suppose such a probability exists: that is, there is a function, say \(P(A, ω)\), which satisfies the above conditions (i) and (ii). If so, then for any Borel (and Lebesgue) set \(A\), \(P(A, ω) = 1_A(ω)\). Therefore if \(A\) is a one-point set, \(A = \{ω\}\), then \(P(\{ω\}, ω) = 1\). Now take the set \(N\). From the definition of a conditional probability and the equality \(\P(N) = 1/2\) we get \[ \frac{1}{2} = \P(N) = \int_\Omega P(N,\omega)\,\lambda(d\omega). \tag{6}\] On the other hand, if \(P(·, ω)\) is a measure for each \(ω\), then \[ P(N, ω) \ge P(\{ω\}, ω) \ \forall ω\in N\implies P(N,ω)=1\ \forall ω\in N. \]
Consider the set \(C = \{ω:P(\{ω\}, ω) = 1\}\). Since \(P(·, ω)\) is a Borel function in \(ω\), then the set \(C\) is Borel measurable with \(\P(C) = 1\). Let \(D = \{ω : P(N, ω) = 1\}\). It is clear that \(D\) is Borel-measurable and \(D ⊃ C N\), which implies that \(D ∪ C^c ⊃ N\). However, the set \(D∪ C^c\) is Borel and covers the (non-measurable!) set \(N\) which has \(λ^*(N) = 1\). Therefore \(\P(D ∪ C^c) = 1\) and \(\P(D) = 1\). In other words, for almost all \(ω\) we get \(P(N, ω) = 1\), which implies the following equality \[ \int_\Omega P(N,\omega)\,\lambda(d\omega) = 1. \] However, this contradicts Equation 6.
Therefore a regular conditional probability need not always exist.
Let us note that in this counterexample the role of the non-measurable set \(N\) is essential. Recall that the construction of \(N\) relies on the axiom of choice. Using a weakened form of the axiom of choice, Solovay (1970) derived several interesting results concerning the measurability of sets, measures and their properties.
General results on the existence of regular conditional probabilities can be found in the works of J. Pfanzagl (1969), Blackwell and Dubins (1975) and Faden (1985).
7.4.2 Rogers and Williams
Here is the version from Rogers and Williams (1994), page 141. Notation shift: \(N\) becomes \(Z\) and \(\F\) becomes \(\G\) the Borel sigma-algebra. \(\tF\) becomes \(\F\). \(\lambda\) becomes \(\mu\), Lebesgue measure.
The setup is the same modulo pointing out that outer measure 1 is used to confirm the extended measure \(\P\) as defined is a measure.
Again, suppose a rcp \(P\) exists for \(\G\subset\F\) as a function \(\F\times\Omega\to [0,1]\). First think about what this means. Knowing the outcome in \(\G\) actually tells you which point occurred because \(\{\omega\}\in\G\ \forall\omega\in\Omega\). The extra information from \(\F\) is whether or not \(Z\) occurred. By definition \(Z\) is “half of \(\Omega\)”, so the probability \(P(\Gamma\cap Z, \omega)\) should be \(\frac{1}{2}1_\Gamma(\omega)\) for \(\Gamma\in\G\) and the set \(J=\{\omega\mid P(\Gamma\cap Z, \omega) = \frac{1}{2}1_\Gamma(\omega) \ \forall\Gamma\in\G\}\) should have \(\P\) and \(\mu\) measure 1. But then if \(\omega\in J\) \[ P(Z\cap J, \omega) =\frac{1}{2}1_J(\omega)=1\not=\frac{1}{2}1_{J\setminus\{\omega\}}(\omega) = P(Z\cap (J\setminus\{\omega\}), \omega) \] so that \(Z\cap J\not=Z\cap (J\setminus\{\omega\})\), i.e., \(\omega\in Z\). But then \(J\), a set of inner measure zero, contains a \(\G\)-measurable set of measure 1, which is a contradiction.
The detailed argument uses the fact that \(\G\) is generated by a countable \(\pi\)-system \(\mathscr I\) to reduce \(J\) to \(\{\omega\mid P(\Gamma\cap Z, \omega) = \frac{1}{2}1_\Gamma(\omega) \ \forall\Gamma\in\mathscr I\})\) which shows \(J\) is in \(\G\) and therefore that its \(\P\) and \(\mu\) measures are the same.
Notice that \(Z\) is independent of \(\G\) since \(P(Z\cap \Gamma) = \frac{1}{2}\mu(\Gamma) = P(Z)P(\Gamma)\) for all \(\Gamma\in\G\).
This example is consistent with the general statement on the existence of rcps which assumes \(\G\) is a sub-sigma algebras of the Borel sigma algebra on \([0,1]\).
7.5 Borel’s paradox
Borel’s Paradox is a famous example in the field of probability theory that illustrates the counterintuitive nature of conditional probabilities in certain contexts. The paradox arises when dealing with conditional probabilities on continuous probability spaces.
Consider a uniform distribution on the surface of a sphere. If we condition on a point on the sphere having a particular latitude, the resulting conditional distribution of longitudes is uniform. However, if we instead condition on the point having a specific longitude, the conditional distribution of latitudes is not uniform. This leads to a paradoxical situation where the method of conditioning affects the outcome, despite the symmetry of the problem.
The paradox highlights the importance of carefully defining conditional probabilities in continuous settings and demonstrates that intuitive ideas from discrete probability do not always directly translate to the continuous case.
It is sometimes described in terms of a meteorite strike. Given that is hits on the equator the distribution is uniform. But given that it hits along a line of longitude through London it is not.
7.6 Other awkward sub sigma algebras
- \(\F\) in its completion
- Tail sigma algebras (JS)
8 Background theory
8.1 A very short introduction to measure theory
8.1.1 Basic definitions
Generally need a sigma-finite assumption!
A ring \(R\) of sets is a nonempty subset of the power set of a set \(X\) (or alternatively a class of sets) that is closed under union and difference: \(\forall\ E,F\in R\): \(E\cup F\in R\) and \(E\setminus F\in R\). The empty set \(E\setminus E\in R\) as are \(E\triangle F=(E\setminus F)\cup (F\setminus E)\) and \(E\cap F=(E\cup F) \setminus (E\triangle F)\).
An algebra is a ring that contains \(X\). Hence it is closed under complements \(E'=X\setminus E\).
A sigma ring/algebra is closed under countable unions.
Given any class of sets \(\C\) there is a unique smallest ring (algebra, sigma-ring, sigma-algebra) containing \(\C\). Each element in the sigma-ring generated by \(\C\) is in a sigma-ring generated by countably many elements of \(\C\). (Because \(\{E\subset\Omega\mid E\in\sigma(\C'), \C'\subset\sigma(\C),\text{ countable}\}\) is a sigma algebra contained in \(\sigma(\C)\)!)
A measure is a non-negative, countably additive set function taking value \(0\) on the empty set. Measures are monotone. The length of an interval is a set function on the class of (semiclosed) intervals.
A class \(\H\) is hereditary if \(F\subset E\in \H\implies F\in \H\). An outer measure is a countably subadditive set function.
Given a measure \(\mu\) on a ring \(R\) \[ \mu^*(E)=\inf\left\{ \sum_{n\ge 1} \mu(E_n) \mid E_n\in R,\ E\subset\bigcup_n E_n \right\} \] defines an outer measure on the hereditary set defined by \(R\).
A set \(E\) in a hereditary sigma-ring \(\H\) is \(\mu^*\)-measurable if \[ \forall A\in\H,\ \mu^*(A)=\mu^*(A\cap E) + \mu^*(A\cap E'). \]
The class \(S\) of \(\mu^*\)-measurable sets is a sigma-ring and \(\mu^*\) restricted to \(S\) is a (complete) measure.
\(\mu\) is a measure on a ring \(R\) with induced measure \(\mu^*\) on \(\H(R)\) and \(\bar\mu\) on \(\bar S\) of measurable sets.
- Every set in \(S(R)\) is \(\mu^*\)-measurable
- \(\mu^*(E)=\inf\left\{ \sum_{n\ge 1} \bar\mu(E_n) \mid E_n\in \bar S,\ E\subset\bigcup_n E_n \right\}\)
- \(\mu^*(E)=\inf\left\{ \sum_{n\ge 1} \bar\mu(E_n) \mid E_n\in S(R),\ E\subset\bigcup_n E_n \right\}\)
- The outer measure is identical to the completion of the extension of \(\mu\) to \(S(R)\) on \(\bar S\)
Starting with \(S\) the sigma-ring generated by (semiclosed) intervals Borel sets and \(\mu\) length, then \(\bar S\) are Lebesgue measurable sets and \(\bar\mu\) is Lebesgue measure.
8.1.2 Atoms
Hoffman-Jørgensen (2017b) Ch 6.2-3.
Let \(\G\) be a sigma-algebra on \(\Omega\). Define the equivalence relation \[ \omega'\equiv \omega\mod\G \iff 1_G(\omega)=1_G(\omega')\quad \forall G\in\G. \] Equivalence classes are called atoms and the atom containing \(\omega_0\) is \[ \G(\omega_0) = \{\omega\mid \omega\equiv\omega_0\mod\G\}. \] The atoms are a disjoint partition of \(\G\). If \(\G(\omega)=\{\omega\}\) then \(\G\) separates points. Thus, \(\G\) gives complete information about outcomes.
There follows a long list of things you expect to be true that are true.
8.1.3 Product Measures
The product sigma algebra is generated by products of measurable sets. Sections of a measurable set are measurable.
The class of finite, disjoint unions of rectangles from two rings is itself a ring.
The product sigma-algebra is the smallest sigma-algebra that makes projections measurable.
A rectangle is the intersection of two inverse images of projections. The area of the rectangle \(\pi_1(E_1)\times \pi_2(E_2) = \P_1(E_1)\P_2(E_2)\).
Let \(X=Y\) be an uncountable set and \(\F=\G\) the class of countable subsets (=sigma-algebra generated by the one-point sets). If \(D=\{ (x,y) \mid x=y \}\) is the diagonal in \(X\times Y\) then every section of \(D\) is measurable but \(D\) is not. (Because: )
Given measure spaces \((X,\F,\P)\) and \((Y,\G,\Q)\) the product measure \(\lambda = \P \times \Q\) is defined by \[ \lambda(E) = \int \Q(E_x)\P(dx) = \int \P(E_y)\Q(dy). \]
The product of two complete measure spaces need not be complete. (Take \(A\subset[0,1]\) not Lebesgue measurable. \({x}\times A\subset [0,1]^2\) is not in the product measure space: it doesn’t have measurable sections.) This leads Simon (2015a) p. 210 to say that
passing from Borel to Lebesgue measurable functions is the work of the devil. Don’t even consider it!
8.1.4 The Diagonal
From Hoffman-Jørgensen (2017b), exercises to Chapter 1 and …
Let \((\Omega,\F)\) be a measure space and \(\Delta=\{(\omega,\omega) \mid \omega\in\Omega\}\) be the diagonal in \(\Omega\times \Omega\).
\(\F\) is countably separating if there exists a countable paving \(\G\subseteq \F\) such that \(\forall \omega\not=\omega',\ \exists G\in\G:\ 1_G(\omega)\not=1_{G'}(\omega')\)
Here are the important facts about the diagonal and countably separable spaces.
- Let \(f:\Lambda\to\Omega\) be an injective function. If \(\F\) is countably separable then so is \(f^{-1}(\F)\).
- Obvious
- Let \((S,d)\) be a separable metric space. Then \(\BB(S)\) is countably separable.
- Let \(\{a_n\}\) be a countable dense subset. Then \(\G=\{ b(a_i,2^{-j}) \}_{i,j}\) countably separates. (For \(\RR\) you can use half intervals at the rationals.)
- (Marczewski function) The \(\{A_n\}\) be a sequence of subsets of \(\Omega\). Define a function \(f:\Omega\to\RR\) by \[ f(\omega) = \sum_n 10^{-n}1_{A_n}(\omega). \] Then \(f^{-1}(\BB(\RR)) = \sigma(A_n)\) and \(f\) is injective iff \(\{A_n\}\) separates points.
- \(\F\) is countably separable iff there exists an injective measurable function \(\Omega\to\RR\).
- (only if, \(\implies\)): use 3, Marczewski function.
- (if \(\impliedby\)): follows from 1 and 2 (applied to \(\RR\)).
- \(\F\) is countably separating iff \(\Delta\in\F\otimes\F\).
- (only if, \(\implies\)): if \(\F\) is countably separable then there is an injective measurable function to \(\RR\). Set \(g:\Omega\times\Omega\to\RR^2\), \(g(\omega_1,\omega_2)=(f(\omega_1), f(\omega_2))\). Then \(\Delta = g^{-1}(\Delta_\RR)\), and the diagonal in \(\RR\) is measurable by 2.
- (if \(\impliedby\)): if \(\Delta\in\F\otimes\F\) then \(\Delta\) is in a sigma-algebra generated by countable set \(\G\subset\F\) (see potted summary). Suppose the elements of \(\G\) do not separate points, so there are \(\omega\not=\omega'\) where both or neither of which is in each set in \(\G\). Since the same holds for unions, complements and intersections of sets in \(\G\), it holds for all sets in \(\sigma(G)\). The section \(\Delta_\omega=\{w \mid (\omega,w)\in\Delta\}\) is \(\G\)-measurable and contains \(\omega\), and therefore \(\omega'\in\Delta_\omega\) meaning \((\omega,\omega')\in\Delta\), which is a contradiction. Therefore \(\G\) separates points.
- Suppose \(\Omega\) has cardinality strictly greater than \(\RR\). Then the diagonal is not in the product sigma algebra.
- There is no injective function to \(\RR\).
- Let \((L,\A)\) and \((M, \B)\) be measure spaces, \(\nu\) a measure on \((M,\B)\) and \(f:L\to M\) be a function so that \(\A=f^{-1}(\BB)\) and \(f(L)\in \BB\). Then
- \(\forall A\in \A,\ f(A) \in \B\)
- Let \(A=f^{-1}(B)\), then \(f(A)=B\cap f(L)\).
- \(\exists \mu\) on \((L,\A)\) so that \(\nu=f\mu\) iff \(\nu(M\setminus f(L))=0\).
- \(\mu(A) = \nu(f(A))\) has the required properties.
- Let \(f:(L,\A)\to(M, \B)\) is a measurable function and suppose \((M,\B)\) is countably separable (iff has a measurable diagonal). Then, the graph of \(f\) \[
\Gamma = \{ (x,f(x)) \mid x\in L \} \in \A\otimes\B.
\]
- By assumption, there exists a measurable injection \(\phi:M\to\RR\). Define \(\Phi(a, b)=(\phi(f(a)), \phi(b))\). Then \(\Phi\) is measurable (projections are measurable) and \(\Gamma=\Phi^{-1}(D)\), \(D\) the diagonal in \(\RR^2\).
- Thus measurable graph or diagonal only depends on the range space being countably separable.
- Let \(S\) be a separable metric space, and let \((L, \A)\) be a measure space. Suppose that \(f:S\times T\to \RR\) is a function with \(f(\cdot, t)\) continuous for all \(t\) and \(f(s, \cdot)\) \(\A\)-measurable for all \(s\). Then \(f\) is \(\BB(S)\otimes\A\)-measurable.
- Let \(\{ a_n\}\) be a countable dense sequence in \(S\) and set \[ \begin{gather} D_1^n = b(a_1,2^{-n}),\quad D^n_j = b(a_j,2^{-n})\setminus\bigcup_{1\le i\le j-1} b(a_i, 2^{-n}) \\ f_n(s,t) = \sum_j f(a_j, t) 1_{D_j^n}(s). \end{gather} \] The \(D_i^n\) are disjoint (by construction) for fixed \(n\) and \(S=\bigcup_j D^n_j\) (\(n\) fixed critical) and \(f(s,t)=\lim_n f_n(s,t)\). The \(f_n\) are \(\BB(S)\otimes \A\)-measurable and the pointwise limit of mesurable functions is measurable.
The sigma-algebra of countable and co-countable sets on an uncountable \(\Omega\) is not countably generated. If it were, then it is generated by a countable collection of countable sets, i.e., by a countable number of points in total. But it is uncountable.
Therefore, if \(\Omega\) is uncountable and \(\F\) is the countable/co-countable sigma algebra, then the diagonal is not measurable in the product. (There can be no countably separable generators.) However, it has measurable sections.
Interestingly, a sigma-algebra cannot be countably infinite. Suppose \(\F\) is countably infinite. Since \(\F\subset\mathscr P(\Omega)\), (power set) \(\Omega\) must be infinite. Define \(A_x=\bigcap \{A\in\F\mid x\in A\}\in\F\) (because it can only be a countable intersection). It is the smallest set in \(\F\) containing \(\F\). The sets \(A_x\) are disjoint: if \(A_x\cap A_y\not=\emptyset\) then \(A_x\setminus A_y\) is a smaller set containing \(x\). But then there must be an infinite number of distinct \(A_x\) and hence \(\F\) is uncountable.
8.2 Radon-Nikodým theorem
From Halmos measure theory, Theorem 31.N.
Theorem 1 (Radon-Nikódym Theorem) If \((\Omega,\F,\mu)\) is a totally \(\sigma\)-finite measure space and if a \(\sigma\)-finite measure \(\nu\) on \(\F\) is absolutely continuous wrt \(\mu\), then there iexists a finite valued mesaureable function \(f\) on \(\Omega\) such that \[ \nu(A) = \int_A f\,d\mu \] for every measurable \(A\). The function \(f\) is unique a.e. \((\mu)\).
Proof. Reduce to finite \(\mu,\nu\). Uniqueness follows from \[ \int_A (f - g)\,d\mu = 0\ \ \forall A\implies \mu\{f>g\} = 0,\ \mu\{f<g\} = 0. \] Let \(\mathcal K\) be the class of all non-negative \(\mu\)-integrable function \(f\) such that \[ \int_A f\,d\mu \le \nu(A) \] and write \[ \alpha = \sup \left\{ \int_\Omega f\,d\mu \le \nu(A) \mid f\in\mathcal K \right\}. \] Let \(f_n\) be a sequence in \(\mathcal K\) such that \[ \lim_n \int_\Omega f\,d\mu = \alpha \] and let \(g_n = \max \{f_1,\dots,f_n\}\). Any \(A\) can be written as a disjoint union \(A=A_1\cup\cdots\cup A_n\) so that \(g_n(x)=f_j(x)\) on \(A_j\). Consequently \[ \int_A g_n\,d\mu = \sum_{j=1}^n \int_{A_j} f_j\, d\mu \le \sum_{j=1}^n \nu(A_j) = \nu(A). \] If we write \(f_0(x) = \sup_n f_n(x)\), then \(f_0(x) = \lim_n g_n(x)\) and (monotone convergence) \(f_0\in\mathcal K\) and \(\int_\Omega f_0\,d\mu=\alpha\). Since \(f_0\) is integrable we can find a finite valued function \(f=f_0\) a.e. \((\mu)\). Then \(f\) is our guy! We show that if \[ \nu_0(A) = \nu(A) - \int_A f\,d\mu \] then \(\nu_0\) is identically zero. It is positive (definition of \(\mathcal K\)) and so if not zero we can find a set \(B\) of positive \(\mu\)-measure so that \(\epsilon\mu(A\cap B)\le \nu_0(A\cap B)\). But then \(g=f+\epsilon 1_B>f\) is in \(\mathcal K\), contradicting the maximality of \(\int f\,d\mu\).
8.3 Reisz Representation Theorem
The Riesz Representation Theorem for measure spaces provides a powerful result that connects linear functionals on spaces of continuous functions to measures. There are two main forms of this theorem, one for compact Hausdorff spaces and one for locally compact Hausdorff spaces. Here, we will focus on the form relevant to compact Hausdorff spaces, which is often the more straightforward case.
Theorem 2 (Riesz Representation Theorem for Compact Hausdorff Spaces) Let \(X\) be a compact Hausdorff space, and let \(C(X)\) denote the space of continuous real-valued functions on \(X\). For every continuous linear functional \(L\) on \(C(X)\), there exists a unique regular Borel measure \(\mu\) on \(X\) such that for all \(f \in C(X)\): \[ L(f) = \int_X f \, d\mu \]
The measure \(\mu\) is a regular Borel measure, which means it is defined on the Borel \(\sigma\)-algebra of \(X\), the measure of any Borel set can be approximated by the measure of compact subsets from within, and open supersets.
Sketch proof. (See also Rudin (1987), theorem 2.14, over seven pages!)
For each non-negative continuous function \(f\), define a set function \(\mu_f\) by: \[ \mu_f(A) = \inf \{ L(g) : f \leq g \text{ and } g \in C(X) \text{ is non-negative, } g \geq 1_A f \} \] This set function \(\mu_f\) should satisfy properties of a measure.
Show that \(\mu_f\) is countably additive. For disjoint sets \(A_i\), \[ \mu_f \left( \bigcup_i A_i \right) = \sum_i \mu_f(A_i) \]
Regularity of the Measure: Define an outer measure \(\mu^*\) using \(\mu_f\). \[ \mu^*(A) = \sup \{ \sum_i \mu_f(A_i) : A \subseteq \bigcup_i A_i \} \] Use Carathéodory’s Extension Theorem to extend \(\mu^*\) to a measure \(\mu\) on the Borel \(\sigma\)-algebra of \(X\). Prove that \(\mu\) is a regular Borel measure. This involves showing that \(\mu\) can be approximated from above by open sets and from below by compact sets.
Uniqueness of the Measure: Suppose there are two measures \(\mu_1\) and \(\mu_2\) that represent \(L\). For any continuous function \(f\), \[ \int_X f \, d\mu_1 = L(f) = \int_X f \, d\mu_2 \] By the properties of measures, \(\mu_1 = \mu_2\) must hold.
8.4 Borel and Polish Spaces
From Rogers and Williams (1994).
A topological space \(S\) is a Polish space if the topology on \(S\) arises from a metric wrt which \(S\) is complete. \(S\) is a Lusin space if \(S\) homeomorphic to a Borel subset of a compact metric space \(J\) (which can be taken to be \([0,1]^\infty\))
All probabilities on a Lusin space are inner regular (equals \(\sup\) of measure of contained compact subsets).
\(S\) is Polish iff it is homeomorphic to a \(G_\delta\) subset of a compact metric space. Every Polish space is a Lusin space.
8.5 Sufficient Statistics
A sufficient statistic is a function of the data that provides as much information about a parameter of interest as the entire data set does. More formally, for a given parameter \(\theta\) and data set \(X\), a statistic \(T(X)\) is said to be sufficient for \(\theta\) if the conditional distribution of the data \(X\) given \(T(X)\) does not depend on \(\theta\).
Given a random sample \(X = (X_1, X_2, \ldots, X_n)\) from a probability distribution with parameter \(\theta\), a statistic \(T(X)\) is sufficient for \(\theta\) if the conditional distribution of the sample given the statistic is independent of \(\theta\). Formally, \(T(X)\) is sufficient for \(\theta\) if: \[ P(X \mid T(X), \theta) = P(X \mid T(X)) \]
This can also be understood via the Factorization Theorem. A statistic \(T(X)\) is sufficient for the parameter \(\theta\) if and only if the joint probability (or probability density) function of \(X\) can be factored as: \[ f(X \mid \theta) = g(T(X) \mid \theta)h(X) \] where \(g\) is a function that depends on the data \(X\) only through the statistic \(T(X)\), and \(h\) is a function that does not depend on \(\theta\).
Example: Binomial Distribution
Consider a random sample \(X_1, X_2, \ldots, X_n\) from a binomial distribution with parameters \(n\) (number of trials) and \(p\) (probability of success). The probability mass function is: \[ f(x_i \mid p) = \binom{n}{x_i} p^{x_i} (1 - p)^{n - x_i} \]
For a sample of size \(k\), the joint probability function is: \[ f(x_1, x_2, \ldots, x_k \mid p) = \prod_{i=1}^{k} \binom{n}{x_i} p^{x_i} (1 - p)^{n - \]i} ]
This can be rewritten as: \[ f(x_1, x_2, \ldots, x_k \mid p) = \left[ p^{\sum_{i=1}^{k} x_i} (1 - p)^{kn - \s \]{i=1}^{k} x_i} ] _{i=1}^{k} ]
Here, \(\sum_{i=1}^{k} x_i\) is a sufficient statistic for \(p\).
Example: Poisson Distribution
A Poisson distribution with parameter \(\lambda\) (rate parameter) has the probability mass function: \[ P(X = x\mid \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} \] for $ x = 0, 1, 2, $.
Consider a random sample \(X_1, X_2, \ldots, X_n\) from a Poisson distribution with parameter \(\lambda\). The joint probability mass function of the sample is: \[ P(X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n\mid \lambda) = \prod_{i=1}^{n} \frac{\lambda^{x_i} e^{-\lambda}}{x_i!} \]
We can rewrite the joint probability mass function as \[ P(X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n\mid \lambda) = \left( \prod_{i=1}^{n} \frac{1}{x_i!} \right) \lambda^{\sum_{i=1}^{n} x_i} e^{-n\lambda} \]
In this factorized form, we have
- \(g(T(X)\mid \lambda) = \lambda^{\sum_{i=1}^{n} x_i} e^{-n\lambda}\)
- \(h(X) = \prod_{i=1}^{n} \frac{1}{x_i!}\)
Here, \(T(X) = \sum_{i=1}^{n} X_i\) is the statistic that appears in \(g(T(X)\mid \lambda)\).
According to the factorization theorem, \(T(X) = \sum_{i=1}^{n} X_i\) is a sufficient statistic for the parameter \(\lambda\). This is because the joint probability mass function can be written in the form: \[ f(X|\lambda) = g(T(X)|\lambda)h(X) \] where: \[ g(T(X)|\lambda) = \lambda^{\sum_{i=1}^n x_i} e^{-n\lambda} \] and \[ h(X) = \prod_{i=1}^n \frac{1}{x_i!}. \]
9 Book treatments
Domain | Treatments |
---|---|
Probability | |
Ash (2014) | |
Billingsley (2017) | |
Bobrowski2005bk | |
Borovkov (2003), Borovkov (2003) | |
Breiman (1992b) | |
Chung (2001) | |
Durrett (2013) | |
Dellacherie and Meyer (1978) | |
Doob (1953), Doob (1984) | |
Dudley (2018) | |
Fristedt and Gray (2013) | |
Gray (2009) | |
Grimmett and Stirzaker (2020a), Grimmett and Stirzaker (2020b) | |
Gut (2006) | |
Hoffman-Jørgensen (2017b), Hoffman-Jørgensen (2017a) | |
Itô (1984) | |
Kallenberg (2001b) | |
Klenke (2014) | |
Loève | |
Malliavin (2012) | |
Moran (1968) | |
Meyer | |
Parthasarathy (1967) | |
Pollard (2002) | |
Rogers and Williams (1994) | |
Shiryaev (1996) | |
Stoyanov (2013) | |
Stromberg (1994) | |
Stroock (2010) | |
Tucker (2013) | |
Williams (1991b) | |
Wise and Hall (1993) | |
Analysis | |
Folland (1999) | |
König (2009) | |
Doob (2012) | |
Tao (2011) | |
Stein and Shakarchi (2009) | |
Simon (2015b) | |
Kelley and Srinivasan (2012) | |
Halmos (1974) | |
Papers | |
Hoffmann-Jørgensen (1971) | |
P. Pfanzagl (1979), J. Pfanzagl (1969) | |
Pachl (1978) | |
Faden (1985) | |
Chang and Pollard (1997) |
Domain, author | Spaces | Sigma Alg | Subsets | RVs | Probabilities |
---|---|---|---|---|---|
Probability | |||||
Hoffmann-J | \(x\in\Omega,L, t\in M\) | \(\F,\A,\B\) | \(A, B\) | \(S, T\) | \(\P,\mu,P^T_S(\cdot\mid\cdot)\) |
Kallenberg | \(\Omega\) | \(\psi\) | |||
Billingsley | \(\Omega\) | \(\psi\) | |||
Shiryaev | \(\Omega\) | \(\psi\) | |||
Pollard | \(\Omega\) probability \(\mathcal X\) metric(izable) |
\(\lambda,\mu\) | |||
Analysis | |||||
Halmos | \(X,Y\) | \(E, F\) | \(f\) | n/a? | |
Statistics | |||||
Generic | \(\Omega\) | ? | \(X\) | \(X,Y,T\) |
Item | Conditional Probability | Regular Conditional Probability | (Regular) Conditional Probability Distritbution | Distingegration |
---|---|---|---|---|
Abbreviation | cp | rcp | cpd | dis |
Ingredients | \(A\in\F\), \(\G\subset\F\) | Same as cp | Same as cp, plus \(T:(\Omega,\F)\to (E,\EE)\) | \(T\), \(Q=T\P\) |
Results per Shiryaev (Sh) | A \(\G\)-measurable function \(\P(A\mid \G):\Omega\to [0,1]\) a.e. so that \(\forall B\in\G\), \(\P(B\cap A)=\int_B \P(A\mid\G)\,d\P\) | \(p:\Omega\times\F\to [0,1]\) s.t. (i) \(\forall A\in\F\), \(p(\cdot, A)\) is version of \(\P(A\mid\G)(\cdot)\), and (ii) \(\forall\omega\), \(p(\omega, \cdot)\) is a probability measure on \(\F\) |
\(p:\Omega\times\EE\to [0,1]\) s.t. (i) \(\forall A\in\EE\), \(p(\cdot, A)\) is version of \(\P(T\in A\mid\G)(\cdot)\), and (ii) \(\forall\omega\), \(p(\omega, \cdot)\) is a probability measure on \(\EE\) |
|
Existence | Always—Radon-Nikódym theorem | |||
Books | ||||
Billingsley, Sec 33 | ⁓ | As Sh for \(X\) a rv | ||
Brieman 4.2,4.3 | As Sh.; shows \(\E[X\mid\G](\omega)=\int X(\nu)\,p(\omega, d\nu)\), i.e., conditional expectation is taken wrt rcp (Sh note 16 above) | As Sh, notation \(\mathcal D\subset\F\); \((E, \EE)\) a Borel space; same as Sh works via rdf | ||
Dudley | Sec 10.1 | In (ii) can replace every with a.e. \(\omega\): if it holds a.e. then pick any prob. meas. on the set of measure zero where it fails; this does not affect (i). | A cpd with \(T\) the identify on \(E=\Omega\) is a rcp | |
Halmos | Sec 48 | ⁓ | ||
Hoffmann-Jørgensen | ||||
Kalleberg Ch 6. Thm 6.3, 6.4 |
CHECK! | In (i) requires measurable in def of kernel. Adds random element \(X\) with values in \((S,\S)\) rcd is a random measure, aka kernel from \(E\) to \(S\), \(p(t, B)=\P(S\in B\mid T):E\times \S\to [0,1]\); \(S\) Borel for existence |
||
Meyer-D v. 1 | ||||
Rogers & Williams (II.42,89) | \(\P\) measurable, integrable | As Sh | In Sec 42 (ii) only require \(p(\omega, \cdot)\) to be a measure a.e. \(\omega\). Existence: \(\Omega\) is a Lusin space and proves \(p\) is a measure \(\forall\omega\); in addition, if \(\G\) is countably generatged then (iii) \(\{ \omega\mid p(\omega, A)=1_A(\omega),\ \forall A\in\G \}\) is a \(\G\) set of \(\P\) measure 1; (iv) if \(A(\omega)\) denotes the atom of \(\G\) containing \(\omega\) then \(\{\omega\mid p(\omega, A(\omega))=1 \}\) is in \(\G\) and has \(\P\) measure 1 |
|
Shiryaev (II.7) | \(\Omega\) is a Borel space (isomorphic to a Borel subset of a complete, separable metric space with the Borel sigma-algebra CHECK) | \((E,\EE)\) a Borel space; If \(T\) is a random variable, \(F:\Omega\times \RR\to[0,1]\) is an rdf if (i) \(\forall x\), \(F(\cdot, F)=\P(T\le x\mid \G)(\omega)\) a.s. and (ii) \(\forall\omega\), \(F(\omega, \cdot)\) is a distribution function on \(\RR\) Prepends regular |
10 Abbreviations
Abbreviation | Meaning | Characteristics |
---|---|---|
ce | conditional expectations | integral of \(X\) and \(\E[X\mid\G]\) over \(\G\)-sets agree, unique a.e.; produces a different function for each \(X\) from RNT |
cp | conditional probability | any version of \(\E[1_B\mid \G]\), \(B\in\F\); produces different function for each \(B\in\F\) from RNT |
rcp | regular conditional probability | a single function on \(\Omega\times\F\) that is a measure \(\forall\omega\) and a.e. equal to the cp for each \(B\) |
(r)cd | regular conditional distribution | synonyms and specializations |
rdf | regular distribution function | |
decomposition |