Conditional Probability
Index | Other Stuff | Background
0.1 Nutshell Summary
Let
If
If
Conditional probabilities are simply conditional expectations of indicator functions
It is usually possible to combine the functions
- MEAS:
, is a probability measure. - VER:
, the function is a version of .
If the condition in MEAS only holds a.e., then the missing values of
The conditional and regular conditional probabilities are connected via
There is a chain of increasing generality, from working on
It was one of Kolmogorov’s many great ideas to use the Radon-Nikodým theorem (RNT) to define conditional probabilities and conditional expectations (Kolmogorov and Bharucha-Reid 2013). The idea can be applied in several flavors, which we now explore.
0.2 Notation
is a probability space and and is are measure spaces is a sub-sigma algebra and are measurable subsets is an -measurable random variable is a -measurable random variable is a measurable random element, taking values in is a measurable random element, taking values in is a measure of defined by
The (Lebesgue) integral used in probability is written in many different ways:
0.3 Conditional Probabilities and Conditional Expectations
We can define a measure
Define
The next table summarizes particular instances of these definitions, illustrating their connection to the fundamental equality
General case | fundamental equality: |
||||
Specific examples | |||||
factorization theorem uniqueness, right |
0.4 Regular Conditional Probability
The function
a -measurable function is a probability measure- The function in item 1 is a version of the function
Notice that item 3 implies the function in item 1 is
When
Conditional expectations can be computed using an rcp
A function satisfying conditions 1 and 2 above is called a probability kernel (or Markov kernel) from
More generally, a second random element
0.5 Disintegrations
In addition, if
0.6 Existence Results
In the general case, an rcd for
Rogers and Williams (1994) II.89.1 works with probabilities rather than distributions (i.e., no
- Existence of an rcd depends on topological properties of
: they essentially limit measurable functions; and - Correct support of the rcd depends on separability of
, which makes the graph measurable and allows application of the disintegration theorem.
Dellacherie and Meyer (1978) also starts with a compact metric space and Borel sigma-field. He extends to any separable metrizable space with Borel sigma-field and assumes that
Pollard (2002) proves a disintegration theorem on a metric space with its Borel sigma-algebra and a function
Chang and Pollard (1997) Theorem 2 states that if
with density are finite for -a.e. iff is sigma-finite are probabilities for -a.e. iff- If
is sigma-finite then and . For -a.e. the measures are probabilities that give a -disintegration of .
Proof. Let
- If
and then mod and so mod . Equation 1 exactly expresses that is the density(!). - Sigma-finite iff there exists a strictly positive real-valued function with finite integral, so there exists
with , so by assumption exist. If mod , the function mod and so is sigma-finite. Conversely, if for then mod (by 1? why needed), which gives finiteness of mod . - If
mod then Equation 1 shows that . Conversely, let have (by sigma finite assumption). Choose in Equation 1 and using the assumption that gives which implies . Similarly . - …SORT OUT.
Chang and Pollard (1997) appendix proves the existence of the disintegration theorem, mirroring Rogers and Williams (1994). EXPAND.
Breiman (1992a) uses a random element to move the problem to a “nice” space and describes this as “like passing over to representation space” to get rid of a difficulty.
0.7 Blackwell and Proper rcps
Blackwell (1956) - Lusin spaces and three weird examples (Doob’s example).
Blackwell and Ryll-Nardzewski (1963) - everywhere proper cds
Blackwell and Dubins (1975) (w Dubins): for countably generated
0.8 Common Notation
Table 1 describes the notation used by different authors for (r)cps, (r)cds, and disintegrations (an rcp with extra assumptions). Here we use the abbreviations regular, conditional, probability, and distribution in various combinations. Rcps apply to
- rc probabiliity
on - rc distributions
conditional distribution of on
Reference | Nomenclature | Disintegration |
---|---|---|
Billingsley (2017) | cd, Thm 33.3 | n/a |
Breiman (1992a) | rcp, 4.3 Def 4.27 rcd for rv |
n/a |
Dellacherie and Meyer (1978) | c “laws” | Ch III.70 w support |
Dudley (2018) | rcp, 10.2; rcd p.342 | n/a |
Hoffman-Jørgensen (2017a) | rcd, II.10.1 | II.10.30 |
Kallenberg (2001a) | rcp Ch.6; rcd |
Thm 6.4 |
Parthasarathy (1967) | rcpd given |
|
Pollard (2002) | rcd, remark w/o | cpd Ch 5.2<4> w support |
Rogers and Williams (1994) | rcp, II.42.3 | see II.89.1 iii, iv |
Williams (1991a) | rcp 9.9 | n/a |
https://stats.stackexchange.com/questions/531705/regular-conditional-distribution-vs-conditional-distribution
If
Concept | Notation | Conditions |
---|---|---|
conditional probability | ||
regular conditional probability defined by |
MEAS: VER |
|
regular conditional probability defined by |
MEAS: VER |
|
(regular) conditional distribution for rv |
MEAS: as rcp VER |
|
(regular) conditional distribution for random element |
MEAS: VER |
|
disintegration as above, |
1 Notation
Pith. These notes follow Hoffman-Jørgensen (2017b). Thus, the general set up is a measure space
Periphery. Probabilitists, analysts, set theorists, and statisticians all write probability and they tend to use slightly different notation. These differences can make it hard to learn probability.
Compounding these differences, probability needs notation for spaces, sigma-algebras, measures, subsets, and random variables. The one thing everyone agrees that
We mix and match notation for expectations and rcps. Hoffman-Jørgensen’s notation
Always we use
2 Motivation
The general theory of conditional probability builds on the elementary discrete theory, which conditions on sets of positive measure.
S1.6 Let
If
S2. Let
Thus
where
The random variable
S3. Countable decompositions are naturally generated by random variables taking a finite or countably infinite number of different values. If
S4. What about conditional probabilities wrt sets of measure zero? These occur naturally when conditioning on the outcome of a continuous random variable.
H1. Think about varying the conditioning set
If
Since the left term is the conditional probability of
given , it is formally possible that as “ shrinks to ” the left term should tend to the conditional probability of given and the right term should tend to the integrand . The RNT is a rigorous substitute for this rather shaky “difference quotient” approach.
He then shows that the collection of functions
H2. Thus
H3. Extending
Notice that Halmos starts with conditional probability by considering
3 Kolmogorov conditional probability
Kolmogorov’s axiomatic treatment of probability defines probability based three axioms: non-negativity, normalization, and countable additivity. Kolmogorov’s axioms apply to a wide range of probabilistic scenarios, including both discrete and continuous cases, allowing for generalization and application in various fields.
3.1 Conditional Expectation wrt Sets of Measure Zero
S5. Let
is -measureable;- for every
We generally write the integral on the left simply as
5.R-W. Rogers and Williams p. 138 say
An experiment has been performed. The only information available to you regarding which sample point
has been chosen is the set of values for every -measurable random variable , or, equivalently, the values for every . Then is regarded as (a.s. equal to) the “expected value of given this information.”
6. Existence. The set function on
3.2 Conditional Probability
S7. The conditional probability of an event
S8. If
S9. These definitions extend those in the finite case.
S10. Conditional probabilities have all the “usual properties”: monotone, Jensen, linear, tower. In addition
- If
then and . - If
is independent of (i.e., independent of for ) then (a.s.). - If
is -measurable then (a.s.). - Bounded convergence, monotone convergence, Fatou, countably additive.
10.1. If
10.2. If
10.3. If the set
3.3 Conditional probability and conditional expectation comparison
The next table compares the Kolmogorov RNT approach to conditional probabilities and expectations showing the transition via
Step | Probability | Transition | Expectation |
---|---|---|---|
Given ingredients a sub-sigma algebra and set or rv |
|||
Consider the measure on |
|||
RNT gives function where |
both | ||
and |
|||
Using |
same | ||
Countable additivity is proved by converting back to and appealing to RNT uniqueness. Other properites handled similarly. |
For disjoint |
For and |
3.4 Expectation Conditional on Another Random Variable
S11. Since the function
for all
Note shorthand notation:
S12. Let
for
S13. The conditional probability of
3.5 The problem of uncountably many measure zero sets
S14. If
However, we cannot assume that
4 Kernels
, and .
A kernel or random measure from
- MF: For fixed
the function is measurable. - MEAS: For each
, is a (probability) measure. In applications it will be a version of conditional probability.
If the measure in 1. is a probability then the kernel is a probability kernel or Markov kernel.
A kernel determines an operator on suitable functions on
5 Regular conditional probabilities
5.1 Regular Conditional Probability (rcp)
S15. A function
The next theorem shows why rcps are useful.
16. Theorem. Let
Proof. If
by part 2 of the rcp definition. The general result holds by approximating using simple functions and using the limit properties of conditional expectations.
5.2 Regular Conditional Distributions and Regular Distribution Functions
S17. A regular conditional probability is determined by a probability and a sub-sigma algebra. Often we want to adjust the probability to the distribution of a random variable (retaining the same sub-sigma algebra). The next two definitions handle these cases.
- Let
be a measurable space, a random variable on with values in , and a sub-sigma algebra. A function is a regular conditional distribution (rcd) of wrt is a probability kernel from to where for each the function is a variant of the conditional probability , i.e.,
- Let
be a random variable[1]. A function is a regular distribution function (rcf) for with respect to if is a distribution function on for each ; a.s. for each .
The distribution function only measures events
S18. Theorem. Let
Proof. It is important that
Here are a few more details to give the flavor. For each rational
where
and therefore if
to ensure right limits and
to ensure appropriate left/right tail behavior. Then define
where
Using
Standard nonsense shows this satisfies 1.i and ii.
Billingsley (2017) (p.460) calls this a conditional distribution of
5.3 Extension to Borel Spaces
S19. A measure space
20. Theorem. Let
Proof. Apply the previous result to the real-valued variable
Br.1. Breiman (1992b) explains that by moving the problem to
5.4 Hoffmann-Jørgensen’s “Useful Rules”
Hoffman-Jørgensen (2017a), p.121 shows conditional expectations generally behave as you expect, provided CONDITION ON
Let
be random elements on with values in , and respectively. so that is -integrable. is -measurable. be a random variable on .
Let
- a probability measure in
for all , - is
-measurable as a function of for all (probability kernel), and - satisfies the rcp condition
for a.e. .
TEST
a.
Then, the useful rules say that
provided that we use the rcp to compute conditional expectations in the last formula. Let
In particular, if
The measurable requirement is discussed further in XX and it cannot be dropped, see Section 7.2.2.
Proof.
- This is just Equation 2 applied to
. - By Equation 2, this holds for
for , . Then use standard integration arguments. - Given
, set in b. Then so the result follows from the defining property of a conditional expectation. - By definition of rcp,
so d follows from c. - Same from d.
- Same.
- Let
and . Then and , . Hence g follows from e. Notice that you need the diagonal to be measurable in order that is -measurable. - Follows from g setting
.
6 Disintegrations
Given measures
6.1 Pachl Disintegrations
Pachl (1978) defines a disintegration as follows.
Setup.
Measure spaces
Result.
Suppose
, and is -measurable
The family
The definition does not assume
6.2 Hoffmann-Jørgensen Disintegrations
6.2.1 RCP
Setup.
Result.
RCP
is -measurable is a measure on for ?ae is a version of conditional probability
RCP | Disintegration |
---|---|
Graph of function |
6.2.3 Pollard Disintegrations
Pollard (2002) 5.2<4> defines an RCP for
Given a kernel
The disintegration goes the other way. If you start with a measure
7 Examples
?Others from Chang and Pollard?
7.1 Duplication Map
Let
7.2 Coco Examples / Diagonal not measurable
7.2.1 Billingsley
Following Billingsley (2017), Section 33.
B.1. There are pathological examples showing that the interpretation of conditional probability in terms of an observer with partial information breaks down in certain cases.
Let
B.2. Switch perspective from fixingThe mathematical definition gives Equation 4; the heuristic considerations lead to Equation 5. Of course, Equation 4 is right and Equation 5 is wrong. The heuristic view breaks down in certain cases but is nonetheless illuminating and cannot, since it does not intervene in proofs, lead to any difficulties.
. $$
Now, for each
B.3. If
Thus conditional probabilities behave like probabilities at points of positive probability. That they may not do so at points of probability
causes no problem because individual such points have not effect on the probabilities of sets.
7.2.2 Hoffmann-Jørgensen
Let
Claim.
Since
7.3 Lebesgue measure example
Example (Pachl (1978)). Assume the continuum hypothesis. Put
Note. This is an example of why Simon says the LM sets are evil.
7.4 Doob’s example
7.4.1 Stoyanov
Following Stoyanov (2013).
Let
-a.s. for an arbitrary ;- for each
, is a probability measure on .
If condition (ii) is satisfied and condition (i) holds for all
Let
Define a new
It is easy to check that
Suppose such a probability exists: that is, there is a function, say
Consider the set
Therefore a regular conditional probability need not always exist.
Let us note that in this counterexample the role of the non-measurable set
General results on the existence of regular conditional probabilities can be found in the works of J. Pfanzagl (1969), Blackwell and Dubins (1975) and Faden (1985).
7.4.2 Rogers and Williams
Here is the version from Rogers and Williams (1994), page 141. Notation shift:
The setup is the same modulo pointing out that outer measure 1 is used to confirm the extended measure
Again, suppose a rcp
The detailed argument uses the fact that
Notice that
This example is consistent with the general statement on the existence of rcps which assumes
7.5 Borel’s paradox
Borel’s Paradox is a famous example in the field of probability theory that illustrates the counterintuitive nature of conditional probabilities in certain contexts. The paradox arises when dealing with conditional probabilities on continuous probability spaces.
Consider a uniform distribution on the surface of a sphere. If we condition on a point on the sphere having a particular latitude, the resulting conditional distribution of longitudes is uniform. However, if we instead condition on the point having a specific longitude, the conditional distribution of latitudes is not uniform. This leads to a paradoxical situation where the method of conditioning affects the outcome, despite the symmetry of the problem.
The paradox highlights the importance of carefully defining conditional probabilities in continuous settings and demonstrates that intuitive ideas from discrete probability do not always directly translate to the continuous case.
It is sometimes described in terms of a meteorite strike. Given that is hits on the equator the distribution is uniform. But given that it hits along a line of longitude through London it is not.
7.6 Other awkward sub sigma algebras
in its completion- Tail sigma algebras (JS)
8 Background theory
8.1 A very short introduction to measure theory
8.1.1 Basic definitions
Generally need a sigma-finite assumption!
A ring
An algebra is a ring that contains
A sigma ring/algebra is closed under countable unions.
Given any class of sets
A measure is a non-negative, countably additive set function taking value
A class
Given a measure
A set
The class
is a measure on a ring with induced measure on and on of measurable sets.
- Every set in
is -measurable - The outer measure is identical to the completion of the extension of
to on
Starting with
8.1.2 Atoms
Hoffman-Jørgensen (2017b) Ch 6.2-3.
Let
There follows a long list of things you expect to be true that are true.
8.1.3 Product Measures
The product sigma algebra is generated by products of measurable sets. Sections of a measurable set are measurable.
The class of finite, disjoint unions of rectangles from two rings is itself a ring.
The product sigma-algebra is the smallest sigma-algebra that makes projections measurable.
A rectangle is the intersection of two inverse images of projections. The area of the rectangle
Let
Given measure spaces
The product of two complete measure spaces need not be complete. (Take
passing from Borel to Lebesgue measurable functions is the work of the devil. Don’t even consider it!
8.1.4 The Diagonal
From Hoffman-Jørgensen (2017b), exercises to Chapter 1 and …
Let
Here are the important facts about the diagonal and countably separable spaces.
- Let
be an injective function. If is countably separable then so is .- Obvious
- Let
be a separable metric space. Then is countably separable.- Let
be a countable dense subset. Then countably separates. (For you can use half intervals at the rationals.)
- Let
- (Marczewski function) The
be a sequence of subsets of . Define a function by Then and is injective iff separates points. is countably separable iff there exists an injective measurable function .- (only if,
): use 3, Marczewski function. - (if
): follows from 1 and 2 (applied to ).
- (only if,
is countably separating iff .- (only if,
): if is countably separable then there is an injective measurable function to . Set , . Then , and the diagonal in is measurable by 2. - (if
): if then is in a sigma-algebra generated by countable set (see potted summary). Suppose the elements of do not separate points, so there are where both or neither of which is in each set in . Since the same holds for unions, complements and intersections of sets in , it holds for all sets in . The section is -measurable and contains , and therefore meaning , which is a contradiction. Therefore separates points.
- (only if,
- Suppose
has cardinality strictly greater than . Then the diagonal is not in the product sigma algebra.- There is no injective function to
.
- There is no injective function to
- Let
and be measure spaces, a measure on and be a function so that and . Then- Let
, then .
on so that iff .
has the required properties.
- Let
is a measurable function and suppose is countably separable (iff has a measurable diagonal). Then, the graph of- By assumption, there exists a measurable injection
. Define . Then is measurable (projections are measurable) and , the diagonal in . - Thus measurable graph or diagonal only depends on the range space being countably separable.
- By assumption, there exists a measurable injection
- Let
be a separable metric space, and let be a measure space. Suppose that is a function with continuous for all and -measurable for all . Then is -measurable.- Let
be a countable dense sequence in and set The are disjoint (by construction) for fixed and ( fixed critical) and . The are -measurable and the pointwise limit of mesurable functions is measurable.
- Let
The sigma-algebra of countable and co-countable sets on an uncountable
Therefore, if
Interestingly, a sigma-algebra cannot be countably infinite. Suppose
8.2 Radon-Nikodým theorem
From Halmos measure theory, Theorem 31.N.
Theorem 1 (Radon-Nikódym Theorem) If
Proof. Reduce to finite
8.3 Reisz Representation Theorem
The Riesz Representation Theorem for measure spaces provides a powerful result that connects linear functionals on spaces of continuous functions to measures. There are two main forms of this theorem, one for compact Hausdorff spaces and one for locally compact Hausdorff spaces. Here, we will focus on the form relevant to compact Hausdorff spaces, which is often the more straightforward case.
Theorem 2 (Riesz Representation Theorem for Compact Hausdorff Spaces) Let
The measure
Sketch proof. (See also Rudin (1987), theorem 2.14, over seven pages!)
For each non-negative continuous function
Show that
Regularity of the Measure: Define an outer measure
Uniqueness of the Measure: Suppose there are two measures
8.4 Borel and Polish Spaces
From Rogers and Williams (1994).
A topological space
All probabilities on a Lusin space are inner regular (equals
8.5 Sufficient Statistics
A sufficient statistic is a function of the data that provides as much information about a parameter of interest as the entire data set does. More formally, for a given parameter
Given a random sample
This can also be understood via the Factorization Theorem. A statistic
Example: Binomial Distribution
Consider a random sample
For a sample of size
This can be rewritten as: \[ f(x_1, x_2, \ldots, x_k \mid p) = \left[ p^{\sum_{i=1}^{k} x_i} (1 - p)^{kn - \s \]{i=1}^{k} x_i} ] _{i=1}^{k} ]
Here,
Example: Poisson Distribution
A Poisson distribution with parameter
Consider a random sample
We can rewrite the joint probability mass function as
In this factorized form, we have
Here,
According to the factorization theorem,
9 Book treatments
Domain | Treatments |
---|---|
Probability | |
Ash (2014) | |
Billingsley (2017) | |
Bobrowski2005bk | |
Borovkov (2003), Borovkov (2003) | |
Breiman (1992b) | |
Chung (2001) | |
Durrett (2013) | |
Dellacherie and Meyer (1978) | |
Doob (1953), Doob (1984) | |
Dudley (2018) | |
Fristedt and Gray (2013) | |
Gray (2009) | |
Grimmett and Stirzaker (2020a), Grimmett and Stirzaker (2020b) | |
Gut (2006) | |
Hoffman-Jørgensen (2017b), Hoffman-Jørgensen (2017a) | |
Itô (1984) | |
Kallenberg (2001b) | |
Klenke (2014) | |
Loève | |
Malliavin (2012) | |
Moran (1968) | |
Meyer | |
Parthasarathy (1967) | |
Pollard (2002) | |
Rogers and Williams (1994) | |
Shiryaev (1996) | |
Stoyanov (2013) | |
Stromberg (1994) | |
Stroock (2010) | |
Tucker (2013) | |
Williams (1991b) | |
Wise and Hall (1993) | |
Analysis | |
Folland (1999) | |
König (2009) | |
Doob (2012) | |
Tao (2011) | |
Stein and Shakarchi (2009) | |
Simon (2015b) | |
Kelley and Srinivasan (2012) | |
Halmos (1974) | |
Papers | |
Hoffmann-Jørgensen (1971) | |
P. Pfanzagl (1979), J. Pfanzagl (1969) | |
Pachl (1978) | |
Faden (1985) | |
Chang and Pollard (1997) |
Domain, author | Spaces | Sigma Alg | Subsets | RVs | Probabilities |
---|---|---|---|---|---|
Probability | |||||
Hoffmann-J | |||||
Kallenberg | |||||
Billingsley | |||||
Shiryaev | |||||
Pollard | |||||
Analysis | |||||
Halmos | n/a? | ||||
Statistics | |||||
Generic | ? |
Item | Conditional Probability | Regular Conditional Probability | (Regular) Conditional Probability Distritbution | Distingegration |
---|---|---|---|---|
Abbreviation | cp | rcp | cpd | dis |
Ingredients | Same as cp | Same as cp, plus |
||
Results per Shiryaev (Sh) | A |
(i) (ii) |
(i) (ii) |
|
Existence | Always—Radon-Nikódym theorem | |||
Books | ||||
Billingsley, Sec 33 | ⁓ | As Sh for |
||
Brieman 4.2,4.3 | As Sh.; shows |
As Sh, notation |
||
Dudley | Sec 10.1 | In (ii) can replace every with a.e. |
A cpd with |
|
Halmos | Sec 48 | ⁓ | ||
Hoffmann-Jørgensen | ||||
Kalleberg Ch 6. Thm 6.3, 6.4 |
CHECK! | In (i) requires measurable in def of kernel. Adds random element |
||
Meyer-D v. 1 | ||||
Rogers & Williams (II.42,89) | As Sh | In Sec 42 (ii) only require Existence: in addition, if (iii) (iv) if |
||
Shiryaev (II.7) | (i) (ii) Prepends regular |
10 Abbreviations
Abbreviation | Meaning | Characteristics |
---|---|---|
ce | conditional expectations | integral of |
cp | conditional probability | any version of |
rcp | regular conditional probability | a single function on |
(r)cd | regular conditional distribution | synonyms and specializations |
rdf | regular distribution function | |
decomposition |