What Did Bayes Really Say? — Propositions 1 – 7

© 19 June 2022 by Michael A. Kohn

Link to the pdf of this article

Introduction | Problem and Definitions | Propositions 1 – 7 | Bayes’s Billiards | Endnotes | References

Propositions 1 – 7

Comment: It is interesting to note that Bayes covers the now standard probability axioms either implicitly or explicitly.

Prop 1 [First Axiom: Addition Rule for Disjoint Events]

Comment: The first part of Bayes’s first proposition, which I will call the “addition rule for disjoint events”, is that the probability of one or the other of several mutually exclusive events is the sum of their individual probabilities. He introduces 3 disjoint (“inconsistent”) events A (1st event), B (2nd event), and C (3rd event), each of which results in receiving value N, and with expected values \textbf{a}, \textbf{b}, and \textbf{c}, respectively. Per Definition 5, he defines probability as the ratio of expected value to amount received, P(A) = \textbf{a}/N, P(B) = \textbf{b}/N, and P(C) = \textbf{c}/N

Original Text

When several events are inconsistent the probability of the happening of one or other of them is the sum of the probabilities of each of them.

Suppose there be three such [inconsistent] events, and whichever of them happens I am to receive N, and that the probability of the 1st, 2nd, and 3rd are respectively a/N, b/N, c/N. Then (by the definition of probability) the value of my expectation from the 1st will be a, from the 2nd b, and from the 3rd c. Wherefore the value of my expectations from all three will be a + b + c. But the sum of my expectations from all three is in this case an expectation of receiving N upon the happening of one or other of them. Wherefore (by definition 5) the probability of one or other of them is (a + b + c)/N or a/N + b/N + c/N. The sum of the probabilities of each of them.

Modern Equivalent

If A_{1}, A_{2}, …, A_{n} are disjoint events, then

$$P\left(\bigcup\limits_{j=1}^{n} A_{j} \right) =\sum_{j=1}^{n}P(A_{j}) $$

Saying that these events are disjoint means that they are mutually exclusive: A_{i} \cap A_{j} = \emptyset. Bayes used “inconsistent” instead of “disjoint”.

Comment: I call this the sum rule for disjoint events to distinguish it from the complement rule (below), which at least Jaynes (p. 33) refers to as “the sum rule”. Although conventional expositions present the sum rule for disjoint events as an axiom, Jaynes (p. 38) deduces it from “simple qualitative conditions of consistency”.

Prop. 1 (continued) [Second Axiom: \textbf{P(S) = 1}]

Comment: The next part of Prop. 1 is that at least one of all possible events must occur, and therefore the union of all possible events has probability 1.

Original Text

Corollary. If it be certain that one or other of the three events must happen, then a + b + c = N. For in this case all the expectations together amounting to a certain expectation of receiving N, their values together must be equal to N.

Modern Equivalent

Since S includes all possible events, P(S) = 1.

Comment: It is awkward that Bayes presents this using expectations instead of probabilities, but as noted P(A) = \textbf{a}/N, P(B) = \textbf{b}/N, and P(C) = \textbf{c}/N, so P(A) + P(B) + P(C) = 1.This identifies certainty with a probability of 1.

Implicit in the above axioms is that the probability of the empty set \emptyset is 0.

$$P(\emptyset) = 0 $$

Prop 2 [Complement Rule]

Again, Bayes’s exposition parallels what we now see in probability textbooks, which introduce the complement rule right after the axioms and then move to the definition of conditional probability and the multiplication rule.

Original Text

And from hence it is plain that the probability of an event added to the probability of its failure (or of its contrary) is the ratio of equality. If a person has an expectation depending on the happening of an event, the probability of the event is to the probability of its failure as his loss if it fails to his gain if it happens.

Modern Equivalent

If A and A^{c} are complementary events (i.e., A^{c} = Not(A)), then

$$P(A^{c}) = 1 – P(A)$$

Prop 3 [Multiplication Rule]

Original Text

The probability that two subsequent events will both happen is a ratio compounded of the probability of the 1st, and the probability of the 2nd on supposition the 1st happens.

Modern Equivalent
$$P(A \cap B) = P(A)P(B|A) $$

Comment: This is the multiplication rule. Bayes presents this first and then the definition of conditional probability as a corollary. Most modern textbooks (again, except for Jaynes) present them the other way around.

Original Text

COROLLARY. Hence if of two subsequent events the probability of the 1st be a/N, and the probability of both together be P/N, then the probability of the 2nd on supposition the 1st happens is P/a.

Modern Equivalent

This is the definition of conditional probability. If A and B are any two events in the sample space S and P(A) \neq 0, then

$$P(B|A) = \frac{P(A \cap B)}{P(A)}$$

Comment: Bayes presents a temporal sequence with A being “determined” (occurring or failing to occur) before B. This is important in Props 4 and 5.

Prop 4 [One that doesn’t fit in]

Comment: This cryptic proposition with its even more cryptic footnote covers 2 pages of the published essay. It does not have a modern textbook equivalent. Here is the first sentence:

Original Text

If there be two subsequent events to be determined every day, and each day the probability of the 2nd is b/N and the probability of both P/N, and I am to receive N if both the events happen the first day on which the 2nd does; I say, according to these conditions, the probability of my obtaining N is P/b.

Modern Equivalent

Assume A and B are two events that can occur daily. I am to receive N if, on the first day that B occurs, A also occurs. Let W be the event of receiving N, i.e., the event of A occurring on the first day that B occurs. P(W) = P( A \cap B)/P(B).

Comment: Here is one explanation of Prop. 4. (Endnote #3 gives an alternative explanation.) Let E(W) be the expected value of this situation, which I refer to as a “game”. On each day, there are four possible outcomes: A \cap B, A^c \cap B, A \cap B^c, A^c \cap B^c . On Day 1, if both A and B occur (A \cap B), I receive N and the game is over. If B occurs but A doesn’t (A^c \cap B), I receive 0 and the game is over. If B doesn’t occur (A \cap B^c or A^c \cap B^c), then I’m back to where I started. Bayes refers to this as “being reinstated in my former circumstances”. This translates into the following equation for the expected value E(W)

$$ E(W) = P( A \cap B) \times N + P(A^c \cap B) \times 0 + (1 – P(B)) \times E(W) $$

Let P(B) = b/N and P( A \cap B) = \textbf{P}/N.

\begin{align*} E(W) &= (\textbf{P}/N) \times N + 0 + (1 – b/N) \times E(W) \\ E(W) – (1 – b/N) E(W) &= \textbf{P}\\ \frac{bE(W)}{N} &= \textbf{P}\\ E(W) &= \frac{\textbf{P}N}{b} \end{align*}

Bayes defines the probability of receiving N as the ratio of E(W) to N, so if P(W) is the probability of receiving N,

$$P(W) = \frac{\textbf{P}}{b} $$

Prop 5 [Same as Prop 3 switching A and B]

Comment: Recall that Bayes thought of event A as being “determined” (occurring or failing to occur) before event B. By repeating Prop 3 and switching A and B, he gives us the probability that the earlier event A occurred based only on knowing whether the later event B occurred. This is one step away from what we now call Bayes’s Rule.

Original Text

If there be two subsequent events, the probability of the 2nd b/N and the probability of both together P/N, and it being first discovered that the 2nd event has happened, from hence I guess that the 1st event has also happened, the probability I am in the right is P/b.

Modern Equivalent

If A and B are any two events in the sample space S and P(B) \neq 0, then

$$P(A|B) = \frac{P(B \cap A)}{P(B)}$$

From Prop 3’s multiplication rule, we know that P(B \cap A) = P(B|A)P(A), so

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

Although this is what we think of as Bayes’s Rule, Prop 5 is as close as he comes to saying it. It is just one of 7 propositions in this introductory section on “the general laws of chance” before addressing the problem stated at the beginning: inferring the probability of a binary event by observing the number of times it happened and failed to happen.

Since, P(B) = P(B|A)P(A) + P(B|A^c)P(A^c), we could also write

$$P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|A^c)P(A^c)} .$$

I prefer the odds form of Bayes’s rule, which is derived as follows:

\begin{align*} P(A \cap B) &= P(B \cap A)\\ P(A|B)P(B) &= P(B|A)P(A) \\ \end{align*}

Similarly,

\begin{align*} P(A^c \cap B) &= P(B \cap A^c)\\ P(A^c|B)P(B) &= P(B|A^c)P(A^c) \\ \end{align*}

Dividing,

\begin{align*} \frac{P(A|B)P(B)}{P(A^c|B)P(B)} &= \frac{P(B|A)}{P(B|A^c)}\frac{P(A)}{P(A^c)} \\ \frac{P(A|B)}{P(A^c|B)} &= \frac{P(B|A)}{P(B|A^c)}\frac{P(A)}{P(A^c)} \\ \end{align*}

Some terminology:

\begin{align*} \frac{P(A)}{P(A^c)} &= Odds(A) = \text{prior odds}\\ \frac{P(A|B)}{P(A^c|B)} &= Odds(A|B) = \text{posterior odds}\\ \frac{P(B|A)}{P(B|A^c)} &= LR_A(B) = \text{likelihood ratio for } A \text{ of } B\\ \end{align*}

So,

\begin{align*} \frac{P(A|B)}{P(A^c|B)} &= \frac{P(B|A)}{P(B|A^c)}\frac{P(A)}{P(A^c)} \\ Odds(A|B) &= LR_A(B) \times Odds(A)\\ \text{posterior odds} &= \text{likelihood ratio} \times \text{prior odds}\\ \end{align*}

According to Dale (see references), John Maynard Keynes presented this as Bayes’s Rule in his Treatise on Probability (1921). It nicely displays “what we think now” (posterior odds) as the product of “what we thought before” (prior odds) and “what we learned” (likelihood ratio). (For more on the odds form of Bayes’s Rule, see Endnote #4.)

Prop 6 [Product Rule for Independent Events]

Comment: Going from axioms to the complement rule to the multiplication rule to Bayes’s Rule to the product rule for independent events (below) is the way we often present probability theory today. Bayes differs by including Prop. 4 and by never stating the rule that bears his name.

Original Text

The probability that several independent events shall all happen is a ratio compounded of the probabilities of each.

Modern Equivalent

If events A and B are independent,

$$P(A \cap B) = P(A)P(B)$$
Original Text

If there be several independent events, and the probability of each one be a, and that of its failing be b, the probability that the 1st happens and the 2nd fails, and the 3rd fails and the 4th happens, etc. will be abba, etc. For, according to the algebraic way of notation, if a denote any ratio and b another, abba denotes the ratio compounded of the ratios a, b, b, a. This corollary therefore is only a particular case of the foregoing.

Modern Equivalent

In a sequence of independent binary trials with success probability p and failure probability q = 1 - p, the probability of a sequence of successes and failures is given by multiplying all the individual probabilities. If the 1st succeeds, the 2nd fails, the 3rd fails, and the 4th succeeds, the probability of the sequence will be pqqp. This is a specific example of the general definition of independence.

Comment: By introducing an independent event that can either occur or fail with a given probability, Bayes has covered what we now call the Bernoulli distribution, so it is natural for him to proceed to the binomial distribution.

Prop 7 [Binomial Distribution]

Original Text

If the probability of an event be a, and that of its failure be b in each single trial, the probability of its happening p times, and failing q times in p + q trials is Ea^{p}b^{q} if E be the coefficient of the term in which occurs a^{p}b^{q} when the binomial (a + b)^{p+q} is expanded.

Comment: Here, Bayes uses a for the probability of success and b = 1-a for the probability of failure. Later, he will use x and r = 1-x. Today, we commonly use p and q = 1 - p, but as I mentioned in the introduction, Bayes uses p for the number of successes and q for the number of failures. Instead of Bayes’s a and b, his later x and r, or our common p and q, I will use \theta and \gamma = 1 -\theta for the probabilities of success and failure, and I will use k and n-k for the number of successes and failures. Also as mentioned in the introduction, I use \binom{n}{k} as “the coefficient of the term in which occurs a^{k}b^{n-k} when the binomial (a + b)^{n} is expanded.”

Modern Equivalent

In a sequence of n independent binary trials with success probability \theta and failure probability \gamma = 1 - \theta, the probability of k successes and n - k failures is given as follows:

$$ P(k; n, \theta) = \binom{n}{k}\theta^{k}\gamma^{n-k}$$

Comment: This is the probability mass function (PMF) of the binomial distribution.

$$ BinomPMF(k; n, \theta) = \binom{n}{k}\theta^{k}(1-\theta)^{n-k}$$

Bayes does not discuss the cumulative distribution function (CDF) for the binomial distribution, but it’s just the sum of all the probabilities up to k.

\begin{align*} P(K \leq k ; n, \theta) &= \sum_{i=0}^k \binom{n}{i}\theta^{i}(1-\theta)^{n-i}\\ BinomCDF(k; n,\theta) &= \sum_{i=0}^k \binom{n}{i}\theta^{i}(1-\theta)^{n-i}\\ \end{align*}

We will need this further on.

The first section of the essay ends with the binomial distribution. The language seems stilted to modern readers. However, except for Prop 4, this first section is reasonably clear as what Price calls “a brief demonstration of the general laws of chance”. According to Price, this “brief demonstration” may not have been available elsewhere, although clearly the material was already known and not being presented for the first time. The second section starts with the “billiards” table.

(next) Bayes’s “Billiards”

Introduction | Problem and Definitions | Propositions 1 – 7 | Bayes’s Billiards | Endnotes | References