What Did Bayes Really Say? — Introduction

© 19 June 2022 by Michael A. Kohn

Link to the pdf of this article

Introduction | Problem and Definitions | Propositions 1 – 7 | Bayes’s Billiards | Endnotes | References

Introduction

The Reverend Thomas Bayes (1701-1761) is famous for “An Essay Towards Solving A Problem in the Doctrine of Chances”, which was published in the Royal Society of London’s Philosophical Transactions on 23 December 1763, more than two and a half years after Bayes’s death. (If you think I should be forming the possessive of Bayes in some way other than “Bayes’s”, see Endnote #1.) His friend Richard Price found the essay among Bayes’s papers and sent it to the Royal Society along with an introductory letter, footnotes, an abridgment of the latter part of the essay, and an appendix containing numerical examples.

The title of the essay and first sentence of Richard Price’s introductory letter. John Canton was secretary of the Royal Society of London. F.R.S means Fellow of the Royal Society.

The essay as originally published is 49 pages — 24 pages written by Bayes and 25 by Price: introductory letter (6 pages), abridged conclusion (4 pages), and appendix (15 pages). It is difficult to read today because, to us, the 18th century English seems stilted and the mathematical notation is unfamiliar. So, I have tried to “translate” it into modern language and mathematical notation.

The essay is not focused on what we now call Bayes’s Rule:

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

It is about the more specific problem of how to draw inferences about the probability of a binary event by observing the number of times it does and doesn’t happen.

Bayes starts with a statement of the problem. Next, he provides 7 definitions, including the definition of “inconsistent” (disjoint) events, “contrary” (complementary) events, and independent events. In Definition 5, he defines the probability of an event as the ratio of its expected value to the value realized if the event occurs.

After the definitions, Bayes presents a set of 6 propositions, including what most modern textbooks (following Kolmogorov) present as axioms, as well as the complement rule, the multiplication rule, and the product rule for independent events, but he does not explicitly present the rule that now bears his name. In Prop. 7, he proceeds to the binomial distribution.

This introductory material might be considered the first textbook coverage of the definitions, axioms, and basic rules of probability theory. Richard Price’s introductory letter says, “Mr Bayes has thought fit to begin his work with a brief demonstration of the general laws of chance. His reason for doing this… was not merely that his reader might not have the trouble of searching elsewhere for the principles on which he has argued, but because he did not know whither to refer him for a clear demonstration of them.” De Moivre’s “Doctrine of Chances” (1718, 1738, and 1756) is sometimes called the first probability textbook, but if Bayes thought it provided a clear demonstration of the “general laws of chance”, he would have known “whither to refer” the reader.

One minor problem is that the notation \binom{n}{k} for “n choose k” did not exist, so Bayes relies on the well-known binomial expansion of (a+b)^n,

$$a^n + na^{n-1}b + \frac{n(n-1)}{2}a^{n-2}{b^2} + \frac{n(n-1)(n-2)}{3 \cdot 2}a^{n-3}{b^3} +… + nab^{n-1} + b^n ,$$

which we now write as

$$\binom{n}{n}a^nb^0+ \binom{n}{n-1}a^{n-1}b^1 + \binom{n}{n-2}a^{n-2}{b^2} + \binom{n}{n-3}a^{n-3}{b^3} +… + \binom{n}{1}a^1b^{n-1} + \binom{n}{0}a^0b^n $$ or $$\sum_{k=0}^{n} \binom{n}{k}a^kb^{n-k} .$$

Bayes refers to \binom{n}{k} as “the coefficient of the term in which occurs a^{k}b^{n-k} when the binomial (a + b)^{n} is expanded”. In this quoted phrase, I have already substituted k for p and n-k for q, because of another potential source of confusion for modern readers….

Today, we often present the binomial distribution this way:

If \ k is the number of successes in \ n binary trials with success probability \ p and failure probability \ q = 1 - p , then the probability mass function (PMF) of \ k is given as follows:

$$ binomPMF(k; n, p) = P(k; n, p) = \binom{n}{k}p^{k}q^{n-k} \quad k = 0, 1, 2, …, n$$

Unfortunately for those of us accustomed to p as the probablity of a success and q = 1-p as the probability of a failure, Bayes used p as the number of successes, where we now often use k, and q as the number of failures, where we now often use n-k. When I translate Bayes’s original text, I use \theta for the probability of a success and, when I need it, \gamma = 1 -\theta for the probability of a failure. I still use k for the the number of successes and, when I need it, j = n-k for the number of failures. The binomial distribution completes Section 1 of the essay.

In the second section, Bayes describes a hypothetical square table onto which he imagines throwing first ball W and then ball O repeatedly. I will follow many others and call his table a billiards table, although Bayes never mentions billiards. He measures the distance of ball W from the right side of his table, so I will assume that balls are thrown onto the billiards table from the right end, not the left end.

Yet another potential source of confusion is that the horizontal axis in Bayes’s figures goes from 0 on the right to 1 on the left, which is the reverse of the way we usually do it now. For example, he did a free hand drawing of the function x^k(1-x)^j for x from 0 to 1. Here is how it would look with with k=4 and j=6.

In Bayes’s figures, the horizontal axis goes from 0 on the right to 1 on the left. This is x4(1-x)6. The maximum is at x =0.4, which is to the right of midline.

Bayes’s key insight was that the area under the curve x^k(1-x)^j for x from 0 to 1 is

$$\frac{1}{\binom{k+j}{k}(k+j+1)}.$$

For example, the area under the curve with k=4 and j=6 in figure above is

$$\frac{1}{\binom{10}{4}(11)} = \frac{1}{(210 \times 11)} = \frac{1}{2310} = 0.0004329 .$$

As we shall see, Bayes first gives what Harvard statistics professor Joseph Blitzstein calls a “story proof” (Blitzstein page 382) and then a derivation using algebra and calculus, which Bayes calls “fluxions”.

Towards the end of the second section, Richard Price wrote, “Thus far Mr. Bayes’s essay.” The rest of the section is Price’s abridgment of what Bayes wrote. Price also added an appendix with numerical examples.

Read on to find out what Thomas Bayes really said and what Richard Price added. I present the essay in small sections of the original text, followed by my translation into the modern equivalent. I also intersperse explanatory comments. If a comment seemed too long, I moved it to an endnote. I finish with annotated references.

(next) Problem and Definitions

Introduction | Problem and Definitions | Propositions 1 – 7 | Bayes’s Billiards | Endnotes | References