What Did Bayes Really Say? — Introduction

Introduction | Problem and Definitions | Propositions 1 – 7 | Bayes’s Billiards | Endnotes | References

Introduction

The Reverend Thomas Bayes (1701-1761) is famous for “An Essay Towards Solving A Problem in the Doctrine of Chances”, which was published in the Royal Society of London’s Philosophical Transactions on 23 December 1763, more than two and a half years after Bayes’s death. (If you think I should be forming the possessive of Bayes in some way other than “Bayes’s”, see Endnote #1.) His friend Richard Price found the essay among Bayes’s papers and sent it to the Royal Society along with an introductory letter, footnotes, an abridgment of the latter part of the essay, and an appendix containing numerical examples.

The title of the essay and first sentence of Richard Price’s introductory letter. John Canton was secretary of the Royal Society of London. F.R.S means Fellow of the Royal Society.

The essay as originally published is 49 pages — 24 pages written by Bayes and 25 by Price: introductory letter (6 pages), abridged conclusion (4 pages), and appendix (15 pages). It is difficult to read today because, to us, the 18th century English seems stilted and the mathematical notation is unfamiliar. So, I have tried to “translate” it into modern language and mathematical notation.

The essay is not focused on what we now call Bayes’s Rule:

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

It is about the more specific problem of how to draw inferences about the probability of a binary event by observing the number of times it does and doesn’t happen.

Bayes starts with a statement of the problem. Next, he provides 7 definitions, including the definition of “inconsistent” (disjoint) events, “contrary” (complementary) events, and independent events. In Definition 5, he defines the probability of an event as the ratio of its expected value to the value realized if the event occurs.

After the definitions, Bayes presents a set of 6 propositions, including what most modern textbooks (following Kolmogorov 1933) present as axioms, as well as the complement rule, the multiplication rule, and the product rule for independent events, but he does not explicitly present the rule that now bears his name. In Prop. 7, he proceeds to the binomial distribution.

This introductory material might be considered the first textbook coverage of the definitions, axioms, and basic rules of probability theory. Richard Price’s introductory letter says, “Mr Bayes has thought fit to begin his work with a brief demonstration of the general laws of chance. His reason for doing this… was not merely that his reader might not have the trouble of searching elsewhere for the principles on which he has argued, but because he did not know whither to refer him for a clear demonstration of them.” De Moivre’s “Doctrine of Chances” (1718, 1738, and 1756) is sometimes called the first probability textbook. Price’s letter refers to De Moivre as “the great improver of this part of mathematics” and cites “The Doctrine of Chances” twice, but if it provided a clear demonstration of the “general laws of chance”, Bayes would have known “whither to refer” the reader.

Price goes on to say this about Bayes’s use of the term “probability”:

[All] will allow that an expectation depending on the truth of any past fact, or the happening of any future event, ought to be estimated so much the more valuable as the fact is more likely to be true, or the event more likely to happen.

I interpret this and Bayes’s Definition 5 as meaning that 1) probability can refer to the occurrence of an event or the truth of a proposition, and 2) probability equals the expected value of an indicator that is 1 if the event occurs or the proposition is true and 0 if the event does not occur or the proposition is false. This distinction between the occurrence of an event versus the truth of a proposition is important according to J.M.Keynes. In his 1921 “Treatise on Probability”, he said, “…it will be more than a verbal improvement to discuss the truth and the probability of propositions instead of the occurrence and the probability of events.” The physicist and probability theorist E. T. Jaynes, who per Chivers (2024) “plays roughly the same role in the cult of Bayes as St. Paul does in Christianity”, presents probability as the rational plausibility of a proposition rather than the likelihood of an event. We will return to Bayes’s definition of probability as the expected value of an indicator variable when we get to Definition 5 in the essay.

One minor problem is that the notation $\binom{n}{k}$ for “n choose k” did not exist; it was introduced in 1826. So, Bayes relies on the well-known binomial expansion of $(a+b)^n$ ,

$$a^n + na^{n-1}b + \frac{n(n-1)}{2}a^{n-2}{b^2} + \frac{n(n-1)(n-2)}{3 \cdot 2}a^{n-3}{b^3} +… + nab^{n-1} + b^n ,$$

which we now write as

$$\binom{n}{n}a^nb^0+ \binom{n}{n-1}a^{n-1}b^1 + \binom{n}{n-2}a^{n-2}{b^2} + \binom{n}{n-3}a^{n-3}{b^3} +… + \binom{n}{1}a^1b^{n-1} + \binom{n}{0}a^0b^n $$ or $$\sum_{k=0}^{n} \binom{n}{k}a^kb^{n-k} .$$

Bayes refers to $\binom{n}{k}$ as “the coefficient of the term in which occurs $a^{k}b^{n-k}$ when the binomial $(a + b)^{n}$ is expanded”. In this quoted phrase, I have already substituted $k$ for $p$ and $n-k$ for $q$ , because of another potential source of confusion for modern readers….

Today, we often present the binomial distribution this way:

If $\ k$ is the number of successes in $\ n$ binary trials with success probability $\ p$ and failure probability $\ q = 1 - p$ , then the probability mass function (PMF) of $\ k$ is given as follows:

$$ binomPMF(k; n, p) = P(k; n, p) = \binom{n}{k}p^{k}q^{n-k} \quad k = 0, 1, 2, …, n$$

Unfortunately for those of us accustomed to $p$ as the probablity of a success and $q = 1-p$ as the probability of a failure, Bayes used $p$ as the number of successes, where we now often use $k$ , and $q$ as the number of failures, where we now often use $n-k$ . When I translate Bayes’s original text, I use $\theta$ for the probability of a success and, when I need it, $\gamma = 1 -\theta$ for the probability of a failure. I still use $k$ for the the number of successes and, when I need it, $j = n-k$ for the number of failures. The binomial distribution completes Section 1 of the essay.

In the second section, Bayes describes a hypothetical square table onto which he imagines throwing first ball $W$ and then ball $O$ repeatedly. I will follow many others and call his table a billiards table, although Bayes never mentions billiards. He measures the distance of ball $W$ from the right side of his table, so I will assume that balls are thrown onto the billiards table from the right end, not the left end.

Yet another potential source of confusion is that the horizontal axis in Bayes’s figures goes from $0$ on the right to $1$ on the left, which is the reverse of the way we usually do it now. For example, he did a free hand drawing of the function $x^k(1-x)^j$ for $x$ from $0$ to $1$ . Here is how it would look with with $k=4$ and $j=6$ .

In Bayes’s figures, the horizontal axis goes from 0 on the *right* to 1 on the *left*. This is x⁴(1-x)⁶. The maximum is at x =0.4, which is to the *right* of midline.

One of Bayes’s key insights was that the area under the curve $x^k(1-x)^j$ for $x$ from $0$ to $1$ is

$$\frac{1}{\binom{k+j}{k}(k+j+1)}.$$

For example, the area under the curve with $k=4$ and $j=6$ in figure above is

$$\frac{1}{\binom{10}{4}(11)} = \frac{1}{(210 \times 11)} = \frac{1}{2310} = 0.0004329 .$$

As we shall see, Bayes first gives what Harvard statistics professor Joseph Blitzstein calls a “story proof” (Blitzstein page 382) and then a derivation using algebra and calculus, which Bayes calls “fluxions”.

Towards the end of the second section, Richard Price wrote, “Thus far Mr. Bayes’s essay.” The rest of the section is Price’s abridgment of what Bayes wrote. Price also added an appendix with numerical examples.

Read on to find out what Thomas Bayes really said and what Richard Price added. I present the essay in small sections of the original text, followed by my translation into the modern equivalent. I also intersperse explanatory comments. If a comment seemed too long, I moved it to an endnote. I finish with annotated references.

Introduction | Problem and Definitions | Propositions 1 – 7 | Bayes’s Billiards | Endnotes | References

Introduction

(next) Problem and Definitions

Introduction | Problem and Definitions | Propositions 1 – 7 | Bayes’s Billiards | Endnotes | References