Bayes' theorem
From Academic Kids

Bayes' theorem is a result in probability theory, which gives the conditional probability distribution of a random variable A given B in terms of the conditional probability distribution of variable B given A and the marginal probability distribution of A alone.
In the context of Bayesian probability theory and statistical inference, the marginal probability distribution of A alone is usually called the prior probability distribution or simply the prior. The conditional distribution of A given the "data" B is called the posterior probability distribution or just the posterior.
As a mathematical theorem, Bayes' theorem is valid regardless of whether one adopts a frequentist or a Bayesian interpretation of probability. However, there is disagreement as to what kinds of variables can be substituted for A and B in the theorem; this topic is treated at greater length in the articles on Bayesian probability and frequentist probability.
Contents 
Historical remarks
Bayes' theorem is named after the Reverend Thomas Bayes (1702—1761). Bayes worked on the problem of computing a distribution for the parameter of a binomial distribution (to use modern terminology); his work was edited and presented posthumously (1763) by his friend Richard Price, in An Essay towards solving a Problem in the Doctrine of Chances. Bayes' results were replicated and extended by PierreSimon Laplace in an essay of 1774, apparently unaware of Bayes' work.
One of Bayes' results (Proposition 5) gives a simple description of conditional probability, and shows that it does not depend on the order in which things occur:
 If there be two subsequent events, the probability of the second b/N and the probability of both together P/N, and it being first discovered that the second event has also happened, the probability I am right [i.e., the conditional probability of the first event being true given that the second has happened] is P/b.
The main result (Proposition 9 in the essay) derived by Bayes is the following: assuming a uniform distribution for the prior distribution of the binomial parameter p, the probability that p is between two values a and b is
 <math>
\frac {\int_a^b {n+m \choose m} p^m (1p)^n\,dp}
{\int_0^1 {n+m \choose m} p^m (1p)^n\,dp}
<math>
where m is the number of observed successes and n the number of observed failures. His preliminary results, in particular Propositions 3, 4, and 5, imply the result now called Bayes' Theorem (as described below), but it does not appear that Bayes himself emphasized or focused on that result.
What is "Bayesian" about Proposition 9 is that Bayes presented it as a probability for the parameter p. That is, not only can one compute probabilities for experimental outcomes, but also for the parameter which governs them, and the same algebra is used to make inferences of either kind. Interestingly, Bayes actually states his question in a way that might make the idea of assigning a probability distribution to a parameter palatable to a frequentist. He supposes that a billiard ball is thrown at random onto a billiard table, and that the probabilities p and q are the probabilities that subsequent billiard balls will fall above or below the first ball. By making the binomial parameter p depend on a random event, he cleverly escapes a philosophical quagmire that he most likely was not even aware was an issue.
Statement of Bayes' theorem
Bayes' theorem is a relation among conditional and marginal probabilities. It can be viewed as a means of incorporating information, from an observation, for example, to produce a modified or updated probability distribution. To derive Bayes' theorem, note first that from the definition of conditional probability
 <math>P(AB) P(B) = P(A,\ B) = P(BA) P(A)\,<math>
where P(A, B) is the joint probability of A and B.
It reads: The probability of A given B times the probability of B is equal to the probability of both event A and B occurring together and is also equal to the probability of B given A times the probability of A.
Dividing the left and righthand sides by P(B) providing that it is nonzero, we obtain
 <math>P(AB) = \frac{P(B  A) P(A)}{P(B)}<math>
which is the theorem conventionally known as Bayes' theorem.
It reads The probability of A given B is equal to the probability of B given A times the probability of A, divided by the probability of B.
Each term in Bayes' theorem has a conventional name. The term P(A) is called the prior probability of A. It is "prior" in the sense that it precedes any information about B. P(A) is also the marginal probability of A. The term P(AB) is called the posterior probability of A, given B. It is "posterior" in the sense that it is derived from or entailed by the specified value of B. The term P(BA), for a specific value of B, is called the likelihood function for A given B and can also be written as L(AB). The term P(B) is the prior or marginal probability of B, and acts as the normalizing constant. With this terminology, the theorem may be paraphrased as
 <math> \mbox{posterior} = \frac{\mbox{likelihood} \times \mbox{prior}} {\mbox{normalizing constant}}. <math>
Alternative forms of Bayes' theorem
Bayes' theorem is often embellished by noting that
 <math>P(B) = P(A, B) + P(A^C, B) = P(BA) P(A) + P(BA^C) P(A^C)\,<math>
so the theorem can be restated as
 <math>P(AB) = \frac{P(B  A) P(A)}{P(BA)P(A) + P(BA^C)P(A^C)}\, ,<math>
where A^{C} is the complementary event of A. More generally, where {A_{i}} forms a partition of the event space,
 <math>P(A_iB) = \frac{P(B  A_i) P(A_i)}{\sum_j P(BA_j)P(A_j)}\, ,<math>
for any A_{i} in the partition.
It can also be written neatly in terms of a likelihood ratio and odds as
 <math>O(AB)=O(A)\Lambda (AB) \, <math>
where <math>O(A)={P(A) \over P(A^C)}, \; O(AB)={P(AB) \over P(A^CB)}, \; \Lambda (AB) = {L(AB) \over L(A^CB)} = {P(BA) \over P(BA^C)}.<math>
See also the law of total probability.
Bayes' theorem for probability densities
There is also a version of Bayes' theorem for continuous distributions. It is somewhat harder to derive, since probability densities, strictly speaking, are not probabilities, so Bayes' theorem has to be established by a limit process; see Papoulis (citation below), Section 7.3 for an elementary derivation. Bayes' theorem for probability densities is formally similar to the theorem for probabilities:
 <math> f(xy) = \frac{f(yx)\,f(x)}{f(y)} <math>
and there is an analogous statement of the law of total probability:
 <math> f(xy) = \frac{f(yx)\,f(x)}{\int_{\infty}^{\infty} f(yx)\,f(x)\,dx}.
<math>
As in the discrete case, the terms have standard names. f(x, y) is the joint distribution of X and Y, f(xy) is the posterior distribution of X given Y=y, f(yx) = L(xy) is (as a function of x) the likelihood function of X given Y=y, and f(x) and f(y) are the marginal distributions of X and Y respectively, with f(x) being the prior distribution of X.
Here we have indulged in a conventional abuse of notation, using f for each one of these terms, although each one is really a different function; the functions are distinguished by the names of their arguments.
Extensions of Bayes' theorem
Theorems analogous to Bayes' theorem hold in problems with more than two variables. These theorems are not given distinct names, as they may be massproduced by applying the laws of probability. The general strategy is to work with a decomposition of the joint probability, and to marginalize (integrate) over the variables that are not of interest. Depending on the form of the decomposition, it may be possible to prove that some integrals must be 1, and thus they fall out of the decomposition; exploiting this property can reduce the computations very substantially. A Bayesian network is essentially a mechanism for automatically generating the extensions of Bayes' theorem that are appropriate for a given decomposition of the joint probability.
Example
Typical examples that use Bayes' theorem assume the philosophy underlying Bayesian probability that uncertainty and degrees of belief can be measured as probabilities. One such example follows. For additional worked out examples, please see the article on the examples of Bayesian inference.
We wish to know about the proportion r of voters in a large population who will vote "yes" in a referendum. Let n be the number of voters in a random sample (chosen with replacement, so that we have statistical independence) and let m be the number of voters in that random sample who will vote "yes". Suppose that n = 10 voters and only m = 7 voted yes, from Bayes' theorem we can calculate the probability distribution function r using
 <math> f(r  n=10, m=7) =
\frac {f(m=7  r, n=10) \, f(r)} {\int_0^1 f(m=7r, n=10) \, f(r) \, dr}. <math>
From this we see that once we have in hand the prior probability density function f(r) and the likelihood function L(r) = P(m = 7r, n = 10,), we can compute the posterior probability density function f(rn = 10, m = 7).
The prior summarizes what we know about the distribution of r in the absence of any observation. We will assume in this case that the prior distribution of r is uniform over the interval [0, 1]. That is, f(r) = 1. That assumption should be considered provisional  if some additional background information is found, we should modify the prior accordingly.
Under the assumption of random sampling, choosing voters is just like choosing balls from an urn. The likelihood function for such a problem is just the probability of 7 successes in 10 trials for a binomial distribution.
 <math> P( m=7  r, n=10) = {10 \choose 7} \, r^7 \, (1r)^3. <math>
As with the prior, the likelihood is open to revision  more complex assumptions will yield more complex likelihood functions. Maintaining the current assumptions, we compute the normalizing factor,
 <math> \int_0^1 P( m=7r, n=10) \, f(r) \, dr = \int_0^1 {10 \choose 7} \, r^7 \, (1r)^3 \, 1 \, dr = {10 \choose 7} \, \frac{1}{1320} <math>
and the posterior distribution for r is then
 <math> f(r  n=10, m=7) =
\frac{{10 \choose 7} \, r^7 \, (1r)^3 \, 1} {{10 \choose 7} \, \frac{1}{1320}} = 1320 \, r^7 \, (1r)^3 <math>
for r between 0 and 1, inclusive.
One may be interested in the probability that more than half the voters will vote "yes". The prior probability that more than half the voters will vote "yes" is 1/2, by the symmetry of the uniform distribution. In comparison, the posterior probability that more than half the voters will vote "yes", i.e., the conditional probability given the outcome of the opinion poll  that seven of the 10 voters questioned will vote "yes"  is
 <math>1320\int_{1/2}^1 r^7(1r)^3\,dr \approx 0.887<math>
which is about an "89% chance".
References
Versions of the essay
 Thomas Bayes (1763), "An Essay towards solving a Problem in the Doctrine of Chances", Philosophical Transactions of the Royal Society of London, 53.
 Thomas Bayes (1763/1958) "Studies in the History of Probability and Statistics: IX. Thomas Bayes's Essay Towards Solving a Problem in the Doctrine of Chances", Biometrika 45:296315 (Bayes's essay in modernized notation)
 Thomas Bayes "An essay towards solving a Problem in the Doctrine of Chances" (http://www.stat.ucla.edu/history/essay.pdf) (Bayes's essay in the original notation)
Commentaries
 G.A. Barnard. (1958) "Studies in the History of Probability and Statistics: IX. Thomas Bayes's Essay Towards Solving a Problem in the Doctrine of Chances", Biometrika 45:293295 (biographical remarks)
 Daniel Covarrubias "An Essay Towards Solving a Problem in the Doctrine of Chances" (http://www.stat.rice.edu/~blairc/seminar/Files/danTalk.pdf) (an outline and exposition of Bayes's essay)
 Stephen M. Stigler (1982) "Thomas Bayes' Bayesian Inference," Journal of the Royal Statistical Society, Series A, 145:250258 (Stigler argues for a revised interpretation of the essay  recommended)
 Isaac Todhunter (1865) A History of the Mathematical Theory of Probability from the time of Pascal to that of Laplace, Macmillan. Reprinted 1949, 1956 by Chelsea and 2001 by Thoemmes.
Additional material
 PierreSimon Laplace (1774), "Mémoire sur la Probabilité des Causes par les Événements," Savants Étranges 6:621656, also Oeuvres 8:2765.
 PierreSimon Laplace (1774/1986), "Memoir on the Probability of the Causes of Events", Statistical Science, 1(3):364378.
 Stephen M. Stigler (1986), "Laplace's 1774 memoir on inverse probability," Statistical Science, 1(3):359378.
 Stephen M. Stigler (1983), "Who Discovered Bayes's Theorem?" The American Statistician, 37(4):290296.
 Jeff Miller. Earliest Known Uses of Some of the Words of Mathematics (B) (http://members.aol.com/jeff570/mathword.html) (very informative  recommended)
 Athanasios Papoulis (1984), Probability, Random Variables, and Stochastic Processes, second edition. New York: McGrawHill.
 James Joyce. "Bayes' Theorem" (http://plato.stanford.edu/entries/bayestheorem/), in the Stanford Encyclopedia of Philosophy.
See also
 Raven paradox
 Prosecutor's fallacy
 Revising opinions in statistics
 Occam's razor
 Bayesian inferencede:BayesTheorem
fr:Théorème de Bayes it:Teorema di Bayes nl:Theorema van Bayes ja:ベイズの定理 pl:Twierdzenie Bayesa ro:Teorema lui Bayes ru:Теорема Байеса