Peter Norvig, 3 Oct 2015, revised 8 Jan 2017

Probability, Paradox, and the Reasonable Person Principle

In another notebook, I introduced the basics of probability theory. A quick glossary of terms:

  • P(E, D): the probability of event E, given probability distribution D describing the complete sample space of possible outcomes.
  • outcome: What actually happens (like a die rolling a 6). Also called an atomic event.
  • event: A description of possibly several atomic events (like rolling an even number).
  • sample space: The set of possible outcomes.
  • probability distribution: A mapping from every possible outcome to a number in the range 0 to 1 saying how likely the outcome is.
  • uniform distribution: A probability distribution in which every outcome is equally probable.

Here I refactor the earlier code, keeping only the parts that are most important for this notebook, which will show how to solve some particularly perplexing paradoxical probability problems.

In [1]:
from fractions import Fraction

def P(predicate, dist): 
    "The probability that `predicate` is true, given the probability distribution `dist`."
    return sum(dist[e] for e in dist if predicate(e))

class ProbDist(dict):
    "A Probability Distribution; an {outcome: probability} mapping where probabilities sum to 1."
    def __init__(self, mapping=(), **kwargs):
        self.update(mapping, **kwargs)
        total = sum(self.values())
        if isinstance(total, int): 
            total = Fraction(total, 1)
        for key in self: # Make probabilities sum to 1.
            self[key] = self[key] / total
    def __and__(self, predicate): # Call this method by writing `probdist & predicate`
        "A new ProbDist, restricted to the outcomes of this ProbDist for which the predicate is true."
        return ProbDist({e:self[e] for e in self if predicate(e)})

def Uniform(outcomes): return ProbDist({e: 1 for e in outcomes})

def joint(A, B, sep=''):
    """The joint distribution of two independent probability distributions. 
    Result is all entries of the form {a+sep+b: P(a)*P(b)}"""
    return ProbDist({a + sep + b: A[a] * B[b]
                    for a in A
                    for b in B})

Child Paradoxes

In 1959, Martin Gardner posed these two problems:

  • Child Problem 1. Mr. Jones has two children. The older child is a boy. What is the probability that both children are boys?

  • Child Problem 2. Mr. Smith has two children. At least one of them is a boy. What is the probability that both children are boys?

Then in 2006, Mike & Tom Starbird came up with a variant, which Gary Foshee introduced to Gardner fans in 2010:

  • Child Problem 3. I have two children. At least one of them is a boy born on Tuesday. What is the probability that both children are boys?

Problems 2 and 3 are considered paradoxes because they have surprising answers that people argue about.

(Assume the probability of a boy is exactly 1/2, and is independent of any siblings.)

Martin Gardner

Martin Gardner

Child Problem 1: Older child is a boy. What is the probability both are boys?

We use 'BG' to denote the outcome in which the older child is a boy and the younger a girl. The prior probability distribution, two_kids, has four equi-probable outcomes:

In [2]:
two_kids = Uniform({'BG', 'BB', 'GB', 'GG'})

Let's define predicates for the conditions of having two boys, and of the older child being a boy:

In [3]:
def two_boys(outcome): return outcome.count('B') == 2

def older_is_a_boy(outcome): return outcome.startswith('B')

Now we can answer Problem 1: "What is the probability of two boys, given two kids of which the older is a boy?"

In [4]:
P(two_boys, two_kids & older_is_a_boy)
Fraction(1, 2)

You're probably thinking that was a lot of mechanism just to get the obvious answer. But in the next problems, what is obvious becomes less obvious.

Child Problem 2: At least one is a boy. What is the probability both are boys?

Implementing this problem and finding the answer is easy:

In [5]:
def at_least_one_boy(outcome): return 'B' in outcome
In [6]:
P(two_boys, two_kids & at_least_one_boy)
Fraction(1, 3)

Understanding the answer is tougher. Some people think the answer should be 1/2. Can we justify the answer 1/3? We can see there are three equiprobable outcomes in which there is at least one boy:

In [7]:
two_kids & at_least_one_boy
{'BB': Fraction(1, 3), 'BG': Fraction(1, 3), 'GB': Fraction(1, 3)}

Of those three outcomes, only one has two boys, so the answer of 1/3 is indeed justified.

But some people still think the answer should be 1/2. Their reasoning is "If one child is a boy, then there are two equiprobable outcomes for the other child, so the probability that the other child is a boy, and thus that there are two boys, is 1/2."

When two methods of reasoning give two different answers, we have a paradox. Here are three responses to a paradox:

  1. The very fundamentals of mathematics must be incomplete, and this problem reveals it!!!
  2. I'm right, and anyone who disagrees with me is an idiot!!!
  3. I have the right answer for one interpretation of the problem, and you have the right answer for a different interpretation of the problem.

If you're Bertrand Russell or Georg Cantor, you might very well uncover a fundamental flaw in mathematics; for the rest of us, I recommend Response 3. When I believe the answer is 1/3, and I hear someone say the answer is 1/2, my response is not "You're wrong!", rather it is "How interesting! You must have a different interpretation of the problem; I should try to discover what your interpretation is, and why your answer is correct for your interpretation." The first step is to be more precise in my wording of the experiment:

  • Child Experiment 2a. Mr. Smith is chosen at random from families with two children. He is asked if at least one of his children is a boy. He replies "yes."

The next step is to envision another possible interpretation of the experiment:

  • Child Experiment 2b. Mr. Smith is chosen at random from families with two children. He is observed at a time when he is accompanied by one of his children, chosen at random. The child is observed to be a boy.

Experiment 2b needs a different sample space, which we will call two_kids_b. It consists of 8 outcomes, not just 4; for each of the 4 outcomes in two_kids, we have a choice of observing either the older child or the younger child. We will use the notation 'GB/g?' to mean that the older child is a girl, the younger a boy, the older child was observed to be a girl, and the younger was not observed. So we have:

In [8]:
two_kids_b = Uniform({'BB/b?', 'BB/?b', 
                      'BG/b?', 'BG/?g', 
                      'GB/g?', 'GB/?b', 
                      'GG/g?', 'GG/?g'})

Now we can restrict ourselves to outcomes in which we observe Mr. Smith with a boy:

In [9]:
def observed_boy(outcome): return 'b' in outcome

two_kids_b & observed_boy
{'BB/?b': Fraction(1, 4),
 'BB/b?': Fraction(1, 4),
 'BG/b?': Fraction(1, 4),
 'GB/?b': Fraction(1, 4)}

And finally we can determine the probability that he has two boys, given that we observed him with a boy:

In [10]:
P(two_boys, two_kids_b & observed_boy)
Fraction(1, 2)

The paradox is resolved. Two reasonable people can have different interpretations of the problem, and can each reason flawlessly to reach different conclusions, 1/3 for Experiment 2a, or 1/2 for Experiment 2b.

Which interpretation of the problem is "better?" We could debate that, or we could just agree to use unambiguous wording (that is, use the language of Experiment 2a or Experiment 2b, not the ambiguous language of Problem 2).

The Reasonable Person Principle

It is an unfortunate fact of human nature that we often assume the other person is an idiot. As George Carlin puts it "Have you ever noticed when you're driving that anybody driving slower than you is an idiot, and anyone going faster than you is a maniac?"

George Carlin

The opposite assumption—that other people are more likely to be reasonable rather than idiots is known as the reasonable person principle. It is a guiding principle at Carnegie Mellon University's School of Computer Science, and is a principle I try to live by as well: when I see people reason to a different conclusion than mine, I assume it is most likely that they are reasonably and rationally solving a different problem, not that they are idiots.

Now let's return to an even more paradoxical problem.

Child Problem 3. One is a boy born on Tuesday. What's the probability both are boys?

Most people can not imagine how the boy's birth-day-of-week could be relevant, and feel the answer should be the same as Problem 2. But to be sure, we need to clearly describe the experiment, define the sample space, and calculate. First:

  • Child Experiment 3a. A parent is chosen at random from families with two children. She is asked if at least one of her children is a boy born on Tuesday. She replies "yes."

Next we'll define a sample space. We'll use the notation "G1B3" to mean the older child is a girl born on the first day of the week (Sunday) and the younger a boy born on the third day of the week (Tuesday). We'll call the resulting distribution over all such outcomes two_kids_w.

In [11]:
one_kid_w  = joint(Uniform('BG'), Uniform('1234567'))
two_kids_w = joint(one_kid_w, one_kid_w)

That's too many to print, but we can sample them:

In [12]:
import random
random.sample(list(two_kids_w), 8)
['B2G4', 'G3B4', 'G2B6', 'B4B5', 'B4G6', 'B5B2', 'G7B7', 'B5B3']

We determine below that the probability of having at least one boy is 3/4, both in two_kids_w (where we keep track of the birth day of week) and in two_kids (where we don't):

In [13]:
P(at_least_one_boy, two_kids_w)
Fraction(3, 4)
In [14]:
P(at_least_one_boy, two_kids)
Fraction(3, 4)

The probability of two boys is 1/4 in either sample space:

In [15]:
P(two_boys, two_kids_w)
Fraction(1, 4)
In [16]:
P(two_boys, two_kids)
Fraction(1, 4)

And the probability of two boys given at least one boy is 1/3 in either sample space:

In [17]:
P(two_boys, two_kids_w & at_least_one_boy)
Fraction(1, 3)
In [18]:
P(two_boys, two_kids & at_least_one_boy)
Fraction(1, 3)

We will define a predicate for the event of at least one boy born on Tuesday:

In [19]:
def at_least_one_boy_tues(outcome): return 'B3' in outcome

We are now ready to answer Problem 3:

In [20]:
P(two_boys, two_kids_w & at_least_one_boy_tues)
Fraction(13, 27)

What? The answer is 13/27? How many of you saw that coming?

13/27 is quite different from 1/3, but rather close to 1/2. So "at least one boy born on Tuesday" is quite different from "at least one boy." Are you surprised? Do you accept the answer, or do you think we did something wrong? Are there other interpretations of the experiment that lead to other answers?

Here is one alternative interpretation:

  • Child Experiment 3b. A parent is chosen at random from families with two children. She is observed at a time when she is accompanied by one of her children, chosen at random. The child is observed to be a boy who reports that his birth day is Tuesday.

We can represent outcomes in this sample space with the notation G1B3/??b3, meaning the older child is a girl born on Sunday, the younger a boy born on Tuesday, the older was not observed, and the younger was.

In [21]:
def observed_boy_tues(outcome): return 'b3' in outcome

two_kids_wb = Uniform({children + '/' + observation
                      for children in two_kids_w
                      for observation in (children[:2].lower()+'??', '??'+children[-2:].lower())}   )                     
In [22]:
random.sample(list(two_kids_wb), 5)
['G3G2/g3??', 'B6B2/b6??', 'B2G6/??g6', 'G6G7/g6??', 'B6B1/??b1']

Now we can answer this version of Child Problem 3:

In [23]:
P(two_boys, two_kids_wb & observed_boy_tues)
Fraction(1, 2)

So with the wording of Child Experiment 3b, the answer is the same as 2b.

Still confused? Let's build a visualization tool to make things more concrete.


We'll display the results as a two dimensional grid of outcomes. An outcome will be colored white if it does not satisfy the condition stated in the problem; green if the outcome contains two boys; and yellow if it does satisfy the condition, but does not have two boys. Every cell in a row has the same older child, and every cell in a column has the same younger child. Here's the code to display a table:

In [24]:
from IPython.display import HTML

def Pgrid(event, dist, condition):
    """Display sample space in a grid, color-coded: green if event and condition is true; 
    yellow if only condition is true; white otherwise."""
    def first_half(s): return s[:len(s)//2]
    olders = sorted(set(map(first_half, dist)))
    return HTML('<table>' +
                cat(row(older, event, dist, condition) for older in olders) +
                '</table>' + 
                '<tt>P({} | {}) = {}</tt>'.format(
                event.__name__, condition.__name__, 
                P(event, dist & condition)))

def row(older, event, dist, condition):
    "Display a row where an older child is paired with each of the possible younger children."
    thisrow = sorted(outcome for outcome in dist if outcome.startswith(older))
    return '<tr>' + cat(cell(outcome, event, condition) for outcome in thisrow) + '</tr>'

def cell(outcome, event, condition): 
    "Display outcome in appropriate color."
    color = ('lightgreen' if event(outcome) and condition(outcome) else
             'yellow' if condition(outcome) else
    return '<td style="background-color: {}">{}</td>'.format(color, outcome)    

cat = ''.join

We can use this visualization tool to see that in Child Problem 1, there is one outcome with two boys (green) out of a total of two outcomes where the older is a boy (green and yellow) so the probability of two boys given that the older is a boy is 1/2.

In [25]:
# Child Problem 1
Pgrid(two_boys, two_kids, older_is_a_boy)
P(two_boys | older_is_a_boy) = 1/2

For Child Problem 2, we see the probability of two boys (green) given at least one boy (green and yellow) is 1/3.

In [26]:
# Child Problem 2
Pgrid(two_boys, two_kids, at_least_one_boy)
P(two_boys | at_least_one_boy) = 1/3

The answer is still 1/3 when we consider the day of the week of each birth.

In [27]:
# Child Problem 2, with days of week enumerated
Pgrid(two_boys, two_kids_w, at_least_one_boy)
P(two_boys | at_least_one_boy) = 1/3

Now for the "paradox" of Child Problem 3:

In [28]:
# Child Problem 3
Pgrid(two_boys, two_kids_w, at_least_one_boy_tues)
P(two_boys | at_least_one_boy_tues) = 13/27

We see there are 27 relevant outcomes, of which 13 are green. So 13/27 really does seem to be the right answer. This picture also gives us a way to think about why the answer is not 1/3. Think of the yellow-plus-green area as a horizontal stripe and a vertical stripe, with an overlap. Each stripe is half yellow and half green, so if there were no overlap at all, the probability of green would be 1/2. When each stripe takes up half the sample space and the overlap is maximal, the probability is 1/3. And in the Problem 3 table, where the overlap is small, the probability is close to 1/2 (but slightly smaller).

One way to look at it is that if I tell you very specific information (such as a boy born on Tuesday), it is unlikely that this applies to both children, so we have smaller overlap and a probability closer to 1/2, but if I give you broad information (a boy), this is more likely to apply to either child, resulting in a larger overlap, and a probability closer to 1/3.

You can read some more discussions of the problem by (in alphabetical order) Alex Bellos, Alexander Bogomolny, Andrew Gelman, David Bigelow, Julie Rehmeyer, Keith Devlin, Peter Lynch, Tanya Khovanova, and Wendy Taylor & Kaye Stacey.

The Sleeping Beauty Paradox

The Sleeping Beauty Paradox is another tricky one:

Sleeping Beauty volunteers to undergo the following experiment and is told all of the following details: On Sunday she will be put to sleep. Then a fair coin will be tossed, to determine which experimental procedure to undertake:

  • Heads: Beauty will be awakened and interviewed on Monday only.
  • Tails: Beauty will be awakened and interviewed on Monday and Tuesday only.

In all cases she is put back to sleep with an amnesia-inducing drug that makes her forget that awakening and sleep until the next one. In any case, she will be awakened on Wednesday without interview and the experiment ends. Any time Beauty is awakened and interviewed, she is asked, "What is your belief now for the proposition that the coin landed heads?"

What should Sleeping Beauty say when she is interviewed? First, she should define the sample space. She should realize that heads and tails are equally likely, and no matter which comes up, there will be a Monday and a Tuesday, so that gives 4 equally-probable checkpoints. We will say for each checkpoint whether she is interviewed, using the notation 'heads/Monday/interviewed' to mean the situation where the coin flip came up heads, it is Monday, and she is interviewed. Here are the 4 equiprobable atomic events:

In [29]:
beauty = Uniform({'heads/Monday/interviewed', 'heads/Tuesday/sleep',
                  'tails/Monday/interviewed', 'tails/Tuesday/interviewed'})

At this point, you're probably expecting me to define predicates like this:

 def heads(outcome): return 'heads' in outcome
 def interviewed(outcome): return 'interviewed' in outcome

We've seen a lot of predicates like this. I think it is time to heed the "don't repeat yourself" principle, so I will define a predicate-defining function, t. Think of "t" as standing for "it is true that":

In [30]:
def t(property):
    "Return a predicate that is true of all outcomes that have 'property' as a substring."
    return lambda outcome: property in outcome

Now we can get the answer, thd probability of heads given that Beauty is interviewed:

In [31]:
P(t("heads"), beauty & t("interviewed"))
Fraction(1, 3)

This problem is considered a paradox because there are people who argue that the answer should be 1/2, not 1/3. I admit I'm having difficulty coming up with an alternative interpretation that supports the "halfer" position.

I do know of a question that has the answer 1/2:

In [32]:
P(t("heads"), beauty) 
Fraction(1, 2)

But that seems like the wrong question; we want the probability of heads given that Sleeping Beauty was interviewed, not the unconditional probability of heads.

The "halfers" argue that before Sleeping Beauty goes to sleep, her unconditional probability for heads should be 1/2. When she is interviewed, she doesn't know anything more than before she went to sleep, so nothing has changed, so the probability of heads should still be 1/2. I find two flaws with this argument. First, if you want to convince me, show me a sample space; don't just make philosophical arguments. (Although a philosophical argument can be employed to help you define the right sample space.) Second, while I agree that before she goes to sleep, Beauty's unconditional probability for heads should be 1/2, I would say that both before she goes to sleep and when she is awakened, her conditional probability of heads given that she is being interviewed should be 1/3, as shown by the sample space.

The Monty Hall Paradox

This is one of the most famous probability paradoxes. It can be stated as follows:

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to switch your choice to door No. 2?" Is it to your advantage to switch your choice?

Monty Hall

Much has been written about this problem, but to solve it all we have to do is be careful about how we understand the problem, and about defining our sample space. I will define outcomes of the form 'Car1/Lo/Pick1/Open2', which means:

  • Car1: The producers of the show randomly placed the car behind door 1.
  • Lo: The host randomly commits to the strategy of opening the lowest-numbered allowable door. A door is allowable if it does not contain the car and was not picked by the contestant. Alternatively, the host could have chosen to open the highest-numbered allowable door (Hi).
  • Pick1: The contestant picks door 1. Our sample space will only consider cases where the contestant picks door 1, but by symmetry, the same arguments could be used if the contestant picked door 2 or 3.
  • Open2: After hearing the contestant's choice, and following the strategy, the host opens a door; in this case door 2.

We can see that the sample space has 6 equiprobable outcomes involving Pick1:

In [33]:
monty = Uniform({'Car1/Lo/Pick1/Open2', 'Car1/Hi/Pick1/Open3',
                 'Car2/Lo/Pick1/Open3', 'Car2/Hi/Pick1/Open3',
                 'Car3/Lo/Pick1/Open2', 'Car3/Hi/Pick1/Open2'})

Now, assuming the contestant picks door 1 and the host opens door 3, we can ask:

  • What are the possible outcomes remaining?
  • What is the probability that the car is behind door 1?
  • Or door 2?
In [34]:
monty & t("Open3")
{'Car1/Hi/Pick1/Open3': Fraction(1, 3),
 'Car2/Hi/Pick1/Open3': Fraction(1, 3),
 'Car2/Lo/Pick1/Open3': Fraction(1, 3)}
In [35]:
P(t("Car1"), monty & t("Open3"))
Fraction(1, 3)
In [36]:
P(t("Car2"), monty & t("Open3"))
Fraction(2, 3)

We see that the car is behund door 1 only 1/3 of the time, and behind door 2 for 2/3 of the time, so the strategy of switching to door 2 is better than sticking with door 1. But don't feel bad if you got this one wrong; it turns out that Monty Hall himself, who opened many doors while hosting Let's Make a Deal for 13 years, didn't know the answer either, as revealed in this letter from Monty to Prof. Lawrence Denenberg, when Denenberg asked for permission to use the problem in his textbook:

If you were Denenberg, how would you answer Monty, in non-mathematical terms? I would try something like this:

When the contestant makes her initial pick, she has 1/3 chance of picking the car, and there is a 2/3 chance the car is behind one of the other doors. That's still true after you open a door, but now the 2/3 chance for either other door becomes concentrated as 2/3 behind one other door, so the contestant should switch.

But that type of argument was not persuasive to everyone. Marilyn vos Savant reports that many of her readers (including, she is pleased to point out, many Ph.D.s) still insist the answer is that it doesn't matter if the contestant switches; the odds are 1/2 either way. Let's try to discover what problem and what sample space those people are dealing with. Perhaps they are reasoning like this:

They define outcomes of the form 'Car1/Pick1/Open2/Goat', which means:

  • Car1: First the car is randomly placed behind door 1.
  • Pick1: The contestant picks door 1.
  • Open2: The host opens one of the two other doors at random (so the host might open the door with the car).
  • Goat: We observe there is a goat behind the opened door.

Under this interpretation, the sample space of all outcomes involving Pick1 is:

In [37]:
monty2 = Uniform({'Car1/Pick1/Open2/Goat', 'Car1/Pick1/Open3/Goat',
                  'Car2/Pick1/Open2/Car',  'Car2/Pick1/Open3/Goat',
                  'Car3/Pick1/Open2/Goat', 'Car3/Pick1/Open3/Car'})

And we can calculate the probability of the car being behind each door, given that the contestant picks door 1 and the host opens door 3 to reveal a goat:

In [38]:
P(t("Car1"), monty2 & t("Open3/Goat"))
Fraction(1, 2)
In [39]:
P(t("Car2"), monty2 & t("Open3/Goat"))
Fraction(1, 2)
In [40]:
P(t("Car3"), monty2 & t("Open3/Goat"))

So we see that under this interpretation it doesn't matter if you switch or not: either way you win half the time (just don't switch to door 3, unless you really like goats).

Is this a valid interpretation? I agree that the wording of the problem can be seen as being ambiguous. However, this interpretation has a serious problem: in all the history of Let's Make a Deal, it was never the case that the host opened up a door with the grand prize. This strongly suggests (but does not prove) that M is the correct interpretation, not M2

Simulating the Monty Hall Problem

Some people might be more convinced by a simulation than by a probability argument. Here is code for a simulation which goes through this sequence of events:

  1. The host randomly chooses a door for the 'car'
  2. The contestant randomly makes a 'pick' of one of the doors
  3. The host randomly selects a non-car, non-pick door to be 'opened.'
  4. If strategy == 'switch', contestant changes 'pick' to the other unopened door
  5. Return true if the pick is the door with the car.
In [41]:
import random

def simulate_monty(strategy, doors=(1, 2, 3)):
    "Randomly place car, and given a strategy of 'switch' or 'stick', return True iff the strategy wins."
    car    = random.choice(doors)
    pick   = random.choice(doors)
    opened = random.choice([d for d in doors if d != car and d != pick])
    if strategy == 'switch':
        pick = next(d for d in doors if d != pick and d != opened)
    return (pick == car)

We can confirm that the contestant wins about 2/3 of the time with the switch strategy, and only wins about 1/3 of the time with the stick strategy:

In [42]:
from collections import Counter

Counter(simulate_monty('switch') for _ in range(10 ** 5))
Counter({False: 33595, True: 66405})
In [43]:
Counter(simulate_monty('stick') for _ in range(10 ** 5))
Counter({False: 66871, True: 33129})

Reasoning with Non-Uniform Probability Distributions

In all our "children" problems so far, the probability of a child being a girl is not exactly 1/2. That's not quite right. An article gives the following counts for a sample of two-child families in Denmark:

GG: 121801    GB: 126840
BG: 127123    BB: 135138

Let's implement that:

In [44]:
DK = ProbDist(GG=121801., GB=126840.,
              BG=127123., BB=135138.)
{'BB': 0.2645086533229465,
 'BG': 0.24882071317004043,
 'GB': 0.24826679089140383,
 'GG': 0.23840384261560926}

Now let's try the first two Child Problems with the probability distribution DK. Since boys are slightly more probable than girls, we expect a little over 1/2 for Problem 1, and a little over 1/3 for problem 2:

In [45]:
# Child Problem 1 in DK
P(two_boys, DK & older_is_a_boy)
In [46]:
# Child Problem 2 in DK
P(two_boys, DK & at_least_one_boy)

It all looks good. Now let's leave Denmark behind and try a new problem:

Child Problem 4. One is a boy born on Feb. 29. What is the probability both are boys?

  • Child Problem 4. I have two children. At least one of them is a boy born on leap day, February 29. What is the probability that both children are boys? Assume that 51.5% of births are boys and that birth days are distributed evenly across the 4×365 + 1 days in a 4-year cycle.

We will use the notation GLBN to mean an older girl born on leap day (L) and a younger boy born on a non-leap day (N).

In [47]:
sexes      = ProbDist(B=51.5, G=48.5)   # Probability distribution over sexes
days       = ProbDist(L=1, N=4*365)     # Probability distribution over Leap days and Non-leap days
child      = joint(sexes, days)         # Probability distribution for one child family
two_kids_L = joint(child, child)        # Probability distribution for two-child family

Let's check out these last two probability distributions:

In [48]:
{'BL': 0.00035249828884325804,
 'BN': 0.5146475017111568,
 'GL': 0.0003319644079397673,
 'GN': 0.48466803559206023}
In [49]:
{'BLBL': 1.2425504363742495e-07,
 'BLBN': 0.00018141236371064045,
 'BLGL': 1.170168857556332e-07,
 'BLGN': 0.0001708446532032245,
 'BNBL': 0.00018141236371064045,
 'BNBN': 0.26486205101753507,
 'BNGL': 0.00017084465320322452,
 'BNGN': 0.24943319367670777,
 'GLBL': 1.170168857556332e-07,
 'GLBN': 0.00017084465320322452,
 'GLGL': 1.102003681388002e-07,
 'GLGN': 0.0001608925374826483,
 'GNBL': 0.0001708446532032245,
 'GNBN': 0.24943319367670777,
 'GNGL': 0.0001608925374826483,
 'GNGN': 0.2349031047246665}

Now we can solve the problem. Since "boy born on a leap day" applies to so few children, we expect the probability of two boys to be just ever so slightly below the baseline rate for boys, 51.5%.

In [50]:
# Child Problem 4: Probability of 2 boys, given two kids, one of which is born on leap day.

P(two_boys, two_kids_L & t('BL'))

The St. Petersburg Paradox

The St. Petersburg paradox was invented in 1713, named for the home town of the Bernoullis, and introduced by Daniel Bernoulli, the nephew of Jacob Bernoulli (the urn guy):

A casino offers a game of chance for a single player in which a fair coin is tossed at each stage. The pot starts at 2 dollars and is doubled every time a head appears. The first time a tail appears, the game ends and the player wins whatever is in the pot. Thus the player wins 2 dollars if a tail appears on the first toss, 4 dollars if a head appears on the first toss and a tail on the second, etc. What is the expected value of this game to the player?

To calculate the expected value, we see there is a 1/2 chance of a tail on the first toss (yielding a pot of \$2) and if not that, a 1/2 &times; 1/2 = 1/4 chance of a tail on the second toss (yielding a pot of \$4), and so on. So in total, the expected value is:

$$\frac{1}{2}\cdot 2 + \frac{1}{4}\cdot 4 + \frac{1}{8}\cdot 8 + \frac{1}{16} \cdot 16 + \cdots = 1 + 1 + 1 + 1 + \cdots = \infty$$

The expected value is infinite! But anyone playing the game would not expect to win an infinite amount; thus the paradox. How can we resolve the paradox?

Response 1: Limited Resources

The first major response to the paradox is that the casino's resources are limited. Once you break their bank, they can't pay out any more, and thus the expected return is finite. Let's create a probability distribution that reflects this reality. We keep doubling the pot and halving the probability of winning the amount in the pot (half because you get the pot on a tail but not a head), until we reach the bank's limit:

In [51]:
def st_pete(limit):
    "Return the probability distribution for the St. Petersburg Paradox with a limited bank."
    P = {}     # The probability distribution
    pot = 2    # Amount of money in the pot
    pr  = 1/2. # Probability that you end up with the amount in pot
    while pot < limit:
        P[pot] = pr
        pot    = pot * 2
        pr     = pr / 2
    P[limit] = pr * 2     # pr * 2 because you get limit for heads or tails
    return ProbDist(P)

Let's try with the casino limited to 100 million dollars:

In [52]:
StP = st_pete(limit=10**8)
{2: 0.5,
 4: 0.25,
 8: 0.125,
 16: 0.0625,
 32: 0.03125,
 64: 0.015625,
 128: 0.0078125,
 256: 0.00390625,
 512: 0.001953125,
 1024: 0.0009765625,
 2048: 0.00048828125,
 4096: 0.000244140625,
 8192: 0.0001220703125,
 16384: 6.103515625e-05,
 32768: 3.0517578125e-05,
 65536: 1.52587890625e-05,
 131072: 7.62939453125e-06,
 262144: 3.814697265625e-06,
 524288: 1.9073486328125e-06,
 1048576: 9.5367431640625e-07,
 2097152: 4.76837158203125e-07,
 4194304: 2.384185791015625e-07,
 8388608: 1.1920928955078125e-07,
 16777216: 5.960464477539063e-08,
 33554432: 2.9802322387695312e-08,
 67108864: 1.4901161193847656e-08,
 100000000: 1.4901161193847656e-08}

Now we define the function EV to compute the expected value of a probability distribution:

In [53]:
def EV(P):
    "The expected value of a probability distribution."
    return sum(P[v] * v 
               for v in P)
In [54]:

This says that for a casino with a bankroll of 100 million dollars, if you want to maximize your expected value, you should be willing to pay up to \$27.49 to play the game. Would you pay that much? I wouldn't, and neither would Daniel Bernoulli.

Response 2: Value of Money

Daniel Bernoulli came up with a second response to the paradox based on the idea that if you have a lot of money, then additional money becomes less valuable to you. If I had nothing, and I won \$1000, I would be very happy. But if I already had a million dollars and I won \$1000, it would be less valuable. How much less valuable? Bernoulli proposed, and experiments confirm, that the value of money is roughly logarithmic. That is, rational bettors don't try to maximize their expected monetary value, they try to maximize their expected utility: the amount of "happiness" that the money is worth. I'll write the function util to describe what a dollar amount is worth to a hypothetical gambler. util says that a dollar is worth a dollar, until the amount is "enough" money. After that point, each additional dollar is worth half as much (only brings half as much happiness). Value keeps accumulating at this rate until we reach the next threshold of "enough," when the utility of additional dollars is halfed again. The exact details of util are not critical; what matters is that overall money becomes less valuable after we have won a lot of it.

In [55]:
def util(dollars, enough=1000): 
    "The value of money: only half as valuable after you already have enough."
    if dollars < enough:
        return dollars
        additional = dollars-enough
        return enough + util(additional / 2, enough * 2)

A table and a plot will give a feel for the util function. Notice the characterisitc concave-down shape of the plot.

In [56]:
for d in range(2, 7):
    m = 10 ** d
    print('{:15,d} $ = {:6,d} util'.format(m, int(util(m))))
            100 $ =    100 util
          1,000 $ =  1,000 util
         10,000 $ =  4,250 util
        100,000 $ = 15,937 util
      1,000,000 $ = 51,593 util
In [57]:
%matplotlib inline 
import matplotlib.pyplot as plt

X = list(range(1000, 1000000, 10000))
plt.plot(X, [util(x) for x in X])
print('Y axis is util(x); x axis is in thousands of dollars.')
Y axis is util(x); x axis is in thousands of dollars.

Now I will define the function EU, which computes the expected utility of the game:

In [58]:
def EU(P, U):
    "The expected utility of a probability distribution, given a utility function."
    return sum(P[e] * U(e) 
               for e in P)
In [59]:
EU(StP, util)

That says we should pay up to \$13.10 to play the game, which sounds more reasonable than \$27.49.

Understanding St. Petersburg through Simulation

Before I plunk down my \$13, I'd like to understand the game better. I'll write a simulation of the game:

In [60]:
def flip(): return random.choice(('head', 'tail'))

def simulate_st_pete(limit=10**9):
    "Simulate one round of the St. Petersburg game, and return the payoff."
    pot = 2
    while flip() == 'head':
        pot = pot * 2
        if pot > limit:
            return limit
    return pot

I will run the simulation 100,000 times (with a random seed specified for reproducability):

In [61]:
from collections import Counter


c = Counter(simulate_st_pete() for _ in range(100000))
Counter({2: 49755,
         4: 25060,
         8: 12590,
         16: 6322,
         32: 3151,
         64: 1607,
         128: 751,
         256: 370,
         512: 191,
         1024: 106,
         2048: 45,
         4096: 29,
         8192: 10,
         16384: 6,
         32768: 5,
         65536: 1,
         1048576: 1})

The results are about what you would expect: about half the pots are 2, a quarter are 4, an eighth are 8, and higher pots are more and more unlikely. Let's check expected utility and expected value:

In [62]:
results = ProbDist(c)

EU(results, util), float(EV(results))
(13.2477575, 26.71606)

These are not too far off from the theoretial values.

To see better how things unfold, I will define a function to plot the running average of repeated rounds:

In [63]:
def running_averages(iterable):
    "For each element in the iterable, yield the mean of all elements seen so far." 
    total, n = 0, 0
    for x in iterable:
        total, n = total + x, n + 1
        yield total / n

def plot_running_averages(fn, n):
    "Plot the running average of calling the function n times."
    plt.plot(list(running_averages(fn() for _ in range(n))))

Let's do ten repetitions of plotting the running averages of 100,000 rounds:

In [64]:

for i in range(10):
    plot_running_averages(simulate_st_pete, 100000);

What can we see from this? Nine of the 10 repetitions have a final expected value payoff (after 100,000 rounds) between 10 and 35. So a price around \$13 still seems reasonable. One outlier has an average payoff just over 100, so if you are feeling lucky you might be willing to pay more than \$13.

The Ellsburg Paradox

The Ellsburg Paradox has it all: an urn problem; a paradox; a conclusion that can only be resolved through psychology, not mathematics alone; and a colorful history with an inventor, Daniel Ellsburg, who went on to become the releaser of the Pentagon Papers. The paradox is as follows:

An urn contains 33 red balls and 66 other balls that are either black or yellow. You don't know the mix of black and yellow, just that they total 66. A single ball is drawn at random. You are given a choice between these two gambles:

  • R: Win \$100 for a red ball.
  • B: Win \$100 for a black ball.

You are also given a choice between these two gambles:

  • RY: Win \$100 for a red or yellow ball.
  • BY: Win \$100 for a black or yellow ball.

Many people reason as follows:

  • R: I win 1/3 of the time
  • B: I win somewhere between 0 and 2/3 of the time, but I'm not sure of the probability.
  • RY: I win at least 1/3 of the time and maybe up to 100% of the time; I'm not sure.
  • BY: I win 2/3 of the time.
  • Overall, I prefer the relative certainty of R over B and of BY over RY.

The paradox is that, from an expected utility point of view, that reasoning is inconsistent, no matter what the mix of black and yellow balls is (or no matter what you believe the mix might be). RY and BY are just the same gambles as R and B, but with an additional \$100 for a yellow ball. So if you prefer R over B, you should prefer RY over BY (and if you prefer B over R you should prefer BY over RY), for any possible mix of black and yellow balls.

Let's demonstrate. For each possible number of black balls (on the x axis), we'll plot the expected value of each of the four gambles; R as a solid red line, B as a solid black line, RY as a dashed red line, and BY as a dashed black line:

In [65]:
def ellsburg():
    show('R', 'r')
    show('B', 'k')
    show('RY', 'r--')
    show('BY', 'k--')
    plt.xlabel('Number of black balls')
    plt.ylabel('Expected value of each gamble')
blacks = list(range(68))
urns   = [Counter(R=33, B=b, Y=67-b) for b in blacks]
def show(colors, line):
    scores = [score(colors, urn) for urn in urns]
    plt.plot(blacks, scores, line)
def score(colors, urn): return sum(urn[c] for c in colors)


We see that for any number of black balls up to 33, the solid red line is above the solid black line, which means R is better than B. The two gambles are equal with 33 black balls, and from there on, B is better than R.

Similarly, up to 33 black balls, the dashed red line is above the dashed black line, so RY is better than BY. They are equal at 33, and after that, BY is better than RY. So in summary, R > B if and only if RY > BY.

It is pretty clear that this hold for every possible mix of black and yellow balls, taken one at a time. But what if you believe that the mix might be one of several possibilities? We'll define avgscore to give the score for a gamble (as specified by the colors in it), averaged over a collection of possible urns, each with a different black/yellow mix. Then we'll define compare to compare the four gambles on the collection of possible urns:

In [66]:
def avgscore(colors, urns): 
    return sum(score(colors, urn) for urn in urns) / len(urns)

def compare(urns):
    for colors in ('R', 'B', 'RY', 'BY'):
        print(colors.ljust(2), avgscore(colors, urns))
R  33.0
B  33.5
RY 66.5
BY 67.0

The above says that if you think any number of black balls is possible and they are all equally equally likely, then you should slightly prefer B > R and BY > RY.

Now imagine that for some reason you believe that any mix is possible, but that a majority of black balls is more likely (i.e. the urns in the second half of the list of urns are twice as likely as those in the first half). Then we will see that the same preferences hold, but more strongly:

In [67]:
compare(urns[:33] + 2 * urns[33:])
R  33.0
B  39.10679611650485
RY 60.89320388349515
BY 67.0

If we believe the first half of the list (with fewer black balls) is twice as likely, we get this:

In [68]:
compare(2 * urns[:33] + urns[33:])
R  33.0
B  27.782178217821784
RY 72.21782178217822
BY 67.0

This time the preferences are reversed for both gambles, R > B and RY > BY.

Now let's try another approach. Imagine there are two urns, each as described before, and the ball will be drawn from one or the other. We will plot the expected value of each of the four gambles, over all possible pairs of two different urns (sorted by the number of black balls in the pair):

In [69]:
def ellsburg2():
    show2('R', 'r')
    show2('B', 'k')
    show2('RY', 'r--')
    show2('BY', 'k--')
    plt.xlabel('Different combinations of two urns')
    plt.ylabel('Expected value of each gamble')
def show2(colors, line):
    urnpairs = [(u1, u2) for u1 in urns for u2 in urns]
    urnpairs.sort(key=lambda urns: avgscore('B', urns))
    X = list(range(len(urnpairs)))
    plt.plot(X, [avgscore(colors, urns) for urns in urnpairs], line)

The curves are different, but the results are the same: R > B if and only if RY > BY.

So why do many people prefer R > B and BY > RY. One explanation is risk aversion; it feels safer to take a definite 1/3 chance of winning, rather than a gamble that might be as good as 2/3, but might be as bad as 0. This is irrational thinking (in the sense that those who follow this strategy will win less), but people are sometimes irrational.

Simpson's Paradox

This has nothing to do with the TV show. D'oh! In 1951, statistician Edward Simpson (who worked with Alan Turing at Bletchley Park during World War II), noted that it is possible to take a sample space in which A is better than B, and split it into two groups, such that B is better than A in both groups.

For example, here is data from trials of two treatments for kidney stones, A and B, separated into two subgroups or cases: first, for small kidney stones, and second for large ones. In all cases we record the number of good and bad outcomes of the treatment:

In [70]:
# Good and bad outcomes for kidney stone reatments A and B,
# each in two cases: [small_stones, large_stones]
A = dict(small=Counter(good=81.,  bad=6.),  large=Counter(good=192., bad=71.))
B = dict(small=Counter(good=234., bad=36.), large=Counter(good=55., bad=25.))

def success(case): return ProbDist(case)['good']

Let's compare probabilities of success:

In [71]:
success(A['small']), success(B['small'])
(0.9310344827586207, 0.8666666666666667)
In [72]:
success(A['large']), success(B['large'])
(0.7300380228136882, 0.6875)

We see that for small stones, A is better, 93% to 87%, and for large stones, A is also better, 73% to 69%. So A is better no matter what, right?

Not so fast.

We can add up Counters to get the overall success rate for A and B, over all cases:

In [73]:
success(A['small'] + A['large'])
In [74]:
success(B['small'] + B['large'])

Overall, B is more successful, 83% to 78%, even though A is better in both cases. So if you had kidney stones, and you want the highest chance of success, which treatment would you prefer? If you knew you had small stones (or large stones), the evidence supports A. But if the size was unknown, does that mean you should prefer B? Analysts agree that the answer is no, you should stick with A. The only reason why B has a higher overall success rate is that doctors choose to do B more often on the easier, small stone cases, and reserve A for the harder, large stone cases. B is better, but it has a lower overall percentage because it is given the more difficult patients.

Here's another example, showing the batting averages for two baseball players, Derek Jeter and David Justice, for the years 1995 and 1996:

In [75]:
Jeter   = {1995: Counter(hit=12,  out=36),  1996: Counter(hit=183, out=399)}
Justice = {1995: Counter(hit=104, out=307), 1996: Counter(hit=45,  out=95)}

def BA(case): "Batting average"; return round(float(ProbDist(case)['hit']), 3)
In [76]:
BA(Jeter[1995]), BA(Justice[1995])
(0.25, 0.253)
In [77]:
BA(Jeter[1996]), BA(Justice[1996])
(0.314, 0.321)

So Justice had a higher batting average than Jeter in both 1995 and in 1996. Let's check the overall averages:

In [78]:
BA(Jeter[1995] + Jeter[1996])
In [79]:
BA(Justice[1995] + Justice[1996])

Overall, Jeter had a significantly higher batting average. How did Jeter manage to do worse both years, then? Because in 1995, Jeter was injured for much of the year, and his low batting average was over a small number of at-bats. In 1996, Jeter was healthy, but Justice was injured for much of the year; however he managed a high average in his few at-bats.

For the kidney stone data, we trust the individual cases, not the overall, because there are biases that lead to patients being assigned to one treatment or the other. For the batting average data, we trust the overall numbers, because there are no such biases, and because larger numbers lead to a closer approximation to a true value. The data alone can't tell you what to believe; you need the story (and the model) behind the data as well.


We've seen how to manage probability paradoxes. Just be explicit about what the problem says, and then methodical about defining the sample space, and finally be careful in counting the number of outcomes in the numerator and denominator. But the bigger lesson is: treat those around you as reasonable people, and when they have different opinions, try to discover what problem they are solving.

Note: Mohammed El-Beltagy created a very nice translation of an earlier version of this page to Julia. .