Let's explore some math and statistics stuff today and make sense of "statistically significant" but small differences.
import math
import random
First let's play with some basic sampling functions, first with integers.
random.randint(0, 100)
24
random_ints = [randint(0, 100) for x in range(10000)]
print random_ints[:10]
[35, 14, 33, 2, 15, 23, 23, 99, 20, 47]
What does this look like? We need a histogram. Histograms are your friend.
import matplotlib.pylab as mp
mp.hist(random_ints)
(array([ 967., 1004., 1051., 980., 995., 1016., 973., 950., 968., 1096.]), array([ 0., 10., 20., 30., 40., 50., 60., 70., 80., 90., 100.]), <a list of 10 Patch objects>)
In case your histogram didn't appear inline, type "show()". Plus, this histogram by default has only 5 bars, which is not enough. Let's ask for more.
mp.hist(random_ints, 101)
(array([ 104., 98., 100., 109., 94., 95., 90., 87., 106., 84., 90., 96., 100., 100., 105., 117., 88., 99., 104., 105., 118., 100., 109., 101., 103., 94., 103., 99., 117., 107., 95., 105., 88., 99., 97., 113., 106., 91., 106., 80., 83., 82., 98., 111., 117., 99., 111., 99., 98., 97., 92., 102., 86., 92., 109., 96., 99., 110., 103., 127., 109., 85., 108., 99., 105., 99., 93., 96., 93., 86., 81., 100., 94., 105., 99., 83., 111., 80., 106., 91., 107., 86., 92., 106., 90., 112., 84., 96., 103., 92., 92., 88., 99., 98., 103., 103., 91., 104., 100., 121., 97.]), array([ 0. , 0.99009901, 1.98019802, 2.97029703, 3.96039604, 4.95049505, 5.94059406, 6.93069307, 7.92079208, 8.91089109, 9.9009901 , 10.89108911, 11.88118812, 12.87128713, 13.86138614, 14.85148515, 15.84158416, 16.83168317, 17.82178218, 18.81188119, 19.8019802 , 20.79207921, 21.78217822, 22.77227723, 23.76237624, 24.75247525, 25.74257426, 26.73267327, 27.72277228, 28.71287129, 29.7029703 , 30.69306931, 31.68316832, 32.67326733, 33.66336634, 34.65346535, 35.64356436, 36.63366337, 37.62376238, 38.61386139, 39.6039604 , 40.59405941, 41.58415842, 42.57425743, 43.56435644, 44.55445545, 45.54455446, 46.53465347, 47.52475248, 48.51485149, 49.5049505 , 50.4950495 , 51.48514851, 52.47524752, 53.46534653, 54.45544554, 55.44554455, 56.43564356, 57.42574257, 58.41584158, 59.40594059, 60.3960396 , 61.38613861, 62.37623762, 63.36633663, 64.35643564, 65.34653465, 66.33663366, 67.32673267, 68.31683168, 69.30693069, 70.2970297 , 71.28712871, 72.27722772, 73.26732673, 74.25742574, 75.24752475, 76.23762376, 77.22772277, 78.21782178, 79.20792079, 80.1980198 , 81.18811881, 82.17821782, 83.16831683, 84.15841584, 85.14851485, 86.13861386, 87.12871287, 88.11881188, 89.10891089, 90.0990099 , 91.08910891, 92.07920792, 93.06930693, 94.05940594, 95.04950495, 96.03960396, 97.02970297, 98.01980198, 99.00990099, 100. ]), <a list of 101 Patch objects>)
Now let me tell you a story about my friend Jordan. He wanted to simulate the lottery, or at least a baby version of it, but all he had were dice. How much of a problem is this? Let's do some experiments.
def make_roll():
return [random.randint(1,6) for c in range(3)]
rolls = [make_roll() for x in range(100)]
print rolls[0:4]
[[3, 6, 5], [2, 4, 6], [1, 2, 4], [4, 2, 6]]
Let's develop a test for there being a double. There are lots of ways to do this.
def has_double(roll):
if roll[0]==roll[1] or roll[0]==roll[2] or roll[1]==roll[2]: return 1
else: return 0
print has_double([1,1,2])
print has_double([0,1,2])
1 0
len(set([1,2,3,2]))
set([1,2,3])
{1, 2, 3}
filtered_rolls = [has_double(roll) for roll in rolls]
print filtered_rolls[:10]
[0, 0, 0, 0, 1, 0, 1, 0, 1, 0]
Now we'd like to figure out "how often" doubles happen. If you think about it, that means we add up all the "1"s and then divide by all the tries we did. But looking at the formula, that's just the average value. Now we know our choice of "1"s and "0"s was good. Of course we could have gone the exact opposite way, but whatevs.
def our_mean(list):
return sum(list)/float(len(list))
our_mean(filtered_rolls)
0.52000000000000002
Remember, that's just an estimate. It could easily have been different with a different sample. Let's vary the one parameter we have here, namely the number of draws in our sample.
def compute_percentage_of_doubles(n):
rolls = [make_roll() for x in range(n)]
filtered_rolls = [has_double(roll) for roll in rolls]
return our_mean(filtered_rolls)
compute_percentage_of_doubles(1000000)
0.444608
After a bunch of experiments, we have some idea of the "actual value". Is there another way we can derive it? Let's use our noggins.
1-(6*5*4)/(6*6*6.0)
0.4444444444444444
Alot of what we just did generalizes to general data analysis. Except, in general, it is not possible to formally derive the answer.
Oh and plus, there's another function that will give you "samples without replacement," which is actually what Jordan wanted.
Oh and plus plus, I have to reimport stuff from random because two different things have the same name now.
from random import *
[sample(range(1, 6), 3) for x in range(10)]
[[2, 5, 3], [1, 2, 3], [1, 5, 4], [5, 1, 3], [4, 5, 1], [2, 1, 4], [2, 5, 3], [5, 1, 4], [5, 4, 1], [2, 5, 1]]
Just for fun, and now that we know the "true value" of the percentage of doubles, let's see how close we get when we take samples of 100. Let's draw a histogram of 1000 estimates of that true value.
just_for_fun = [compute_percentage_of_doubles(400) for x in range(1000)]
hist(just_for_fun, 30)
(array([ 2., 5., 10., 9., 14., 8., 38., 33., 39., 57., 66., 44., 70., 66., 87., 65., 79., 33., 64., 46., 47., 35., 22., 9., 14., 19., 9., 5., 3., 2.]), array([ 0.3775 , 0.38208333, 0.38666667, 0.39125 , 0.39583333, 0.40041667, 0.405 , 0.40958333, 0.41416667, 0.41875 , 0.42333333, 0.42791667, 0.4325 , 0.43708333, 0.44166667, 0.44625 , 0.45083333, 0.45541667, 0.46 , 0.46458333, 0.46916667, 0.47375 , 0.47833333, 0.48291667, 0.4875 , 0.49208333, 0.49666667, 0.50125 , 0.50583333, 0.51041667, 0.515 ]), <a list of 30 Patch objects>)
What other kind of distributions do we have that don't involve integers?
uni_samples = [uniform(0, 1) for x in range(1000000)]
uni_samples[:10]
[0.26639958361763505, 0.38163880239537396, 0.04999522393547573, 0.8981225243250924, 0.334938312639219, 0.9050525677371757, 0.8310748478239941, 0.20260983773385222, 0.8555161375071846, 0.7208789533182086]
hist(uni_samples, 30)
(array([ 33122., 33182., 33389., 33452., 33341., 33274., 33189., 33588., 33233., 33512., 33417., 33028., 33468., 33625., 33255., 33327., 33108., 33224., 33356., 33213., 33368., 33649., 33356., 33413., 33463., 33256., 33370., 33370., 33091., 33361.]), array([ 5.18495105e-07, 3.33337631e-02, 6.66670077e-02, 1.00000252e-01, 1.33333497e-01, 1.66666741e-01, 1.99999986e-01, 2.33333231e-01, 2.66666475e-01, 2.99999720e-01, 3.33332964e-01, 3.66666209e-01, 3.99999454e-01, 4.33332698e-01, 4.66665943e-01, 4.99999187e-01, 5.33332432e-01, 5.66665676e-01, 5.99998921e-01, 6.33332166e-01, 6.66665410e-01, 6.99998655e-01, 7.33331899e-01, 7.66665144e-01, 7.99998389e-01, 8.33331633e-01, 8.66664878e-01, 8.99998122e-01, 9.33331367e-01, 9.66664611e-01, 9.99997856e-01]), <a list of 30 Patch objects>)
tri_samples = [triangular(0, 1, 0.4) for x in range(100000)]
hist(tri_samples, 30)
(array([ 298., 897., 1387., 1956., 2518., 3103., 3643., 4240., 4711., 5231., 5810., 6276., 6535., 6097., 5671., 5341., 4883., 4585., 4270., 3887., 3490., 3094., 2812., 2509., 2055., 1709., 1332., 951., 519., 190.]), array([ 0.00194036, 0.03517305, 0.06840574, 0.10163842, 0.13487111, 0.16810379, 0.20133648, 0.23456917, 0.26780185, 0.30103454, 0.33426723, 0.36749991, 0.4007326 , 0.43396529, 0.46719797, 0.50043066, 0.53366334, 0.56689603, 0.60012872, 0.6333614 , 0.66659409, 0.69982678, 0.73305946, 0.76629215, 0.79952483, 0.83275752, 0.86599021, 0.89922289, 0.93245558, 0.96568827, 0.99892095]), <a list of 30 Patch objects>)
Look at the wikipedia entry for the beta distribution to check out the two parameters we need.
beta_samples = [betavariate(0.5, 0.5) for x in range(100000)]
hist(beta_samples, 30)
(array([ 11619., 4850., 3845., 3187., 2947., 2744., 2686., 2461., 2413., 2225., 2237., 2201., 2147., 2135., 2070., 2059., 2112., 2134., 2243., 2216., 2237., 2418., 2558., 2646., 2776., 3035., 3396., 3756., 5009., 11638.]), array([ 2.58498809e-09, 3.33333358e-02, 6.66666691e-02, 1.00000002e-01, 1.33333336e-01, 1.66666669e-01, 2.00000002e-01, 2.33333335e-01, 2.66666669e-01, 3.00000002e-01, 3.33333335e-01, 3.66666668e-01, 4.00000002e-01, 4.33333335e-01, 4.66666668e-01, 5.00000001e-01, 5.33333335e-01, 5.66666668e-01, 6.00000001e-01, 6.33333334e-01, 6.66666668e-01, 7.00000001e-01, 7.33333334e-01, 7.66666667e-01, 8.00000001e-01, 8.33333334e-01, 8.66666667e-01, 9.00000000e-01, 9.33333334e-01, 9.66666667e-01, 1.00000000e+00]), <a list of 30 Patch objects>)
What do we think the Facebook sentiment score distribution looked like? I'ma go with Gaussian, where there were actually two distributions with very small differences in their means. Let's explore this situation.
gauss_samples_1 = [gauss(0.01, 1) for x in range(1000)]
print mean(gauss_samples_1)
gauss_samples_2 = [gauss(-0.01, 1) for x in range(1000)]
print mean(gauss_samples_2)
0.00845756841145 0.00658068542127
hist(gauss_samples_1, 50)
(array([ 2., 2., 0., 0., 1., 2., 1., 4., 4., 4., 15., 7., 12., 15., 22., 22., 15., 32., 30., 35., 50., 48., 48., 44., 63., 53., 51., 56., 31., 57., 40., 46., 37., 34., 22., 14., 14., 14., 14., 9., 7., 9., 2., 5., 1., 1., 2., 2., 0., 1.]), array([-3.35331988, -3.22112349, -3.0889271 , -2.95673071, -2.82453432, -2.69233793, -2.56014154, -2.42794515, -2.29574876, -2.16355237, -2.03135598, -1.89915959, -1.7669632 , -1.63476681, -1.50257042, -1.37037403, -1.23817764, -1.10598125, -0.97378486, -0.84158847, -0.70939208, -0.57719569, -0.4449993 , -0.31280291, -0.18060652, -0.04841013, 0.08378626, 0.21598265, 0.34817904, 0.48037543, 0.61257182, 0.74476821, 0.8769646 , 1.00916099, 1.14135738, 1.27355377, 1.40575016, 1.53794655, 1.67014294, 1.80233933, 1.93453572, 2.06673211, 2.1989285 , 2.33112489, 2.46332128, 2.59551767, 2.72771406, 2.85991045, 2.99210684, 3.12430323, 3.25649962]), <a list of 50 Patch objects>)
hist(gauss_samples_2, 50)
(array([ 1., 0., 0., 1., 3., 3., 2., 8., 6., 13., 14., 24., 12., 26., 33., 31., 27., 39., 46., 49., 54., 51., 57., 58., 52., 53., 51., 46., 39., 29., 20., 35., 20., 19., 21., 15., 12., 12., 6., 6., 2., 2., 1., 0., 0., 0., 0., 0., 0., 1.]), array([-3.12092145, -2.98469082, -2.8484602 , -2.71222957, -2.57599894, -2.43976832, -2.30353769, -2.16730706, -2.03107644, -1.89484581, -1.75861518, -1.62238456, -1.48615393, -1.3499233 , -1.21369268, -1.07746205, -0.94123142, -0.8050008 , -0.66877017, -0.53253955, -0.39630892, -0.26007829, -0.12384767, 0.01238296, 0.14861359, 0.28484421, 0.42107484, 0.55730547, 0.69353609, 0.82976672, 0.96599735, 1.10222797, 1.2384586 , 1.37468923, 1.51091985, 1.64715048, 1.78338111, 1.91961173, 2.05584236, 2.19207299, 2.32830361, 2.46453424, 2.60076487, 2.73699549, 2.87322612, 3.00945675, 3.14568737, 3.281918 , 3.41814862, 3.55437925, 3.69060988]), <a list of 50 Patch objects>)
So far we really can't see the difference, can we? But what if we had way more data?
gauss_samples_1 = [gauss(0.01, 1) for x in range(100000)]
print mean(gauss_samples_1)
gauss_samples_2 = [gauss(-0.01, 1) for x in range(100000)]
print mean(gauss_samples_2)
0.00740574896649 -0.0128683599903
That's much more like it. But is it a fluke to get closer estimates? Let's do it a bunch of times and see. We'll do it 1000 times with samples of size 1000. That's a total of 1,000,000 draws.
D = [mean([gauss(0.01, 1) for x in range(1000)]) for x in range(1000)]
print mean(D)
hist(D, 30)
0.0107593259445
(array([ 1., 2., 4., 11., 12., 10., 25., 35., 41., 54., 63., 65., 83., 78., 100., 77., 80., 73., 53., 34., 36., 29., 19., 7., 5., 2., 0., 0., 0., 1.]), array([-0.08494238, -0.07810073, -0.07125907, -0.06441742, -0.05757576, -0.0507341 , -0.04389245, -0.03705079, -0.03020913, -0.02336748, -0.01652582, -0.00968416, -0.00284251, 0.00399915, 0.01084081, 0.01768246, 0.02452412, 0.03136578, 0.03820743, 0.04504909, 0.05189074, 0.0587324 , 0.06557406, 0.07241571, 0.07925737, 0.08609903, 0.09294068, 0.09978234, 0.106624 , 0.11346565, 0.12030731]), <a list of 30 Patch objects>)
Pretty good, but not good enough actually. If we have four times as many "people" in each group, and do that 4 times fewer times, we still have 1,000,000 draws. It's a general rule (especially for the normal distribution) that when we do this our error will get smaller by a factor of 2. Let's try it and see!
D2 = [mean([gauss(0.01, 1) for x in range(4000)]) for x in range(250)]
print mean(D2)
hist(D2, 30)
0.00964399175911
(array([ 1., 1., 3., 2., 1., 1., 5., 5., 9., 9., 14., 10., 17., 19., 17., 19., 22., 22., 13., 11., 8., 15., 6., 5., 4., 3., 2., 1., 4., 1.]), array([-0.03798015, -0.03490851, -0.03183687, -0.02876523, -0.02569359, -0.02262195, -0.01955031, -0.01647867, -0.01340703, -0.01033539, -0.00726375, -0.00419211, -0.00112047, 0.00195118, 0.00502282, 0.00809446, 0.0111661 , 0.01423774, 0.01730938, 0.02038102, 0.02345266, 0.0265243 , 0.02959594, 0.03266758, 0.03573922, 0.03881086, 0.0418825 , 0.04495414, 0.04802578, 0.05109742, 0.05416906]), <a list of 30 Patch objects>)
we still have about a 0.03 or 0.04 spread in each direction, and we want it to be more like a 0.01 spread. We need 16 times more data in our sample.
4000*16
64000
In other words, if we have 100,000 people in each group, we're probably fine.