========================================================================================================
This data science project was inspired by problems in ThinkStats by Allen Downey's http://greenteapress.com/thinkstats/
% matplotlib inline
import numpy as np
import survey
import Pmf
from matplotlib import pyplot as plt
import pylab as P
Class survey and Pmf are available here: http://greenteapress.com/thinkstats/data.html
# import CDC data on pregnancies (http://www.icpsr.umich.edu/nsfg6/)
table = survey.Pregnancies()
table.ReadRecords()
print 'Number of pregnancies in records:', len(table.records)
Number of pregnancies in records: 13593
Next, we make subrecords of First babies and Other babies (babies born second, third etc.). We filter for only live births.
def part(table):
firstb = survey.Pregnancies()
otherb = survey.Pregnancies()
for ii in table.records:
if ii.outcome !=1:
continue
if ii.birthord ==1:
firstb.AddRecord(ii)
else:
otherb.AddRecord(ii)
return firstb, otherb
FB, OB = part(table)
print 'number of first babies', len(FB), 'number of other babies ', len(OB)
number of first babies 4413 number of other babies 4735
Now let's calculate the mean gestation period:
def Meanprg(list):
"""returns the mean of the pregnancy term of the
first babies or other babies
"""
item=[]
for elem in list.records:
item.append(elem.prglength*1.0)
itemmean=sum(item)/len(list.records)
return itemmean
fmean = Meanprg(FB)
omean = Meanprg(OB)
print 'Mean pregnancy term for first babies ', fmean, '[weeks]'
print 'Mean pregnancy term for other babies ', omean, '[weeks]'
print 'Difference in Mean [hours] ', (fmean-omean)*7*24.0, '[hours]'
Mean pregnancy term for first babies 38.6009517335 [weeks] Mean pregnancy term for other babies 38.5229144667 [weeks] Difference in Mean [hours] 13.1102608186 [hours]
First babies are born on average 13 hours earlier!!! Now let's look at the variation in the samples by calculating the standard deviation:
def stdprg(list):
"""return the standard deviation of the pregnancy
term for first babies or other biabies
"""
items=[]
for elem in list.records:
items.append(elem.prglength*1.0)
mean = sum(items)/len(list.records)
varterm=[]
for elem in list.records:
vari = (mean*1.0 - elem.prglength)**2.0
varterm.append(vari)
stddev = (sum(varterm)**0.5)/len(list.records)
return stddev
Fstd = stdprg(FB)
Ostd = stdprg(OB)
print 'StdDev term for first babies ', Fstd , ' weeks'
print 'StdDev term for other babies ', Ostd, ' weeks'
print 'StdDev term for first babies ', Fstd*7.0*24 , ' hours'
print 'StdDev term for other babies ', Ostd*7.0*24, ' hours'
print '\nRelative StdDev (StDev/mean):'
print 'first babies [stdev/mean] ', Fstd/fmean
print 'other babies [stdev/mean] ', Ostd/omean
StdDev term for first babies 0.0420226951999 weeks StdDev term for other babies 0.0380108315324 weeks StdDev term for first babies 7.05981279359 hours StdDev term for other babies 6.38581969745 hours Relative StdDev (StDev/mean): first babies [stdev/mean] 0.00108864401816 other babies [stdev/mean] 0.000986707056271
The difference in the mean pregnancy duration betweem first babies and other babies is roughly two standard deviations apart. Thus the difference of about 13 hours seems to be significant.
A picture is worth a thousand words. Let's visualize the data using a histogram:
def makehist(list): """ Returns a histogram for survey objects: myhist and mybin """ val=[] for ii in list.records: val.append(ii.prglength*1.0) myhist, mybin = np.histogram(val,bins=22,range=(22,47)) center = (mybin[1:]+mybin[:-1])/2.0 return myhist, mybin, center, val
Fhist, Fbin, center, Fval = makehist(FB)
Ohist, Obin, center, Oval = makehist(OB)
width=0.354
plt.bar(center, Fhist, width, color='pink',label='first babies')
plt.bar(center+width, Ohist, width, color='blue',label='other babies')
plt.ylabel("frequency")
plt.xlabel("weeks")
plt.legend(loc=2)
plt.show()
Now imagine that you're a physician and your patient, a first time mom, is pregnant and in her 40th week. She wants to know the chances of her baby being born on the 40th week of her pregnancy. You can answer this question by calculating the conditional probability of the baby being born on the 40th week provided the mother is still pregnant. Consider the following:
def BornatWeek(table,week):
bornweek = survey.Pregnancies()
for x in table.records:
if x.prglength >= week:
bornweek.AddRecord(x)
return bornweek
FB40 = BornatWeek(FB,40)
def pmfBornWeek(table,week):
val=[]
for x in table.records:
val.append(x.prglength*1.0)
weekpmf = Pmf.MakePmfFromList(val)
return weekpmf
FB40pmf = pmfBornWeek(FB40,40)
print 'chances of mom giving birth to first child, on 40th week into her pregnancy:', FB40pmf.Prob(40)
chances of mom giving birth to first child, on 40th week into her pregnancy: 0.438625204583
It is interesting to compare First babies wtih Other babies in this instance:
OB40=BornatWeek(OB,40)
OB40pmf = pmfBornWeek(OB40,40)
print 'chances of mom giving birth (to Other baby), on 40th week into her pregnancy:', OB40pmf.Prob(40)
chances of mom giving birth (to Other baby), on 40th week into her pregnancy: 0.56640625
It's interesting to find that the probabilities seem substantially different if the bother is giving birth to her first child or if she's given birth before.