Khan Academy: Descriptive statistics¶

https://www.khanacademy.org/math/probability/descriptive-statistics

In [237]:

%pylab inline
%load_ext sympyprinting

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Statistics Intro: mean, median, and mode¶

https://www.khanacademy.org/math/probability/descriptive-statistics/central_tendency/v/statistics-intro--mean--median-and-mode

statistics deals with data

descriptive statistics: can we somehow describe it with a smaller set of number

inferential statistics: start to make ideas about the data

In [238]:

import numpy as np

How can we describe data?

Se have a set of numbers, e.g. height of plants in garden in inches:

In [239]:

a = np.array([23, 29, 20, 32, 23, 21, 33, 25])
print a.sum()
a.size

Out[239]:

Average - "typical" or "middle" => central tendency

somehow represent the "center" of the numbers in the set.

Arithmetic Mean - sum of all numbers divided by the number (count) of numbers.

(4 + 3 + 1 + 6 + 1 + 7) / 6 == 22 / 6 == 3 (4 / 6) = 3 (2 / 3) = 3.6

In [243]:

mean = a.sum(dtype=float) / a.size
print mean
print np.mean(a)
assert mean == np.mean(a)

25.75
25.75

Median - middle number. if you have an even number of numbers, you take the Arithmetic Mean of the two middle numbers.

[4, 3, 1, 6, 1, 7] => [1, 1, 3, 4, 6, 7] => 3.5 0 7 50 10,000, 1,000,000 => 50

In [39]:

print a.sort()
print np.median(a)

24.0

Mode - Most common number in the data set.

[4, 3, 1, 6, 1, 7] => 1

In [50]:

np.bincount(a).argmax()
np.angle(

Out[50]:

In [52]:

c = np.array([3, 2, 7, 9, 5, 1, 2])
print np.median(c)

3.0

In [53]:

c = np.array([5, 6, 6, 2, 9])
print np.mean(c)

5.6

In [54]:

c = np.array([2, 5, 10, 9, 2, 9, 4, 9])
np.bincount(c).argmax()

Out[54]:

In [56]:

c = np.array([9, 7, 6, 6, 6, 8, 8, 4, 6, 2])
print np.mean(c)

6.2

In [57]:

c = np.array([10, 3, 2, 5, 1, 8, 1, 9, 7])
print np.median(c)

5.0

In [58]:

c = np.array([8, 6, 5, 9, 10, 1, 2, 4, 10])
np.mean(c)

Out[58]:

6.1111111111111107

In [59]:

c = np.array([6, 2, 2, 5, 1, 2, 8, 8])
np.bincount(c).argmax()

Out[59]:

In [60]:

c = np.array([4, 7, 1, 9, 8, 6, 1])
np.median(c)

Out[60]:

6.0

In [76]:

b = (81 + 81 + 81 + 81 + 91) / 5

In [77]:

Out[77]:

In [78]:

c = np.array([85, 77, 94, 88, 91])
np.mean(c)

Out[78]:

87.0

In [79]:

c = np.array([82, 82, 82, 82, 100, 100])
np.mean(c)

Out[79]:

88.0

In [80]:

c = np.array([83, 98, 80, 81, 91, 95])
np.mean(c)

Out[80]:

88.0

In [81]:

c = np.array([82, 82, 82, 90])
np.mean(c)

Out[81]:

84.0

In [82]:

c = np.array([92, 88, 86, 95, 97, 76])
np.mean(c)

Out[82]:

89.0

In [83]:

c = np.array([85, 85, 85, 100, 100])
np.mean(c)

Out[83]:

91.0

In [89]:

c = np.array([84, 84, 84, 96])
np.mean(c)

Out[89]:

87.0

Reading Box-and-Whisker Plots¶

https://www.khanacademy.org/math/probability/descriptive-statistics/Box-and-whisker%20plots/v/reading-box-and-whisker-plots

Constructing a box-and-whisker plot¶

In [103]:

d = np.array([14, 6, 3, 2, 4, 15, 11, 8, 1, 7, 2, 1, 3, 4, 10, 22, 20], dtype=float)
boxplot(d, vert=False, )

Out[103]:

{'boxes': [<matplotlib.lines.Line2D at 0x10e57ef50>],
 'caps': [<matplotlib.lines.Line2D at 0x10e587350>,
  <matplotlib.lines.Line2D at 0x10e501f90>],
 'fliers': [<matplotlib.lines.Line2D at 0x10e583650>,
  <matplotlib.lines.Line2D at 0x10e583fd0>],
 'medians': [<matplotlib.lines.Line2D at 0x10e57ed50>],
 'whiskers': [<matplotlib.lines.Line2D at 0x10e587090>,
  <matplotlib.lines.Line2D at 0x10e587d50>]}

In [104]:

d = np.array([3, 5, 6, 6, 7, 7, 7, 7, 7, 8, 8, 9, 9, 10, 11], dtype=float)
boxplot(d, vert=False)

Out[104]:

{'boxes': [<matplotlib.lines.Line2D at 0x10e6bd2d0>],
 'caps': [<matplotlib.lines.Line2D at 0x10e6b7890>,
  <matplotlib.lines.Line2D at 0x10e6b7d90>],
 'fliers': [<matplotlib.lines.Line2D at 0x10e6bdcd0>,
  <matplotlib.lines.Line2D at 0x10e6bf490>],
 'medians': [<matplotlib.lines.Line2D at 0x10e6bd7d0>],
 'whiskers': [<matplotlib.lines.Line2D at 0x10e6b70d0>,
  <matplotlib.lines.Line2D at 0x10e6b7310>]}

Variance of population¶

https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/variance-of-a-population

Variance as a measure of average, how far the data points in a population are from the population mean. 'N' is the total population.

$$ \begin{align*} mean = \mu = \frac{\displaystyle\sum_{i=1}^{N} x^2} {N} = \\ \frac{x_1 + x_2 + x_3 + x_4 + x_5} 5 \end{align*} $$

In [235]:

sample = [1, 3, 5, 7, 14]
population_mean = sum(sample)/len(sample)
print population_mean

variance $\begin{align*}= \sigma^2 = \frac{\displaystyle\sum_{i=1}^{N} (x_i - \mu)^2} {N} \end{align*}$

$\mu =$ population mean

In [236]:

def population_variance(population):
    mean = np.mean(population)
    pv = sum([(float(i) - mean)**2 for i in population]) / len(population)
    return pv
population_variance(sample)

Out[236]:

Sample Variance¶

https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/sample-variance

TV Watching¶

Population of TV watchers ~300m (μ) Take a sample. What is the sample mean?

$$ mean = {\bar x} = \frac{\displaystyle\sum_{i=1}^{n} x_i} {n} $$

In [246]:

# Hours of television
sample = [1.5, 4, 1, 2.5, 2, 1]
print 'mean: %s' % np.mean(sample)
print 'variance: %s' % population_variance(sample)

mean: 2.0
variance: 1.08333333333

Is this variance the best estimate we can make given the data we have?

$$ s^2_n = \frac{\displaystyle\sum_{i=1}^{n} (x_i - {\bar x})^2} {n} $$

Unbiased sample variance:

$$ s^2 = s^2_{n-1} = \frac{\displaystyle\sum_{i=1}^{n} (x_i - {\bar x})^2} {n-1} $$

In [247]:

def unbiased_variance(sample):
    mean = np.mean(sample)
    pv = sum([(float(i) - mean)**2 for i in sample]) / (len(sample) - 1)
    return pv
print 'sample variance: %s' % unbiased_variance(sample)

sample variance: 1.3

Review and intuition why we divide by n-1 for the unbiaed sample variance¶

https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/review-and-intuition-why-we-divide-by-n-1-for-the-unbiased-sample-variance

N = Total Population

n = Sample population

Population Mean (parameter): $$ \mu = \frac{\displaystyle\sum_{i=1}^{N} x^2} {N} $$

Sample Mean (statistic): $$ {\bar x} = \frac{\displaystyle\sum_{i=1}^{n} x_i} {n} $$

Population variance (parameter): $$ \sigma^2 = \frac{\displaystyle\sum_{i=1}^{N} (x_i - \mu)^2} {N} $$

Sample variance estimate (statistic): $$ s^2_n = \frac{\displaystyle\sum_{i=1}^{n} (x_i - {\bar x})^2} {n} $$

Unbiased sample variance estimate: $$ s^2 = s^2_{n-1} = \frac{\displaystyle\sum_{i=1}^{n} (x_i - {\bar x})^2} {n-1} $$

When you divide by a smaller number, e.g. n-1, you will get a larger value. If you have to guess, they are probably talking about the unbiased estimate. With a larger sample, decreasing the denominator by 1 will have a smaller effect.

When using sample variance we are approaching a biased estimate:

$$ mean = {\bar x} = \frac{\displaystyle\sum_{i=1}^{n} x_i} {n} => \\ \frac{(n-1)\sigma^2} {n} = \sigma^2 $$$$ variance = s^2_{n-1} = \frac{\displaystyle\sum_{i=1}^{n} (x_i - {\bar x})^2} {n-1} $$

Variance Problem Set¶

https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/e/variance

population variance¶

In [218]:

sample = [8,10,16,2,23,4]
print '%.2f years old' % np.mean(sample)
print '%.2f years^2' % population_variance(sample)

10.50 years old
51.25 years^2

unbiased variance¶

In [216]:

sample = [30,17,1,3,11]
print '%.2f years old' % np.mean(sample)
print '%.2f years^2' % unbiased_variance(sample)

12.40 years old
137.80 years^2

Statistics: Standard Deviation¶

https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/statistics--standard-deviation

In [142]:

sample = [1, 2, 3, 8, 7]

$$ {\mu} = \frac{1 + 2 + 3 + 8 + 7} {5} = 4.2 $$$$ {\sigma^2} = \frac{(1 - 4.2)^2 + (2 - 4.2)^2 + (3 - 4.2)^2 + (8 - 4.2)^2 + (7 - 4.2)^2} {5} = 7.76 $$

In [215]:

from IPython.display import display, Math, Latex
from sympy.printing.python import python
import sympy as sym
display(Math(latex('\mu = %.2f' % np.mean(sample))))
display(Math(latex('\sigma^2 = %.2f' % population_variance(sample))))
display(Math(latex('s^2_{n-1} = %.2f' % ( unbiased_variance(sample)))))

$$\mu = 4.20$$

$$\sigma^2 = 7.76$$

$$s^2_{n-1} = 9.70$$

Standard deviation is simply the square root of the variance.

In [62]:

math.sqrt(7.76)

Out[62]:

2.7856776554368237

Standard Deviation Problem Set¶

https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/e/standard_deviation

In [232]:

sample = [30,17,1,3,11]
mean = np.mean([float(i) for i in sample])
print '%.2f' % mean
v = unbiased_variance([float(i) for i in sample])
print '%.2f' % v
print '%.2f' % math.sqrt(v)

12.40
137.80
11.74

In [230]:

sample = [8,10,16,2,23,4]
mean = np.mean([float(i) for i in sample])
print '%.2f' % mean
v = variance([float(i) for i in sample])
print '%.2f' % v
print '%.2f' % math.sqrt(v)

10.50
51.25
7.16

Alternate Variance Formulas¶

https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/statistics--alternate-variance-formulas

$$ \begin{align*} \sigma^2 = \\ \frac{\displaystyle\sum_{i=1}^{N} (x_i - \mu)^2} {N} = \\ \frac{\displaystyle\sum_{i=1}^{N}(x_{i}^2 - 2_x{\mu} + \mu^2)} {N} \end{align*} $$

Now, we are just going to look at the part numerator, ignoring the N, as we simplify the equation. $$ \begin{align*} (x_i - \mu)^2 = \\ (x_i - \mu)(x_i - \mu) = \\ (x_i^2 - 2\mu{x_i} + \mu^2) \end{align*} $$

And below we pick up again $$ \begin{align*} \sigma^2 = \\ \frac{\displaystyle\sum_{i=1}^{N} (x_i - \mu)^2} {N} = \\ \frac{\displaystyle\sum_{i=1}^{N}(x_{i}^2 - 2_x{\mu} + \mu^2)} {N} = \\ \frac{\displaystyle\sum_{i=1}^{N}x_i - 2{\mu}\sum_{i=1}^{N}x_i + \mu^2\sum_{i=1}^{N}1} {N} = \\\\ \frac{\sum_{i=1}^{N}x_i^2} {N} - \frac{2\mu\sum_{i=1}^{N}x_i} {N} + \frac{\mu^2N} {N} = \\ \frac{\sum_{i=1}^{N}x_i^2} {N} - 2\mu^2 + \mu^2 = \\ \frac{\sum_{i=1}^{N}x_i^2} {N} - \mu^2 \end{align*} $$

This is close to the "Raw Score Method" \begin{align*} \frac{\displaystyle\sum_{i=1}^{N}x_i^2} {N} - \frac{\displaystyle\sum_{i=1}^{N}{x_i}^2} {N^2} \end{align*}

Notes, Tests, and Experiments¶

In [78]:

display(Math(r'F(k) = \int_{-\infty}^{\infty} f(x) e^{2\pi i k} dx'))

$$F(k) = \int_{-\infty}^{\infty} f(x) e^{2\pi i k} dx$$

In [196]:

from sympy.printing.python import python
import sympy as sym

x, y, z = sym.symbols("x y z")
print latex(Rational(3,2)*pi + exp(I*x) / (x**2 + y))

\frac{3}{2} \pi + \frac{e^{\mathbf{\imath} x}}{x^{2} + y}

In [199]:

display(Math(latex('%s^2' % sym.Symbol("s_n"))))

$$s_n^2$$

In [ ]: