https://www.khanacademy.org/math/probability/descriptive-statistics
%pylab inline
%load_ext sympyprinting
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline]. For more information, type 'help(pylab)'.
statistics deals with data
descriptive statistics: can we somehow describe it with a smaller set of number
inferential statistics: start to make ideas about the data
import numpy as np
How can we describe data?
Se have a set of numbers, e.g. height of plants in garden in inches:
a = np.array([23, 29, 20, 32, 23, 21, 33, 25])
print a.sum()
a.size
206
Average - "typical" or "middle" => central tendency
somehow represent the "center" of the numbers in the set.
Arithmetic Mean - sum of all numbers divided by the number (count) of numbers.
(4 + 3 + 1 + 6 + 1 + 7) / 6 == 22 / 6 == 3 (4 / 6) = 3 (2 / 3) = 3.6
mean = a.sum(dtype=float) / a.size
print mean
print np.mean(a)
assert mean == np.mean(a)
25.75 25.75
Median - middle number. if you have an even number of numbers, you take the Arithmetic Mean of the two middle numbers.
[4, 3, 1, 6, 1, 7] => [1, 1, 3, 4, 6, 7] => 3.5 0 7 50 10,000, 1,000,000 => 50
print a.sort()
print np.median(a)
24.0
Mode - Most common number in the data set.
[4, 3, 1, 6, 1, 7] => 1
np.bincount(a).argmax()
np.angle(
23
c = np.array([3, 2, 7, 9, 5, 1, 2])
print np.median(c)
3.0
c = np.array([5, 6, 6, 2, 9])
print np.mean(c)
5.6
c = np.array([2, 5, 10, 9, 2, 9, 4, 9])
np.bincount(c).argmax()
9
c = np.array([9, 7, 6, 6, 6, 8, 8, 4, 6, 2])
print np.mean(c)
6.2
c = np.array([10, 3, 2, 5, 1, 8, 1, 9, 7])
print np.median(c)
5.0
c = np.array([8, 6, 5, 9, 10, 1, 2, 4, 10])
np.mean(c)
6.1111111111111107
c = np.array([6, 2, 2, 5, 1, 2, 8, 8])
np.bincount(c).argmax()
2
c = np.array([4, 7, 1, 9, 8, 6, 1])
np.median(c)
6.0
b = (81 + 81 + 81 + 81 + 91) / 5
b
83
c = np.array([85, 77, 94, 88, 91])
np.mean(c)
87.0
c = np.array([82, 82, 82, 82, 100, 100])
np.mean(c)
88.0
c = np.array([83, 98, 80, 81, 91, 95])
np.mean(c)
88.0
c = np.array([82, 82, 82, 90])
np.mean(c)
84.0
c = np.array([92, 88, 86, 95, 97, 76])
np.mean(c)
89.0
c = np.array([85, 85, 85, 100, 100])
np.mean(c)
91.0
c = np.array([84, 84, 84, 96])
np.mean(c)
87.0
d = np.array([14, 6, 3, 2, 4, 15, 11, 8, 1, 7, 2, 1, 3, 4, 10, 22, 20], dtype=float)
boxplot(d, vert=False, )
{'boxes': [<matplotlib.lines.Line2D at 0x10e57ef50>], 'caps': [<matplotlib.lines.Line2D at 0x10e587350>, <matplotlib.lines.Line2D at 0x10e501f90>], 'fliers': [<matplotlib.lines.Line2D at 0x10e583650>, <matplotlib.lines.Line2D at 0x10e583fd0>], 'medians': [<matplotlib.lines.Line2D at 0x10e57ed50>], 'whiskers': [<matplotlib.lines.Line2D at 0x10e587090>, <matplotlib.lines.Line2D at 0x10e587d50>]}
d = np.array([3, 5, 6, 6, 7, 7, 7, 7, 7, 8, 8, 9, 9, 10, 11], dtype=float)
boxplot(d, vert=False)
{'boxes': [<matplotlib.lines.Line2D at 0x10e6bd2d0>], 'caps': [<matplotlib.lines.Line2D at 0x10e6b7890>, <matplotlib.lines.Line2D at 0x10e6b7d90>], 'fliers': [<matplotlib.lines.Line2D at 0x10e6bdcd0>, <matplotlib.lines.Line2D at 0x10e6bf490>], 'medians': [<matplotlib.lines.Line2D at 0x10e6bd7d0>], 'whiskers': [<matplotlib.lines.Line2D at 0x10e6b70d0>, <matplotlib.lines.Line2D at 0x10e6b7310>]}
Variance as a measure of average, how far the data points in a population are from the population mean. 'N' is the total population.
$$ \begin{align*} mean = \mu = \frac{\displaystyle\sum_{i=1}^{N} x^2} {N} = \\ \frac{x_1 + x_2 + x_3 + x_4 + x_5} 5 \end{align*} $$sample = [1, 3, 5, 7, 14]
population_mean = sum(sample)/len(sample)
print population_mean
6
variance $\begin{align*}= \sigma^2 = \frac{\displaystyle\sum_{i=1}^{N} (x_i - \mu)^2} {N} \end{align*}$
$\mu =$ population mean
def population_variance(population):
mean = np.mean(population)
pv = sum([(float(i) - mean)**2 for i in population]) / len(population)
return pv
population_variance(sample)
# Hours of television
sample = [1.5, 4, 1, 2.5, 2, 1]
print 'mean: %s' % np.mean(sample)
print 'variance: %s' % population_variance(sample)
mean: 2.0 variance: 1.08333333333
Is this variance the best estimate we can make given the data we have?
$$ s^2_n = \frac{\displaystyle\sum_{i=1}^{n} (x_i - {\bar x})^2} {n} $$Unbiased sample variance:
$$ s^2 = s^2_{n-1} = \frac{\displaystyle\sum_{i=1}^{n} (x_i - {\bar x})^2} {n-1} $$def unbiased_variance(sample):
mean = np.mean(sample)
pv = sum([(float(i) - mean)**2 for i in sample]) / (len(sample) - 1)
return pv
print 'sample variance: %s' % unbiased_variance(sample)
sample variance: 1.3
N = Total Population
n = Sample population
Population Mean (parameter): $$ \mu = \frac{\displaystyle\sum_{i=1}^{N} x^2} {N} $$
Sample Mean (statistic): $$ {\bar x} = \frac{\displaystyle\sum_{i=1}^{n} x_i} {n} $$
Population variance (parameter): $$ \sigma^2 = \frac{\displaystyle\sum_{i=1}^{N} (x_i - \mu)^2} {N} $$
Sample variance estimate (statistic): $$ s^2_n = \frac{\displaystyle\sum_{i=1}^{n} (x_i - {\bar x})^2} {n} $$
Unbiased sample variance estimate: $$ s^2 = s^2_{n-1} = \frac{\displaystyle\sum_{i=1}^{n} (x_i - {\bar x})^2} {n-1} $$
When you divide by a smaller number, e.g. n-1, you will get a larger value. If you have to guess, they are probably talking about the unbiased estimate. With a larger sample, decreasing the denominator by 1 will have a smaller effect.
When using sample variance we are approaching a biased estimate:
$$ mean = {\bar x} = \frac{\displaystyle\sum_{i=1}^{n} x_i} {n} => \\ \frac{(n-1)\sigma^2} {n} = \sigma^2 $$$$ variance = s^2_{n-1} = \frac{\displaystyle\sum_{i=1}^{n} (x_i - {\bar x})^2} {n-1} $$sample = [8,10,16,2,23,4]
print '%.2f years old' % np.mean(sample)
print '%.2f years^2' % population_variance(sample)
10.50 years old 51.25 years^2
sample = [30,17,1,3,11]
print '%.2f years old' % np.mean(sample)
print '%.2f years^2' % unbiased_variance(sample)
12.40 years old 137.80 years^2
sample = [1, 2, 3, 8, 7]
from IPython.display import display, Math, Latex
from sympy.printing.python import python
import sympy as sym
display(Math(latex('\mu = %.2f' % np.mean(sample))))
display(Math(latex('\sigma^2 = %.2f' % population_variance(sample))))
display(Math(latex('s^2_{n-1} = %.2f' % ( unbiased_variance(sample)))))
Standard deviation is simply the square root of the variance.
math.sqrt(7.76)
2.7856776554368237
sample = [30,17,1,3,11]
mean = np.mean([float(i) for i in sample])
print '%.2f' % mean
v = unbiased_variance([float(i) for i in sample])
print '%.2f' % v
print '%.2f' % math.sqrt(v)
12.40 137.80 11.74
sample = [8,10,16,2,23,4]
mean = np.mean([float(i) for i in sample])
print '%.2f' % mean
v = variance([float(i) for i in sample])
print '%.2f' % v
print '%.2f' % math.sqrt(v)
10.50 51.25 7.16
Now, we are just going to look at the part numerator, ignoring the N, as we simplify the equation. $$ \begin{align*} (x_i - \mu)^2 = \\ (x_i - \mu)(x_i - \mu) = \\ (x_i^2 - 2\mu{x_i} + \mu^2) \end{align*} $$
And below we pick up again $$ \begin{align*} \sigma^2 = \\ \frac{\displaystyle\sum_{i=1}^{N} (x_i - \mu)^2} {N} = \\ \frac{\displaystyle\sum_{i=1}^{N}(x_{i}^2 - 2_x{\mu} + \mu^2)} {N} = \\ \frac{\displaystyle\sum_{i=1}^{N}x_i - 2{\mu}\sum_{i=1}^{N}x_i + \mu^2\sum_{i=1}^{N}1} {N} = \\\\ \frac{\sum_{i=1}^{N}x_i^2} {N} - \frac{2\mu\sum_{i=1}^{N}x_i} {N} + \frac{\mu^2N} {N} = \\ \frac{\sum_{i=1}^{N}x_i^2} {N} - 2\mu^2 + \mu^2 = \\ \frac{\sum_{i=1}^{N}x_i^2} {N} - \mu^2 \end{align*} $$
This is close to the "Raw Score Method" \begin{align*} \frac{\displaystyle\sum_{i=1}^{N}x_i^2} {N} - \frac{\displaystyle\sum_{i=1}^{N}{x_i}^2} {N^2} \end{align*}
display(Math(r'F(k) = \int_{-\infty}^{\infty} f(x) e^{2\pi i k} dx'))
from sympy.printing.python import python
import sympy as sym
x, y, z = sym.symbols("x y z")
print latex(Rational(3,2)*pi + exp(I*x) / (x**2 + y))
\frac{3}{2} \pi + \frac{e^{\mathbf{\imath} x}}{x^{2} + y}
display(Math(latex('%s^2' % sym.Symbol("s_n"))))