In [1]:

s=np.random.normal(size=(50000,100))
n = np.array(range(2,100))
# Note that to estimate the population SD from a sample, use ddof=1 rather than the default ddof=0.
# With ddof=0, the SD bias is 4x higher.
sd_est = np.array([s[:, 0:ii].std(axis=1, ddof=1).mean() for ii in n])
plot(n, sd_est)

Out[1]:

[<matplotlib.lines.Line2D at 0x10ae3c390>]

Taken from http://stats.stackexchange.com/questions/11707/why-is-sample-standard-deviation-a-biased-estimator-of-sigma

Here I will explicitly calculate the expectation of the sample standard deviation (the original poster's second question) from a normally distributed sample, at which point the bias is clear.

The unbiased sample variance of a set of points $x_1, ..., x_n$ is

$$ s^{2} = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \overline{x})^2 $$

If the $x_i$'s are normally distributed, it is a fact that

$$ \frac{(n-1)s^2}{\sigma^2} \sim \chi^{2}_{n-1} $$

where $\sigma^2$ is the true variance. The $\chi^2_{k}$ distribution has probability density

$$ p(x) = \frac{(1/2)^{k/2}}{\Gamma(k/2)} x^{k/2 - 1}e^{-x/2} $$

using this we can derive the expected value of $s$;

$$ \begin{align} E(s) &= \sqrt{\frac{\sigma^2}{n-1}} E \left( \sqrt{\frac{s^2(n-1)}{\sigma^2}} \right) \\ &= \sqrt{\frac{\sigma^2}{n-1}} \int_{0}^{\infty} \sqrt{x} \frac{(1/2)^{(n-1)/2}}{\Gamma((n-1)/2)} x^{((n-1)/2) - 1}e^{-x/2} \ dx \end{align} $$

which follows from the definition of expected value and fact that $ \sqrt{\frac{s^2(n-1)}{\sigma^2}}$ is the square root of a $\chi^2$ distributed variable. The trick now is to rearrange terms so that the integrand becomes another $\chi^2$ density:

$$ \begin{align} E(s) &= \sqrt{\frac{\sigma^2}{n-1}} \int_{0}^{\infty} \frac{(1/2)^{(n-1)/2}}{\Gamma(\frac{n-1}{2})} x^{(n/2) - 1}e^{-x/2} \ dx \\ &= \sqrt{\frac{\sigma^2}{n-1}} \cdot \frac{ \Gamma(n/2) }{ \Gamma( \frac{n-1}{2} ) } \int_{0}^{\infty} \frac{(1/2)^{(n-1)/2}}{\Gamma(n/2)} x^{(n/2) - 1}e^{-x/2} \ dx \\ &= \sqrt{\frac{\sigma^2}{n-1}} \cdot \frac{ \Gamma(n/2) }{ \Gamma( \frac{n-1}{2} ) } \cdot \frac{ (1/2)^{(n-1)/2} }{ (1/2)^{n/2} } \underbrace{ \int_{0}^{\infty} \frac{(1/2)^{n/2}}{\Gamma(n/2)} x^{(n/2) - 1}e^{-x/2} \ dx}_{\chi^2_n \ {\rm density} } \end{align} $$

now we know the integrand the last line is equal to 1, since it is a $\chi^2_{n}$ density. Simplifying constants a bit gives

$$ E(s) = \sigma \cdot \sqrt{ \frac{2}{n-1} } \cdot \frac{ \Gamma(n/2) }{ \Gamma( \frac{n-1}{2} ) } $$

Therefore the bias of $s$ is

$$ \sigma - E(s) = \sigma \bigg(1 - \sqrt{ \frac{2}{n-1} } \cdot \frac{ \Gamma(n/2) }{ \Gamma( \frac{n-1}{2} ) } \bigg) \sim \frac{\sigma}{4 n} \>$$

as $n \to \infty$.

It's not hard to see that this bias is not 0 for any finite $n$, thus proving the sample standard deviation is biased. Below the bias is plot as a function of $n$ for $\sigma=1$ in red along with $1/4n$ in blue:

In [7]:

from scipy.special import gamma

bias = sd_est * (1. - sqrt(2. / (n-1)) * (gamma(n / 2.) / gamma((n-1) / 2.)))
plot(n,bias,'r-', n,sd_est/(4*n),'b-')
xlabel('Sample size')
ylabel('Standard deviation estimate bias')
legend(('Full estimate','SD/4n approximation'))

Out[7]:

<matplotlib.legend.Legend at 0x110b36fd0>

In [3]:

plot([0,100],[1,1],'k--', n,sd_est,'b-', n,(sd_est+sd_est/(4*n)),'c-', n,(sd_est+bias),'m-')
xlabel('Sample size')
ylabel('Standard deviation estimate')
legend(('True population SD', 'Uncorrected estimate', 'Corrected (SD/4n approx.)', 'Corrected (full model)'),loc='lower right')

Out[3]:

<matplotlib.legend.Legend at 0x1082c37d0>

In [ ]: