#!/usr/bin/env python # coding: utf-8 # # A Graduate Introduction to Probability and Statistics for Scientists and Engineers # # ## [Philip B. Stark](http://www.stat.berkeley.edu/~stark), Department of Statistics, University of California, Berkeley # # ## First offering: a 10-hour short course at University of Tokyo, August 2015 # # ## Software requirements # + Jupyter: http://continuum.io/downloads and Python 2 kernel for Jupyter; see https://ipython.org/install.html # # ## Supplemental Texts # + Stark, P.B., 1997–2015. [_SticiGui: Statistical Tools for Internet and Classroom Instruction with a Graphical User Interface_](http://www.stat.berkeley.edu/~stark/SticiGui/index.htm). # + Stark, P.B., 1990–2010. Lecture notes for Nonparametrics, [Statistics 240](https://www.stat.berkeley.edu/~stark/Teach/S240/Notes/index.htm) # # # Index # **These notes are in draft form, with large gaps.** # I'm happy to hear about any errors, and I hope eventually to fill in some of the missing pieces. # # 1. [Overview](overview.ipynb) # 1. [Introduction to Jupyter and Python](jupyter.ipynb) # 1. [Sets, Combinatorics, & Probability](prob.ipynb) # 1. [Theories of Probability](probTheory.ipynb) # 1. [Random Variables, Expectation, Random Vectors, and Stochastic Processes](rv.ipynb) # 1. [Probability Inequalities](ineq.ipynb) # 1. [Inference](inference.ipynb) # 1. [Confidence Sets](conf.ipynb) # # Rough Syllabus for Tokyo Short Course # # ## [Preamble: Introduction to Jupyter and Python](jupyter.ipynb) # 1. Jupyter notebook # + Cells, markdown, MathJax # 1. Less Python than you need # # ## [Lecture 1: Probability](prob.ipynb) # 1. What's the difference between Probability and Statistics? # 1. Counting and combinatorics # + Sets: unions, intersections, partitions # + De Morgan's Laws # + The Inclusion-Exclusion principle # + The Fundamental Rule of Counting # + Combinations # + Permutations # + Strategies for counting # # 2. Axiomatic Probability # + Outcome space and events, events as sets # + Kolmogorov's axioms (finite and countable) # + Analogies between probability and area or mass # + Consequences of the axioms # - Probabilities of unions and intersections # - Bounds on probabilities # - Bonferroni's inequality # - The inclusion-exclusion rule for probabilities # + Conditional probability # - The Multiplication Rule # - Independence # - Bayes Rule # ## Lecture 2: Probability, continued # 3. Theories of probability # + Equally likely outcomes # + Frequency Theory # + Subjective Theory # + Shortcomings of the theories # + Rates versus probabilities # + Measurement error # + Where does probability come from in physical problems? # + Making sense of geophysical probabilities # - Earthquake probabilities # - Probability of magnetic reversals # - Probability that Earth is more than 5B years old # 4. Random variables. # + Probability distributions of real-valued random variables # + Cumulative distribution functions # + Discrete random variables # - Probability mass functions # - The uniform distribution on a finite set # - Bernoulli random variables # - Random variables derived from the Bernoulli # * Binomial random variables # * Geometric # * Negative binomial # - Hypergeometric random variables # - Poisson random variables: countably infinite outcome spaces # # ## Lecture 3: Random variables, contd. # 5. Random variables, continued # + Continuous and "mixed" random variables # + Probability densities # - The uniform distribution on an interval # - The Gaussian distribution # + The CDF of discrete, continuous, and mixed distributions # + Distribution of measurement errors # - The box model for random error # - Systematic and stochastic error # 6. Independence of random variables # + Events derived from random variables # + Definitions of independence # + Independence and "informativeness" # + Examples of independent and dependent random variables # + IID random variables # + Exchangeability of random variables # 7. Marginal distributions # 8. Point processes # + Poisson processes # - Homogeneous and inhomogeneous Poisson processes # - Spatially heterogeneous, temporally homogenous Poisson processes as a model for seismicity # - The conditional distribution of Poisson processes given N # + Marked point processes # + Inter-arrival times and inter-arrival distributions # + Branching processes # - ETAS # # ## Lecture 4: Expectation, Probability Inequalities, and Simulation # 9. Expectation # + The Law of Large Numbers # + The Expected Value # - Expected value of a discrete univariate distribution # * Special cases: Bernoulli, Binomial, Geometric, Hypergeometric, Poisson # - Expected value of a continuous univariate distribution # * Special cases: uniform, exponential, normal # - Expected value of a multivariate distribution # + Standard Error and Variance. # - Discrete examples # - Continuous examples # - The square-root law # - Standardization and Studentization # - The Central Limit Theorem # + The tail-sum formula for the expected value # + Conditional expectation # - The conditional expectation is a random variable # - The expectation of the conditional expectation is the unconditional expectation # + Useful probability inequalities # - Markov's Inequality # - Chebychev's Inequality # - Hoeffding's Inequality # - Jensen's inequality # 10. Simulation # + Pseudo-random number generation # - Importance of the PRNG. Period, DIEHARD # + Assumptions # + Uncertainties # + Sampling distributions # # ## Lecture 5: Testing # 11. Hypothesis tests # + Null and alternative hypotheses, "omnibus" hypotheses # + Type I and Type II errors # + Significance level and power # + Approximate, exact, and conservative tests # + Families of tests # + P-values # - Estimating P-values by simulation # + Test statistics # - Selecting a test statistic # - The null distribution of a test statistic # - One-sided and two-sided tests # + Null hypotheses involving actual, hypothetical, and counterfactual randomness # + Multiplicity # - Per-comparison error rate (PCER) # - Familywise error rate (FWER) # - The False Discovery Rate (FDR) # # ## Lecture 6: Tests and Confidence sets # 12. Tests, continued # + Parametric and nonparametric tests # - The Kolmogorov-Smirnov test and the MDKW inequality # - Example: Testing for uniformity # - Conditional test for Poisson behavior # + Permutation and randomization tests # - Invariances of distributions # - Exchangeability # - The permutation distribution of test statistics # - Approximating permutation distributions by simulation # - The two-sample problem # + Testing when there are nuisance parameters # 13. Confidence sets # + Definition # + Interpretation # + Duality between hypothesis tests and confidence sets # + Tests and confidence sets for Binomial p # + Pivoting # - Confidence sets for a normal mean # * known variance # * unknown variance; Student's t distribution # + Approximate confidence intervals using the normal approximation # - Empirical coverage # - Failures # + Nonparametric confidence bounds for the mean of a nonnegative population # + Multiplicity # - Simultaneous coverage # - Selective coverage # # Rough Syllabus for complete 45-hour course # # --- # ### Descriptive Statistics # # 1. Summarizing data. # 1. Types of data: categorical, ordinal, quantitative # 1. Univariate data. # 1. Measures of location and spread: mean, median, mode, quantiles, inter-quartile range, range, standard deviation, RMS # 1. Markov's and Chebychev's inequalities for quantitative lists # 1. Ranks and ordinal categorical data # 1. Frequency tables and histograms # 1. Bar charts # 1. Multivariate data # 1. Scatterplots # 1. Measures of association: Pearson and Spearman correlation coefficients # 1. Linear regression # 1. The Least Squares principle # 1. The Projection Theorem # 1. The Normal Equations # 1. Numerical solution of the normal equations # 1. Numerical linear algebra is not the same as abstract linear algebra # 1. Condition number # 1. Do not invert matrices to solve linear systems: use backsubstitution or factorization # 1. Errors in regression: RMS error of linear regression # 1. Least Absolute Value regression # 1. Principal components and approximation by subspaces: another application of the Projection Theorem # 1. Clustering # 1. Distance functions # 1. Hierarchical methods, tree-based methods # 1. Centroid methods: K-means # 1. Density-based clustering: kernel methods, DBSCAN # # --- # ### Probability # # 1. Counting and combinatorics # 1. Sets: unions, intersections, partitions # 1. De Morgan's Laws # 1. The Inclusion-Exclusion principle. # 1. The Fundamental Rule of Counting # 1. Combinations. Application (using the Inclusion-Exclusion Principle): counting derangements # 1. Permutations # 1. Strategies for complex counting problems # # 1. Theories of probability # 1. Equally likely outcomes # 1. Frequency Theory # 1. Subjective Theory # 1. Shortcomings of the theories # # 1. Axiomatic Probability # 1. Outcome space and events, events as sets # 1. Kolmogorov's axioms (finite and countable) # 1. Analogies between probability and area or mass # 1. Consequences of the axioms # 1. Probabilities of unions and intersections # 1. Bounds on probabilities # 1. Bonferroni's inequality # 1. The inclusion-exclusion rule for probabilities # 1. Conditional probability # 1. The Multiplication Rule # 1. Independence # 1. Bayes Rule # # 1. Random variables. # 1. Probability distributions # 1. Cumulative distribution functions for real-valued random variables # 1. Discrete random variables # 1. Probability mass functions # 1. The uniform distribution on a finite set # 1. Bernoulli random variables # 1. Random variables derived from the Bernoulli # 1. Binomial random variables # 1. Geometric # 1. Negative binomial # 1. Poisson random variables: countably infinite outcome spaces # 1. Hypergeometric random variables # 1. Examples of other discrete random variables # 1. Continuous and "mixed" random variables # 1. Probability densities # 1. The uniform distribution on an interval # 1. The exponential distribution and double-exponential distributions # 1. The Gaussian distribution # 1. The CDF of discrete, continuous, and mixed distributions # 1. Survival functions and hazard functions # 1. Counting processes # 1. Joint distributions of collections of random variables, random vectors # 1. The multivariate uniform distribution # 1. The multivariate normal distribution # 1. Independence of random variables # 1. Events derived from random variables # 1. Definitions of independence # 1. Marginal distributions # 1. Conditional distributions # 1. The "memoryless property" of the exponential distribution # 1. The Central Limit Theorem # 1. Stochastic processes # 1. Point processes # 1. Intensity functions and conditional intensity functions # 1. Poisson processes # 1. Homogeneous and inhomogeneous Poisson processes # 1. The conditional distribution of Poisson processes given N # 1. Marked point processes # 1. Inter-arrival times and inter-arrival distributions # 1. The conditional distribution of a Poisson process # 1. Random walks # 1. Markov chains # 1. Brownian motion # # 1. Expectation # 1. The Law of Large Numbers # 1. The Expected Value # 1. Expected value of a discrete univariate distribution # 1. Special cases: Bernoulli, Binomial, Geometric, Hypergeometric, Poisson # 1. Expected value of a continuous univariate distribution # 1. Special cases: uniform, exponential, normal # 1. (Aside: measurability, Lebesgue integration, and the CDF as a measure) # 1. Expected value of a multivariate distribution # 1. Expected values of functions of a random variable # 1. Change-of-variables formulas for probability mass functions and densities # 1. Standard Error and Variance. # 1. Discrete examples # 1. Continuous examples # 1. The square-root law # 1. The tail-sum formula for the expected value # 1. Conditional expectation # 1. The expectation of the conditional expectation is the unconditional expectation # 1. Useful probability inequalities # 1. Markov's Inequality # 1. Chebychev's Inequality # 1. Hoeffding's Inequality # # --- # ### Sampling # # 1. Empirical distributions # 1. The ECDF for univariate distributions # 1. The Kolmogorov-Smirnov statistic and The Massart-Dvoretzky-Kiefer-Wolfowitz inequality # 1. Inference: inverting the MDKW inequality # 1. Q-Q plots # # 1. Random sampling. # 1. Types of samples # 1. Samples of convenience # 1. Quota sampling # 1. Systematic sampling # 1. The importance of random sampling: stirring the soup. # 1. Systematic random sampling # 1. Random sampling with replacement # 1. Simple random sampling # 1. Stratified random sampling. # 1. Cluster sampling # 1. Multistage sampling # 1. Weighted random samples # 1. Sampling with probability proportional to size # 1. Sampling frames # 1. Nonresponse and missing data # 1. Sampling bias # # 1. Simulation # 1. Pseudo-random number generators # 1. Why the PRNG matters # 1. Uniformity, period, independence # 1. Assessing PRNGs. DIEHARD and other tests # 1. Linear congruential PRNGs, including the Wichmann-Hill. Group-induced patterns # 1. Statistically "adequate" PRNGs, including the Mersenne Twister # 1. Cryptographic quality PRNGs, including cryptographic hashes # 1. Generating pseudorandom permutations # 1. Taking pseudorandom samples # 1. Simulating sampling distributions # # --- # ### Estimation and Inference # # 1. Estimating parameters using random samples # 1. Sampling distributions # 1. The Central Limit Theorem # 1. Measures of accuracy: mean squared error, median absolute deviation, etc. # 1. Maximum likelihood # 1. Loss functions, Risk, and decision theory # 1. Minimax estimates # 1. Bayes estimates # 1. The Bootstrap # 1. Shrinkage and regularization # # 1. Inference # 1. Hypothesis tests # 1. Null and alternative hypotheses, "omnibus" hypotheses # 1. Type I and Type II errors # 1. Significance level and Power # 1. Approximate, exact, and conservative tests # 1. Families of tests # 1. P-values # 1. Estimating P-values by simulation # 1. Test statistics # 1. Selecting a test statistic # 1. The null distribution of a test statistic # 1. One-sided and two-sided tests # 1. Null hypotheses involving actual, hypothetical, and counterfactual randomness # 1. Multiplicity # 1. Per-comparison error rate # 1. Familywise error rate # 1. The False Discovery Rate # 1. Approaches to testing # 1. Parametric and nonparametric tests # 1. Likelihood ratio tests # 1. Permutation and randomization tests # 1. Invariances of distributions # 1. Exchangeability # 1. Other symmetries # 1. The permutation distribution of test statistics # 1. Approximating permutation distributions by simulation # 1. Confidence sets # 1. Duality between hypothesis tests and confidence sets # 1. Conditional tests, conditional and unconditional significance levels # # 1. Tests of particular hypotheses # 1. The Neyman model of a randomized experiment. # 1. Strong and weak null hypotheses # 1. Testing the strong null hypothesis # 1. The distribution of a test statistic under the strong null # 1. "Interference" # 1. Blocking and other designs # 1. Ensuring that the null hypothesis matches the experiment # 1. Tests for Binomial p # 1. The Sign test # 1. The sign test for the median; tests for other quantiles # 1. The sign test for a difference in medians # 1. Tests based on the normal approximation # 1. The Z statistic and the Z test # 1. The t statistic and the t test # 1. 2-sample problems, paired and unpaired tests # 1. Tests based on ranks # 1. The Wilcoxon test # 1. The Wilcoxon signed rank test # 1. Tests using actual values # 1. Tests of association # 1. The hypothesis of exchangeability # 1. The Spearman test # 1. The permutation distribution of the Pearson correlation # 1. Tests of randomness and independence # 1. The runs test # 1. Tests of symmetry # 1. Tests of exchangeability # 1. Tests of spherical symmetry # 1. The two-sample problem # 1. Selecting the test statistic: what's the alternative? # 1. Mean, sum, Student t # 1. Smirnov statistic # 1. Other choices # 1. The permutation distribution of the test statistic # 1. The two-sample problem for complex data # 1. Test statistics # 1. The k-sample problem # 1. Stratified permutation tests # 1. Fisher's Exact Test # 1. Tests of homogeneity and ANOVA # 1. The F statistic # 1. The permutation distribution of the F statistic # 1. Other statistics # 1. Ordered alternatives # 1. Tests based on the distribution function: The Kolmogorov-Smirnov Test # 1. The universality of the null distribution for continuous variables # 1. Using the K-S test to test for Poisson behavior # 1. Sequential tests and Wald's SPRT # 1. Random walks and Gambler's ruin # 1. Wald's Theorem # # 1. Confidence intervals for particular parameters # 1. Confidence intervals for a shift in the Neyman model # 1. Confidence intervals for Binomial p # 1. Application: confidence bounds for P-values estimated by simulation # 1. Application: intervals for quantiles by inverting binomial tests # 1. Confidence intervals for a Normal mean using the Z and t distributions # 1. Confidence intervals for the mean # 1. Nonparametric confidence bounds for a population mean # 1. The need for a priori bounds # 1. Nonnegative random variables # 1. Bounded random variables # 1. Confidence sets for multivariate parameters # # 1. Density estimation # 1. Histogram estimates # 1. Kernel estimates # 1. Confidence bounds for monotone and shape-restricted densities # 1. Lower confidence bounds on the number of modes # # 1. Function estimation # 1. Splines and penalized splines # 1. Polynomial splines # 1. Periodic splines # 1. Smoothing splines as least-squares # 1. B-splines # 1. L1 splines # 1. Constraints # 1. Balls and ellipsoids # 1. Smoothness and norms # 1. Lipschitz conditions # 1. Sobolev conditions # 1. Cones # 1. Nonnegativity # 1. Shape restrictions # 1. Monotonicity # 1. Convexity # 1. Star-shaped constraints # 1. Sparsity and minimum L1 methods # # --- # ### *Sketchy from here down* # ### Experiments # # 1. Experiments versus observational studies # 1. Controls and the Method of Comparison # 1. Randomization # 1. Blinding # # 1. Experimental design # 1. Blocking # 1. Orthogonal designs # 1. Latin hypercube design # # # # # In[1]: # Version information get_ipython().run_line_magic('load_ext', 'version_information') get_ipython().run_line_magic('version_information', 'scipy, numpy, pandas, matplotlib') # In[ ]: