Welcome to the data science school in Nyeri, Kenya.
This notebook provides you with the guide to your lab classes for the entire school. The lab classes are intended to help get you familiar with modeling as well as the principles of probabilistic inference.
The lab classes make use of our
pods software for 'open data science' for access to data sets and other resources.
pip install -pre pods
The first day will review probability and introduce regression models.
Jupyter and Probability Review Introduction to the Jupyter notebook and a review of probability.
Introduction to Regression and Linear Algebra Review Regression is the mainstay of many approaches to machine learning, here we also motivate linear algebra through solving the regression problem.
Introduction to Basis Functions Linear models can be limiting, basis functions allow us to go non linear in our predictions, but stay linear in our parameters.
Model Validation Validation of model predictions is one of the most general and important concepts in machine learning, statistics and data science. Here we teach the general concepts and apply them to polynomial regression.
Probabilistic classification and naive Bayes Classification of data is a mainstay of machine learning and data science. Here we consider the naive Bayes approach from the perspective of probabilistic modeling.
Logistic regression and Generalized Linear Models Naive Bayes classifies the data by modeling the entire joint distribution of the the labels and inputs, this can be useful when there's missing data, but it requires a very rich class of models. Logistic regression models the conditional distribution of the label given the data. It also leads to a very general class of models know as 'generalized linear models'.
Dimensionality reduction Unsupervised learning is an exploratory approach to understanding a data set. Here we consider dimensionality reduction as an approach to unsupervised learning.
Clustering Clustering is a discrete exploratory approach to understanding a data set. In this lab we consider clustering as an approach to unsupervised learning
We don't have time to cover everything we'd like, but we've included a couple of other resources here in case you would like to learn more.
Matrix Factorization for Collaborative Filtering An example of using
pandas for data analysis and how to represent opinions on a computer: collaborative filtering and matrix factorization.
Bayesian Regression Bayesian averaging over models is one way of improving performance through reducing variance, but without increasing bias.
Non linear Dimensionality Reduction TODO Non linear dimensionality reduction seeks to find a low dimensional space that is non linearly related to our observed data.