Data Science School, Nyeri, Kenya

15th-17th June 2015

Neil D. Lawrence

Welcome to the data science school in Nyeri, Kenya.

This notebook provides you with the guide to your lab classes for the entire school. The lab classes are intended to help get you familiar with modeling as well as the principles of probabilistic inference.

The lab classes make use of our pods software for 'open data science' for access to data sets and other resources.

pip install -pre pods

Day 1: Introduction and Regression

The first day will review probability and introduce regression models.

  • Jupyter and Probability Review Introduction to the Jupyter notebook and a review of probability.

  • Introduction to Regression and Linear Algebra Review Regression is the mainstay of many approaches to machine learning, here we also motivate linear algebra through solving the regression problem.

  • Introduction to Basis Functions Linear models can be limiting, basis functions allow us to go non linear in our predictions, but stay linear in our parameters.

  • Model Validation Validation of model predictions is one of the most general and important concepts in machine learning, statistics and data science. Here we teach the general concepts and apply them to polynomial regression.

Day 2: Classification

  • Probabilistic classification and naive Bayes Classification of data is a mainstay of machine learning and data science. Here we consider the naive Bayes approach from the perspective of probabilistic modeling.

  • Logistic regression and Generalized Linear Models Naive Bayes classifies the data by modeling the entire joint distribution of the the labels and inputs, this can be useful when there's missing data, but it requires a very rich class of models. Logistic regression models the conditional distribution of the label given the data. It also leads to a very general class of models know as 'generalized linear models'.

Day 3: Unsupervised Learning

  • Dimensionality reduction Unsupervised learning is an exploratory approach to understanding a data set. Here we consider dimensionality reduction as an approach to unsupervised learning.

  • Clustering Clustering is a discrete exploratory approach to understanding a data set. In this lab we consider clustering as an approach to unsupervised learning

Other Material

We don't have time to cover everything we'd like, but we've included a couple of other resources here in case you would like to learn more.

In [ ]: