#!/usr/bin/env python # coding: utf-8 # # Chapter 10: Cross validation, regularization, and basis functions # # Exercises # ## Exercise 1 (50 points) # # We want to use a smoothing (cubic) spline to fit a good curve to the data set `ratdiet_fields.dat`, fitting `trt` vs. `t`, which has $N$ data points. # # We will control the amount of smoothing through the effective degrees of freedom, dof, in the fit. # # Vary the dof from $2$ to $35$ in steps of $0.2$. For each value of dof, use leave-one-out cross validation to calculate the RMS of the fits. (That is, for each dof, you will fit $N$ cubic splines, each using $N-1$ different sets of data points, and you will calculate the residual on the one point left out for that $N$.) # Plot RMS vs. dof. Use this to identify the value of the dof with the smallest RMS (call this dof$_{\rm min}$) and report these two values. # # Plot the data, and overplot the three cubic spline solutions with dof equal to dof$_{\rm min}$, half this, and twice this. # ## Exercise 2 (50 points) # # The data file `cars93sel_MASS.dat` lists $14$ variables for $93$ cars (I'll leave you to guess the units). # # We want to find out the relative importance of the $J=13$ variables in determining the price, $y$, (in the first column). # For this we will use a simple multidimensional linear model to predict the price of the $i^{th}$ car # \begin{equation} # f(x_i) = \sum_{j=1}^{J} \beta_j x_{i,j} \ . # \end{equation} # # Do the following: # # 1. Plot the price against each of the 13 variables separately (i.e.\ plot 13 panels). Which variables look as though they may be most significant (and most irrelevant) in determining the price? Do any of the variables look correlated? # 2. _Standardize_ your data. This means shift each of the variables (including price) to have zero mean, and rescale so that each has unit standard deviation. We do this so that each variable covers a similar range, and so their fit coefficients can be more easily compared. # 3. Do a linear least squares fit to determine the coefficients of the model. # 4. Use ridge regression with $\lambda=100$ to determine the coefficients of the model. # 5. For both solutions, calculate the RMS of the residuals as well as the sum-of-squares of the coefficients, and plot the price residuals against the true price. Compare the coefficients found by your two fits. Did ridge regression preferentially ``shrink'' some coefficients rather than others? # 6. Identify the single most significant variable, and perform a cubic least squares regression (no regularization) of price against this. What is the RMS and how does it compare to linear regression on all the variables? # 7. Re-do the ridge regression for $50$ values of $\lambda$ equally spaced logarithmically between $\lambda=10^{-1}$ and # $\lambda=10^5$. Plot $\beta$ vs. $\log_{10} \lambda$ for each of the $13$ coefficients. Plot them all on a single panel so you can see how the relative size of each coefficient varies (I suggest to use lines to connect the points for each coefficient, rather than plotting with points). For each of the values compute the RMS and plot them against $\lambda$. Compare with linear regression.