We want to use a smoothing (cubic) spline to fit a good curve to the data set ratdiet_fields.dat
, fitting trt
vs. t
, which has $N$ data points.
We will control the amount of smoothing through the effective degrees of freedom, dof, in the fit.
Vary the dof from $2$ to $35$ in steps of $0.2$. For each value of dof, use leave-one-out cross validation to calculate the RMS of the fits. (That is, for each dof, you will fit $N$ cubic splines, each using $N-1$ different sets of data points, and you will calculate the residual on the one point left out for that $N$.) Plot RMS vs. dof. Use this to identify the value of the dof with the smallest RMS (call this dof$_{\rm min}$) and report these two values.
Plot the data, and overplot the three cubic spline solutions with dof equal to dof$_{\rm min}$, half this, and twice this.
The data file cars93sel_MASS.dat
lists $14$ variables for $93$ cars (I'll leave you to guess the units).
We want to find out the relative importance of the $J=13$ variables in determining the price, $y$, (in the first column). For this we will use a simple multidimensional linear model to predict the price of the $i^{th}$ car \begin{equation} f(x_i) = \sum_{j=1}^{J} \beta_j x_{i,j} \ . \end{equation}
Do the following:
$\lambda=10^5$. Plot $\beta$ vs. $\log_{10} \lambda$ for each of the $13$ coefficients. Plot them all on a single panel so you can see how the relative size of each coefficient varies (I suggest to use lines to connect the points for each coefficient, rather than plotting with points). For each of the values compute the RMS and plot them against $\lambda$. Compare with linear regression.