#!/usr/bin/env python # coding: utf-8 # (Defining latex commands: not to be shown...) # $$ # \newcommand{\norm}[1]{\left \| #1 \right \|} # \DeclareMathOperator{\minimize}{minimize} # \DeclareMathOperator{\maximize}{maximize} # \newcommand{\real}{\mathbb{R}} # \newcommand{\blasso}{\beta^{\mathrm{LASSO}}} # \newcommand{\bzero}{\beta^0} # \newcommand{\bLS}{\hat{\beta}^{\mathrm{LS}}} # $$ # # Weight shrinkage competition # # In this competition you will work on a real world dataset. The objective is to try and predict weight of some person using historical data collected over several years. # # The competition is hosted on [Kaggle website](https://inclass.kaggle.com/c/weightshrinkage). This competition is by invitiation only, we will send out invitations to all emails registered on the Google mailing list. You can participate in groups of two or individually, Kaggle will help you how to participate in groups. # # In order to get a full mark for this competition you should get pass the benchmark evaluation. In this sheet you will get an idea how to do that. # # Interploation techniques # # The dataset you will be working with is an instance of a time series. Time series are different from what you may typically see in Machine Learning problems. The most important distinction between a time series and other datasets is that data points are no longer iid (**i**ndependently and **i**dentically **d**istributed). # # Analysis of time series is a rich field of study. Here we show you one very simple method to work with time series. The dataset you download from the website contains many NA (**N**ot **A**ssigned) values. These missing values arise naturally in many datasets. Sometimes measurements are not made for every record or sometimes the data is corrupted. In this case, most NA values are cases where no measurement was made. # # Since the dataset is a time series one simple approach is to fill in the NA values with neighboring values. This works particularly well when missing values are scattered and there are no missing values in long continuous ranges. Unfortunately our dataset suffers from this issue but we will use this technique as a first step. Feel free to use other methods to improve your result. Our objective is to fill in missing values for the *Weight* vector which is our target variable via interpolation. # # ## Interpolation in Python # # You can do interpolation in Python using many different packages. Here we use [Pandas](http://pandas.pydata.org/) which is a common library for data analysis. The difference between Pandas and Numpy is the way data is layed out in memory which makes many operations efficient and easy. # # As usual we start with loading the package: # In[68]: import pandas as pd # Pandas provide similar, and sometimes more advanced, functionalities for loading different kinds of data. One if its particular strength is in working with time series data. # In[76]: data = pd.read_csv("./datasets/kag_train.csv",index_col='Date',parse_dates='Date') data[400:410] # This line loads the data, treats *Date* field in the dataset as Datetime and sets *Date* column as the index of the loaded Table. The standard data structure in Pandas is *Dataframe*, a dataframe contains several *Series* with the same index. If the index is a time series Panda will automatically constructs appropriate indexing. Also note that Pandas detects NA values automatically. # In[77]: get_ipython().run_line_magic('matplotlib', 'inline') data['Weight'].plot(marker = 'x',title='reported Weight') # Next, we would like to perform interpolation for missing values in the *Weight* column. To do so we first select the column we would like to make the interpolation for and then apply the $\texttt{interpolate}$ function on the Series we selected. # In[78]: data['WeightInter'] = data['Weight'].interpolate() data[400:410] # Comparing the values in *Weight* and *WeightInter* we can see pandas used a linear interpolation to fill in null values in our dataset. We can now store the results and make a submission on Kaggle. # # # Kaggle submission # # In order to make a submission we should make predictions for specific dates, indexed by ID. The first step is to load the indices from *test.csv* file. # In[80]: test = pd.read_csv("./datasets/kag_test.csv") test[:10] # Note that we did not use *ID* as the index therefore pandas created an index for us. Now we use the $\texttt{join}$ operation to join these indices to our original dataset. In order to join two so-called frames we need to set the Index column of the our training set to *ID*. This will allow us to join two frames on this column. # In[81]: data = data.set_index('ID') predictions = test.join(data,on='ID') predictions[:10] # We can now observe that *WeightInter* contains predictions for all IDs. The only thing left now is to save the results and make a submission. # In[82]: predictions[['ID','WeightInter']].to_csv('sampleSubmission.csv', header = ['ID','Weight'], index_label=False,index=False) # The first 5 lines of the file will look like this: # In[75]: get_ipython().system('head -5 myFirstSubmission') # # Remarks # # This was the first step to build a model for predicting the weight. What else can be done? # # In order to predict time series values, one often computes a trendline using the target variable only (as you have just done for the weight), and then one considers the residuals between the observed values and this trendline. Those residuals are then regressed on the predictor variables (here: calories, proteins, etc.). # # That is, if you want to obtain better predictions, you may either use more sophisticated interpolation methods that yield better trendlines or you use regression methods (e.g. those from the lecture) in order to find a good model for the residuals using the predictor variables. # # Working with categorical data # # If you want to use the *Remark* column to make predictions, you will have to find a way to handle categorical data. There are several ways to do so. In the following we will present two of them: # # 1. Predict the mean of the output variable (for example the *Weight*, or its residual), conditionally to the value of *Remark*. # 2. Convert the categorical data to numerical data, and use any standard regression method (for example linear regression). # # To convert categorical data to numerical data, again, there are several options. Here are two of them: # # 1. If you have only 2 categories, then you are fine by just assigning 1 number to the first, and another to the second category: for example 0 and 1. However, usually, if you have more than 2 categories, it is a bad idea to assign each category a randomly chosen number. This is because 1 is nearer from 2 than it is from 3, but it may not make any sense to say that the first category is more similar to the second than it is to the third. # 2. Make a vector with length the number of different categories, where all entries are set to 0, except that of the active category, which is set to 1. For example, if you want to model data with 3 categories, say "VisitOf","VisitTo","NoRemark", then you would use a vector $v$ of length 3, where $v=(0,0,1)$, $v=(0,1,0)$ and $v=(1,0,0)$ encode respectively "NoRemark", "VisitTo", "VisitOf". Note that some would prefer an encoding like: $v = (0,0)$, $v=(0,1)$ and $v=(1,0)$. # # Finally, note that you are free to build new categories and more generally new features. You may for example merge different categories into a single one, or create new categories like "isNA" and "isNotNA". Or create new numerical vectors like the average food weight during the past 7 days, etc. # # A final piece of advice # # There are as many models to predict the weight as you may think of. However, very simple models often yield among the best results if used with the right input features. # Thus, before starting to think about which (sophisticated) prediction method you may use, have a very close look at the data and think twice about what could be the relevant features to your problem. And if not provided, do construct these features!