#!/usr/bin/env python
# coding: utf-8

# ##Image Prediction w/ scikit-learn
# For the purposes of this post our problem statement is as follows: If we are given a training set of _n_ faces belonging to a class _y_, can we accurately compose or predict the rest of a partially obscured or missing faces? A data array representing an image depends on the image's format, which dictates the number of channels and channel depth. This excercise requires a constrained and standardized input space, for example 1024 x 768 images with red, green, and blue (RGB) channels and one 8-bit alpha channel. The alpha channel acts as a mask and specifies how the pixel's colors should be merged with another pixel when the two are overlaid. In most instances, you wouldn't define the alpha channel on a pixel-by-pixel basis, but rather per object or layer (think Photoshop).
# 
# A conventional image detection pipeline consists of the following stages:
# 1. Detect
# 2. Align
# 3. Represent
# 4. Classify/Predict
# 
# We can consider an "Extract" step to exist partly in step 1 and 2. Given the sum of my knowledge in computational photography/vison and the scope of this post, I'll be skipping the alignment/extractions steps and assume we're working with an image set of centered and cropped faces with a threshold of quality. Quality extractions in the wild along with all the affine, scale, rotational, etc. transformations could be it's own post but I want to focus more on the **Classify/Predict** portion of the pipeline.

# In[1]:


import numpy as np
from matplotlib import pyplot as plt


# In[2]:


# Note: Don't use `--pylab inline`
get_ipython().run_line_magic('matplotlib', 'inline')


# ##Our Dataset
# I am using the AT&T Laboratories Cambridge Olivetti faces dataset. Although it looks like the orignal higher-res images were lost, Sam Rowel's "Data for MATLAB hackers" at NYU has the complete dataset in smaller form: 400 64x64 grayscale 8-bit images. **40 subjects, 10 photos from various angles each.** http://www.cs.nyu.edu/~roweis/data.html
# 
# In this section we'll be doing our data munging as the data in this original MATLAB file isn't quite in the format we'll need. We'll get it in usable shape via a handful of transformations.
# 
# Luckily scipy has a helpful MATLAB file loader which returns a dict of numpy arrays:

# In[3]:


import scipy.io as sio

mat_data = sio.loadmat("data/olivettifaces.mat")

# Let's examine the arrays in our dict:
for key in mat_data.keys():
    if type(mat_data[key]).__module__ == 'numpy':
        print  key + ':', mat_data[key].shape
    else: None


# We want the `faces` array of this dictionary, let's transpose this 4096 x 400 matrix so we can have 400 rows (individual photos) with 4096 features [(raveled)](http://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html).

# In[4]:


faces = mat_data['faces'].T.copy() # Chaining numpy's transpose and copy method.
faces = np.float32(faces) # Convert to float for our estimators (could have used .astype


# 4096 = 64x64, thus we can unravel or reshape the values then transpose (or rotate if we wanted a different view) the numpy array as follows and examine the result

# In[5]:


# Diff image composition but gets the photo the right way up
# x = np.rot90(x, k =-1)
test_face = faces[90,:].reshape(64,64).T
plt.imshow(test_face, cmap=plt.cm.gray) # Manual grayscale spec

# If wanted to do the same for the entire set and create a 3D matrix
# faces = faces.reshape((400, 64, 64)).transpose(0, 2, 1)


# This is not a sophisticated matplotlib subplot, but a hacky way to quickly churn out multiple images for examination in matplotlib:
# 
# ```python
# for img in range(9):
#     plt.figure()
#     plt.imshow(faces[img,:].reshape(64,64).T, cmap=plt.cm.gray)
# ```
# I used the above code to discover that the dataset consisted of multiple unique 10-picture sets of each photo subject. Each photo is the subject with slight variations in angles (3/4 or headon) and expressions (smiling, stoic, etc.)
# 

# Let's examine the historgram of pixel intensities in this image. I see near black values but not white. We imported seaborn so we should expect an aesthetically pleasing override of matplotlib defaults.

# In[6]:


import seaborn as sns


# In[7]:


# calculate histogram
counts, bins = np.histogram(test_face, range(257))
# plot histogram centered on values 0..255
plt.bar(bins[:-1] - 0.5, counts, width=1)
plt.xlim([-0.5, 255.5])
plt.show()


# Our values are still in terms of pixel intensity. Let's standardize or scale our data as objective functions will not work properly without normalization and gradient descent converges much faster when the features are scaled.

# In[8]:


faces = faces - faces.min()
faces /= faces.max()
print faces.min(), faces.max()


# Let's save a 3D matrix of raveled data as image matrices for later use as we won't be feeding this format of array through our ML pipeline.

# In[9]:


faces_img_frmt = faces.reshape((400, 64, 64)).transpose(0, 2, 1)
faces_img_frmt.shape


# Let's get our data back in the original (row = 1 image, columns = features) format.

# In[10]:


# # the unspecified value is inferred to be 4096 (64*64)
faces = faces_img_frmt.reshape(len(faces_img_frmt), -1)
faces.shape


# If we sequentially look through all the images, we'll find that the pictures of our subjects are placed in contiguous blocks of 10. The only thing permutating within a class (set of photos from one subject) is the expression or angle of the face. We can create our label or class vector like so:

# In[11]:


target = np.array([i // 10 for i in range(400)])
faces.shape, target.shape, target[:20]


# ## Prediction
# ###Creating the train and test sets
# We now have our feature space and labels and can move on to prediction. I would usually utilize numpy.random.shuffle, numpy.random.permutation, or scikit's train_test_split function like so:   
# 
# ```python
# from sklearn.cross_validation import train_test_split
# 
# X_train, X_test, y_train, y_test = train_test_split(faces, target, test_size=0.33) #random_state=1)
# ```
# 
# This would create a randomized train and test set, in this instance however I want to ensure that my training model never sees a subset of subjects. This could mean my test set might have a face with a beard but my estimator was trained on a dataset that lacked a bearded subject. Given that the only thing permutating within a class (set of photos from one subject) is the expression or angle of the face
# 
# Labels are in contiguous blocks of ten, so we'll save the last ten faces for the test set.

# In[12]:


train = faces[target < 30]
test = faces[target >= 30]

print 'target: {}'.format(target.shape)
print 'train: {}'.format(train.shape)
print 'test: {}'.format(test.shape) 


# In[13]:


n_faces = test.shape[0]/10 #10
class_ids = range(30,41)

print 'class_ids: {}'.format(class_ids)
print 'n_faces: {}'.format(n_faces)


# Creating our train & test X and y:

# In[14]:


test = test[class_ids, :]

n_pixels = faces.shape[1] # of columns

# Upper half of the faces
X_train = train[:, :np.ceil(0.5 * n_pixels)]  
X_test = test[:, :np.ceil(0.5 * n_pixels)]

# Lower half of the faces
y_train = train[:, np.floor(0.5 * n_pixels):]  
y_test = test[:, np.floor(0.5 * n_pixels):]


# ###Choosing and Training Estimators

# Now that we have our training and test data we can try a variety of estimators from the least to the most complex. 
# 
# **Linear Regression:**
# You might not have known that the scikit package includes a Linear Regression implementation. In this case it's plain Ordinary Least Squares (scipy.linalg.lstsq) wrapped in scikit's excellent predictor object.
# 
# **RidgeCV:**
# Not sure why, but during my career I've mostly seen core statisticians, which I'm defining as individuals with classic stats education, use Ridge or LASSO implementations and almost never those with pure computer science or ML backgrounds. The gist behind Ridge regression is that it imposes a penalty on the size of coefficients in OLS. The ridge coefficients minimize a penalized residual sum of squares. A nice summary of Ridge vs OLS can be found here: http://www.stat.cmu.edu/~ryantibs/datamining/lectures/16-modr1.pdf
# 
# 
# **K-Nearest Neighbors:** 
# KNN came up a lot during my computational photography coursework although I've never had a reason to implement it in the wild, it's application for images is intuitive. KNN is nonparametric as it fits the data in the absence of any guidance or constraints, it also works well if data is bounded and we already have observations at those bounds. In the case of images matrices we know that the range for any given element or pixel is 0-255.
# 
# **Random Forests:** The bread and butter of the "data kiddie" (funny yet poignant: http://www.datatau.com/item?id=468). I'm never sure who my audience is for this blog and if they are more interested in practical application or all the underlying math and concepts behind the functions I apply to my toy problems. I'm thinking I should do a dedicated post to some of the more common estimators in ML and link back to those posts whenever I mention the estimator in future writing. 
# 
# Given a large enough machine, the aforementioned data kiddies can throw a multitude of estimators at a problem set and use grid or random search to tune all the hyper-parameters to find a better than default fit. This of course doesn't help model generalization for the next unseen dataset, bias/variation considerations, distilling or communicating the methods or findings to stakeholders and end-users, etc.

# In[15]:


from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor


# In[22]:


estimators_dict = { 
    "Linear regression": LinearRegression(),
    "Ridge": RidgeCV(),
    "K-nn": KNeighborsRegressor(),
    "Random Forest": RandomForestRegressor(n_estimators=10,
                                           criterion='mse',
                                           max_features=32,
                                           n_jobs=1),
    "Extra trees": ExtraTreesRegressor(n_estimators=10,
                                       max_features=32,
                                       random_state=0,
                                       n_jobs=1)
}

y_test_predictions = dict()


# I prefer to keep all the outputs of my estimators in a dictionary for more structured and readable code. Let's store a predicition and score for each estimator:

# In[23]:


get_ipython().run_line_magic('time', '')
for name, estimator in estimators_dict.items():
    estimator.fit(X_train, y_train)
    y_test_predictions[name]= {}
    y_test_predictions[name]['prediction'] = estimator.predict(X_test)
    y_test_predictions[name]['score'] = estimator.score(X_test, y_test)


# In[24]:


scores_list = []
for name, estimator in y_test_predictions.items():
    scores_list.append((name, estimator['score']))

# Sort our list of tuples
scores_list.sort(key=lambda list_tup: list_tup[1])
scores_list


# Ridge did the best out of all the estimators. Random Forest might be great off the shelf estimator, but in this case we see the superiority of more simple methods in specific applications.
# 
# Let's take a look at the plots and examine if we can qualitatively infer the same.

# ### Plotting the Results of the Prediction

# Below we'll use matplotlib to create grid plot of the various attempts each estimator made for a given test subject.
# A `figure` in matplotlib refers to the whole window being plotted, think of it as the "canvas". This figure can contain subplots within it. These subplots are positioned in a grid within the figure (3x3, 5x5, etc.)

# In[38]:


# Setting our plot config
n_cols = 1 + len(estimators_dict) # of columns in subplot = count of estimators
image_shape = (64, 64)
plt.figure(figsize=(2.0 * n_cols, 1.5 * n_faces)) # Odd that we specify figure size via a w,h tuple in inches

# Add a centered title to the figure.
# The y here was trial and error to get just right considering I am using plt.tight_layout()
plt.suptitle("Multiple Estimators on Face Completion", fontsize=16, y=1.02) 

# Compose our sub plot
for face in range(n_faces):
    
    # Plot left most actual face column
    actual_face = np.hstack((X_test[face], y_test[face]))
    sub = plt.subplot(n_faces, n_cols, face * n_cols + 1,
                          title="Actual Faces")
    sub.axis("off")
    sub.imshow(actual_face.reshape(image_shape),cmap=plt.cm.gray)
    
    # Plot results of our estimators
    for dict_index, estimator in enumerate(estimators_dict):
        predicted_face = np.hstack((X_test[face], y_test_predictions[estimator]['prediction'][face]))    
        sub = plt.subplot(n_faces, n_cols, face * n_cols + 2 + dict_index,
                              title=estimator + 'Score:' + str(np.float16(y_test_predictions[estimator]['score']))[:5] )
        sub.axis("off")
        sub.imshow(predicted_face.reshape(image_shape),cmap=plt.cm.gray)

plt.tight_layout()
plt.show()


# It looks like KNN had strong bias to a specifc implementation in the toothed smile we see. Ridge seems to generalize well in this case, the decision tree pair seem to undercommit and potentially have too much variance and less bias, and linear regression won't be used by the NSA anytime soon :)
# 
# 
# I'm curious about this untuned random forest's poor performance. More trees is generally always better with diminishing returns. Growing deeper trees is generally better and less computationally intensive than growing more trees. Deeper trees reduces the bias, whereas more trees reduces the variance. I don't believe there was enough features or observations to see a performance difference between Random Forests and Extra Trees as we couldn't exhaustively split down a given feature versus create random cut off points for depth.
# 
# I gave Random Forests another go with no arbitrary feature cut offs or tree depth and acheived slightly worse results. The next attempt would involve a random or grid search of the estimator's parameters.