Welcome !¶

Welcome to the Machine Learning project for hunting planets and supernova. In this project you are given the challenge of finding exoplanets from the Kepler Planet Hunting Project and supernovas from the Catalina Survey.

READ THE INSTRUCTIONS CAREFULLY

You can use the tutorial notebook as a guideline for putting in peices of code into your project notebook and sufficient hints will be provided along the way. Feel free to ask for help if you have any trouble navigating through the project.

This notebook will be your worksheet for the project. You are also asked to reason your decisions and record your results on the paper provided.

Imports¶

All neccessary functions that you will need to use will be already imported to the workspace. So you need not worry about them :)

In [1]:

# All the basic stuff
import numpy as np   # For some numerical stuff
import matplotlib.pyplot as plt # For making beautiful plots
from datasets import load_planets, evaluate_my_results
import warnings
warnings.filterwarnings('ignore')

# Machine Learning Models
from sklearn.neighbors import KNeighborsClassifier  # KNN Algorithm
from sklearn.tree import DecisionTreeClassifier # Decision Tree Classifier
from sklearn.linear_model import SGDClassifier # Stochastic Gradient Classifier


# Evaluation tools
from sklearn.cross_validation import train_test_split # A utility to split data
from sklearn.metrics import precision_score
%pylab inline

Populating the interactive namespace from numpy and matplotlib

Your dataset to train your algorithms¶

In [2]:

# The dataset that you need to train and evaluate will be preloaded. You just need to start putting your code
# Variables are as described as below
# feature_names = The features of the light curves in the dataset
# features = The actual features for training and testing
# labels = Class labels of the objects

feature_names,features,labels,validation_data = load_planets()

Successfully Loaded the data !

There are total 632 samples in your dataset.

There are 119 Planets, 196 RRLyrae and 317 Supernovaes in your data.

Your dataset has 21 features in total.

Your validation data has 271 samples.

Features in the dataset¶

Key	Description
[0] amplitude (B)	Amplitude from the Fourier decomposition
[1] cusum (B)	Cumulative sum index
[2] hl_amp_ratio (B)	Ratio of higher and lower magnitudes than the average
[3] kurtosis (B)	Kurtosis
[4] period (P-V)	Period
[5] period_SNR (P-V)	SNR of period derived using a periodogram
[6] period_uncertainty (V)	Period uncertainty based on a periodogram
[7] phase_cusum (V)	Cumulative sum index over a phase-foled ligit curve
[8] phase_eta (V)	Eta over a phase-foled ligit curve
[9] phi21 (V)	2nd and 1st phase difference from the Fourier decomposition
[10] phi31 (V)	3rd and 1st phase difference from the Fourier decomposition
[11] quartile31 (B)	3rd quartile - 1st quartile
[12] r21 (P-V)	2nd and 1st amplitude difference from the Fourier decomposition
[13] r31 (P-V)	3nd and 1st amplitude difference from the Fourier decomposition
[14] shapiro_w (V)	Shapiro-Wilk test statistics
[15] skewness (B,V)	Skewness
[16] slope_per10 (B)	10% percentile of slopes of a phase-folded light curve
[17] slope_per90 (B)	90% percentile of slopes of a phase-folded light curve
[18] stetson_k (V)	Stetson K
[19] weighted_mean (B)	Weighted mean magnitude
[20] weighted_std (B)	Weighted standard deviation of magnitudes

Feature Selection¶

Sometime not all features are useful for good model. So it will be better to remove or use a certain combination of features for better training and model selection. In the following piece of code you make changes to select all the features or some of them for better training.

In [3]:

# The numbers are the corrresponding indices of the features as in the previous table. You can delete/add indices

features_selected =[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]

## Do not touch this peice of code
features = features[:,features_selected]

Start pasting your codes for data splitting, training and model evaluation¶

In [4]:

# Split the data
train_data,test_data,train_labels,test_labels = train_test_split(features,labels,test_size=0.3,random_state=0)

In [5]:

# Train KNN
mymodel = KNeighborsClassifier(n_neighbors=5,)  # Create the classifier object to a variable 'mymodel'

mymodel = mymodel.fit(train_data,train_labels) # Train the algorithm and save the model mymodel 

In [6]:

# Test model accuracy on test set
predictions = mymodel.predict(test_data)

score = precision_score(predictions,test_labels,average='micro')

print 'The precision score is %f'%(score*100)

The precision score is 91.052632

In [7]:

# Make predictions on unseen data
my_predictions = mymodel.predict(validation_data)

In [8]:

#### Test the model against the benchmark validations
evaluate_my_results(my_predictions)

Number of labels  271
The precision score is 88.191882
The recall is 88.191882
You found 25 Exoplanets
You found 84 RRLyraes
You found 130 Supernova