In some problems the response variable is not normally distributed.
The probability distribution of a random variable that can take the positive case with probability P or the negative case with probability 1-P.
If the response variable represents a probability, it must be constrained to the range {0,1}
Linear regression assumes that a constant change in the value of an explanatory variable results in a constant change in the value of the response variable, an assumption that does not hold if the value of the response variable represents a probability
Generalized linear models remove this assumption by relating a linear combination of the explanatory variables to the response variable using a link function
ordinary linear regression is a special case of the generalized linear model that relates a linear combination of the explanatory variables to a normally distributed response variable using the identity link function
We can use a different link function to relate a linear combination of the explanatory variables to the response variable that is not normally distributed.
import pandas as pd
df = pd.read_csv('data/SMSSpamCollection', delimiter='\t', header=None)
print(df.head())
0 1 0 ham Go until jurong point, crazy.. Available only ... 1 ham Ok lar... Joking wif u oni... 2 spam Free entry in 2 a wkly comp to win FA Cup fina... 3 ham U dun say so early hor... U c already then say... 4 ham Nah I don't think he goes to usf, he lives aro...
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
df = pd.read_csv('data/SMSSpamCollection', delimiter='\t', header=None)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
for i, prediction in enumerate(predictions[:5]):
print('Prediction: %s. Message: %s' % (prediction, X_test_raw[i]))
Prediction: ham. Message: Hi juan. Im coming home on fri hey. Of course i expect a welcome party and lots of presents. Ill phone u when i get back. Loads of love nicky x x x x x x x x x Prediction: ham. Message: Jason says it's cool if we pick some up from his place in like an hour Prediction: ham. Message: Can not use foreign stamps in this country. Prediction: ham. Message: Night has ended for another day, morning has come in a special way. May you smile like the sunny rays and leaves your worries at the blue blue bay. Gud mrng Prediction: ham. Message: Ma head dey swell oh. Thanks for making my day
Negative | Positive | |
---|---|---|
Negative | TN | FP |
Positive | FN | TP |
%matplotlib inline
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
y_test = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred = [0, 1, 0, 0, 0, 0, 0, 1, 1, 1]
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
plt.matshow(confusion_matrix)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
[[4 1] [2 3]]
from sklearn.metrics import accuracy_score
y_pred, y_true = [0, 1, 1, 0], [1, 1, 1, 1]
print 'Accuracy:', accuracy_score(y_true, y_pred)
Accuracy: 0.5
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
df = pd.read_csv('data/SMSSpamCollection', delimiter='\t', header=None)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
scores = cross_val_score(classifier, X_train, y_train, cv=5)
print(np.mean(scores), scores)
(0.95596992395138047, array([ 0.94504182, 0.96774194, 0.9497006 , 0.96047904, 0.95688623]))
Negative | Positive | |
---|---|---|
Negative | TN | FP |
Positive | FN | TP |
R=TPTP+FN
Negative | Positive | |
---|---|---|
Negative | TN | FP |
Positive | FN | TP |
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
df = pd.read_csv('data/SMSSpamCollection', delimiter='\t', header=None)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
lb = LabelBinarizer()
print(y_train)
print(lb.fit_transform(y_train))
y_train2 = np.array([number[0] for number in lb.fit_transform(y_train)])
classifier = LogisticRegression()
classifier.fit(X_train, y_train2)
precisions = cross_val_score(classifier, X_train, y_train2, cv=5, scoring='precision')
print('Precision', np.mean(precisions), precisions)
recalls = cross_val_score(classifier, X_train, y_train2, cv=5, scoring='recall')
print('Recalls', np.mean(recalls), recalls)
['ham' 'spam' 'ham' ..., 'ham' 'spam' 'ham'] [[0] [1] [0] ..., [0] [1] [0]] ('Precision', 0.99002164502164514, array([ 1. , 1. , 0.975 , 0.98701299, 0.98809524])) ('Recalls', 0.6660869565217391, array([ 0.67826087, 0.59130435, 0.67826087, 0.66086957, 0.72173913]))
Negative | Positive | |
---|---|---|
Negative | TN | FP |
Positive | FN | TP |
f1s = cross_val_score(classifier, X_train, y_train2, cv=5, scoring='f1')
print('F1', np.mean(f1s), f1s)
('F1', 0.79545941505710827, array([ 0.80829016, 0.7431694 , 0.8 , 0.79166667, 0.83417085]))
- number of false positives divided by the total number of negatives
- 실제 negative 중에서 틀린 비율
%matplotlib inline
import matplotlib.pyplot as plt
plt.title('DT Receiver Operating Characteristic')
negative_recall = [1, 0.98, 0.95, 0.9, 0.8, 0.4, 0]
recall = [0, 0.2, 0.4, 0.55, 0.62, 0.8, 1]
plt.plot(negative_recall, recall)
plt.plot([0, 1], [1, 0], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Negative Recall')
plt.show()
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.metrics import roc_curve, auc
df = pd.read_csv('data/SMSSpamCollection', delimiter='\t', header=None)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
lb = LabelBinarizer()
y_train2 = np.array([number[0] for number in lb.fit_transform(y_train)])
y_test2 = np.array([number[0] for number in lb.fit_transform(y_test)])
classifier = LogisticRegression()
classifier.fit(X_train, y_train2)
predictions = classifier.predict_proba(X_test)
false_positive_rate, recall, thresholds = roc_curve(y_test2, predictions[:, 1])
roc_auc = auc(false_positive_rate, recall)
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()
print(y_test2)
print(predictions[:,1])
print(false_positive_rate)
print(recall)
print(thresholds)
[0 0 0 ..., 0 1 0] [ 0.03919853 0.0401668 0.03758962 ..., 0.05738717 0.89730153 0.04664608] [ 0. 0. 0. ..., 0.99834711 0.99917355 1. ] [ 0.00546448 0.01092896 0.01639344 ..., 1. 1. 1. ] [ 0.97599692 0.97542541 0.96781251 ..., 0.00508785 0.00473233 0.00411287]
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.75),
'vect__stop_words': ('english', None),
'vect__max_features': (2500, 5000, 10000, None),
'vect__ngram_range': ((1, 1), (1, 2)),
# 'vect__use_idf': (True, False),
# 'vect__norm': ('l1', 'l2'),
'clf__penalty': ('l1', 'l2'),
# 'clf__C': (0.01, 0.1, 1, 10),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
df = pd.read_csv('data/SMSSpamCollection', delimiter='\t', header=None)
X, y, = df[1], df[0]
X_train, X_test, y_train, y_test = train_test_split(X, y)
lb = LabelBinarizer()
y_train2 = np.array([number[0] for number in lb.fit_transform(y_train)])
y_test2 = np.array([number[0] for number in lb.fit_transform(y_test)])
grid_search.fit(X_train, y_train2)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print '\t%s: %r' % (param_name, best_parameters[param_name])
predictions = grid_search.predict(X_test)
print 'Accuracy:', accuracy_score(y_test2, predictions)
print 'Precision:', precision_score(y_test2, predictions)
print 'Recall:', recall_score(y_test2, predictions)
Fitting 3 folds for each of 96 candidates, totalling 288 fits
[Parallel(n_jobs=-1)]: Done 1 jobs | elapsed: 0.2s [Parallel(n_jobs=-1)]: Done 50 jobs | elapsed: 2.6s [Parallel(n_jobs=-1)]: Done 200 jobs | elapsed: 10.5s [Parallel(n_jobs=-1)]: Done 274 out of 288 | elapsed: 14.3s remaining: 0.7s [Parallel(n_jobs=-1)]: Done 288 out of 288 | elapsed: 14.8s finished
Best score: 0.966 Best parameters set: clf__penalty: 'l2' vect__max_df: 0.25 vect__max_features: 2500 vect__ngram_range: (1, 2) vect__stop_words: None Accuracy: 0.977027997128 Precision: 1.0 Recall: 0.835051546392
import pandas as pd
df = pd.read_csv('movie-reviews/train.tsv', header=0, delimiter='\t')
print df.count()
PhraseId 156060 SentenceId 156060 Phrase 156060 Sentiment 156060 dtype: int64
print df.head()
PhraseId SentenceId Phrase \ 0 1 1 A series of escapades demonstrating the adage ... 1 2 1 A series of escapades demonstrating the adage ... 2 3 1 A series 3 4 1 A 4 5 1 series Sentiment 0 1 1 2 2 2 3 2 4 2
print df['Phrase'].head(10)
0 A series of escapades demonstrating the adage ... 1 A series of escapades demonstrating the adage ... 2 A series 3 A 4 series 5 of escapades demonstrating the adage that what... 6 of 7 escapades demonstrating the adage that what is... 8 escapades 9 demonstrating the adage that what is good for ... Name: Phrase, dtype: object
print df['Sentiment'].describe()
count 156060.000000 mean 2.063578 std 0.893832 min 0.000000 25% 2.000000 50% 2.000000 75% 3.000000 max 4.000000 Name: Sentiment, dtype: float64
print df['Sentiment'].value_counts()
2 79582 3 32927 1 27273 4 9206 0 7072 dtype: int64
print df['Sentiment'].value_counts()/df['Sentiment'].count()
2 0.509945 3 0.210989 1 0.174760 4 0.058990 0 0.045316 dtype: float64
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'clf__C': (0.1, 1, 10),
}
df = pd.read_csv('movie-reviews/train.tsv', header=0, delimiter='\t')
X, y = df['Phrase'], df['Sentiment'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)
grid_search = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print '\t%s: %r' % (param_name, best_parameters[param_name])
Fitting 3 folds for each of 24 candidates, totalling 72 fits
[Parallel(n_jobs=3)]: Done 1 jobs | elapsed: 2.2s [Parallel(n_jobs=3)]: Done 50 jobs | elapsed: 51.0s [Parallel(n_jobs=3)]: Done 68 out of 72 | elapsed: 1.3min remaining: 4.8s [Parallel(n_jobs=3)]: Done 72 out of 72 | elapsed: 1.5min finished
Best score: 0.624 Best parameters set: clf__C: 10 vect__max_df: 0.25 vect__ngram_range: (1, 2) vect__use_idf: False
predictions = grid_search.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
print 'Confusion Matrix:', confusion_matrix(y_test, predictions)
print 'Classification Report:', classification_report(y_test, predictions)
print df['Sentiment'].as_matrix()
Accuracy: 0.639305395361 Confusion Matrix: [[ 1134 1693 676 55 2] [ 907 6020 6019 558 20] [ 223 3147 32775 3545 173] [ 30 419 6350 8267 1373] [ 4 35 464 2452 1689]] Classification Report: precision recall f1-score support 0 0.49 0.32 0.39 3560 1 0.53 0.45 0.48 13524 2 0.71 0.82 0.76 39863 3 0.56 0.50 0.53 16439 4 0.52 0.36 0.43 4644 avg / total 0.62 0.64 0.63 78030 [1 2 2 ..., 3 2 2]
Hamming loss
Jaccard similarity (or the Jaccard index)
import numpy as np
from sklearn.metrics import hamming_loss
print hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]),
np.array([[0.0, 1.0], [1.0, 1.0]]))
0.0
print hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]]))
0.25
print hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]]))
0.5
print hamming_loss(np.array([[0.0, 1.0, 1.0], [1.0, 1.0, 1.0]]), np.array([[1.0, 1.0, 1.0], [0.0, 1.0, 1.0]]))
0.333333333333
from sklearn.metrics import jaccard_similarity_score
print jaccard_similarity_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]]))
1.0
print jaccard_similarity_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]]))
0.75
print jaccard_similarity_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]]))
0.5
print jaccard_similarity_score(np.array([[0.0, 1.0, 1.0], [1.0, 1.0, 1.0]]), np.array([[1.0, 1.0, 1.0], [0.0, 1.0, 1.0]]))
0.666666666667