Let's use np.random.normal
to draw random samples from a normal or Gaussian distribution.
np.random.normal
distribution takes three argument:
mu, sigma = 0, 0.1
s = np.random.normal(mu, sigma, 10000)
We could test the effeciency of numpy sort compared to python one!
cpython_s = list(s)
%timeit sorted(cpython_s)
100 loops, best of 3: 5.1 ms per loop
%timeit s.sort()
1000 loops, best of 3: 182 µs per loop
Now a fancy histogram!
count, bins, _ = hist(s, bins=30, normed=True)
Now, let's draw the probability density function along the histogram. The probability density can be expressed as: $$p(x) = \frac{1}{(\sqrt{2\pi\sigma^2})}\exp -\frac{(x-\mu)^2 }{2\sigma^2}$$
So let define the correspondant python function:
def probability_density(bins, mu, sigma):
return 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (bins - mu)**2 / (2 * sigma**2))
count, bins, _ = hist(s, bins=30, normed=True)
plot(bins, probability_density(bins, mu, sigma), "r")
[<matplotlib.lines.Line2D at 0x10f41ab50>]
from scipy.misc import lena
from scipy.ndimage.filters import sobel
imshow(lena(), cmap=cm.gray)
<matplotlib.image.AxesImage at 0x110e73110>
Applying a basic sobel filtering to lena. (This filter will emphasizes edges and transitions on the previous image)
imshow(sobel(lena()), cmap=cm.gray)
<matplotlib.image.AxesImage at 0x10f35edd0>
This exemple is taken from the scikit-learn documentation: http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html
from sklearn import linear_model, datasets
iris = datasets.load_iris()
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.decomposition import PCA
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
plt.figure(2)
plt.clf()
# Plot the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
# To getter a better understanding of interaction of the dimensions
# plot the first three PCA dimensions
fig = plt.figure(1)
ax = Axes3D(fig, elev=-150, azim=110)
X_reduced = PCA(n_components=3).fit_transform(iris.data)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y,
cmap=plt.cm.Paired)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])
plt.show()
iris
{'DESCR': 'Iris Plants Database\n\nNotes\n-----\nData Set Characteristics:\n :Number of Instances: 150 (50 in each of three classes)\n :Number of Attributes: 4 numeric, predictive attributes and the class\n :Attribute Information:\n - sepal length in cm\n - sepal width in cm\n - petal length in cm\n - petal width in cm\n - class:\n - Iris-Setosa\n - Iris-Versicolour\n - Iris-Virginica\n :Summary Statistics:\n ============== ==== ==== ======= ===== ====================\n Min Max Mean SD Class Correlation\n ============== ==== ==== ======= ===== ====================\n sepal length: 4.3 7.9 5.84 0.83 0.7826\n sepal width: 2.0 4.4 3.05 0.43 -0.4194\n petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\n petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n ============== ==== ==== ======= ===== ====================\n :Missing Attribute Values: None\n :Class Distribution: 33.3% for each of 3 classes.\n :Creator: R.A. Fisher\n :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n :Date: July, 1988\n\nThis is a copy of UCI ML iris datasets.\nhttp://archive.ics.uci.edu/ml/datasets/Iris\n\nThe famous Iris database, first used by Sir R.A Fisher\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.) The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\nReferences\n----------\n - Fisher,R.A. "The use of multiple measurements in taxonomic problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics" (John Wiley, NY, 1950).\n - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the data.\n - Many, many more ...\n', 'data': array([[ 5.1, 3.5, 1.4, 0.2], [ 4.9, 3. , 1.4, 0.2], [ 4.7, 3.2, 1.3, 0.2], [ 4.6, 3.1, 1.5, 0.2], [ 5. , 3.6, 1.4, 0.2], [ 5.4, 3.9, 1.7, 0.4], [ 4.6, 3.4, 1.4, 0.3], [ 5. , 3.4, 1.5, 0.2], [ 4.4, 2.9, 1.4, 0.2], [ 4.9, 3.1, 1.5, 0.1], [ 5.4, 3.7, 1.5, 0.2], [ 4.8, 3.4, 1.6, 0.2], [ 4.8, 3. , 1.4, 0.1], [ 4.3, 3. , 1.1, 0.1], [ 5.8, 4. , 1.2, 0.2], [ 5.7, 4.4, 1.5, 0.4], [ 5.4, 3.9, 1.3, 0.4], [ 5.1, 3.5, 1.4, 0.3], [ 5.7, 3.8, 1.7, 0.3], [ 5.1, 3.8, 1.5, 0.3], [ 5.4, 3.4, 1.7, 0.2], [ 5.1, 3.7, 1.5, 0.4], [ 4.6, 3.6, 1. , 0.2], [ 5.1, 3.3, 1.7, 0.5], [ 4.8, 3.4, 1.9, 0.2], [ 5. , 3. , 1.6, 0.2], [ 5. , 3.4, 1.6, 0.4], [ 5.2, 3.5, 1.5, 0.2], [ 5.2, 3.4, 1.4, 0.2], [ 4.7, 3.2, 1.6, 0.2], [ 4.8, 3.1, 1.6, 0.2], [ 5.4, 3.4, 1.5, 0.4], [ 5.2, 4.1, 1.5, 0.1], [ 5.5, 4.2, 1.4, 0.2], [ 4.9, 3.1, 1.5, 0.1], [ 5. , 3.2, 1.2, 0.2], [ 5.5, 3.5, 1.3, 0.2], [ 4.9, 3.1, 1.5, 0.1], [ 4.4, 3. , 1.3, 0.2], [ 5.1, 3.4, 1.5, 0.2], [ 5. , 3.5, 1.3, 0.3], [ 4.5, 2.3, 1.3, 0.3], [ 4.4, 3.2, 1.3, 0.2], [ 5. , 3.5, 1.6, 0.6], [ 5.1, 3.8, 1.9, 0.4], [ 4.8, 3. , 1.4, 0.3], [ 5.1, 3.8, 1.6, 0.2], [ 4.6, 3.2, 1.4, 0.2], [ 5.3, 3.7, 1.5, 0.2], [ 5. , 3.3, 1.4, 0.2], [ 7. , 3.2, 4.7, 1.4], [ 6.4, 3.2, 4.5, 1.5], [ 6.9, 3.1, 4.9, 1.5], [ 5.5, 2.3, 4. , 1.3], [ 6.5, 2.8, 4.6, 1.5], [ 5.7, 2.8, 4.5, 1.3], [ 6.3, 3.3, 4.7, 1.6], [ 4.9, 2.4, 3.3, 1. ], [ 6.6, 2.9, 4.6, 1.3], [ 5.2, 2.7, 3.9, 1.4], [ 5. , 2. , 3.5, 1. ], [ 5.9, 3. , 4.2, 1.5], [ 6. , 2.2, 4. , 1. ], [ 6.1, 2.9, 4.7, 1.4], [ 5.6, 2.9, 3.6, 1.3], [ 6.7, 3.1, 4.4, 1.4], [ 5.6, 3. , 4.5, 1.5], [ 5.8, 2.7, 4.1, 1. ], [ 6.2, 2.2, 4.5, 1.5], [ 5.6, 2.5, 3.9, 1.1], [ 5.9, 3.2, 4.8, 1.8], [ 6.1, 2.8, 4. , 1.3], [ 6.3, 2.5, 4.9, 1.5], [ 6.1, 2.8, 4.7, 1.2], [ 6.4, 2.9, 4.3, 1.3], [ 6.6, 3. , 4.4, 1.4], [ 6.8, 2.8, 4.8, 1.4], [ 6.7, 3. , 5. , 1.7], [ 6. , 2.9, 4.5, 1.5], [ 5.7, 2.6, 3.5, 1. ], [ 5.5, 2.4, 3.8, 1.1], [ 5.5, 2.4, 3.7, 1. ], [ 5.8, 2.7, 3.9, 1.2], [ 6. , 2.7, 5.1, 1.6], [ 5.4, 3. , 4.5, 1.5], [ 6. , 3.4, 4.5, 1.6], [ 6.7, 3.1, 4.7, 1.5], [ 6.3, 2.3, 4.4, 1.3], [ 5.6, 3. , 4.1, 1.3], [ 5.5, 2.5, 4. , 1.3], [ 5.5, 2.6, 4.4, 1.2], [ 6.1, 3. , 4.6, 1.4], [ 5.8, 2.6, 4. , 1.2], [ 5. , 2.3, 3.3, 1. ], [ 5.6, 2.7, 4.2, 1.3], [ 5.7, 3. , 4.2, 1.2], [ 5.7, 2.9, 4.2, 1.3], [ 6.2, 2.9, 4.3, 1.3], [ 5.1, 2.5, 3. , 1.1], [ 5.7, 2.8, 4.1, 1.3], [ 6.3, 3.3, 6. , 2.5], [ 5.8, 2.7, 5.1, 1.9], [ 7.1, 3. , 5.9, 2.1], [ 6.3, 2.9, 5.6, 1.8], [ 6.5, 3. , 5.8, 2.2], [ 7.6, 3. , 6.6, 2.1], [ 4.9, 2.5, 4.5, 1.7], [ 7.3, 2.9, 6.3, 1.8], [ 6.7, 2.5, 5.8, 1.8], [ 7.2, 3.6, 6.1, 2.5], [ 6.5, 3.2, 5.1, 2. ], [ 6.4, 2.7, 5.3, 1.9], [ 6.8, 3. , 5.5, 2.1], [ 5.7, 2.5, 5. , 2. ], [ 5.8, 2.8, 5.1, 2.4], [ 6.4, 3.2, 5.3, 2.3], [ 6.5, 3. , 5.5, 1.8], [ 7.7, 3.8, 6.7, 2.2], [ 7.7, 2.6, 6.9, 2.3], [ 6. , 2.2, 5. , 1.5], [ 6.9, 3.2, 5.7, 2.3], [ 5.6, 2.8, 4.9, 2. ], [ 7.7, 2.8, 6.7, 2. ], [ 6.3, 2.7, 4.9, 1.8], [ 6.7, 3.3, 5.7, 2.1], [ 7.2, 3.2, 6. , 1.8], [ 6.2, 2.8, 4.8, 1.8], [ 6.1, 3. , 4.9, 1.8], [ 6.4, 2.8, 5.6, 2.1], [ 7.2, 3. , 5.8, 1.6], [ 7.4, 2.8, 6.1, 1.9], [ 7.9, 3.8, 6.4, 2. ], [ 6.4, 2.8, 5.6, 2.2], [ 6.3, 2.8, 5.1, 1.5], [ 6.1, 2.6, 5.6, 1.4], [ 7.7, 3. , 6.1, 2.3], [ 6.3, 3.4, 5.6, 2.4], [ 6.4, 3.1, 5.5, 1.8], [ 6. , 3. , 4.8, 1.8], [ 6.9, 3.1, 5.4, 2.1], [ 6.7, 3.1, 5.6, 2.4], [ 6.9, 3.1, 5.1, 2.3], [ 5.8, 2.7, 5.1, 1.9], [ 6.8, 3.2, 5.9, 2.3], [ 6.7, 3.3, 5.7, 2.5], [ 6.7, 3. , 5.2, 2.3], [ 6.3, 2.5, 5. , 1.9], [ 6.5, 3. , 5.2, 2. ], [ 6.2, 3.4, 5.4, 2.3], [ 5.9, 3. , 5.1, 1.8]]), 'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='|S10')}
from sklearn.cross_validation import train_test_split
iris_full = np.c_[iris.data, iris.target]
iris_train, iris_test = train_test_split(iris_full, test_size=10)
len(iris_train), len(iris_test)
(140, 10)
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(iris_train[:,:-1], iris_train[:,-1])
LogisticRegression(C=100000.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)
logreg.predict(iris_test[:, :-1])
array([ 2., 2., 2., 2., 2., 2., 0., 1., 0., 0.])
iris_test[:, -1]
array([ 2., 2., 2., 2., 2., 2., 0., 2., 0., 0.])
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
h = .02 # step size in the mesh
logreg = linear_model.LogisticRegression(C=1e5)
# we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, Y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.show()
Some pandas possiblities, based on the MovieLens Dataset (http://www.grouplens.org/datasets/movielens/).
See Greg Reda article for deeper explanations (http://www.gregreda.com/2013/10/26/using-pandas-on-the-movielens-dataset/)
import pandas as pd
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols)
users
user_id | age | sex | occupation | zip_code | |
---|---|---|---|---|---|
0 | 1 | 24 | M | technician | 85711 |
1 | 2 | 53 | F | other | 94043 |
2 | 3 | 23 | M | writer | 32067 |
3 | 4 | 24 | M | technician | 43537 |
4 | 5 | 33 | F | other | 15213 |
5 | 6 | 42 | M | executive | 98101 |
6 | 7 | 57 | M | administrator | 91344 |
7 | 8 | 36 | M | administrator | 05201 |
8 | 9 | 29 | M | student | 01002 |
9 | 10 | 53 | M | lawyer | 90703 |
10 | 11 | 39 | F | other | 30329 |
11 | 12 | 28 | F | other | 06405 |
12 | 13 | 47 | M | educator | 29206 |
13 | 14 | 45 | M | scientist | 55106 |
14 | 15 | 49 | F | educator | 97301 |
15 | 16 | 21 | M | entertainment | 10309 |
16 | 17 | 30 | M | programmer | 06355 |
17 | 18 | 35 | F | other | 37212 |
18 | 19 | 40 | M | librarian | 02138 |
19 | 20 | 42 | F | homemaker | 95660 |
20 | 21 | 26 | M | writer | 30068 |
21 | 22 | 25 | M | writer | 40206 |
22 | 23 | 30 | F | artist | 48197 |
23 | 24 | 21 | F | artist | 94533 |
24 | 25 | 39 | M | engineer | 55107 |
25 | 26 | 49 | M | engineer | 21044 |
26 | 27 | 40 | F | librarian | 30030 |
27 | 28 | 32 | M | writer | 55369 |
28 | 29 | 41 | M | programmer | 94043 |
29 | 30 | 7 | M | student | 55436 |
... | ... | ... | ... | ... | ... |
913 | 914 | 44 | F | other | 08105 |
914 | 915 | 50 | M | entertainment | 60614 |
915 | 916 | 27 | M | engineer | N2L5N |
916 | 917 | 22 | F | student | 20006 |
917 | 918 | 40 | M | scientist | 70116 |
918 | 919 | 25 | M | other | 14216 |
919 | 920 | 30 | F | artist | 90008 |
920 | 921 | 20 | F | student | 98801 |
921 | 922 | 29 | F | administrator | 21114 |
922 | 923 | 21 | M | student | E2E3R |
923 | 924 | 29 | M | other | 11753 |
924 | 925 | 18 | F | salesman | 49036 |
925 | 926 | 49 | M | entertainment | 01701 |
926 | 927 | 23 | M | programmer | 55428 |
927 | 928 | 21 | M | student | 55408 |
928 | 929 | 44 | M | scientist | 53711 |
929 | 930 | 28 | F | scientist | 07310 |
930 | 931 | 60 | M | educator | 33556 |
931 | 932 | 58 | M | educator | 06437 |
932 | 933 | 28 | M | student | 48105 |
933 | 934 | 61 | M | engineer | 22902 |
934 | 935 | 42 | M | doctor | 66221 |
935 | 936 | 24 | M | other | 32789 |
936 | 937 | 48 | M | educator | 98072 |
937 | 938 | 38 | F | technician | 55038 |
938 | 939 | 26 | F | student | 33319 |
939 | 940 | 32 | M | administrator | 02215 |
940 | 941 | 20 | M | student | 97229 |
941 | 942 | 48 | F | librarian | 78209 |
942 | 943 | 22 | M | student | 77841 |
943 rows × 5 columns
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols)
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(5))
movies
movie_id | title | release_date | video_release_date | imdb_url | |
---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Toy%20Story%2... |
1 | 2 | GoldenEye (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?GoldenEye%20(... |
2 | 3 | Four Rooms (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Four%20Rooms%... |
3 | 4 | Get Shorty (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Get%20Shorty%... |
4 | 5 | Copycat (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Copycat%20(1995) |
5 | 6 | Shanghai Triad (Yao a yao yao dao waipo qiao) ... | 01-Jan-1995 | NaN | http://us.imdb.com/Title?Yao+a+yao+yao+dao+wai... |
6 | 7 | Twelve Monkeys (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Twelve%20Monk... |
7 | 8 | Babe (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Babe%20(1995) |
8 | 9 | Dead Man Walking (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Dead%20Man%20... |
9 | 10 | Richard III (1995) | 22-Jan-1996 | NaN | http://us.imdb.com/M/title-exact?Richard%20III... |
10 | 11 | Seven (Se7en) (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Se7en%20(1995) |
11 | 12 | Usual Suspects, The (1995) | 14-Aug-1995 | NaN | http://us.imdb.com/M/title-exact?Usual%20Suspe... |
12 | 13 | Mighty Aphrodite (1995) | 30-Oct-1995 | NaN | http://us.imdb.com/M/title-exact?Mighty%20Aphr... |
13 | 14 | Postino, Il (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Postino,%20Il... |
14 | 15 | Mr. Holland's Opus (1995) | 29-Jan-1996 | NaN | http://us.imdb.com/M/title-exact?Mr.%20Holland... |
15 | 16 | French Twist (Gazon maudit) (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Gazon%20maudi... |
16 | 17 | From Dusk Till Dawn (1996) | 05-Feb-1996 | NaN | http://us.imdb.com/M/title-exact?From%20Dusk%2... |
17 | 18 | White Balloon, The (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Badkonake%20S... |
18 | 19 | Antonia's Line (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Antonia%20(1995) |
19 | 20 | Angels and Insects (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Angels%20and%... |
20 | 21 | Muppet Treasure Island (1996) | 16-Feb-1996 | NaN | http://us.imdb.com/M/title-exact?Muppet%20Trea... |
21 | 22 | Braveheart (1995) | 16-Feb-1996 | NaN | http://us.imdb.com/M/title-exact?Braveheart%20... |
22 | 23 | Taxi Driver (1976) | 16-Feb-1996 | NaN | http://us.imdb.com/M/title-exact?Taxi%20Driver... |
23 | 24 | Rumble in the Bronx (1995) | 23-Feb-1996 | NaN | http://us.imdb.com/M/title-exact?Hong%20Faan%2... |
24 | 25 | Birdcage, The (1996) | 08-Mar-1996 | NaN | http://us.imdb.com/M/title-exact?Birdcage,%20T... |
25 | 26 | Brothers McMullen, The (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Brothers%20Mc... |
26 | 27 | Bad Boys (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Bad%20Boys%20... |
27 | 28 | Apollo 13 (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Apollo%2013%2... |
28 | 29 | Batman Forever (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Batman%20Fore... |
29 | 30 | Belle de jour (1967) | 01-Jan-1967 | NaN | http://us.imdb.com/M/title-exact?Belle%20de%20... |
... | ... | ... | ... | ... | ... |
1652 | 1653 | Entertaining Angels: The Dorothy Day Story (1996) | 27-Sep-1996 | NaN | http://us.imdb.com/M/title-exact?Entertaining%... |
1653 | 1654 | Chairman of the Board (1998) | 01-Jan-1998 | NaN | http://us.imdb.com/Title?Chairman+of+the+Board... |
1654 | 1655 | Favor, The (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Favor,%20The%... |
1655 | 1656 | Little City (1998) | 20-Feb-1998 | NaN | http://us.imdb.com/M/title-exact?Little+City+(... |
1656 | 1657 | Target (1995) | 28-Feb-1996 | NaN | http://us.imdb.com/M/title-exact?Target%20(1995) |
1657 | 1658 | Substance of Fire, The (1996) | 06-Dec-1996 | NaN | http://us.imdb.com/M/title-exact?Substance%20o... |
1658 | 1659 | Getting Away With Murder (1996) | 12-Apr-1996 | NaN | http://us.imdb.com/Title?Getting+Away+With+Mur... |
1659 | 1660 | Small Faces (1995) | 09-Aug-1996 | NaN | http://us.imdb.com/M/title-exact?Small%20Faces... |
1660 | 1661 | New Age, The (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?New%20Age,%20... |
1661 | 1662 | Rough Magic (1995) | 30-May-1997 | NaN | http://us.imdb.com/M/title-exact?Rough%20Magic... |
1662 | 1663 | Nothing Personal (1995) | 30-Apr-1997 | NaN | http://us.imdb.com/M/title-exact?Nothing%20Per... |
1663 | 1664 | 8 Heads in a Duffel Bag (1997) | 18-Apr-1997 | NaN | http://us.imdb.com/Title?8+Heads+in+a+Duffel+B... |
1664 | 1665 | Brother's Kiss, A (1997) | 25-Apr-1997 | NaN | http://us.imdb.com/M/title-exact?Brother%27s%2... |
1665 | 1666 | Ripe (1996) | 02-May-1997 | NaN | http://us.imdb.com/M/title-exact?Ripe%20%28199... |
1666 | 1667 | Next Step, The (1995) | 13-Jun-1997 | NaN | http://us.imdb.com/M/title-exact?Next%20Step%2... |
1667 | 1668 | Wedding Bell Blues (1996) | 13-Jun-1997 | NaN | http://us.imdb.com/M/title-exact?Wedding%20Bel... |
1668 | 1669 | MURDER and murder (1996) | 20-Jun-1997 | NaN | http://us.imdb.com/M/title-exact?MURDER+and+mu... |
1669 | 1670 | Tainted (1998) | 01-Feb-1998 | NaN | http://us.imdb.com/M/title-exact?Tainted+(1998) |
1670 | 1671 | Further Gesture, A (1996) | 20-Feb-1998 | NaN | http://us.imdb.com/M/title-exact?Further+Gestu... |
1671 | 1672 | Kika (1993) | 01-Jan-1993 | NaN | http://us.imdb.com/M/title-exact?Kika%20(1993) |
1672 | 1673 | Mirage (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Mirage%20(1995) |
1673 | 1674 | Mamma Roma (1962) | 01-Jan-1962 | NaN | http://us.imdb.com/M/title-exact?Mamma%20Roma%... |
1674 | 1675 | Sunchaser, The (1996) | 25-Oct-1996 | NaN | http://us.imdb.com/M/title-exact?Sunchaser,%20... |
1675 | 1676 | War at Home, The (1996) | 01-Jan-1996 | NaN | http://us.imdb.com/M/title-exact?War%20at%20Ho... |
1676 | 1677 | Sweet Nothing (1995) | 20-Sep-1996 | NaN | http://us.imdb.com/M/title-exact?Sweet%20Nothi... |
1677 | 1678 | Mat' i syn (1997) | 06-Feb-1998 | NaN | http://us.imdb.com/M/title-exact?Mat%27+i+syn+... |
1678 | 1679 | B. Monkey (1998) | 06-Feb-1998 | NaN | http://us.imdb.com/M/title-exact?B%2E+Monkey+(... |
1679 | 1680 | Sliding Doors (1998) | 01-Jan-1998 | NaN | http://us.imdb.com/Title?Sliding+Doors+(1998) |
1680 | 1681 | You So Crazy (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?You%20So%20Cr... |
1681 | 1682 | Scream of Stone (Schrei aus Stein) (1991) | 08-Mar-1996 | NaN | http://us.imdb.com/M/title-exact?Schrei%20aus%... |
1682 rows × 5 columns
ratings
user_id | movie_id | rating | unix_timestamp | |
---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 |
1 | 186 | 302 | 3 | 891717742 |
2 | 22 | 377 | 1 | 878887116 |
3 | 244 | 51 | 2 | 880606923 |
4 | 166 | 346 | 1 | 886397596 |
5 | 298 | 474 | 4 | 884182806 |
6 | 115 | 265 | 2 | 881171488 |
7 | 253 | 465 | 5 | 891628467 |
8 | 305 | 451 | 3 | 886324817 |
9 | 6 | 86 | 3 | 883603013 |
10 | 62 | 257 | 2 | 879372434 |
11 | 286 | 1014 | 5 | 879781125 |
12 | 200 | 222 | 5 | 876042340 |
13 | 210 | 40 | 3 | 891035994 |
14 | 224 | 29 | 3 | 888104457 |
15 | 303 | 785 | 3 | 879485318 |
16 | 122 | 387 | 5 | 879270459 |
17 | 194 | 274 | 2 | 879539794 |
18 | 291 | 1042 | 4 | 874834944 |
19 | 234 | 1184 | 2 | 892079237 |
20 | 119 | 392 | 4 | 886176814 |
21 | 167 | 486 | 4 | 892738452 |
22 | 299 | 144 | 4 | 877881320 |
23 | 291 | 118 | 2 | 874833878 |
24 | 308 | 1 | 4 | 887736532 |
25 | 95 | 546 | 2 | 879196566 |
26 | 38 | 95 | 5 | 892430094 |
27 | 102 | 768 | 2 | 883748450 |
28 | 63 | 277 | 4 | 875747401 |
29 | 160 | 234 | 5 | 876861185 |
... | ... | ... | ... | ... |
99970 | 449 | 120 | 1 | 879959573 |
99971 | 661 | 762 | 2 | 876037121 |
99972 | 721 | 874 | 3 | 877137447 |
99973 | 821 | 151 | 4 | 874792889 |
99974 | 764 | 596 | 3 | 876243046 |
99975 | 537 | 443 | 3 | 886031752 |
99976 | 618 | 628 | 2 | 891308019 |
99977 | 487 | 291 | 3 | 883445079 |
99978 | 113 | 975 | 5 | 875936424 |
99979 | 943 | 391 | 2 | 888640291 |
99980 | 864 | 685 | 4 | 888891900 |
99981 | 750 | 323 | 3 | 879445877 |
99982 | 279 | 64 | 1 | 875308510 |
99983 | 646 | 750 | 3 | 888528902 |
99984 | 654 | 370 | 2 | 887863914 |
99985 | 617 | 582 | 4 | 883789294 |
99986 | 913 | 690 | 3 | 880824288 |
99987 | 660 | 229 | 2 | 891406212 |
99988 | 421 | 498 | 4 | 892241344 |
99989 | 495 | 1091 | 4 | 888637503 |
99990 | 806 | 421 | 4 | 882388897 |
99991 | 676 | 538 | 4 | 892685437 |
99992 | 721 | 262 | 3 | 877137285 |
99993 | 913 | 209 | 2 | 881367150 |
99994 | 378 | 78 | 3 | 880056976 |
99995 | 880 | 476 | 3 | 880175444 |
99996 | 716 | 204 | 5 | 879795543 |
99997 | 276 | 1090 | 1 | 874795795 |
99998 | 13 | 225 | 2 | 882399156 |
99999 | 12 | 203 | 3 | 879959583 |
100000 rows × 4 columns
movie_ratings = pd.merge(movies, ratings, on="movie_id") # if `on` is None, it can be inferred!
lens = pd.merge(movie_ratings, users)
lens.groupby('title').size().order(ascending=False)
title Star Wars (1977) 583 Contact (1997) 509 Fargo (1996) 508 Return of the Jedi (1983) 507 Liar Liar (1997) 485 English Patient, The (1996) 481 Scream (1996) 478 Toy Story (1995) 452 Air Force One (1997) 431 Independence Day (ID4) (1996) 429 Raiders of the Lost Ark (1981) 420 Godfather, The (1972) 413 Pulp Fiction (1994) 394 Twelve Monkeys (1995) 392 Silence of the Lambs, The (1991) 390 ... Liebelei (1933) 1 Bird of Prey (1996) 1 Lotto Land (1995) 1 Love Is All There Is (1996) 1 Low Life, The (1994) 1 Coldblooded (1995) 1 MURDER and murder (1996) 1 Big Bang Theory, The (1994) 1 Mad Dog Time (1996) 1 Mamma Roma (1962) 1 Man from Down Under, The (1943) 1 Marlene Dietrich: Shadow and Light (1996) 1 Mat' i syn (1997) 1 Mille bolle blu (1993) 1 � k�ldum klaka (Cold Fever) (1994) 1 Length: 1664, dtype: int64
movies_stats = lens.groupby('title').agg({'rating': [np.size, np.mean]}).sort([('rating', 'mean')], ascending=False)
movies_stats.head()
rating | ||
---|---|---|
size | mean | |
title | ||
They Made Me a Criminal (1939) | 1 | 5 |
Marlene Dietrich: Shadow and Light (1996) | 1 | 5 |
Saint of Fort Washington, The (1993) | 2 | 5 |
Someone Else's America (1995) | 1 | 5 |
Star Kid (1997) | 3 | 5 |
atleast_100 = movies_stats['rating'].size >= 100
movies_stats[atleast_100].head()
rating | ||
---|---|---|
size | mean | |
title | ||
Close Shave, A (1995) | 112 | 4.491071 |
Schindler's List (1993) | 298 | 4.466443 |
Wrong Trousers, The (1993) | 118 | 4.466102 |
Casablanca (1942) | 243 | 4.456790 |
Shawshank Redemption, The (1994) | 283 | 4.445230 |
users.age.hist(bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x10b86d910>
most_50 = lens.groupby('movie_id').size().order(ascending=False)[:50]
pivoted = lens.pivot_table(rows=['movie_id', 'title'],
cols=['sex'],
values='rating',
fill_value=0)
pivoted.head()
sex | F | M | |
---|---|---|---|
movie_id | title | ||
1 | Toy Story (1995) | 3.789916 | 3.909910 |
2 | GoldenEye (1995) | 3.368421 | 3.178571 |
3 | Four Rooms (1995) | 2.687500 | 3.108108 |
4 | Get Shorty (1995) | 3.400000 | 3.591463 |
5 | Copycat (1995) | 3.772727 | 3.140625 |
pivoted['diff'] = pivoted.M - pivoted.F
pivoted.head()
sex | F | M | diff | |
---|---|---|---|---|
movie_id | title | |||
1 | Toy Story (1995) | 3.789916 | 3.909910 | 0.119994 |
2 | GoldenEye (1995) | 3.368421 | 3.178571 | -0.189850 |
3 | Four Rooms (1995) | 2.687500 | 3.108108 | 0.420608 |
4 | Get Shorty (1995) | 3.400000 | 3.591463 | 0.191463 |
5 | Copycat (1995) | 3.772727 | 3.140625 | -0.632102 |
pivoted.reset_index('movie_id', inplace=True)
disagreements = pivoted[pivoted.movie_id.isin(most_50.index)]['diff']
disagreements.order().plot(kind='barh', figsize=[9, 15])
<matplotlib.axes._subplots.AxesSubplot at 0x10894b450>