데이터 분석은 일반적으로 다음과 같은 단계로 나뉘어 집니다.¶

외부 데이터와 연동
준비
변환
모델링과 계산
제시

이 장은 파이썬과 Pandas를 통해 어떤 식으로 분석하는지 맛보기입니다. 어렵거나 이해가 잘 안되어도 너무 걱정하실 필요는 없습니다.¶

미정부 페이지(.gov 또는 .mil) 방문자 분석¶

In [340]:

path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
open(path).readline()

Out[340]:

'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u": "http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'

json파서를 통해 파이썬 딕셔너리로 읽어들입니다.¶

In [341]:

import json
records = [json.loads(line) for line in open(path)]
records[0]

Out[341]:

{u'a': u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
 u'al': u'en-US,en;q=0.8',
 u'c': u'US',
 u'cy': u'Danvers',
 u'g': u'A6qOVH',
 u'gr': u'MA',
 u'h': u'wfLQtf',
 u'hc': 1331822918,
 u'hh': u'1.usa.gov',
 u'l': u'orofrog',
 u'll': [42.576698, -70.954903],
 u'nk': 1,
 u'r': u'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
 u't': 1331923247,
 u'tz': u'America/New_York',
 u'u': u'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}

In [342]:

records[0]['tz']

Out[342]:

u'America/New_York'

파이썬으로 타임존 세기¶

In [343]:

time_zones = [rec['tz'] for rec in records]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-343-db4fbd348da9> in <module>()
----> 1 time_zones = [rec['tz'] for rec in records]

KeyError: 'tz'

타임존 정보가 없는 경우도 있어서 에러가 발생합니다. 그런 경우는 무시하도록 합시다.¶

In [344]:

time_zones = [rec['tz'] for rec in records if 'tz' in rec]

이제 타임존을 세기위한 함수를 작성해 봅니다.¶

In [345]:

def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts

파이썬 내장 모듈에서 제공하는 defaultdict함수를 이용하면 좀 더 간단해 집니다.¶

In [346]:

def get_counts2(sequence):
    counts = defaultdict(int)  # 기본 값이 0으로 초기화 됩니다.
    for x in sequence:
        counts[x] += 1
    return counts

아래와 같이 잘 동작하네요¶

In [347]:

counts = get_counts(time_zones)
counts['America/New_York']

Out[347]:

In [348]:

len(time_zones)

Out[348]:

가장 많은 방문지역 10개를 찾으려면¶

In [349]:

def top_counts(count_dict, n=10):
    value_key_pairs = [(count, tz) for tz, count in count_dict.items()] 
    value_key_pairs.sort()
    return value_key_pairs[-n:]

top_counts(counts)

Out[349]:

[(33, u'America/Sao_Paulo'),
 (35, u'Europe/Madrid'),
 (36, u'Pacific/Honolulu'),
 (37, u'Asia/Tokyo'),
 (74, u'Europe/London'),
 (191, u'America/Denver'),
 (382, u'America/Los_Angeles'),
 (400, u'America/Chicago'),
 (521, u''),
 (1251, u'America/New_York')]

파이썬 라이브러리를 잘 찾아보면 다음과 같이 할 수도 있습니다.¶

In [350]:

from collections import Counter
counter = Counter(time_zones)
counter.most_common(10)

Out[350]:

[(u'America/New_York', 1251),
 (u'', 521),
 (u'America/Chicago', 400),
 (u'America/Los_Angeles', 382),
 (u'America/Denver', 191),
 (u'Europe/London', 74),
 (u'Asia/Tokyo', 37),
 (u'Pacific/Honolulu', 36),
 (u'Europe/Madrid', 35),
 (u'America/Sao_Paulo', 33)]

Pandas로 타임존 세기¶

데이터프레임은 판다스에서 가장 중요한 구조체입니다. 파이썬 딕셔너리를 인자로 데이터프레임을 간단히 만들수 있습니다.¶

In [351]:

from pandas import DataFrame, Series
import pandas as pd; import numpy as np

frame = DataFrame(records)
frame

Out[351]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3560 entries, 0 to 3559
Data columns (total 18 columns):
_heartbeat_    120  non-null values
a              3440  non-null values
al             3094  non-null values
c              2919  non-null values
cy             2919  non-null values
g              3440  non-null values
gr             2919  non-null values
h              3440  non-null values
hc             3440  non-null values
hh             3440  non-null values
kw             93  non-null values
l              3440  non-null values
ll             2919  non-null values
nk             3440  non-null values
r              3440  non-null values
t              3440  non-null values
tz             3440  non-null values
u              3440  non-null values
dtypes: float64(4), object(14)

데이터 프레임의 내용이 많은 경우, 위처럼 Summary View형식으로 나오며, 간단하면 다 출력됩니다.¶

In [352]:

frame['tz'][:10]

Out[352]:

0     America/New_York
1       America/Denver
2     America/New_York
3    America/Sao_Paulo
4     America/New_York
5     America/New_York
6        Europe/Warsaw
7                     
8                     
9                     
Name: tz, dtype: object

value_counts로 각 값이 몇 번 나왔는지 알 수 있습니다.¶

In [353]:

tz_counts = frame['tz'].value_counts()
tz_counts[:10]

Out[353]:

America/New_York       1251
                        521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Europe/London            74
Asia/Tokyo               37
Pacific/Honolulu         36
Europe/Madrid            35
America/Sao_Paulo        33
dtype: int64

fillna는 빈 값을 채워줍니다.¶

In [354]:

clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknown'

tz_counts = clean_tz.value_counts()
tz_counts[:10]

Out[354]:

America/New_York       1251
Unknown                 521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Missing                 120
Europe/London            74
Asia/Tokyo               37
Pacific/Honolulu         36
Europe/Madrid            35
dtype: int64

plot을 사용하면 간단히 그래프를 그릴 수 있습니다.¶

In [355]:

tz_counts[:10].plot(kind='barh', rot=0)

Out[355]:

<matplotlib.axes.AxesSubplot at 0x11f7165d0>

방문자의 에이전트 정보를 살펴보면 정말 다양합니다.¶

In [356]:

frame['a'][1]

Out[356]:

u'GoogleMaps/RochesterNY'

In [357]:

frame['a'][50]

Out[357]:

u'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'

In [358]:

frame['a'][51]

Out[358]:

u'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P925/V10e Build/FRG83G) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'

파이썬으로 필요한 부분만 추려낼 수 있습니다.¶

In [359]:

results = Series([x.split()[0] for x in frame.a.dropna()])

results[:5]

Out[359]:

0               Mozilla/5.0
1    GoogleMaps/RochesterNY
2               Mozilla/4.0
3               Mozilla/5.0
4               Mozilla/5.0
dtype: object

In [360]:

results.value_counts()[:8]

Out[360]:

Mozilla/5.0                 2594
Mozilla/4.0                  601
GoogleMaps/RochesterNY       121
Opera/9.80                    34
TEST_INTERNET_AGENT           24
GoogleProducer                21
Mozilla/6.0                    5
BlackBerry8520/5.0.0.681       4
dtype: int64

상위 타임존을 윈도우 권과 비 윈도우 권으로 나누려면 어떻게 해야할까요?¶

In [279]:

cframe = frame[frame.a.notnull()]  # 에이전트 정보가 있는 것만 추려냄
    
operating_system = np.where(cframe['a'].str.contains('Windows'),
                            'Windows', 'Not Windows')

operating_system[:5]

Out[279]:

0        Windows
1    Not Windows
2        Windows
3    Not Windows
4        Windows
Name: a, dtype: object

이제 데이터를 타임존과 OS정보로 그룹짓습니다.¶

In [280]:

by_tz_os = cframe.groupby(['tz', operating_system])

size()로 개수를 세고, unstack으로 옆으로 펴줍니다.

In [281]:

agg_counts = by_tz_os.size().unstack().fillna(0)
agg_counts[:10]

Out[281]:

a	Not Windows	Windows
tz
	245	276
Africa/Cairo	0	3
Africa/Casablanca	0	1
Africa/Ceuta	0	2
Africa/Johannesburg	0	1
Africa/Lusaka	0	1
America/Anchorage	4	1
America/Argentina/Buenos_Aires	1	0
America/Argentina/Cordoba	0	1
America/Argentina/Mendoza	0	1

끝으로, 방문자 수로 상위 타임존을 구하고, 소팅용 간접 인덱스를 구합니다.¶

In [282]:

indexer = agg_counts.sum(1).argsort()
indexer[:10]

Out[282]:

tz
                                  24
Africa/Cairo                      20
Africa/Casablanca                 21
Africa/Ceuta                      92
Africa/Johannesburg               87
Africa/Lusaka                     53
America/Anchorage                 54
America/Argentina/Buenos_Aires    57
America/Argentina/Cordoba         26
America/Argentina/Mendoza         55
dtype: int64

take함수를 사용해 인덱스 순서대로의 데이터프레임을 구합니다.¶

In [283]:

count_subset = agg_counts.take(indexer)[-10:]

count_subset

Out[283]:

a	Not Windows	Windows
tz
America/Sao_Paulo	13	20
Europe/Madrid	16	19
Pacific/Honolulu	0	36
Asia/Tokyo	2	35
Europe/London	43	31
America/Denver	132	59
America/Los_Angeles	130	252
America/Chicago	115	285
	245	276
America/New_York	339	912

이제 plot으로 그립니다. stacked=True를 하면 중첩 막대 그래프가 그려집니다.¶

In [284]:

count_subset.plot(kind='barh', stacked=True)

Out[284]:

<matplotlib.axes.AxesSubplot at 0x127e30910>

위의 그래프로 전체적인 윈도우 사용자의 비율을 보기에는 적합하지 않습니다. 이를 위해 정규화한 후 다시 그려봅니다.¶

In [285]:

normed_subset = count_subset.div(count_subset.sum(1), axis=0)
normed_subset.plot(kind='barh', stacked=True)

Out[285]:

<matplotlib.axes.AxesSubplot at 0x127e71410>

무비렌즈 1M 데이터 셋¶

1990년부터 2000년 초반까지 무비렌즈의 6000명의 유저가 4000개의 영화에 대해 매긴 100만개 평점정보를 가지고 분석을 해봅시다¶

평점, 유저 정보, 영화 정보의 3개 테이블로 나누어져 있습니다.¶

In [286]:

!head ch02/movielens/movies.dat

1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller

In [287]:

import pandas as pd
unames = ['user_id', 'gender', 'age', 'occupation', 'zip'] 
users = pd.read_table('ch02/movielens/users.dat', sep='::', header=None,
                       names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ch02/movielens/ratings.dat', sep='::', header=None,
                        names=rnames)

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ch02/movielens/movies.dat', sep='::', header=None,
                        names=mnames)

In [288]:

users[:5]

Out[288]:

	user_id	gender	age	occupation	zip
0	1	F	1	10	48067
1	2	M	56	16	70072
2	3	M	25	15	55117
3	4	M	45	7	02460
4	5	M	25	20	55455

In [289]:

ratings[:5]

Out[289]:

	user_id	movie_id	rating	timestamp
0	1	1193	5	978300760
1	1	661	3	978302109
2	1	914	3	978301968
3	1	3408	4	978300275
4	1	2355	5	978824291

In [290]:

movies[:5]

Out[290]:

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

이렇게 3개로 나뉘어진 데이터테이블을 다루는 것은 쉽지 않습니다. merge를 이용해 합쳐 봅니다.¶

In [291]:

data = pd.merge(pd.merge(ratings, users), movies)

테이블간 같은 이름의 컬럼이 있으면 그것을 기준으로 합쳐 줍니다.¶

In [292]:

data

Out[292]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 10 columns):
user_id       1000209  non-null values
movie_id      1000209  non-null values
rating        1000209  non-null values
timestamp     1000209  non-null values
gender        1000209  non-null values
age           1000209  non-null values
occupation    1000209  non-null values
zip           1000209  non-null values
title         1000209  non-null values
genres        1000209  non-null values
dtypes: int64(6), object(4)

In [293]:

data.ix[0]

Out[293]:

user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object

이렇게 해 두면, 유저들이나 영화의 속성으로 평점을 통합하기가 용이합니다. 이제 성별에 의한 평균 평점을 구하기 위해 pivot_table을 사용합니다.¶

In [294]:

mean_ratings = data.pivot_table('rating', rows='title',
                                cols='gender', aggfunc='mean')

mean_ratings[:5]

Out[294]:

gender	F	M
title
$1,000,000 Duck (1971)	3.375000	2.761905
'Night Mother (1986)	3.388889	3.352941
'Til There Was You (1997)	2.675676	2.733333
'burbs, The (1989)	2.793478	2.962085
...And Justice for All (1979)	3.828571	3.689024

이제 최소한 250개 이상의 평가를 받은 영화들을 추려보겠습니다. 이를 위해 size()를 호출해 타이틀별 평점 개수를 구합니다.¶

In [295]:

ratings_by_title = data.groupby('title').size()

ratings_by_title[:10]

Out[295]:

title
$1,000,000 Duck (1971)                37
'Night Mother (1986)                  70
'Til There Was You (1997)             52
'burbs, The (1989)                   303
...And Justice for All (1979)        199
1-900 (1994)                           2
10 Things I Hate About You (1999)    700
101 Dalmatians (1961)                565
101 Dalmatians (1996)                364
12 Angry Men (1957)                  616
dtype: int64

유효한(=일정 횟수 이상의 평가가 있는) 타이틀들의 인덱스를 구합니다.¶

In [296]:

active_titles = ratings_by_title.index[ratings_by_title >= 250]

이제 유효한 영화에 대한 성별별 평균 테이블을 추려낼 수 있습니다.¶

In [297]:

mean_ratings = mean_ratings.ix[active_titles]

mean_ratings

Out[297]:

<class 'pandas.core.frame.DataFrame'>
Index: 1216 entries, 'burbs, The (1989) to eXistenZ (1999)
Data columns (total 2 columns):
F    1216  non-null values
M    1216  non-null values
dtypes: float64(2)

이것으로 여성들에게 좋은 평가를 받은 영화를 찾아보겠습니다.¶

In [298]:

top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)

top_female_ratings[:10]

Out[298]:

gender	F	M
title
Close Shave, A (1995)	4.644444	4.473795
Wrong Trousers, The (1993)	4.588235	4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)	4.572650	4.464589
Wallace & Gromit: The Best of Aardman Animation (1996)	4.563107	4.385075
Schindler's List (1993)	4.562602	4.491415
Shawshank Redemption, The (1994)	4.539075	4.560625
Grand Day Out, A (1992)	4.537879	4.293255
To Kill a Mockingbird (1962)	4.536667	4.372611
Creature Comforts (1990)	4.513889	4.272277
Usual Suspects, The (1995)	4.513317	4.518248

성별에 따른 평점 차이 측정¶

남녀간 평점 차이가 큰 영화를 알아볼까요? 평균 테이블에 새로운 컬럼 diff를 만듧니다.¶

In [299]:

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

그리고 이를 기준으로 소팅하면 남녀간 견해 차이가 큰 영화들도 알 수 있겠네요.¶

In [300]:

sorted_by_diff = mean_ratings.sort_index(by='diff')

sorted_by_diff[:15]

Out[300]:

gender	F	M	diff
title
Dirty Dancing (1987)	3.790378	2.959596	-0.830782
Jumpin' Jack Flash (1986)	3.254717	2.578358	-0.676359
Grease (1978)	3.975265	3.367041	-0.608224
Little Women (1994)	3.870588	3.321739	-0.548849
Steel Magnolias (1989)	3.901734	3.365957	-0.535777
Anastasia (1997)	3.800000	3.281609	-0.518391
Rocky Horror Picture Show, The (1975)	3.673016	3.160131	-0.512885
Color Purple, The (1985)	4.158192	3.659341	-0.498851
Age of Innocence, The (1993)	3.827068	3.339506	-0.487561
Free Willy (1993)	2.921348	2.438776	-0.482573
French Kiss (1995)	3.535714	3.056962	-0.478752
Little Shop of Horrors, The (1960)	3.650000	3.179688	-0.470312
Guys and Dolls (1955)	4.051724	3.583333	-0.468391
Mary Poppins (1964)	4.197740	3.730594	-0.467147
Patch Adams (1998)	3.473282	3.008746	-0.464536

이것을 역으로 정렬해서 보면, 남녀의 평가가 비슷한 영화들이 나옵니다.¶

In [301]:

sorted_by_diff[::-1][:15]

Out[301]:

gender	F	M	diff
title
Good, The Bad and The Ugly, The (1966)	3.494949	4.221300	0.726351
Kentucky Fried Movie, The (1977)	2.878788	3.555147	0.676359
Dumb & Dumber (1994)	2.697987	3.336595	0.638608
Longest Day, The (1962)	3.411765	4.031447	0.619682
Cable Guy, The (1996)	2.250000	2.863787	0.613787
Evil Dead II (Dead By Dawn) (1987)	3.297297	3.909283	0.611985
Hidden, The (1987)	3.137931	3.745098	0.607167
Rocky III (1982)	2.361702	2.943503	0.581801
Caddyshack (1980)	3.396135	3.969737	0.573602
For a Few Dollars More (1965)	3.409091	3.953795	0.544704
Porky's (1981)	2.296875	2.836364	0.539489
Animal House (1978)	3.628906	4.167192	0.538286
Exorcist, The (1973)	3.537634	4.067239	0.529605
Fright Night (1985)	2.973684	3.500000	0.526316
Barb Wire (1996)	1.585366	2.100386	0.515020

만약, 성별을 무시하고 평가의 차이가 큰 영화를 찾는다면 변화율이나 표준편차로 측정할 수 있습니다.¶

In [302]:

# 타이틀 별 평점의 표준편차
rating_std_by_title = data.groupby('title')['rating'].std()

# 유효한 타이틀에 대해 좁힌다
rating_std_by_title = rating_std_by_title.ix[active_titles]

# 상위 10개만 본다
rating_std_by_title.order(ascending=False)[:10]

Out[302]:

title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64

장르 정보는 '|'로 구분되어 있는데, 장르별 분석을 위해서는 이것도 사용이 용이한 형태로 바꾸어야 할 것입니다.¶

1880년부터 2010년까지 미국 아기 이름 분석¶

이제, 미국 사회보장국(SSA)가 제공하는 1880년부터 현재까지 출생하는 아기들의 이름 데이터를 분석해보겠습니다.

유닉스의 head명령으로 파일의 윗부분을 살펴보면 다음과 같습니다.¶

In [303]:

!head ch02/names/yob1880.txt

, 로 구분된 텍스트 데이터는 read_csv로 데이터프레임을 만들 수 있습니다.¶

In [304]:

import pandas as pd

names1880 = pd.read_csv('ch02/names/yob1880.txt', names=['name', 'sex', 'births'])

names1880

Out[304]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 3 columns):
name      2000  non-null values
sex       2000  non-null values
births    2000  non-null values
dtypes: int64(1), object(2)

성별로 그룹짓고 이름별 출생을 다 더하면 그 해의 신생아 수를 알 수 있습니다.¶

In [362]:

names1880.groupby('sex').births.sum()

Out[362]:

sex
F       90993
M      110493
Name: births, dtype: int64

데이터가 연도별 파일로 나누어져 있기에, 각 데이터프레임을 concat으로 하나의 데이터프레임으로 합치겠습니다.¶

In [306]:

years = range(1880, 2011)

pieces = []
columns = ['name', 'sex', 'births']

for year in years:
    path = 'ch02/names/yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)
    frame['year'] = year 
    pieces.append(frame)
    
# 모든 데이터프레임들을 합친다.
names = pd.concat(pieces, ignore_index=True)

concat은 테이블을 행단위로 합칩니다. 여기에서 원래 인덱스는 관심이 없기에 ignore_index=True를 했습니다.¶

In [307]:

names

Out[307]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
name      1690784  non-null values
sex       1690784  non-null values
births    1690784  non-null values
year      1690784  non-null values
dtypes: int64(2), object(2)

여기에서 출생 연도별로 나뉘어진 성별 출생수 피봇테이블을 만듧니다.¶

In [308]:

total_births = names.pivot_table('births', rows='year',
                                 cols='sex', aggfunc=sum)

In [309]:

total_births.tail()

Out[309]:

sex	F	M
year
2006	1896468	2050234
2007	1916888	2069242
2008	1883645	2032310
2009	1827643	1973359
2010	1759010	1898382

이제 각 이름이 그해의 전체 신생아중 얼마의 비율을 차지하는지를 계산해 prop컬럼에 넣겠습니다.¶

In [310]:

def add_prop(group):
    # 실수형으로 나눕니다.
    births = group.births.astype(float)
    group['prop'] = births / births.sum()
    return group

names = names.groupby(['year', 'sex']).apply(add_prop)

In [311]:

names

Out[311]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1690784 entries, 0 to 1690783
Data columns (total 5 columns):
name      1690784  non-null values
sex       1690784  non-null values
births    1690784  non-null values
year      1690784  non-null values
prop      1690784  non-null values
dtypes: float64(1), int64(2), object(2)

In [312]:

names[:6]

Out[312]:

	name	sex	births	year	prop
0	Mary	F	7065	1880	0.077643
1	Anna	F	2604	1880	0.028618
2	Emma	F	2003	1880	0.022013
3	Elizabeth	F	1939	1880	0.021309
4	Minnie	F	1746	1880	0.019188
5	Margaret	F	1578	1880	0.017342

이런 그룹 연산에는 합산 결과를 확인해 보는 것이 좋습니다.¶

In [363]:

np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)

Out[363]:

True

분석 대상이 될 이름을 추려내겠습니다.¶

In [364]:

def get_top1000(group):
    return group.sort_index(by='births', ascending=False)[:1000]

grouped = names.groupby(['year', 'sex']) 
top1000 = grouped.apply(get_top1000)

이름 경향 분석하기¶

이제 이름에 대한 다양한 분석을 시작할 수 있습니다.¶

In [315]:

boys = top1000[top1000.sex == 'M']
girls = top1000[top1000.sex == 'F']

이름별로 연도별 경향을 보기 위해서 좀 더 개조가 필요합니다.¶

In [367]:

total_births = top1000.pivot_table('births', rows='year', cols='name',
                                   aggfunc=sum)

total_births

Out[367]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 1880 to 2010
Columns: 6865 entries, Aaden to Zuri
dtypes: float64(6865)

이제 몇몇 이름들에 대한 경향을 그래프로 그려봅니다.¶

In [368]:

subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]

subset.plot(subplots=True, figsize=(12, 10), grid=False,
            title="Number of births per year")

Out[368]:

array([<matplotlib.axes.AxesSubplot object at 0x127d56dd0>,
       <matplotlib.axes.AxesSubplot object at 0x117ae1850>,
       <matplotlib.axes.AxesSubplot object at 0x127c61450>,
       <matplotlib.axes.AxesSubplot object at 0x127c7f410>], dtype=object)

위의 그래프에서 이 이름들이 점점 인기가 없는 것 같다는 결론이 나올 수도 있지만, 사실은 좀 더 복잡합니다.¶

작명의 다양성을 측정하기¶

위의 그래프에서 평범한 이름을 선택하는 경우가 줄어든다고 해석할 수도 있는데, 데이터 분석으로 접근해 보겠습니다.¶

In [318]:

table = top1000.pivot_table('prop', rows='year',
                            cols='sex', aggfunc=sum)

table.plot(title=u'Sum of table1000.prop by year and sex',
           yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))

Out[318]:

<matplotlib.axes.AxesSubplot at 0x13ad505d0>

상위 1000개의 이름이 차지하는 비율이 줄어든다는 것은, 이름의 다양성이 증가한다는 것을 말해줍니다.¶

위에서부터 몇 개의 이름을 선택해야 50%가 될지 우선 알아봅니다.¶

In [371]:

prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum()

prop_cumsum[:10]

Out[371]:

year  sex         
2010  M    1676644    0.011523
           1676645    0.020934
           1676646    0.029959
           1676647    0.038930
           1676648    0.047817
           1676649    0.056579
           1676650    0.065155
           1676651    0.073414
           1676652    0.081528
           1676653    0.089621
dtype: float64

for 루프로 순회하면서 계산해도 되겠지만, 벡터화된 NumPy방식이 더 좋은 방식입니다.¶

In [372]:

prop_cumsum.searchsorted(0.5) + 1

Out[372]:

1900년도에는 훨씬 더 적은 개수가 필요했습니다.¶

In [373]:

df = boys[boys.year == 1900]

in1900 = df.sort_index(by='prop', ascending=False).prop.cumsum()

in1900.searchsorted(0.5) + 1

Out[373]:

이제 이 계산을 모든 연도와 성별에 대해서 적용해보겠습니다.¶

In [374]:

def get_quantile_count(group, q=0.5):
    group = group.sort_index(by='prop', ascending=False) 
    return group.prop.cumsum().searchsorted(q) + 1

diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)
diversity = diversity.unstack('sex')

이 diversity에는 연도별로 각 성별에 대한 50%가 되기위한 개수가 있습니다.¶

In [375]:

diversity.head()

Out[375]:

sex	F	M
year
1880	38	14
1881	38	14
1882	38	15
1883	39	15
1884	39	16

In [376]:

diversity.plot(title="Number of popular names in top 50%")

Out[376]:

<matplotlib.axes.AxesSubplot at 0x117cb4190>

위의 결과에서 여자 이름이 항상 남자보다 더 다양했다는 것을 알 수 있습니다.¶

'마지막 글자'의 변화¶

Laura Wattenberg는 지난 100년간 남자 이름의 끝자가 많이 변했다는 것을 지적했습니다. 이것을 확인하기 위해 원래 데이터에서 끝자만 추출합니다.¶

In [378]:

# 이름 컬럼에서 마지막 글자 추출
get_last_letter = lambda x: x[-1]
last_letters = names.name.map(get_last_letter) 
last_letters.name = 'last_letter'

table = names.pivot_table('births', rows=last_letters, cols=['sex', 'year'], aggfunc=sum)

여기에서 시기별 3개 연도에 대해서 앞의 몇 줄을 출력해 보겠습니다.¶

In [379]:

subtable = table.reindex(columns=[1910, 1960, 2010], level='year')

subtable.head()

Out[379]:

sex	F			M
year	1910	1960	2010	1910	1960	2010
last_letter
a	108376	691247	670605	977	5204	28438
b	NaN	694	450	411	3912	38859
c	5	49	946	482	15476	23125
d	6750	3729	2607	22111	262112	44398
e	133569	435013	313833	28655	178823	129012

성별 출생 비율을 구하기위해 전체 출생수로 정규화 합니다.¶

In [380]:

subtable.sum()

Out[380]:

sex  year
F    1910     396416
     1960    2022062
     2010    1759010
M    1910     194198
     1960    2132588
     2010    1898382
dtype: float64

In [381]:

letter_prop = subtable / subtable.sum().astype(float)

이제 글자 비율을 이용해 성별별 그래프를 연도로 구분해 그릴 수 있습니다.¶

In [382]:

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 1, figsize=(10, 8)) 
letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male') 
letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female',
                      legend=False)

Out[382]:

<matplotlib.axes.AxesSubplot at 0x11f8eb190>

결과에서 보이듯 남자이름의 마지막 'n'은 1960년 이후로 크게 증가했습니다.¶

이번에는 특정 몇 글자에 대해서만 집중적으로 보겠습니다.¶

In [383]:

letter_prop = table / table.sum().astype(float)

dny_ts = letter_prop.ix[['d', 'n', 'y'], 'M'].T

dny_ts.head()

Out[383]:

	d	n	y
year
1880	0.083055	0.153213	0.075760
1881	0.083247	0.153214	0.077451
1882	0.085340	0.149560	0.077537
1883	0.084066	0.151646	0.079144
1884	0.086120	0.149915	0.080405

이것으로는 연도별 경향을 그래프로 그릴 수 있습니다.¶

In [384]:

dny_ts.plot(figsize=(10, 5))

Out[384]:

<matplotlib.axes.AxesSubplot at 0x117c9c790>

남자 이름이 여자 이름으로 된 경우 (Lesley 또는 Leslie에 대해)¶

유니크한 모든 이름 리스트를 구하고, 거기에서 'lesl'을 포함하는 이름만 선택합니다.¶

In [385]:

all_names = top1000.name.unique()

mask = np.array(['lesl' in x.lower() for x in all_names])

lesley_like = all_names[mask]

lesley_like

Out[385]:

array(['Leslie', 'Lesley', 'Leslee', 'Lesli', 'Lesly'], dtype=object)

In [388]:

mask

Out[388]:

array([False, False, False, ..., False, False, False], dtype=bool)

In [386]:

filtered = top1000[top1000.name.isin(lesley_like)]

filtered.groupby('name').births.sum()

Out[386]:

name
Leslee      1082
Lesley     35022
Lesli        929
Leslie    370429
Lesly      10067
Name: births, dtype: int64

이것에서 성별과 연도에 대해 피봇테이블을 만들고 연도별로 정규화합니다.¶

In [389]:

table = filtered.pivot_table('births', rows='year', cols='sex', aggfunc='sum')

table = table.div(table.sum(1), axis=0)

table.tail()

Out[389]:

sex	F	M
year
2006	1	NaN
2007	1	NaN
2008	1	NaN
2009	1	NaN
2010	1	NaN

최종적으로, 시간의 경과에 따른 성별 분석을 그래프로 그립니다.¶

In [336]:

table.plot(style={'M': 'k-', 'F': 'k--'}, figsize=(10, 5))

Out[336]:

<matplotlib.axes.AxesSubplot at 0x13addb350>