用spark进行数据挖掘¶

本例使用spark的python接口，对titanic数据做了一个完整的尝试
首先用算质数的例子显示，即使在单机中，spark利用了多核处理能提高计算效率
之后读入数据集，并对数据进行预处理
- 步骤1：对名字进行了处理，用正则取出四种常见title
- 步骤2：基于title，对年龄进行了缺失值处理
- 步骤3：将类别变量均转为0-1变量
数据合并整理成spark.mllib需要的格式
使用线性模型建模，并得出错误率
本例代码参考了《machine learning with spark》一书

In [1]:

from pyspark import  SparkContext
sc = SparkContext( 'local[4]')

算质数的例子

In [2]:

def isprime(n):
    """
    check if integer n is a prime
    """
    # make sure n is a positive integer
    n = abs(int(n))
    # 0 and 1 are not primes
    if n < 2:
        return False
    # 2 is the only even prime number
    if n == 2:
        return True
    # all other even numbers are not primes
    if not n & 1:
        return False
    # range starts with 3 and only needs to go up the square root of n
    # for all odd numbers
    for x in range(3, int(n**0.5)+1, 2):
        if n % x == 0:
            return False
    return True

In [3]:

%%timeit
import numpy as np
nums = xrange(1000000)
print np.sum([1 for x in nums if isprime(x)])

78498
78498
78498
78498
1 loops, best of 3: 4.81 s per loop

In [4]:

%%timeit
nums = sc.parallelize(xrange(1000000))
print nums.filter(isprime).count()

78498
78498
78498
78498
1 loops, best of 3: 2.71 s per loop

titanic例子，先读入变量名

In [2]:

vname = !head -1 titanic.csv
vname = vname[0].split(',')

In [3]:

#!sed 1d titanic.csv > titanic_noheader.csv
raw = sc.textFile('titanic_noheader.csv')
raw.first() # 原始数据

Out[3]:

u'0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S'

数据预处理

In [4]:

# 处理title
def extract_name(x):
    import re
    return re.search("\"(.*)\"", x).group(1)

In [5]:

names = raw.map(extract_name)
names.take(4)

Out[5]:

[u'Braund, Mr. Owen Harris',
 u'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
 u'Heikkinen, Miss. Laina',
 u'Futrelle, Mrs. Jacques Heath (Lily May Peel)']

In [6]:

import re
title = names.map(lambda x: re.search(r", (.*?)\. ", x).group(1))

In [7]:

sorted(title.countByValue().iteritems(),key=lambda (k,v): v,reverse=True)

Out[7]:

[(u'Mr', 517),
 (u'Miss', 182),
 (u'Mrs', 125),
 (u'Master', 40),
 (u'Dr', 7),
 (u'Rev', 6),
 (u'Major', 2),
 (u'Mlle', 2),
 (u'Col', 2),
 (u'Sir', 1),
 (u'the Countess', 1),
 (u'Don', 1),
 (u'Capt', 1),
 (u'Lady', 1),
 (u'Jonkheer', 1),
 (u'Ms', 1),
 (u'Mme', 1)]

In [8]:

top_title = [x[0] for x in sorted(title.countByValue().iteritems(),key=lambda (k,v): v,reverse=True)[:4]]
top_title

Out[8]:

[u'Mr', u'Miss', u'Mrs', u'Master']

In [9]:

def assign_title(x):
    if x in top_title: return x
    else: return u'other'

In [10]:

title_less = title.map(assign_title)
title_less.take(4)

Out[10]:

[u'Mr', u'Mrs', u'Miss', u'Mrs']

In [11]:

# 处理其它数据
def split_rest(x):
    import re
    rec = re.sub("\"(.*)\",", '', x)
    return rec.split(',')

In [12]:

df = raw.map(split_rest)
df.first()

Out[12]:

[u'0', u'3', u'male', u'22', u'1', u'0', u'A/5 21171', u'7.25', u'', u'S']

In [13]:

# 观察数据
vname.remove('name')

In [14]:

# 取值个数
m = len(df.first())
for i in range(m):
    print '%dth variable:%s  distinct value: %s' %(i, vname[i],df.map(lambda row: row[i]).distinct().count())

0th variable:survived  distinct value: 2
1th variable:pclass  distinct value: 3
2th variable:sex  distinct value: 2
3th variable:age  distinct value: 89
4th variable:sibsp  distinct value: 7
5th variable:parch  distinct value: 7
6th variable:ticket  distinct value: 681
7th variable:fare  distinct value: 248
8th variable:cabin  distinct value: 148
9th variable:embarked  distinct value: 4

In [15]:

# 缺失个数
for i in range(m):
    print '%dth variable:%s  miss value: %s' %(i, vname[i],df.map(lambda row: row[i]=='').sum())

0th variable:survived  miss value: 0
1th variable:pclass  miss value: 0
2th variable:sex  miss value: 0
3th variable:age  miss value: 177
4th variable:sibsp  miss value: 0
5th variable:parch  miss value: 0
6th variable:ticket  miss value: 0
7th variable:fare  miss value: 0
8th variable:cabin  miss value: 687
9th variable:embarked  miss value: 2

In [16]:

# 处理年龄缺失
age  = df.map(lambda x: x[3])

In [17]:

title_age = title.zip(age)

In [18]:

title_age = title_age.mapValues(lambda x: float(x) if x!='' else -1)

In [19]:

import numpy as np

In [20]:

def miss_mean(data):
    res = [x for x in data if x!=-1]
    return np.mean(res)

In [21]:

age_dict = dict(title_age.groupByKey().map(lambda (k,v): (k, miss_mean(v.data))).collect())

In [22]:

age_dict

Out[22]:

{u'Capt': 70.0,
 u'Col': 58.0,
 u'Don': 40.0,
 u'Dr': 42.0,
 u'Jonkheer': 38.0,
 u'Lady': 48.0,
 u'Major': 48.5,
 u'Master': 4.5741666666666667,
 u'Miss': 21.773972602739725,
 u'Mlle': 24.0,
 u'Mme': 24.0,
 u'Mr': 32.368090452261306,
 u'Mrs': 35.898148148148145,
 u'Ms': 28.0,
 u'Rev': 43.166666666666664,
 u'Sir': 49.0,
 u'the Countess': 33.0}

In [23]:

def age_func((title,age)):
    if  age== -1: res = (title, age_dict[title])
    else: res = (title, age)
    return res

In [24]:

title_age = title_age.map(age_func)
age_imputed = title_age.values()
age_imputed.take(4)

Out[24]:

[22.0, 38.0, 26.0, 35.0]

In [25]:

# 处理 embarked缺失
df.map(lambda record: record[9]).countByValue()

Out[25]:

defaultdict(<type 'int'>, {u'Q': 77, u'': 2, u'S': 644, u'C': 168})

In [26]:

def embarked_func(record):
    if record[9]=='' : return u'S' 
    else: return record[9]

In [27]:

embarked= df.map(embarked_func)

In [28]:

# 将四个类别变量转为0-1二元变量

In [29]:

title_dict = title_less.distinct().zipWithIndex().collectAsMap()
title_dict

Out[29]:

{u'Master': 1, u'Miss': 0, u'Mr': 3, u'Mrs': 4, u'other': 2}

In [30]:

def create_vector(term, term_dict):
    #from scipy import sparse as sp
    num_terms = len(term_dict)
    #x = sp.csc_matrix((1, num_terms))
    x = [0]*num_terms
    idx = term_dict[term]
    x[idx] = 1
    return x

In [31]:

create_vector(u'Master',title_dict)

Out[31]:

[0, 1, 0, 0, 0]

In [32]:

title_ind = title_less.map(lambda x: create_vector(x,title_dict))
title_ind.take(4)

Out[32]:

[[0, 0, 0, 1, 0], [0, 0, 0, 0, 1], [1, 0, 0, 0, 0], [0, 0, 0, 0, 1]]

In [33]:

pclass_dict = df.map(lambda x: x[1]).distinct().zipWithIndex().collectAsMap()
pclass_dict

Out[33]:

{u'1': 0, u'2': 2, u'3': 1}

In [34]:

pclass_ind = df.map(lambda x: create_vector(x[1],pclass_dict))
pclass_ind.take(4)

Out[34]:

[[0, 1, 0], [1, 0, 0], [0, 1, 0], [1, 0, 0]]

In [35]:

embarked_dict = embarked.distinct().zipWithIndex().collectAsMap()
embarked_dict

Out[35]:

{u'C': 2, u'Q': 0, u'S': 1}

In [36]:

embarked_ind = embarked.map(lambda x: create_vector(x,embarked_dict))
embarked_ind.take(4)

Out[36]:

[[0, 1, 0], [0, 0, 1], [0, 1, 0], [0, 1, 0]]

In [37]:

gender_ind = df.map(lambda x: 1 if x[2]==u'male' else 0)

In [38]:

# 合并数据
restdf = df.map(lambda x: [int(x[0]),int(x[4]), int(x[5]), float(x[7])]).zipWithIndex().map(lambda (v,k): (k,v))
restdf.take(4)

Out[38]:

[(0, [0, 1, 0, 7.25]),
 (1, [1, 1, 0, 71.2833]),
 (2, [1, 0, 0, 7.925]),
 (3, [1, 1, 0, 53.1])]

In [39]:

title_ind = title_ind.zipWithIndex().map(lambda (v,k): (k,v))
title_ind.take(4)

Out[39]:

[(0, [0, 0, 0, 1, 0]),
 (1, [0, 0, 0, 0, 1]),
 (2, [1, 0, 0, 0, 0]),
 (3, [0, 0, 0, 0, 1])]

In [40]:

pclass_ind = pclass_ind.zipWithIndex().map(lambda (v,k): (k,v))
pclass_ind.take(4)

Out[40]:

[(0, [0, 1, 0]), (1, [1, 0, 0]), (2, [0, 1, 0]), (3, [1, 0, 0])]

In [41]:

embarked_ind = embarked_ind.zipWithIndex().map(lambda (v,k): (k,v))
embarked_ind.take(4)

Out[41]:

[(0, [0, 1, 0]), (1, [0, 0, 1]), (2, [0, 1, 0]), (3, [0, 1, 0])]

In [42]:

gender_ind = gender_ind.zipWithIndex().map(lambda (v,k): (k,[v]))
gender_ind.take(4)

Out[42]:

[(0, [1]), (1, [0]), (2, [0]), (3, [0])]

In [43]:

age_imputed = age_imputed.zipWithIndex().map(lambda (v,k): (k,[v]))
age_imputed.take(4)

Out[43]:

[(0, [22.0]), (1, [38.0]), (2, [26.0]), (3, [35.0])]

In [44]:

finaldf = restdf.union(embarked_ind).reduceByKey(lambda x,y: x + y)
finaldf = finaldf.union(age_imputed).reduceByKey(lambda x,y: x + y)
finaldf = finaldf.union(gender_ind).reduceByKey(lambda x,y: x + y)
finaldf = finaldf.union(title_ind).reduceByKey(lambda x,y: x + y)
finaldf = finaldf.union(pclass_ind).reduceByKey(lambda x,y: x + y)

In [45]:

finaldf.take(4)

Out[45]:

[(0, [0, 1, 0, 7.25, 0, 1, 0, 22.0, 1, 0, 0, 0, 1, 0, 0, 1, 0]),
 (384,
  [0, 0, 0, 7.8958, 0, 1, 0, 32.368090452261306, 1, 0, 0, 0, 1, 0, 0, 1, 0]),
 (132, [0, 1, 0, 14.5, 0, 1, 0, 47.0, 0, 0, 0, 0, 0, 1, 0, 1, 0]),
 (588, [0, 0, 0, 8.05, 0, 1, 0, 22.0, 1, 0, 0, 0, 1, 0, 0, 1, 0])]

In [ ]:

In [46]:

# 准备建模需要格式
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
def parsePoint(line):
    features = line[1][1:]
    target = line[1][0]
    return LabeledPoint(target, features)

In [47]:

modeldata = finaldf.map(parsePoint)

In [48]:

modeldata.first()

Out[48]:

LabeledPoint(0.0, [1.0,0.0,7.25,0.0,1.0,0.0,22.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0])

In [49]:

# 数据切分
train, test = modeldata.randomSplit([0.75,0.25])
# 建模
model = LogisticRegressionWithSGD.train(train,iterations =1000,regType='l2')

In [50]:

# 评估
labelsAndPreds = test.map(lambda p: (p.label, model.predict(p.features)))
testErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(test.count())
print("Training Error = " + str(testErr))

Training Error = 0.308056872038

In [ ]: