小试python的网页数据抓取¶

数据抓取两大阶段，网页获取和网页解析
R的数据抓取主要靠两个包，网页获取是RCurl，解析是靠XML
python的话也是对应两个包，网磁获取是requests，解析是BeautifulSoup
如下是练习三种抓取情况
- 直接拿现成的表格
- 模拟登陆
- 拿豆瓣250电影

直接拿表格

In [28]:

import pandas as pd
url = "http://zh.wikipedia.org/wiki/%E7%9C%81%E4%BC%9A"
dfs = pd.read_html(url, attrs={'class': 'wikitable'})
dfs[0].head()

Out[28]:

	0	1	2	3	4	5	6	7	8	9	10
0	編號	行政區	簡稱	省會或首府	地區	NaN	編號	行政區	簡稱	省會或首府	地區
1	01	江蘇省	蘇	鎮江	華中	20	甘肅省	隴	蘭州	華北	NaN
2	02	浙江省	浙	杭州	華中	21	寧夏省	寧	銀川	塞北	NaN
3	03	安徽省	皖	合肥	華中	22	青海省	青	西寧	西部	NaN
4	04	江西省	贛	南昌	華中	23	綏遠省	綏	歸綏（今呼和浩特）	塞北	NaN

模拟登陆

登陆zhihu，要配合chrome的审查工具来观察登陆时的数据

抓取豆瓣250电影

抓取250部最佳电影的评分和评价人数
用R来画图

In [30]:

import requests
from bs4 import BeautifulSoup
import re
url = "http://movie.douban.com/top250"
r = requests.get(url)
soup_packtpage = BeautifulSoup(r.text)

In [31]:

# 根据网页特征定义抓取函数
def namefunc(movie):
    names = [x.findChild('span',attrs={'class':'title'}).string for x in movie]
    return names
def scorefunc(movie):
    scores = [float(str(x.findChild('em').string)) for x in movie]
    return scores
def numfunc(movie):
    num = [x.findChild('span',attrs=None).string for x in movie]
    num = [int(str(re.sub('\D', '', x))) for x in num]
    return num
url = "http://movie.douban.com/top250"
def getinfo(url):
    r = requests.get(url)
    soup_packtpage = BeautifulSoup(r.text)
    movie = soup_packtpage.findAll('div',attrs={'class':'info'})
    names = namefunc(movie)
    scores = scorefunc(movie)
    num = numfunc(movie)
    res = {'names': names, 'scores': scores, 'num': num}
    return res

In [32]:

# 得到不同网址
urls = []
index = range(0,250,25)
for x in index:
    urls.append('http://movie.douban.com/top250?start='+str(x)+'&filter=&type=')
urls

Out[32]:

['http://movie.douban.com/top250?start=0&filter=&type=',
 'http://movie.douban.com/top250?start=25&filter=&type=',
 'http://movie.douban.com/top250?start=50&filter=&type=',
 'http://movie.douban.com/top250?start=75&filter=&type=',
 'http://movie.douban.com/top250?start=100&filter=&type=',
 'http://movie.douban.com/top250?start=125&filter=&type=',
 'http://movie.douban.com/top250?start=150&filter=&type=',
 'http://movie.douban.com/top250?start=175&filter=&type=',
 'http://movie.douban.com/top250?start=200&filter=&type=',
 'http://movie.douban.com/top250?start=225&filter=&type=']

In [33]:

# 对每个网址进行抓取
res = {'names': [], 'scores': [], 'num': []}
for url in urls:
    new = getinfo(url)
    res['names'].extend(new['names'])
    res['scores'].extend(new['scores'])
    res['num'].extend(new['num'])

In [34]:

import pandas as pd
df = pd.DataFrame(res)
df.head()

Out[34]:

	names	num	scores
0	肖申克的救赎	573416	9.6
1	这个杀手不太冷	543434	9.4
2	阿甘正传	484866	9.4
3	霸王别姬	389376	9.4
4	美丽人生	265644	9.4

可视化

In [37]:

%load_ext rpy2.ipython

In [38]:

%%R -i df -w 500 -h 300 
library(ggplot2)
p = ggplot(df,aes(x = num, y = scores)) + geom_point(size=4,alpha=0.5) + stat_smooth()
print(p)

geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.