This is a work in progress
%matplotlib inline
import flickrapi
import pandas as pd
import json
Prerequisite:
In order to understand this tutorial you need knowledge about a few things:
Happy googling! :)
First we declare a flickrapi object with our flickr API key
api_key = ''
flickr = flickrapi.FlickrAPI(api_key, format='json')
Then we retrieve the faves (favorites) of a selected photo.
Here we choose to use the JSON format because it is readable by pandas.
fav = flickr.photos_getFavorites(photo_id='14368887810');
By printing the content of the string, we see that some characters have to be removed for having a valid JSON string
print 'head of fav: ',fav[:50]
print 'tail of fav: ',fav[-50:],'\n'
# to return a valid json string, just remove 14 first characters and the last one
print 'head of fav (valid): ',fav[14:50]
print 'tail of fav (valid): ',fav[-50:-1]
head of fav: jsonFlickrApi({"photo":{"person":[{"nsid":"1208257 tailof fav: s":12, "perpage":10, "total":"120"}, "stat":"ok"}) head of fav (valid): {"photo":{"person":[{"nsid":"1208257 tail of fav (valid): s":12, "perpage":10, "total":"120"}, "stat":"ok"}
We parse JSON string, basically what it does is transforming the string into a dict
s_fav = json.loads(fav[14:-1])
print str(s_fav)[:200] # printing only the beginning
{u'photo': {u'perpage': 10, u'farm': 3, u'pages': 12, u'id': u'14368887810', u'person': [{u'username': u'Lukasan2014', u'family': 0, u'realname': u'', u'nsid': u'120825773@N05', u'iconserver': u'3908'
We can see that the content related to favorites is contained in the entry 'person' in the form of a list
print str(s_fav['photo']['person'])[:200], '\n' # display the fav list
print s_fav['photo']['person'][5], '\n' # display info about one fav
print s_fav['photo']['person'][5]['favedate'], '\n' # display fav date (UNIX timestamp)
[{u'username': u'Lukasan2014', u'family': 0, u'realname': u'', u'nsid': u'120825773@N05', u'iconserver': u'3908', u'favedate': u'1406519587', u'contact': 0, u'friend': 0, u'iconfarm': 4}, {u'username' {u'username': u'christilou1', u'family': 0, u'realname': u'', u'nsid': u'38030824@N04', u'iconserver': u'4108', u'favedate': u'1406116570', u'contact': 0, u'friend': 0, u'iconfarm': 5} 1406116570
We can import the information into a pandas data frame
# read only valid json sub string into data frame
df_fav = pd.read_json(json.dumps(s_fav['photo']['person']));
Now we can check we have a data frame that contains only information about favorites
df_fav
contact | family | favedate | friend | iconfarm | iconserver | nsid | realname | username | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1406519587 | 0 | 4 | 3908 | 120825773@N05 | Lukasan2014 | |
1 | 0 | 0 | 1406309214 | 0 | 3 | 2927 | 91007561@N05 | Cường Euro | Photography | |
2 | 0 | 0 | 1406293554 | 0 | 4 | 3673 | 104486340@N03 | JM Fikko | jm.fikko |
3 | 0 | 0 | 1406259611 | 0 | 3 | 2939 | 43738607@N07 | .brianday | |
4 | 0 | 0 | 1406239666 | 0 | 4 | 3679 | 60142567@N05 | Paxtorino | |
5 | 0 | 0 | 1406116570 | 0 | 5 | 4108 | 38030824@N04 | christilou1 | |
6 | 0 | 0 | 1405960894 | 0 | 3 | 2907 | 96274428@N05 | Fabio Luzzi | faio33 |
7 | 0 | 0 | 1405825932 | 0 | 9 | 8060 | 43295905@N05 | Marvett Smith | Marvett Smith (Savoring the Simple) |
8 | 0 | 0 | 1405535709 | 0 | 3 | 2888 | 117312075@N06 | Abhijit Roy | c4roy |
9 | 0 | 0 | 1405488932 | 0 | 4 | 3794 | 28465812@N07 | Monica & Cebolinha |
To summarize, what we have to do to access our data of interest and put it into a data frame is the following:
json.load
. Why do that? Because the information about the faves is easier to extract by using my_json_var['field1'][field2']
type of queries (extracting directly from the string would require regular expressions and we don't want that).json.dumps
and so can put it into a pandas data frame._ Edit: I also tried it by using only data frames (without creating dict objects) but it required more manipulation._
Next we can view the dates as a time series
df_fav['favedate']
0 1406519587 1 1406309214 2 1406293554 3 1406259611 4 1406239666 5 1406116570 6 1405960894 7 1405825932 8 1405535709 9 1405488932 Name: favedate, dtype: int64
Then we want to convert the dates into a more readable format.
For this we use the to_datetime
function and specify that the UNIX timestamp unit is second (default is nanosecond)
Additionally we may want to create a new data frame that only contains our date information.
df2 = pd.DataFrame(pd.to_datetime(df_fav['favedate'], unit='s'))
If we are interested in the day of the week information, we have to create a DateTimeIndex
object from which we can extract the weekday
and create a new column of the data frame with such information
df2['weekday'] = pd.DatetimeIndex(df2['favedate']).weekday
df2
favedate | weekday | |
---|---|---|
0 | 2014-07-28 03:53:07 | 0 |
1 | 2014-07-25 17:26:54 | 4 |
2 | 2014-07-25 13:05:54 | 4 |
3 | 2014-07-25 03:40:11 | 4 |
4 | 2014-07-24 22:07:46 | 3 |
5 | 2014-07-23 11:56:10 | 2 |
6 | 2014-07-21 16:41:34 | 0 |
7 | 2014-07-20 03:12:12 | 6 |
8 | 2014-07-16 18:35:09 | 2 |
9 | 2014-07-16 05:35:32 | 2 |
OK, so now we know those basic tools, we can start to visualize some stuff.
Let's see if we can first visualize the number of favorites as a function of the time after posting a picture.
In the following we define the number faves + comments as the activity.
Things that may be interesting to look at:
Here we chose the photo that has been ranked as first on the explore page and we try to see the evolution of the number of favorites with time.
# picture that is 1st on explore page (date Jul 29th 2014)
pp = 50;
p_id = '14778520992'
str_tmp = flickr.photos_getFavorites(photo_id=p_id, per_page=pp);
json_tmp = json.loads(str_tmp[14:-1])
df_tmp = pd.read_json(json.dumps(json_tmp['photo']['person']));
df_fav = pd.DataFrame(pd.to_datetime(df_tmp['favedate'], unit='s'))
num_p = json_tmp['photo']['pages']
num_p
20
pandas.Timestamp
object.str_tmp = flickr.photos_getInfo(photo_id=p_id);
tt = int(json.loads(str_tmp[14:-1])['photo']['dates']['posted'])
posted = pd.to_datetime(tt, unit='s')
posted
Timestamp('2014-07-29 18:44:12')
for i in arange(2, num_p+1):
str_tmp = \
flickr.photos_getFavorites(photo_id=p_id, page=i, per_page=pp)
json_tmp = json.loads(str_tmp[14:-1]);
df_tmp = pd.read_json(json.dumps(json_tmp['photo']['person']))
df_fav = df_fav.append(pd.DataFrame(pd.to_datetime(df_tmp['favedate'],\
unit='s')), ignore_index=True)
#print "data frame\n", df_fav
df_fav['diff'] = df_fav['favedate'] - posted
ts=(df_fav['diff'].astype('timedelta64[s]').astype(float)/3600);
figsize(17,5)
ts.hist(bins=50, color='slategrey', histtype='stepfilled')
xlabel('hours'); ylabel('Number of favorites')
<matplotlib.text.Text at 0x7f1551003450>
seaborn
package that is built on top of matplotlib
for statistical visualization.import seaborn as sb
figsize(17,5)
ts.hist(bins=50, color='slategrey', histtype='stepfilled')
xlabel('hours'); ylabel('Number of favorites')
<matplotlib.text.Text at 0x7f15462ebd10>
# Extracts for a photo the times and types of events (fave or comment)
def getFlickrActivity(df_list):
pp = 50;
df = pd.DataFrame(columns=['time','event_type','photo_id'])
for p_id in df_list:
# parse first page (necessary to know the total number of pages)
str_tmp = flickr.photos_getFavorites(photo_id=p_id, per_page=pp)
json_tmp = json.loads(str_tmp[14:-1])
num_p = json_tmp['photo']['pages']
df_tmp = pd.read_json(json.dumps(json_tmp['photo']['person']))
dt = pd.Series(pd.to_datetime(df_tmp['favedate'], unit='s'))
df = df.append(pd.DataFrame({ 'time' : dt,
'event_type' : 'f',
'photo_id' : p_id}))
# parse other pages
for i in arange(2, num_p+1):
str_tmp = \
flickr.photos_getFavorites(photo_id=p_id, page=i, per_page=pp);
json_tmp = json.loads(str_tmp[14:-1])
df_tmp = pd.read_json(json.dumps(json_tmp['photo']['person']));
dt = pd.Series(pd.to_datetime(df_tmp['favedate'], unit='s'))
df = df.append(pd.DataFrame({ 'time' : dt,
'event_type' : 'f',
'photo_id' : p_id}), ignore_index=True)
# do the same with comments (no page parsing required)
str_tmp = json.loads(flickr.photos_comments_getList(photo_id=p_id)[14:-1])
df_tmp = pd.read_json(json.dumps(str_tmp['comments']['comment']))
dt = pd.Series(pd.to_datetime(df_tmp['datecreate'], unit='s'))
df = df.append(pd.DataFrame({ 'time' : dt,
'event_type' : 'c',
'photo_id' : p_id}), ignore_index=True)
return df
# Takes a data frame containing a list of photos and extracts infomation for each of them
# time posted, views (to complete in future versions)
def getFlickrInfo(df_list):
pp = 50;
df_out = pd.DataFrame(index = arange(df_list.size), columns = ['photo_id', 'posted', 'views'])
for idx, p_id in enumerate(df_list):
str_tmp = flickr.photos_getInfo(photo_id=p_id)
tt = int(json.loads(str_tmp[14:-1])['photo']['dates']['posted'])
posted = pd.to_datetime(tt, unit='s')
views = int(json.loads(str_tmp[14:-1])['photo']['views'])
df_out.posted.ix[idx] = posted;
df_out.views.ix[idx] = views;
df_out.photo_id.ix[idx] = p_id;
return df_out
def getFlickrExploreList():
str_tmp = flickr.interestingness_getList()
json_tmp = json.loads(str_tmp[14:-1])
df_tmp = pd.read_json(json.dumps(json_tmp['photos']['photo']))
df = df_tmp['id']
return df # list of photo ids
# Making a data frame with pictures from explore
df_list = getFlickrExploreList()
df_info = getFlickrInfo(df_list)
df_event = getFlickrActivity(df_list)
Now we have a data frame df_info
that contains for each picture the time posted and the number of views (which we couls also count as a part of the activity),
df_info.tail()
photo_id | posted | views | |
---|---|---|---|
95 | 14944614512 | 2014-08-17 11:04:56 | 10970 |
96 | 14924191006 | 2014-08-17 15:15:35 | 9145 |
97 | 14764195088 | 2014-08-17 20:56:48 | 10851 |
98 | 14758390478 | 2014-08-17 11:10:59 | 12199 |
99 | 14955273955 | 2014-08-18 06:02:39 | 1665 |
and another one df_event
that contains the list of events (fave and comments) for our picture list
df_event.tail()
event_type | photo_id | time | |
---|---|---|---|
35718 | c | 14955273955 | 2014-08-18 15:35:11 |
35719 | c | 14955273955 | 2014-08-18 15:43:48 |
35720 | c | 14955273955 | 2014-08-18 18:02:35 |
35721 | c | 14955273955 | 2014-08-18 18:56:43 |
35722 | c | 14955273955 | 2014-08-18 19:50:20 |
figsize(8,6)
tt=pd.DatetimeIndex(df_event['time']);
hist(tt.hour, bins=24, histtype='stepfilled', alpha=0.7);
figsize(8,6)
tt=pd.DatetimeIndex(df_info['posted']);
hist(tt.hour, bins=24, histtype='stepfilled', alpha=0.7);
grouped = df_event.groupby('photo_id')
figsize(15,7)
b=arange(0, 24)
acc = zeros(len(b)-1)
for k, grp in grouped:
tt=pd.DatetimeIndex(grp['time'])
h,hh=histogram(tt.hour, bins=b)
acc += h
plot(b[:-1],h, '.r', alpha=0.25, markersize=10)
acc /= len(grouped)
plot(b[:-1],acc)
gca().invert_yaxis()
gca().xaxis.tick_top()
figsize(15,7)
nbin = 100
b=arange(0, 48)
acc = zeros(len(b)-1)
for k, grp in grouped:
tt=grp['time']-df_info.posted[df_info.photo_id==k].iloc[0]
#print tt
ttt=(tt.astype('timedelta64[s]').astype(float)/3600);
h,hh=histogram(ttt, bins=b)
acc += h
plot(b[:-1], h, '.r', alpha=0.25, markersize=10)
acc /= len(grouped)
plot(b[:-1], acc)
gca().invert_yaxis()
gca().xaxis.tick_top()