name = "2020-09-10-github-scrape"
title = "Scraping GitHub after a hackweek"
tags = "requests, github, webscrape, pandas, dataviz, text processing"
author = "Callum Rollo"
from nb_tools import connect_notebook_to_post
from IPython.core.display import HTML
html = connect_notebook_to_post(name, title, tags, author)
I put this notebook together after attending the excellent Oceean Hack Week 2020 (OHW) event. You can read my blog post about it here. A key part of the event was creating collaborative projects on GitHub. I wanted to see if attending the hackweek changed participants pattern of activity on GitHub. I though the simplest way would be to get info on all the commits by OHW participants to OHW projects.
It goes without saying that commits =/= work done on a project. I freely admit that many of my commits are nonsense! This is just a fun side project for me to explore the github API and try some simple analysis methods.
First off, to access the Github API you'll need to edit the credentials file credentials.json
to supply your username and a Github access token.
{
"username": "<your-username>",
"token": "<your-access-token>"
}
Once you have supplied these creds, you are still limited to 5000 requests per hour so, if you get bounced by the API, leave it some time to cool off. Without credentials the limit is far lower and you will would soon generate an error message like this:
Obviously I have not provided my own credentials! I'm not sure what action github would take if a DDOS on their API originated from my account, but I'm not willing to find out.
import json
import requests
from collections import Counter
import pandas as pd
import numpy as np
credentials = json.loads(open('credentials-secret.json').read()) #don't forget to add your creds here!
username = credentials['username']
token = credentials['token']
For a start, let's use the API to get some details on my account
user_data = requests.get('https://api.github.com/users/' + credentials['username'],auth = (username,token)).json()
user_data
{'login': 'callumrollo', 'id': 28703282, 'node_id': 'MDQ6VXNlcjI4NzAzMjgy', 'avatar_url': 'https://avatars0.githubusercontent.com/u/28703282?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/callumrollo', 'html_url': 'https://github.com/callumrollo', 'followers_url': 'https://api.github.com/users/callumrollo/followers', 'following_url': 'https://api.github.com/users/callumrollo/following{/other_user}', 'gists_url': 'https://api.github.com/users/callumrollo/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/callumrollo/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/callumrollo/subscriptions', 'organizations_url': 'https://api.github.com/users/callumrollo/orgs', 'repos_url': 'https://api.github.com/users/callumrollo/repos', 'events_url': 'https://api.github.com/users/callumrollo/events{/privacy}', 'received_events_url': 'https://api.github.com/users/callumrollo/received_events', 'type': 'User', 'site_admin': False, 'name': 'Callum Rollo', 'company': None, 'blog': 'https://callumrollo.github.io/', 'location': None, 'email': None, 'hireable': None, 'bio': 'Oceanographer, Pythonista and vim user. Regularly breaks things on Fridays', 'twitter_username': 'callum_rollo', 'public_repos': 36, 'public_gists': 0, 'followers': 18, 'following': 3, 'created_at': '2017-05-15T09:23:12Z', 'updated_at': '2020-10-09T14:36:52Z', 'private_gists': 0, 'total_private_repos': 7, 'owned_private_repos': 7, 'disk_usage': 389086, 'collaborators': 0, 'two_factor_authentication': True, 'plan': {'name': 'pro', 'space': 976562499, 'collaborators': 0, 'private_repos': 9999}}
This returns a lot of info. Now, I'll try the account of one of my collaborators ocefpaf
. You can specify any user, though the information returned is less than when you look at your own account.
data = requests.get('https://api.github.com/users/' + 'ocefpaf',auth = (username,token)).json()
data
{'login': 'ocefpaf', 'id': 950575, 'node_id': 'MDQ6VXNlcjk1MDU3NQ==', 'avatar_url': 'https://avatars1.githubusercontent.com/u/950575?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/ocefpaf', 'html_url': 'https://github.com/ocefpaf', 'followers_url': 'https://api.github.com/users/ocefpaf/followers', 'following_url': 'https://api.github.com/users/ocefpaf/following{/other_user}', 'gists_url': 'https://api.github.com/users/ocefpaf/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/ocefpaf/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/ocefpaf/subscriptions', 'organizations_url': 'https://api.github.com/users/ocefpaf/orgs', 'repos_url': 'https://api.github.com/users/ocefpaf/repos', 'events_url': 'https://api.github.com/users/ocefpaf/events{/privacy}', 'received_events_url': 'https://api.github.com/users/ocefpaf/received_events', 'type': 'User', 'site_admin': False, 'name': 'Filipe', 'company': None, 'blog': 'http://ocefpaf.github.io/python4oceanographers', 'location': 'Florianópolis, SC', 'email': 'ocefpaf@gmail.com', 'hireable': True, 'bio': 'Physical oceanographer turned research software engineer, and software packaging hobbyist.', 'twitter_username': 'ocefpaf', 'public_repos': 1190, 'public_gists': 133, 'followers': 395, 'following': 4, 'created_at': '2011-07-31T23:10:26Z', 'updated_at': '2020-10-10T15:23:18Z'}
We can see a user's core stats. How about their commits and other actions taken? Simply append /events
to the request query
data = requests.get('https://api.github.com/users/' + 'callumrollo' +'/events',auth = (username,token)).json()
data[0]
{'id': '13873691257', 'type': 'PushEvent', 'actor': {'id': 28703282, 'login': 'callumrollo', 'display_login': 'callumrollo', 'gravatar_id': '', 'url': 'https://api.github.com/users/callumrollo', 'avatar_url': 'https://avatars.githubusercontent.com/u/28703282?'}, 'repo': {'id': 48320187, 'name': 'ueapy/ueapy.github.io', 'url': 'https://api.github.com/repos/ueapy/ueapy.github.io'}, 'payload': {'push_id': 5868625550, 'size': 1, 'distinct_size': 1, 'ref': 'refs/heads/master', 'head': '0b70258f3668dbd25fe8e9681b9324af62e3066f', 'before': 'f9c3596cef28b24cfd4d7c27ba91a1c038c943b6', 'commits': [{'sha': '0b70258f3668dbd25fe8e9681b9324af62e3066f', 'author': {'email': 'c.rollo@outlook.com', 'name': 'Callum Rollo'}, 'message': 'Generate Pelican site', 'distinct': True, 'url': 'https://api.github.com/repos/ueapy/ueapy.github.io/commits/0b70258f3668dbd25fe8e9681b9324af62e3066f'}]}, 'public': True, 'created_at': '2020-10-16T17:42:00Z', 'org': {'id': 15898025, 'login': 'ueapy', 'gravatar_id': '', 'url': 'https://api.github.com/orgs/ueapy', 'avatar_url': 'https://avatars.githubusercontent.com/u/15898025?'}}
Note that the Github API lists only the 30 most recent events
len(data)
30
To get at more events, we use a short loop to access subsequent pages of results. I found out the hard way that the API restricts you to 10 pages.
tgt_user = 'callumrollo'
base_url = 'https://api.github.com/users/' + tgt_user +'/events'
url = base_url
url_list = [base_url]
data = []
page_no = 1
repos_data = []
total_fetched = 0
while (True):
response = requests.get(url,auth = (username,token)).json()
data = data + response
events_fetched = len(response)
total_fetched = events_fetched + total_fetched
print(f"Page: {page_no} total events fetched: {total_fetched}")
if total_fetched == 300:
print(f"\nAPI maxed out! https://docs.github.com/v3/#pagination\n\
returning only most recent 300 events by {tgt_user}")
print(f"\nevents span the range \n{data[-1]['created_at']}\n{data[0]['created_at']}")
break
if (events_fetched == 30):
page_no = page_no + 1
url = base_url + '?page=' + str(page_no)
url_list.append(url)
else:
print(f"\n{tgt_user}: all your events are belong to us now")
print(f"\nevents span the range \n{data[-1]['created_at']}\n{data[0]['created_at']}")
break
Page: 1 total events fetched: 30 Page: 2 total events fetched: 60 Page: 3 total events fetched: 90 Page: 4 total events fetched: 120 Page: 5 total events fetched: 150 Page: 6 total events fetched: 180 Page: 7 total events fetched: 210 Page: 8 total events fetched: 240 Page: 9 total events fetched: 270 Page: 10 total events fetched: 299 callumrollo: all your events are belong to us now events span the range 2020-07-22T15:01:25Z 2020-10-16T17:42:00Z
This system logs all events: commits, issues, PRs, forks, stars etc. We are only interested in commits.
These are referred to as PushEvent
in the json entry type
for event in data[-10:]:
print(event["type"])
CreateEvent CreateEvent WatchEvent WatchEvent WatchEvent CreateEvent PushEvent PushEvent PushEvent PushEvent
commit_events = []
for event in data:
if event["type"] == "PushEvent":
commit_events.append(event)
len(commit_events)
186
There are some complications. Not all of these commits in these events are by the Github user we are querying. For instance, some are commits by other users that our target user has merged in.
To work around this, we look through the payoad of each PushEvent
and retain only the commits associated with the user we are interested in.
n.b this approach wil only work if the github username is the exact match for the name the author uses for their git commits. We apply a workaround for when this is not the case later
tgt_username = requests.get('https://api.github.com/users/' + credentials['username'],
auth = (username,token)).json()["name"]
tgt_username
'Callum Rollo'
user_commits = []
for event in commit_events:
commit_list = event["payload"]["commits"]
commit_list_author = []
if len(commit_list)>0:
for com_n in range(len(commit_list)):
commit_username = commit_list[com_n]["author"]["name"]
#print(commit_username)
if commit_username == tgt_username:
commit_list_author.append(commit_list[com_n])
if commit_list_author:
event["payload"]["commits"] = commit_list_author
user_commits.append(event)
len(user_commits)
182
We can print out the name associated with the commits we have selected to confirm
commit_names = []
for event in user_commits:
commit_list = event["payload"]["commits"]
for com_n in range(len(commit_list)):
commit_username = commit_list[com_n]["author"]["name"]
commit_names.append(commit_username)
print("git usernames and number of commits:")
Counter(commit_names).most_common()
git usernames and number of commits:
[('Callum Rollo', 232)]
print(f"From {len(data)} events we have extracted {len(commit_names)} commits by {tgt_username}")
From 299 events we have extracted 232 commits by Callum Rollo
The final step of this (almost certainly imperfect) data cleaning is to get info on all the commits by this user. We will pull the author, message, SHA, url, repo and date into a pandas dataframe.
df = pd.DataFrame()
for event in user_commits:
for commit in event["payload"]["commits"]:
commit_subset = {"id": event["id"],
"datetime" : event["created_at"],
"sha" : commit["sha"],
"message" : commit["message"],
"author" : commit["author"]["name"],
"url": commit["url"],
"repo": event["repo"]["name"]}
df = df.append(commit_subset, ignore_index=True)
We index by datetime and have a look at our dataframe
df.index = pd.DatetimeIndex(df.datetime)
df = df.drop("datetime", axis=1)
df.head()
author | id | message | repo | sha | url | |
---|---|---|---|---|---|---|
datetime | ||||||
2020-10-16 17:42:00+00:00 | Callum Rollo | 13873691257 | Generate Pelican site | ueapy/ueapy.github.io | 0b70258f3668dbd25fe8e9681b9324af62e3066f | https://api.github.com/repos/ueapy/ueapy.githu... |
2020-10-16 17:41:23+00:00 | Callum Rollo | 13873684842 | Added notebook on webscraping | ueapy/ueapy.github.io | 3364fd1e786612c16b4e999cbb832e9584818804 | https://api.github.com/repos/ueapy/ueapy.githu... |
2020-10-16 16:47:44+00:00 | Callum Rollo | 13873114207 | added beam miss figure and subplot labelling | callumrollo/adcp-glider | ad7aeeac2527b92cb5a98aa8cfc134bf355571e6 | https://api.github.com/repos/callumrollo/adcp-... |
2020-10-16 16:43:06+00:00 | Callum Rollo | 13873061051 | Generate Pelican site | callumrollo/callumrollo.github.io | f5e76561f2e5886db6a413e079209d039f0bac74 | https://api.github.com/repos/callumrollo/callu... |
2020-10-16 16:42:30+00:00 | Callum Rollo | 13873054646 | added MATS poster | callumrollo/callumrollo.github.io | 30a6802786ba5b479e9cf09594e51e3e5ebcef94 | https://api.github.com/repos/callumrollo/callu... |
We can see an uncharacteristically helpful set of commit messages, and the meta event of a commit I made to this very notebook
Now we remove any repeated commits that may have snuck in by a deduplicating on the SHA checksum
side note the SHA chescksum uniquely identifies each commit. Even if you had commits by the same author to the same repo with the same message ("added stuff" or something similarly helpful) the SHA will differentiate the two. See more here
df = df.drop_duplicates(subset=['sha'])
df.head()
author | id | message | repo | sha | url | |
---|---|---|---|---|---|---|
datetime | ||||||
2020-10-16 17:42:00+00:00 | Callum Rollo | 13873691257 | Generate Pelican site | ueapy/ueapy.github.io | 0b70258f3668dbd25fe8e9681b9324af62e3066f | https://api.github.com/repos/ueapy/ueapy.githu... |
2020-10-16 17:41:23+00:00 | Callum Rollo | 13873684842 | Added notebook on webscraping | ueapy/ueapy.github.io | 3364fd1e786612c16b4e999cbb832e9584818804 | https://api.github.com/repos/ueapy/ueapy.githu... |
2020-10-16 16:47:44+00:00 | Callum Rollo | 13873114207 | added beam miss figure and subplot labelling | callumrollo/adcp-glider | ad7aeeac2527b92cb5a98aa8cfc134bf355571e6 | https://api.github.com/repos/callumrollo/adcp-... |
2020-10-16 16:43:06+00:00 | Callum Rollo | 13873061051 | Generate Pelican site | callumrollo/callumrollo.github.io | f5e76561f2e5886db6a413e079209d039f0bac74 | https://api.github.com/repos/callumrollo/callu... |
2020-10-16 16:42:30+00:00 | Callum Rollo | 13873054646 | added MATS poster | callumrollo/callumrollo.github.io | 30a6802786ba5b479e9cf09594e51e3e5ebcef94 | https://api.github.com/repos/callumrollo/callu... |
Now that we have a method for finding commits by a user, the next step is to loop through a list of users. As a test case, I have analysed the commits from Oceanhackweek2020
def gh_scrape(tgt_users, cred_file = 'credentials.json', verbose=True):
"""Simple scraping function
Supply a list of github usersnames ['jane-doe', 'torvalds', 'satoshi_nakamoto']
Returns a dataframe of unique commits by these users over the last 90 days
Assumes that the github user is the user with most git commits associated with their github profile
Rate limited by the Github API to 300 events
Requires you to supply a Github API token in a credentials.json file
Verbose switch prints a line for each user with the number of events and commits found
Returns a pandas dataframe of commit info for all usernames in supplied list
"""
# Get user supplied credentials for the github API
credentials = json.loads(open(cred_file).read())
username = credentials['username']
token = credentials['token']
df = pd.DataFrame()
for tgt_user in tgt_users:
base_url = 'https://api.github.com/users/' + tgt_user +'/events'
url = base_url
url_list = [base_url]
data = []
page_no = 1
repos_data = []
total_fetched = 0
while (True):
response = requests.get(url,auth = (username,token)).json()
if type(response)==dict:
# Catch when the API returns a dict rather than expected list. Usually a credentials error message
print(response)
return
data = data + response
events_fetched = len(response)
total_fetched = events_fetched + total_fetched
if total_fetched == 300:
# Requesting more will max out the API
break
if (events_fetched == 30):
# if we fethced 30 events from this page, there will be another one after it
page_no = page_no + 1
url = base_url + '?page=' + str(page_no)
url_list.append(url)
else:
# We have collected all events by this user
break
commits_events = []
for event in data:
# We're only interested in commits, which are classed as "PushEvents"
if event["type"] == "PushEvent":
commits_events.append(event)
if len(commits_events)==0:
# If the user has no commit events, stop processing
continue
commit_usernames_list = []
for event in commits_events:
# Search though the payload for which git user is associated with each commit
commit_list = event["payload"]["commits"]
if len(commit_list)>0:
for com_n in range(len(commit_list)):
commit_username = commit_list[com_n]["author"]["name"]
commit_usernames_list.append(commit_username)
# Working on the assumption that the git user with the most commits pushed to Github by this user is the one we want
c = Counter(commit_usernames_list)
most_common_username = c.most_common(1)[0][0]
user_commits = []
# Go back through the commits and pull only the ones by the most common git username
for event in commits_events:
commit_list = event["payload"]["commits"]
commit_list_author = []
if len(commit_list)>0:
for com_n in range(len(commit_list)):
commit_username = commit_list[com_n]["author"]["name"]
if commit_username == most_common_username:
commit_list_author.append(commit_list[com_n])
if commit_list_author:
event["payload"]["commits"] = commit_list_author
user_commits.append(event)
# Extract the information we're interested in and put it in a pandas DataFrame
for commit in user_commits:
for com_n in range(len(commit["payload"]["commits"])):
commit_detail = commit["payload"]["commits"][com_n]
commit_subset = {"id": commit["id"],
"datetime" : commit["created_at"],
"sha" : commit_detail["sha"],
"message" : commit_detail["message"],
"author" : commit_detail["author"]["name"],
"url": commit_detail["url"],
"repo": commit["repo"]["name"]}
df = df.append(commit_subset, ignore_index=True)
df.index = pd.DatetimeIndex(df.datetime)
df = df.drop_duplicates(subset=['sha'])
if verbose:
print(f"{tgt_user}: found {len(data)} events containing {len(user_commits)} unique commits by {most_common_username}\n")
return df
users = ["callumrollo"]
df = gh_scrape(users, cred_file = 'credentials-secret.json')
len(df)
callumrollo: found 299 events containing 182 unique commits by Callum Rollo
210
Quite a few commits. What if we want solely the hackweek ones?
df['ohw20_repo'] = df['repo'].str.contains("ohw20")
sum(df['ohw20_repo'] )
23
Using the above function and a list of hackweek participants (not included) I grab github commits from the last 90 days
import csv
from itertools import chain
with open('ohw_participants.csv', newline='') as f:
nest_list = list(csv.reader(f))
ohw_participants_list = list(chain.from_iterable(nest_list))
df_all = gh_scrape(ohw_participants_list, cred_file='credentials-secret.json', verbose=False)
len(df_all)
1356
As you saw when I grabbed the data just from my username, it contains a lot of identifying information. I have anonymised and saved this data for a later notebook of analysis
We can get all the details of a commit by delving deeper into the json structure accessed through the commit url.
example:
url = event['payload']['commits'][0]['url']
commit_detail = requests.get(url,auth = (username,token)).json()
commit_detail.keys()
dict_keys(['sha', 'node_id', 'commit', 'url', 'html_url', 'comments_url', 'author', 'committer', 'parents', 'stats', 'files'])
The most interesting section is files
. This gives a summary of the lines changed on each file altered in this commit
'files' in commit_detail.keys()
True
commit_detail['files']
[{'sha': 'c90c6b45472c3d1a962024805ce9bacac967066c', 'filename': 'dev-environment.yml', 'status': 'added', 'additions': 13, 'deletions': 0, 'changes': 13, 'blob_url': 'https://github.com/callumrollo/geotiff-generator/blob/1b7056d0bd435af6ca6eec9414dd4dc3cd39d587/dev-environment.yml', 'raw_url': 'https://github.com/callumrollo/geotiff-generator/raw/1b7056d0bd435af6ca6eec9414dd4dc3cd39d587/dev-environment.yml', 'contents_url': 'https://api.github.com/repos/callumrollo/geotiff-generator/contents/dev-environment.yml?ref=1b7056d0bd435af6ca6eec9414dd4dc3cd39d587', 'patch': '@@ -0,0 +1,13 @@\n+name: geotiff-test\n+channels:\n+ - conda-forge\n+dependencies:\n+ - python=3.7\n+ - numpy\n+ - pandas\n+ - xarray\n+ - netcdf4\n+ - gdal\n+ - jupyter\n+ - ipython\n+ - pytest'}, {'sha': '8ddd4f0dc27b9251adaa755485f754ded4ddc3f3', 'filename': 'geotiff_gen.py', 'status': 'modified', 'additions': 0, 'deletions': 1, 'changes': 1, 'blob_url': 'https://github.com/callumrollo/geotiff-generator/blob/1b7056d0bd435af6ca6eec9414dd4dc3cd39d587/geotiff_gen.py', 'raw_url': 'https://github.com/callumrollo/geotiff-generator/raw/1b7056d0bd435af6ca6eec9414dd4dc3cd39d587/geotiff_gen.py', 'contents_url': 'https://api.github.com/repos/callumrollo/geotiff-generator/contents/geotiff_gen.py?ref=1b7056d0bd435af6ca6eec9414dd4dc3cd39d587', 'patch': '@@ -13,7 +13,6 @@\n import xarray as xr\n from netCDF4 import Dataset\n import copy\n-import glob\n from pathlib import Path\n from osgeo import gdal, osr\n '}, {'sha': '1d901078bfb7d640216b52bc1c76b44bf86718a0', 'filename': 'test_geotiff.py', 'status': 'added', 'additions': 1, 'deletions': 0, 'changes': 1, 'blob_url': 'https://github.com/callumrollo/geotiff-generator/blob/1b7056d0bd435af6ca6eec9414dd4dc3cd39d587/test_geotiff.py', 'raw_url': 'https://github.com/callumrollo/geotiff-generator/raw/1b7056d0bd435af6ca6eec9414dd4dc3cd39d587/test_geotiff.py', 'contents_url': 'https://api.github.com/repos/callumrollo/geotiff-generator/contents/test_geotiff.py?ref=1b7056d0bd435af6ca6eec9414dd4dc3cd39d587', 'patch': '@@ -0,0 +1 @@\n+from geotiff_gen import\n\\ No newline at end of file'}]
This was originally Filipe's idea
This requires a small extention to our scraping function. Because we want details at a commit level, this will make an additional API call for every single commit, so adds quite a time penalty. A good target for optimistion with async methods perhaps? This is left as an excercise for the reader
def gh_scrape(tgt_users, cred_file = 'credentials.json', verbose=True):
"""Simple scraping function
Supply a list of github usersnames ['jane-doe', 'torvalds', 'satoshi_nakamoto']
Returns a dataframe of unique commits by these users over the last 90 days
Assumes that the github user is the user with most git commits associated with their github profile
Rate limited by the Github API to 300 events
Requires you to supply a Github API token in a credentials.json file
Verbose switch prints a line for each user with the number of events and commits found
Returns a pandas dataframe of commit info for all usernames in supplied list
"""
# Get user supplied credentials for the github API
credentials = json.loads(open(cred_file).read())
username = credentials['username']
token = credentials['token']
df = pd.DataFrame()
for tgt_user in tgt_users:
print(tgt_user)
base_url = 'https://api.github.com/users/' + tgt_user +'/events'
url = base_url
url_list = [base_url]
data = []
page_no = 1
repos_data = []
total_fetched = 0
while (True):
response = requests.get(url,auth = (username,token)).json()
if type(response)==dict:
# Catch when the API returns a dict rather than expected list. Usually a credentials error message
print(response)
return
data = data + response
events_fetched = len(response)
total_fetched = events_fetched + total_fetched
if total_fetched == 300:
# Requesting more will max out the API
break
if (events_fetched == 30):
# if we fethced 30 events from this page, there will be another one after it
page_no = page_no + 1
url = base_url + '?page=' + str(page_no)
url_list.append(url)
else:
# We have collected all events by this user
break
commits_events = []
for event in data:
# We're only interested in commits, which are classed as "PushEvents"
if event["type"] == "PushEvent":
commits_events.append(event)
if len(commits_events)==0:
# If the user has no commit events, stop processing
continue
commit_usernames_list = []
for event in commits_events:
# Search though the payload for which git user is associated with each commit
commit_list = event["payload"]["commits"]
if len(commit_list)>0:
for com_n in range(len(commit_list)):
commit_username = commit_list[com_n]["author"]["name"]
commit_usernames_list.append(commit_username)
# Working on the assumption that the git user with the most commits pushed to Github by this user is the one we want
c = Counter(commit_usernames_list)
most_common_username = c.most_common(1)[0][0]
user_commits = []
# Go back through the commits and pull only the ones by the most common git username
for event in commits_events:
commit_list = event["payload"]["commits"]
commit_list_author = []
if len(commit_list)>0:
for com_n in range(len(commit_list)):
commit_username = commit_list[com_n]["author"]["name"]
if commit_username == most_common_username:
commit_list_author.append(commit_list[com_n])
if commit_list_author:
event["payload"]["commits"] = commit_list_author
user_commits.append(event)
# Extract the information we're interested in and put it in a pandas DataFrame
for commit in user_commits:
for com_n in range(len(commit["payload"]["commits"])):
commit_detail = commit["payload"]["commits"][com_n]
commit_all_details = requests.get(commit_detail["url"],auth = (username,token)).json()
extensions = ""
if 'files' not in commit_all_details.keys():
continue
for file in commit_all_details['files']:
extensions = extensions + file['filename'].split('.')[-1] + ', '
commit_subset = {"id": commit["id"],
"datetime" : commit["created_at"],
"sha" : commit_detail["sha"],
"message" : commit_detail["message"],
"author" : commit_detail["author"]["name"],
"url": commit_detail["url"],
"repo": commit["repo"]["name"],
"extensions": extensions[:-2]}
df = df.append(commit_subset, ignore_index=True)
df.index = pd.DatetimeIndex(df.datetime)
df = df.drop_duplicates(subset=['sha'])
if verbose:
print(f"{tgt_user}: found {len(data)} events containing {len(user_commits)} unique commits by {most_common_username}\n")
return df
users = ["callumrollo"]
df = gh_scrape(users, cred_file = 'credentials-secret.json')
df.head()
callumrollo: found 299 events containing 182 unique commits by Callum Rollo
author | datetime | extensions | id | message | repo | sha | url | |
---|---|---|---|---|---|---|---|---|
datetime | ||||||||
2020-10-16 17:42:00+00:00 | Callum Rollo | 2020-10-16T17:42:00Z | html, html, html, html, html, html, html, html... | 13873691257 | Generate Pelican site | ueapy/ueapy.github.io | 0b70258f3668dbd25fe8e9681b9324af62e3066f | https://api.github.com/repos/ueapy/ueapy.githu... |
2020-10-16 17:41:23+00:00 | Callum Rollo | 2020-10-16T17:41:23Z | md, png, ipynb | 13873684842 | Added notebook on webscraping | ueapy/ueapy.github.io | 3364fd1e786612c16b4e999cbb832e9584818804 | https://api.github.com/repos/ueapy/ueapy.githu... |
2020-10-16 16:47:44+00:00 | Callum Rollo | 2020-10-16T16:47:44Z | ipynb, png, png | 13873114207 | added beam miss figure and subplot labelling | callumrollo/adcp-glider | ad7aeeac2527b92cb5a98aa8cfc134bf355571e6 | https://api.github.com/repos/callumrollo/adcp-... |
2020-10-16 16:43:06+00:00 | Callum Rollo | 2020-10-16T16:43:06Z | pdf, html | 13873061051 | Generate Pelican site | callumrollo/callumrollo.github.io | f5e76561f2e5886db6a413e079209d039f0bac74 | https://api.github.com/repos/callumrollo/callu... |
2020-10-16 16:42:30+00:00 | Callum Rollo | 2020-10-16T16:42:30Z | pdf, md | 13873054646 | added MATS poster | callumrollo/callumrollo.github.io | 30a6802786ba5b479e9cf09594e51e3e5ebcef94 | https://api.github.com/repos/callumrollo/callu... |
Success! We have the filetypes used in every commit by this author. This should tell us, in broad strokes, what programming language they are working on and potentially many other things. Are they working on .md files for documentation? Uploading lots of .png iamges? Do they prefer .ipynb
notebooks to pure .py
files?
I hope this notebook has given you some ideas of the kinds of information you can grab from GitHub. It may also serve as a good reminder that everything we put on there is public and scrapable, so write helpful commit messages!
HTML(html)