Notebook

In [1]:

name = "2020-09-10-github-scrape"
title = "Scraping GitHub after a hackweek"
tags = "requests, github, webscrape, pandas, dataviz, text processing"
author = "Callum Rollo"

In [2]:

from nb_tools import connect_notebook_to_post
from IPython.core.display import HTML

html = connect_notebook_to_post(name, title, tags, author)

A meta-hackweek hack¶

I put this notebook together after attending the excellent Oceean Hack Week 2020 (OHW) event. You can read my blog post about it here. A key part of the event was creating collaborative projects on GitHub. I wanted to see if attending the hackweek changed participants pattern of activity on GitHub. I though the simplest way would be to get info on all the commits by OHW participants to OHW projects.

It goes without saying that commits =/= work done on a project. I freely admit that many of my commits are nonsense! This is just a fun side project for me to explore the github API and try some simple analysis methods.

First off, to access the Github API you'll need to edit the credentials file credentials.json to supply your username and a Github access token.

{
    "username": "<your-username>",
    "token": "<your-access-token>"
}

Once you have supplied these creds, you are still limited to 5000 requests per hour so, if you get bounced by the API, leave it some time to cool off. Without credentials the limit is far lower and you will would soon generate an error message like this:

rate limit reached

Obviously I have not provided my own credentials! I'm not sure what action github would take if a DDOS on their API originated from my account, but I'm not willing to find out.

Let's get scraping!¶

In [3]:

import json
import requests
from collections import Counter
import pandas as pd
import numpy as np

In [6]:

credentials = json.loads(open('credentials-secret.json').read()) #don't forget to add your creds here!

username = credentials['username']
token = credentials['token']

For a start, let's use the API to get some details on my account

In [10]:

user_data = requests.get('https://api.github.com/users/' + credentials['username'],auth = (username,token)).json()
user_data

Out[10]:

{'login': 'callumrollo',
 'id': 28703282,
 'node_id': 'MDQ6VXNlcjI4NzAzMjgy',
 'avatar_url': 'https://avatars0.githubusercontent.com/u/28703282?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/callumrollo',
 'html_url': 'https://github.com/callumrollo',
 'followers_url': 'https://api.github.com/users/callumrollo/followers',
 'following_url': 'https://api.github.com/users/callumrollo/following{/other_user}',
 'gists_url': 'https://api.github.com/users/callumrollo/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/callumrollo/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/callumrollo/subscriptions',
 'organizations_url': 'https://api.github.com/users/callumrollo/orgs',
 'repos_url': 'https://api.github.com/users/callumrollo/repos',
 'events_url': 'https://api.github.com/users/callumrollo/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/callumrollo/received_events',
 'type': 'User',
 'site_admin': False,
 'name': 'Callum Rollo',
 'company': None,
 'blog': 'https://callumrollo.github.io/',
 'location': None,
 'email': None,
 'hireable': None,
 'bio': 'Oceanographer, Pythonista and vim user. Regularly breaks things on Fridays',
 'twitter_username': 'callum_rollo',
 'public_repos': 36,
 'public_gists': 0,
 'followers': 18,
 'following': 3,
 'created_at': '2017-05-15T09:23:12Z',
 'updated_at': '2020-10-09T14:36:52Z',
 'private_gists': 0,
 'total_private_repos': 7,
 'owned_private_repos': 7,
 'disk_usage': 389086,
 'collaborators': 0,
 'two_factor_authentication': True,
 'plan': {'name': 'pro',
  'space': 976562499,
  'collaborators': 0,
  'private_repos': 9999}}

This returns a lot of info. Now, I'll try the account of one of my collaborators ocefpaf. You can specify any user, though the information returned is less than when you look at your own account.

In [11]:

data = requests.get('https://api.github.com/users/' + 'ocefpaf',auth = (username,token)).json()
data

Out[11]:

{'login': 'ocefpaf',
 'id': 950575,
 'node_id': 'MDQ6VXNlcjk1MDU3NQ==',
 'avatar_url': 'https://avatars1.githubusercontent.com/u/950575?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/ocefpaf',
 'html_url': 'https://github.com/ocefpaf',
 'followers_url': 'https://api.github.com/users/ocefpaf/followers',
 'following_url': 'https://api.github.com/users/ocefpaf/following{/other_user}',
 'gists_url': 'https://api.github.com/users/ocefpaf/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/ocefpaf/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/ocefpaf/subscriptions',
 'organizations_url': 'https://api.github.com/users/ocefpaf/orgs',
 'repos_url': 'https://api.github.com/users/ocefpaf/repos',
 'events_url': 'https://api.github.com/users/ocefpaf/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/ocefpaf/received_events',
 'type': 'User',
 'site_admin': False,
 'name': 'Filipe',
 'company': None,
 'blog': 'http://ocefpaf.github.io/python4oceanographers',
 'location': 'Florianópolis, SC',
 'email': 'ocefpaf@gmail.com',
 'hireable': True,
 'bio': 'Physical oceanographer turned research software engineer, and software packaging hobbyist.',
 'twitter_username': 'ocefpaf',
 'public_repos': 1190,
 'public_gists': 133,
 'followers': 395,
 'following': 4,
 'created_at': '2011-07-31T23:10:26Z',
 'updated_at': '2020-10-10T15:23:18Z'}

We can see a user's core stats. How about their commits and other actions taken? Simply append /events to the request query

Events¶

In [12]:

data = requests.get('https://api.github.com/users/' + 'callumrollo' +'/events',auth = (username,token)).json()
data[0]

Out[12]:

{'id': '13873691257',
 'type': 'PushEvent',
 'actor': {'id': 28703282,
  'login': 'callumrollo',
  'display_login': 'callumrollo',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/callumrollo',
  'avatar_url': 'https://avatars.githubusercontent.com/u/28703282?'},
 'repo': {'id': 48320187,
  'name': 'ueapy/ueapy.github.io',
  'url': 'https://api.github.com/repos/ueapy/ueapy.github.io'},
 'payload': {'push_id': 5868625550,
  'size': 1,
  'distinct_size': 1,
  'ref': 'refs/heads/master',
  'head': '0b70258f3668dbd25fe8e9681b9324af62e3066f',
  'before': 'f9c3596cef28b24cfd4d7c27ba91a1c038c943b6',
  'commits': [{'sha': '0b70258f3668dbd25fe8e9681b9324af62e3066f',
    'author': {'email': 'c.rollo@outlook.com', 'name': 'Callum Rollo'},
    'message': 'Generate Pelican site',
    'distinct': True,
    'url': 'https://api.github.com/repos/ueapy/ueapy.github.io/commits/0b70258f3668dbd25fe8e9681b9324af62e3066f'}]},
 'public': True,
 'created_at': '2020-10-16T17:42:00Z',
 'org': {'id': 15898025,
  'login': 'ueapy',
  'gravatar_id': '',
  'url': 'https://api.github.com/orgs/ueapy',
  'avatar_url': 'https://avatars.githubusercontent.com/u/15898025?'}}

Note that the Github API lists only the 30 most recent events

In [13]:

len(data)

Out[13]:

To get at more events, we use a short loop to access subsequent pages of results. I found out the hard way that the API restricts you to 10 pages.

In [16]:

tgt_user = 'callumrollo'
base_url = 'https://api.github.com/users/' + tgt_user +'/events'
url = base_url
url_list = [base_url]
data = []
page_no = 1
repos_data = []
total_fetched = 0
while (True):
    response = requests.get(url,auth = (username,token)).json()
    data = data + response
    events_fetched = len(response)
    total_fetched = events_fetched + total_fetched
    print(f"Page: {page_no} total events fetched: {total_fetched}")
    
    if total_fetched == 300:
        print(f"\nAPI maxed out! https://docs.github.com/v3/#pagination\n\
        returning only most recent 300 events by {tgt_user}")
        print(f"\nevents span the range \n{data[-1]['created_at']}\n{data[0]['created_at']}")
        break
    
    if (events_fetched == 30):
        page_no = page_no + 1
        url = base_url + '?page=' + str(page_no)
        url_list.append(url)
    else:
        print(f"\n{tgt_user}: all your events are belong to us now")
        print(f"\nevents span the range \n{data[-1]['created_at']}\n{data[0]['created_at']}")
        break

Page: 1 total events fetched: 30
Page: 2 total events fetched: 60
Page: 3 total events fetched: 90
Page: 4 total events fetched: 120
Page: 5 total events fetched: 150
Page: 6 total events fetched: 180
Page: 7 total events fetched: 210
Page: 8 total events fetched: 240
Page: 9 total events fetched: 270
Page: 10 total events fetched: 299

callumrollo: all your events are belong to us now

events span the range 
2020-07-22T15:01:25Z
2020-10-16T17:42:00Z

This system logs all events: commits, issues, PRs, forks, stars etc. We are only interested in commits.

These are referred to as PushEvent in the json entry type

In [17]:

for event in data[-10:]:
    print(event["type"])

CreateEvent
CreateEvent
WatchEvent
WatchEvent
WatchEvent
CreateEvent
PushEvent
PushEvent
PushEvent
PushEvent

In [18]:

commit_events = []
for event in data:
    if event["type"] == "PushEvent":
        commit_events.append(event)
len(commit_events)

Out[18]:

There are some complications. Not all of these commits in these events are by the Github user we are querying. For instance, some are commits by other users that our target user has merged in.

To work around this, we look through the payoad of each PushEvent and retain only the commits associated with the user we are interested in.

n.b this approach wil only work if the github username is the exact match for the name the author uses for their git commits. We apply a workaround for when this is not the case later

In [19]:

tgt_username = requests.get('https://api.github.com/users/' + credentials['username'],
                            auth = (username,token)).json()["name"]
tgt_username

Out[19]:

'Callum Rollo'

In [20]:

user_commits = []
for event in commit_events:
    commit_list = event["payload"]["commits"]
    commit_list_author = []
    if len(commit_list)>0:
        for com_n in range(len(commit_list)):
            commit_username = commit_list[com_n]["author"]["name"]
            #print(commit_username)
            if commit_username == tgt_username:
                commit_list_author.append(commit_list[com_n])
        if commit_list_author:
            event["payload"]["commits"] = commit_list_author
            user_commits.append(event)

In [21]:

len(user_commits)

Out[21]:

We can print out the name associated with the commits we have selected to confirm

In [22]:

commit_names = []
for event in user_commits:
    commit_list = event["payload"]["commits"]
    for com_n in range(len(commit_list)):
            commit_username = commit_list[com_n]["author"]["name"]
            commit_names.append(commit_username)
print("git usernames and number of commits:")
Counter(commit_names).most_common()

git usernames and number of commits:

Out[22]:

[('Callum Rollo', 232)]

In [23]:

print(f"From {len(data)} events we have extracted {len(commit_names)} commits by {tgt_username}")

From 299 events we have extracted 232 commits by Callum Rollo

The final step of this (almost certainly imperfect) data cleaning is to get info on all the commits by this user. We will pull the author, message, SHA, url, repo and date into a pandas dataframe.

In [30]:

df = pd.DataFrame()
for event in user_commits:
    for commit in event["payload"]["commits"]:
        commit_subset = {"id": event["id"],
                     "datetime" : event["created_at"],
                     "sha" : commit["sha"],
                     "message" : commit["message"],
                     "author" : commit["author"]["name"],
                     "url": commit["url"],
                     "repo": event["repo"]["name"]}
        df = df.append(commit_subset, ignore_index=True)

We index by datetime and have a look at our dataframe

In [31]:

df.index = pd.DatetimeIndex(df.datetime)
df = df.drop("datetime", axis=1)
df.head()

Out[31]:

	author	id	message	repo	sha	url
datetime
2020-10-16 17:42:00+00:00	Callum Rollo	13873691257	Generate Pelican site	ueapy/ueapy.github.io	0b70258f3668dbd25fe8e9681b9324af62e3066f	https://api.github.com/repos/ueapy/ueapy.githu...
2020-10-16 17:41:23+00:00	Callum Rollo	13873684842	Added notebook on webscraping	ueapy/ueapy.github.io	3364fd1e786612c16b4e999cbb832e9584818804	https://api.github.com/repos/ueapy/ueapy.githu...
2020-10-16 16:47:44+00:00	Callum Rollo	13873114207	added beam miss figure and subplot labelling	callumrollo/adcp-glider	ad7aeeac2527b92cb5a98aa8cfc134bf355571e6	https://api.github.com/repos/callumrollo/adcp-...
2020-10-16 16:43:06+00:00	Callum Rollo	13873061051	Generate Pelican site	callumrollo/callumrollo.github.io	f5e76561f2e5886db6a413e079209d039f0bac74	https://api.github.com/repos/callumrollo/callu...
2020-10-16 16:42:30+00:00	Callum Rollo	13873054646	added MATS poster	callumrollo/callumrollo.github.io	30a6802786ba5b479e9cf09594e51e3e5ebcef94	https://api.github.com/repos/callumrollo/callu...

We can see an uncharacteristically helpful set of commit messages, and the meta event of a commit I made to this very notebook

Now we remove any repeated commits that may have snuck in by a deduplicating on the SHA checksum

side note the SHA chescksum uniquely identifies each commit. Even if you had commits by the same author to the same repo with the same message ("added stuff" or something similarly helpful) the SHA will differentiate the two. See more here

In [32]:

df = df.drop_duplicates(subset=['sha'])
df.head()

Out[32]:

	author	id	message	repo	sha	url
datetime
2020-10-16 17:42:00+00:00	Callum Rollo	13873691257	Generate Pelican site	ueapy/ueapy.github.io	0b70258f3668dbd25fe8e9681b9324af62e3066f	https://api.github.com/repos/ueapy/ueapy.githu...
2020-10-16 17:41:23+00:00	Callum Rollo	13873684842	Added notebook on webscraping	ueapy/ueapy.github.io	3364fd1e786612c16b4e999cbb832e9584818804	https://api.github.com/repos/ueapy/ueapy.githu...
2020-10-16 16:47:44+00:00	Callum Rollo	13873114207	added beam miss figure and subplot labelling	callumrollo/adcp-glider	ad7aeeac2527b92cb5a98aa8cfc134bf355571e6	https://api.github.com/repos/callumrollo/adcp-...
2020-10-16 16:43:06+00:00	Callum Rollo	13873061051	Generate Pelican site	callumrollo/callumrollo.github.io	f5e76561f2e5886db6a413e079209d039f0bac74	https://api.github.com/repos/callumrollo/callu...
2020-10-16 16:42:30+00:00	Callum Rollo	13873054646	added MATS poster	callumrollo/callumrollo.github.io	30a6802786ba5b479e9cf09594e51e3e5ebcef94	https://api.github.com/repos/callumrollo/callu...

Scaling it up: work from a list of Github usernames¶

Now that we have a method for finding commits by a user, the next step is to loop through a list of users. As a test case, I have analysed the commits from Oceanhackweek2020

In [33]:

def gh_scrape(tgt_users, cred_file = 'credentials.json', verbose=True):
    """Simple scraping function
    Supply a list of github usersnames ['jane-doe', 'torvalds', 'satoshi_nakamoto']
    Returns a dataframe of unique commits by these users over the last 90 days
    Assumes that the github user is the user with most git commits associated with their github profile
    Rate limited by the Github API to 300 events
    Requires you to supply a Github API token in a credentials.json file
    Verbose switch prints a line for each user with the number of events and commits found
    Returns a pandas dataframe of commit info for all usernames in supplied list
    """
    # Get user supplied credentials for the github API
    credentials = json.loads(open(cred_file).read())
    username = credentials['username']
    token = credentials['token']
    df = pd.DataFrame()
    
    for tgt_user in tgt_users:
        base_url = 'https://api.github.com/users/' + tgt_user +'/events'
        url = base_url
        url_list = [base_url]
        data = []
        page_no = 1
        repos_data = []
        total_fetched = 0
        while (True):
            response = requests.get(url,auth = (username,token)).json()
            if type(response)==dict:
                # Catch when the API returns a dict rather than expected list. Usually a credentials error message
                print(response)
                return
            data = data + response
            events_fetched = len(response)
            total_fetched = events_fetched + total_fetched
            if total_fetched == 300:
                # Requesting more will max out the API
                break
            if (events_fetched == 30):
                # if we fethced 30 events from this page, there will be another one after it
                page_no = page_no + 1
                url = base_url + '?page=' + str(page_no)
                url_list.append(url)
            else:
                # We have collected all events by this user
                break
        commits_events = []
        for event in data:
            # We're only interested in commits, which are classed as "PushEvents"
            if event["type"] == "PushEvent":
                commits_events.append(event)
        if len(commits_events)==0:
            # If the user has no commit events, stop processing
            continue
            
        commit_usernames_list = []
        for event in commits_events:
            # Search though the payload for which git user is associated with each commit
            commit_list = event["payload"]["commits"]
            if len(commit_list)>0:    
                for com_n in range(len(commit_list)):
                        commit_username = commit_list[com_n]["author"]["name"]       
                        commit_usernames_list.append(commit_username)
        # Working on the assumption that the git user with the most commits pushed to Github by this user is the one we want
        c = Counter(commit_usernames_list)
        most_common_username = c.most_common(1)[0][0]
        user_commits = []
        
        # Go back through the commits and pull only the ones by the most common git username
        for event in commits_events:
            commit_list = event["payload"]["commits"]
            commit_list_author = []
            if len(commit_list)>0:    
                for com_n in range(len(commit_list)):
                    commit_username = commit_list[com_n]["author"]["name"]
                    if commit_username == most_common_username:
                        commit_list_author.append(commit_list[com_n])
                if commit_list_author:
                    event["payload"]["commits"] = commit_list_author
                    user_commits.append(event)
        
        # Extract the information we're interested in and put it in a pandas DataFrame
        for commit in user_commits:
            for com_n in range(len(commit["payload"]["commits"])):
                commit_detail = commit["payload"]["commits"][com_n]
                commit_subset = {"id": commit["id"],
                             "datetime" : commit["created_at"],
                             "sha" : commit_detail["sha"],
                             "message" : commit_detail["message"],
                             "author" : commit_detail["author"]["name"],
                             "url": commit_detail["url"],
                             "repo": commit["repo"]["name"]}
                df = df.append(commit_subset, ignore_index=True)
        df.index = pd.DatetimeIndex(df.datetime)
        df = df.drop_duplicates(subset=['sha'])
        if verbose:
            print(f"{tgt_user}: found {len(data)} events containing {len(user_commits)} unique commits by {most_common_username}\n")
    return df

In [34]:

users = ["callumrollo"]
df = gh_scrape(users, cred_file = 'credentials-secret.json')
len(df)

callumrollo: found 299 events containing 182 unique commits by Callum Rollo

Out[34]:

Quite a few commits. What if we want solely the hackweek ones?

In [35]:

df['ohw20_repo'] = df['repo'].str.contains("ohw20")
sum(df['ohw20_repo'] )

Out[35]:

OHW analysis¶

Using the above function and a list of hackweek participants (not included) I grab github commits from the last 90 days

In [36]:

import csv
from itertools import chain
with open('ohw_participants.csv', newline='') as f:
    nest_list = list(csv.reader(f))
ohw_participants_list = list(chain.from_iterable(nest_list))
df_all = gh_scrape(ohw_participants_list, cred_file='credentials-secret.json', verbose=False)

In [37]:

len(df_all)

Out[37]:

As you saw when I grabbed the data just from my username, it contains a lot of identifying information. I have anonymised and saved this data for a later notebook of analysis

Going deeper¶

We can get all the details of a commit by delving deeper into the json structure accessed through the commit url.

example:

In [40]:

url = event['payload']['commits'][0]['url']
commit_detail = requests.get(url,auth = (username,token)).json()
commit_detail.keys()

Out[40]:

dict_keys(['sha', 'node_id', 'commit', 'url', 'html_url', 'comments_url', 'author', 'committer', 'parents', 'stats', 'files'])

The most interesting section is files. This gives a summary of the lines changed on each file altered in this commit

In [52]:

'files' in commit_detail.keys()

Out[52]:

True

In [50]:

commit_detail['files']

Out[50]:

[{'sha': 'c90c6b45472c3d1a962024805ce9bacac967066c',
  'filename': 'dev-environment.yml',
  'status': 'added',
  'additions': 13,
  'deletions': 0,
  'changes': 13,
  'blob_url': 'https://github.com/callumrollo/geotiff-generator/blob/1b7056d0bd435af6ca6eec9414dd4dc3cd39d587/dev-environment.yml',
  'raw_url': 'https://github.com/callumrollo/geotiff-generator/raw/1b7056d0bd435af6ca6eec9414dd4dc3cd39d587/dev-environment.yml',
  'contents_url': 'https://api.github.com/repos/callumrollo/geotiff-generator/contents/dev-environment.yml?ref=1b7056d0bd435af6ca6eec9414dd4dc3cd39d587',
  'patch': '@@ -0,0 +1,13 @@\n+name: geotiff-test\n+channels:\n+   - conda-forge\n+dependencies:\n+   - python=3.7\n+   - numpy\n+   - pandas\n+   - xarray\n+   - netcdf4\n+   - gdal\n+   - jupyter\n+   - ipython\n+   - pytest'},
 {'sha': '8ddd4f0dc27b9251adaa755485f754ded4ddc3f3',
  'filename': 'geotiff_gen.py',
  'status': 'modified',
  'additions': 0,
  'deletions': 1,
  'changes': 1,
  'blob_url': 'https://github.com/callumrollo/geotiff-generator/blob/1b7056d0bd435af6ca6eec9414dd4dc3cd39d587/geotiff_gen.py',
  'raw_url': 'https://github.com/callumrollo/geotiff-generator/raw/1b7056d0bd435af6ca6eec9414dd4dc3cd39d587/geotiff_gen.py',
  'contents_url': 'https://api.github.com/repos/callumrollo/geotiff-generator/contents/geotiff_gen.py?ref=1b7056d0bd435af6ca6eec9414dd4dc3cd39d587',
  'patch': '@@ -13,7 +13,6 @@\n import xarray as xr\n from netCDF4 import Dataset\n import copy\n-import glob\n from pathlib import Path\n from osgeo import gdal, osr\n '},
 {'sha': '1d901078bfb7d640216b52bc1c76b44bf86718a0',
  'filename': 'test_geotiff.py',
  'status': 'added',
  'additions': 1,
  'deletions': 0,
  'changes': 1,
  'blob_url': 'https://github.com/callumrollo/geotiff-generator/blob/1b7056d0bd435af6ca6eec9414dd4dc3cd39d587/test_geotiff.py',
  'raw_url': 'https://github.com/callumrollo/geotiff-generator/raw/1b7056d0bd435af6ca6eec9414dd4dc3cd39d587/test_geotiff.py',
  'contents_url': 'https://api.github.com/repos/callumrollo/geotiff-generator/contents/test_geotiff.py?ref=1b7056d0bd435af6ca6eec9414dd4dc3cd39d587',
  'patch': '@@ -0,0 +1 @@\n+from geotiff_gen import\n\\ No newline at end of file'}]

Let's go through our little database and extract the file extenstions of all the files altered in each commit¶

This was originally Filipe's idea

This requires a small extention to our scraping function. Because we want details at a commit level, this will make an additional API call for every single commit, so adds quite a time penalty. A good target for optimistion with async methods perhaps? This is left as an excercise for the reader

In [53]:

def gh_scrape(tgt_users, cred_file = 'credentials.json', verbose=True):
    """Simple scraping function
    Supply a list of github usersnames ['jane-doe', 'torvalds', 'satoshi_nakamoto']
    Returns a dataframe of unique commits by these users over the last 90 days
    Assumes that the github user is the user with most git commits associated with their github profile
    Rate limited by the Github API to 300 events
    Requires you to supply a Github API token in a credentials.json file
    Verbose switch prints a line for each user with the number of events and commits found
    Returns a pandas dataframe of commit info for all usernames in supplied list
    """
    # Get user supplied credentials for the github API
    credentials = json.loads(open(cred_file).read())
    username = credentials['username']
    token = credentials['token']
    df = pd.DataFrame()
    
    for tgt_user in tgt_users:
        print(tgt_user)
        base_url = 'https://api.github.com/users/' + tgt_user +'/events'
        url = base_url
        url_list = [base_url]
        data = []
        page_no = 1
        repos_data = []
        total_fetched = 0
        while (True):
            response = requests.get(url,auth = (username,token)).json()
            if type(response)==dict:
                # Catch when the API returns a dict rather than expected list. Usually a credentials error message
                print(response)
                return
            data = data + response
            events_fetched = len(response)
            total_fetched = events_fetched + total_fetched
            if total_fetched == 300:
                # Requesting more will max out the API
                break
            if (events_fetched == 30):
                # if we fethced 30 events from this page, there will be another one after it
                page_no = page_no + 1
                url = base_url + '?page=' + str(page_no)
                url_list.append(url)
            else:
                # We have collected all events by this user
                break
        commits_events = []
        for event in data:
            # We're only interested in commits, which are classed as "PushEvents"
            if event["type"] == "PushEvent":
                commits_events.append(event)
        if len(commits_events)==0:
            # If the user has no commit events, stop processing
            continue
            
        commit_usernames_list = []
        for event in commits_events:
            # Search though the payload for which git user is associated with each commit
            commit_list = event["payload"]["commits"]
            if len(commit_list)>0:    
                for com_n in range(len(commit_list)):
                        commit_username = commit_list[com_n]["author"]["name"]       
                        commit_usernames_list.append(commit_username)
        # Working on the assumption that the git user with the most commits pushed to Github by this user is the one we want
        c = Counter(commit_usernames_list)
        most_common_username = c.most_common(1)[0][0]
        user_commits = []
        
        # Go back through the commits and pull only the ones by the most common git username
        for event in commits_events:
            commit_list = event["payload"]["commits"]
            commit_list_author = []
            if len(commit_list)>0:    
                for com_n in range(len(commit_list)):
                    commit_username = commit_list[com_n]["author"]["name"]
                    if commit_username == most_common_username:
                        commit_list_author.append(commit_list[com_n])
                if commit_list_author:
                    event["payload"]["commits"] = commit_list_author
                    user_commits.append(event)
        
        # Extract the information we're interested in and put it in a pandas DataFrame
        for commit in user_commits:
            for com_n in range(len(commit["payload"]["commits"])):
                commit_detail = commit["payload"]["commits"][com_n]
                commit_all_details = requests.get(commit_detail["url"],auth = (username,token)).json()
                extensions = ""
                if 'files' not in commit_all_details.keys():
                    continue
                for file in commit_all_details['files']:
                    extensions = extensions + file['filename'].split('.')[-1] + ', '
                commit_subset = {"id": commit["id"],
                             "datetime" : commit["created_at"],
                             "sha" : commit_detail["sha"],
                             "message" : commit_detail["message"],
                             "author" : commit_detail["author"]["name"],
                             "url": commit_detail["url"],
                             "repo": commit["repo"]["name"],
                             "extensions": extensions[:-2]}
                df = df.append(commit_subset, ignore_index=True)
        df.index = pd.DatetimeIndex(df.datetime)
        df = df.drop_duplicates(subset=['sha'])
        if verbose:
            print(f"{tgt_user}: found {len(data)} events containing {len(user_commits)} unique commits by {most_common_username}\n")
    return df

In [43]:

users = ["callumrollo"]
df = gh_scrape(users, cred_file = 'credentials-secret.json')
df.head()

callumrollo: found 299 events containing 182 unique commits by Callum Rollo

Out[43]:

	author	datetime	extensions	id	message	repo	sha	url
datetime
2020-10-16 17:42:00+00:00	Callum Rollo	2020-10-16T17:42:00Z	html, html, html, html, html, html, html, html...	13873691257	Generate Pelican site	ueapy/ueapy.github.io	0b70258f3668dbd25fe8e9681b9324af62e3066f	https://api.github.com/repos/ueapy/ueapy.githu...
2020-10-16 17:41:23+00:00	Callum Rollo	2020-10-16T17:41:23Z	md, png, ipynb	13873684842	Added notebook on webscraping	ueapy/ueapy.github.io	3364fd1e786612c16b4e999cbb832e9584818804	https://api.github.com/repos/ueapy/ueapy.githu...
2020-10-16 16:47:44+00:00	Callum Rollo	2020-10-16T16:47:44Z	ipynb, png, png	13873114207	added beam miss figure and subplot labelling	callumrollo/adcp-glider	ad7aeeac2527b92cb5a98aa8cfc134bf355571e6	https://api.github.com/repos/callumrollo/adcp-...
2020-10-16 16:43:06+00:00	Callum Rollo	2020-10-16T16:43:06Z	pdf, html	13873061051	Generate Pelican site	callumrollo/callumrollo.github.io	f5e76561f2e5886db6a413e079209d039f0bac74	https://api.github.com/repos/callumrollo/callu...
2020-10-16 16:42:30+00:00	Callum Rollo	2020-10-16T16:42:30Z	pdf, md	13873054646	added MATS poster	callumrollo/callumrollo.github.io	30a6802786ba5b479e9cf09594e51e3e5ebcef94	https://api.github.com/repos/callumrollo/callu...

Success! We have the filetypes used in every commit by this author. This should tell us, in broad strokes, what programming language they are working on and potentially many other things. Are they working on .md files for documentation? Uploading lots of .png iamges? Do they prefer .ipynb notebooks to pure .py files?

Ideas for further analysis¶

Use of different filetypes. Particularly .py vs .ipynb
word cloud of commit messages
check out non-commit activity: merge, PR, issue...
geographical/timezone patterns
how much "crunch" did we get before presentations on Friday?
examine links between authors (who merged who? Comments mentioning issues?)

Github is a rich mine of information¶

I hope this notebook has given you some ideas of the kinds of information you can grab from GitHub. It may also serve as a good reminder that everything we put on there is public and scrapable, so write helpful commit messages!

In [ ]:

HTML(html)