name = "2020-09-11-github-analysis"
title = "Analysis of GitHub commits"
tags = "github, webscrape, pandas, wordcloud, nlp"
author = "Callum Rollo"
from nb_tools import connect_notebook_to_post
from IPython.core.display import HTML
html = connect_notebook_to_post(name, title, tags, author)
To see where we got this dataset from, check out our last notebook on scraping GitHub.
To get plotting, we start by importing the holy trinity of Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
We read in the anaonymised data to a pands dataframe and make a second dataframe of only commits to OHW20 repos
df = pd.read_csv('ohw_anonymised.csv', index_col='datetime', parse_dates=True)
df_ohw20 = df[df.ohw20_repo]
df
author | extensions | message | ohw20_repo | |
---|---|---|---|---|
datetime | ||||
2020-08-20 15:08:09+00:00 | participant-21 | md | Merge pull request #1 from cbirdferrer/patch-1... | False |
2020-08-17 16:24:17+00:00 | participant-21 | ipynb | deleting old presentation_figs file | True |
2020-08-14 20:00:38+00:00 | participant-21 | ipynb, ipynb | adding improvements to interpolate notebook to... | True |
2020-08-14 20:00:38+00:00 | participant-21 | ipynb | merging changes | True |
2020-08-14 20:00:38+00:00 | participant-21 | ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, ipyn... | merging interpolate notebook | True |
... | ... | ... | ... | ... |
2020-08-07 15:58:21+00:00 | participant-33 | md | change read.md | False |
2020-08-06 22:26:24+00:00 | participant-33 | md, md | change new2.md and create new.md | False |
2020-08-06 22:25:02+00:00 | participant-33 | md, md | changes in new.md and new2.md | False |
2020-08-06 22:20:59+00:00 | participant-33 | md, md, md | edit readme and new, create new2.md | False |
2020-08-06 22:13:07+00:00 | participant-33 | md, md | edit readme and add new.md | False |
1483 rows × 4 columns
weekly_commits = df.author.groupby(df.index.week).count()
df_hw = df[df.index.week==33]
hw_commits = df_hw.author.groupby(df_hw.index.week).count()
plt.rcParams.update({'font.size': 18})
fig, ax = plt.subplots(figsize=(15,10))
ax.bar(weekly_commits.index, weekly_commits.values, label='Weeks of 2020')
ax.bar(hw_commits.index, hw_commits.values, label='OHW20')
ax.legend()
ax.set(xlabel='week of 2020', ylabel='Commits by OHW participants')
[Text(0.5, 0, 'week of 2020'), Text(0, 0.5, 'Commits by OHW participants')]
daily_commits = df_hw.author.groupby(df_hw.index.day).count()
fig, ax = plt.subplots(figsize=(12,8))
ax.bar(daily_commits.index, daily_commits.values)
ax.set(xlabel='day of hackweek', ylabel='Commits by OHW participants')
ax.set_xticks(np.arange(10,17))
ax.set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']);
df_friday = df_hw[df_hw.index.day==14]
friday_commits = df_friday.author.groupby(df_friday.index.hour).count()
fig, ax = plt.subplots(figsize=(12,8))
ax.bar(friday_commits.index, friday_commits.values)
ax.set(xlabel='Hour of Friday (UTC)', ylabel='Commits by OHW participants')
[Text(0.5, 0, 'Hour of Friday (UTC)'), Text(0, 0.5, 'Commits by OHW participants')]
Hopefuly an indication of Oceanhackweek's global participation, not some sorely sleep deprived coders!
# severley unpythonic but it works
week_commits = []
unique_users = []
for week in weekly_commits.index:
sub_df = df[df.index.week==week]
unique_users.append(len(sub_df.groupby('author').count()))
fig, ax = plt.subplots(figsize=(15,10))
ax.bar(weekly_commits.index, unique_users)
ax.bar(33, max(unique_users))
ax.set(xlabel='week of 2020', ylabel='Unique OHW20 participants with commits')
[Text(0.5, 0, 'week of 2020'), Text(0, 0.5, 'Unique OHW20 participants with commits')]
from collections import Counter
c = Counter(df_ohw20[df_ohw20.ohw20_repo].author)
MVP = c.most_common()
mvp_names, mvp_nums = [], []
for tup in MVP:
mvp_names.append(tup[0])
mvp_nums.append(tup[1])
fig, ax = plt.subplots(figsize=(14,8))
ax.bar(mvp_names, mvp_nums)
ax.set(xticks=[], ylabel='Number of commits per participant');
A basic bit of NLP with Counter and the WordCloud library
What were the most common file types pushed to git?
from collections import Counter
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
all_commit_strings = ''.join(df_ohw20.message)
extensions = ''.join(str(df_ohw20.extensions.values))
extensions
"['ipynb' 'ipynb, ipynb' 'ipynb'\n 'ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, py, ipynb, ipynb, ipynb, py'\n 'ipynb' 'ipynb' 'ipynb' 'gitignore' 'ipynb' 'ipynb' 'ipynb' 'ipynb' 'py'\n 'py' 'ipynb' 'py' 'py, py, py, ipynb' 'py' 'py, py'\n 'py, gitkeep, py, py, gitkeep, py, py, py, gitkeep, py, py, py, gitkeep, py, py, py, py, py'\n 'md, binder/Dockerfile, txt, yml, json, binder/postBuild, binder/start'\n 'gitignore, Makefile, docs/Makefile, rst, py, rst, rst, bat, ipynb, gitkeep, gitkeep, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, gitkeep, gitkeep, gitkeep, txt, py, py, gitkeep, py, py, gitkeep, py, py, gitkeep, py, py, py, gitkeep, py, py, nc, py, ini'\n 'py' 'md' 'Rhistory, gitignore, Rmd, html, Rmd, ipynb, R, png, R, Rmd'\n 'ipynb' 'ipynb' 'ipynb' 'ipynb, ipynb, html, pkl' 'mat, ipynb'\n 'ipynb, ipynb' 'ipynb' 'ipynb' 'md' 'ipynb' 'ipynb' 'ipynb' 'yml'\n 'ipynb, ipynb, md, nc, ipynb, ipynb' 'ipynb, ipynb' 'yml'\n 'ipynb, png, ipynb' 'yml' 'md, yml, postBuild' 'md, yml'\n 'nc, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb'\n 'postBuild' 'yml' 'yml' 'yml' 'ipynb' 'ipynb' 'ipynb' 'gitignore'\n 'ipynb, ipynb' 'angel' 'py, ipynb' 'py, ipynb' 'py, py, ipynb' 'ipynb'\n 'yml' 'yml' 'yml' 'py' 'py, ipynb' 'py' 'py, ipynb' 'ipynb' 'py' 'md'\n 'ipynb' 'gitignore, yml, py' 'ipynb' 'ipynb' 'ipynb' 'ipynb' 'ipynb'\n 'ipynb' 'ipynb' 'ipynb' 'ipynb' 'ipynb' 'ipynb, ipynb' 'ipynb'\n 'ipynb, ipynb, ipynb, ipynb' 'ipynb' 'ipynb' 'gitignore, ipynb' 'ipynb'\n 'ipynb' 'ipynb' 'ipynb' 'ipynb' 'ipynb' 'ipynb, ipynb, ipynb, ipynb' 'md'\n 'ipynb, ipynb' 'ipynb, ipynb' 'ipynb, ipynb' 'ipynb, ipynb' 'Rmd'\n 'gitignore, R, ipynb' 'ipynb' 'R, Rmd' 'R, R' 'Rmd, geojson' 'ipynb'\n 'R, Rmd' 'ipynb' 'Rmd, html' 'ipynb' 'ipynb' 'nc, nc'\n 'ipynb, ipynb, nc, xyz, ipynb, ipynb, ipynb, ipynb, ipynb, xyz, ipynb'\n 'ipynb, ipynb, ipynb, nc, nc, xyz, ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, nc'\n 'ipynb' 'ipynb' 'ipynb, ipynb, ipynb' 'yml' 'yml' 'yml' 'ipynb' 'ipynb'\n 'ipynb' 'ipynb, html, html' 'ipynb' 'ipynb, csv' 'ipynb' 'ipynb, ipynb'\n 'ipynb' nan 'ipynb' 'ipynb' 'md' 'md' 'ipynb' 'md' 'ipynb, ipynb' 'ipynb'\n 'ipynb' 'ipynb' 'yml' 'ipynb'\n 'ipynb, ipynb, ipynb, nc, nc, ipynb, ipynb, ipynb, nc, nc, nc, nc, nc, nc, nc, nc, nc, nc, nc, nc'\n 'ipynb' 'ipynb' 'ipynb' 'ipynb' 'ipynb' 'ipynb, ipynb, ipynb, ipynb'\n 'ipynb, ipynb, ipynb, ipynb' 'ipynb' 'ipynb' 'ipynb, nc, xyz'\n 'ipynb, nc, xyz' 'ipynb' 'ipynb' 'ipynb' 'ipynb' 'yml, yml'\n 'ipynb, ipynb' 'ipynb, ipynb, ipynb, md, yml' 'ipynb, ipynb' 'md' 'ipynb'\n 'ipynb' 'ipynb' 'ipynb' 'ipynb' 'py' 'ipynb' 'ipynb'\n 'ipynb, ipynb, ipynb, ipynb' 'ipynb' 'ipynb, ipynb, ipynb, ipynb, ipynb'\n 'ipynb' 'ipynb' 'ipynb, ipynb' 'ipynb' 'ipynb, ipynb, ipynb, nc'\n 'ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, csv, ipynb, ipynb'\n 'ipynb, ipynb, ipynb, py, ipynb, ipynb' 'ipynb, ipynb' 'csv' 'R, Rmd'\n 'py' 'ipynb' 'ipynb' 'ipynb' 'ipynb' 'ipynb, py' 'ipynb' 'py' 'py' 'py'\n 'py' 'py' 'ipynb' 'ipynb' 'py' 'py' 'ipynb' 'ipynb' 'ipynb, py' 'ipynb'\n 'yml' 'cfg, yml, docs/Makefile, png, py, rst, rst, py' 'py'\n 'cfg, in, md, py, ipynb, ipynb, ipynb, txt, cfg, py' 'py'\n 'cfg, in, py, ipynb, ipynb, ipynb, txt, cfg, py' 'ipynb' 'ipynb'\n 'gitignore, yml, py, ipynb' 'ipynb' 'ipynb, ipynb' 'ipynb' 'ipynb'\n 'ipynb' 'ipynb, ipynb, ipynb, ipynb, ipynb' 'ipynb' 'ipynb' 'nc, ipynb'\n 'ipynb, ipynb, md, ipynb, ipynb' 'shp, R' 'ipynb' 'ipynb' 'ipynb'\n 'ipynb, ipynb' 'ipynb'\n 'md, ipynb, ipynb, ipynb, ipynb, yml, ipynb, ipynb, ipynb' 'py' 'ipynb'\n 'ipynb, ipynb, ipynb, ipynb, ipynb, ipynb, py, ipynb, py']"
This needs a bit of cleaning
extensions_no_newline = extensions.replace('\n','')
extensions_no_quotes = extensions_no_newline.replace("'", "")
extensions_no_commas = extensions_no_quotes.replace(",", "")
extensions_no_commas
'[ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb py ipynb ipynb ipynb py ipynb ipynb ipynb gitignore ipynb ipynb ipynb ipynb py py ipynb py py py py ipynb py py py py gitkeep py py gitkeep py py py gitkeep py py py gitkeep py py py py py md binder/Dockerfile txt yml json binder/postBuild binder/start gitignore Makefile docs/Makefile rst py rst rst bat ipynb gitkeep gitkeep ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb gitkeep gitkeep gitkeep txt py py gitkeep py py gitkeep py py gitkeep py py py gitkeep py py nc py ini py md Rhistory gitignore Rmd html Rmd ipynb R png R Rmd ipynb ipynb ipynb ipynb ipynb html pkl mat ipynb ipynb ipynb ipynb ipynb md ipynb ipynb ipynb yml ipynb ipynb md nc ipynb ipynb ipynb ipynb yml ipynb png ipynb yml md yml postBuild md yml nc ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb postBuild yml yml yml ipynb ipynb ipynb gitignore ipynb ipynb angel py ipynb py ipynb py py ipynb ipynb yml yml yml py py ipynb py py ipynb ipynb py md ipynb gitignore yml py ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb gitignore ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb md ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb Rmd gitignore R ipynb ipynb R Rmd R R Rmd geojson ipynb R Rmd ipynb Rmd html ipynb ipynb nc nc ipynb ipynb nc xyz ipynb ipynb ipynb ipynb ipynb xyz ipynb ipynb ipynb ipynb nc nc xyz ipynb ipynb ipynb ipynb ipynb ipynb nc ipynb ipynb ipynb ipynb ipynb yml yml yml ipynb ipynb ipynb ipynb html html ipynb ipynb csv ipynb ipynb ipynb ipynb nan ipynb ipynb md md ipynb md ipynb ipynb ipynb ipynb ipynb yml ipynb ipynb ipynb ipynb nc nc ipynb ipynb ipynb nc nc nc nc nc nc nc nc nc nc nc nc ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb nc xyz ipynb nc xyz ipynb ipynb ipynb ipynb yml yml ipynb ipynb ipynb ipynb ipynb md yml ipynb ipynb md ipynb ipynb ipynb ipynb ipynb py ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb nc ipynb ipynb ipynb ipynb ipynb ipynb csv ipynb ipynb ipynb ipynb ipynb py ipynb ipynb ipynb ipynb csv R Rmd py ipynb ipynb ipynb ipynb ipynb py ipynb py py py py py ipynb ipynb py py ipynb ipynb ipynb py ipynb yml cfg yml docs/Makefile png py rst rst py py cfg in md py ipynb ipynb ipynb txt cfg py py cfg in py ipynb ipynb ipynb txt cfg py ipynb ipynb gitignore yml py ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb ipynb nc ipynb ipynb ipynb md ipynb ipynb shp R ipynb ipynb ipynb ipynb ipynb ipynb md ipynb ipynb ipynb ipynb yml ipynb ipynb ipynb py ipynb ipynb ipynb ipynb ipynb ipynb ipynb py ipynb py]'
extensions_list = extensions_no_commas.split(" ")
c = Counter(extensions_list)
top16 = c.most_common(16)
word, counts = [], []
for tup in top16:
word.append(tup[0])
counts.append(tup[1])
fig, ax = plt.subplots(figsize=(12,8))
ax.bar(word,counts)
plt.setp(ax.get_xticklabels(), ha="right", rotation=45);
ax.set_title("Occurence of filetypes in commits to OHW20 repos")
Text(0.5, 1.0, 'Occurence of filetypes in commits to OHW20 repos')
Jupyter notebooks (ipynb) overwhelmingly popular
For a slighly fancier visualisation, lets make a wordlcoud
stopwords = set(STOPWORDS)
wc = WordCloud(background_color="white", max_words=16,collocations=False,
stopwords=stopwords, contour_width=3, contour_color='steelblue')
wordcloud = wc.generate(extensions_no_commas)
fig, ax = plt.subplots(figsize=(12,8))
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis("off")
(-0.5, 399.5, 199.5, -0.5)
How about git commit messages?
wc = WordCloud(background_color="white", max_words=33,collocations=False,
stopwords=stopwords, contour_width=3, contour_color='steelblue')
wordcloud = wc.generate(all_commit_strings)
fig, ax = plt.subplots(figsize=(12,8))
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis("off")
(-0.5, 399.5, 199.5, -0.5)
I have only scratched the surface of the data you can get from scraping github here. For more ideas on what to do with this data, check out the end of the other notebook
HTML(html)
This post was written as an IPython (Jupyter) notebook. You can view or download it using nbviewer.