Dirk Loss / @dloss, v1.0, 2014-04-17
The OpenSSL project has a public git repository. Let's clone it and see if we can use it to answer some simple questions.
%time !git clone git://git.openssl.org/openssl.git
Cloning into 'openssl'... remote: Counting objects: 138452, done. remote: Compressing objects: 100% (29938/29938), done. remote: Total 138452 (delta 110293), reused 135887 (delta 108259) Receiving objects: 100% (138452/138452), 37.76 MiB | 481.00 KiB/s, done. Resolving deltas: 100% (110293/110293), done. Checking connectivity... done. CPU times: user 513 ms, sys: 137 ms, total: 650 ms Wall time: 1min 6s
from IPython.display import IFrame
IFrame("http://en.wikipedia.org/wiki/OpenSSL#History_of_the_OpenSSL_project", 800, 400)
So the official start of the OpenSSL project was on December 23, 1998. Now let's see what we have in our repository:
cd openssl/
/Users/dirk/projekte/openssl-git/openssl
!git log --reverse | head -40
commit 90718ac5274e07cd7b1933f068e9546d12e621f5 Author: Ralf S. Engelschall <rse@openssl.org> Date: Mon Dec 21 10:52:45 1998 +0000 This commit was generated by cvs2svn to track changes on a CVS vendor branch. commit ec96f926b98721d6b84c7023fde0ecc5fe98e644 Author: Ralf S. Engelschall <rse@openssl.org> Date: Mon Dec 21 10:52:45 1998 +0000 Import of old SSLeay release: SSLeay 0.8.1b commit b7896b3cb86d80206af14a14d69b0717786f2729 Merge: 90718ac d02b48c Author: Ralf S. Engelschall <rse@openssl.org> Date: Mon Dec 21 10:52:47 1998 +0000 This commit was generated by cvs2svn to track changes on a CVS vendor branch. commit d02b48c63a58ea4367a0e905979f140b7d090f86 Author: Ralf S. Engelschall <rse@openssl.org> Date: Mon Dec 21 10:52:47 1998 +0000 Import of old SSLeay release: SSLeay 0.8.1b commit eda1f21f1af8b6f77327e7b37573af9c1ba73726 Merge: b7896b3 c7e9169 Author: Ralf S. Engelschall <rse@openssl.org> Date: Mon Dec 21 10:56:30 1998 +0000 This commit was generated by cvs2svn to track changes on a CVS vendor branch. commit c7e91699977f0dcf5025c00670d9dde0c2296641 Author: Ralf S. Engelschall <rse@openssl.org> Date: Mon Dec 21 10:56:30 1998 +0000 Import of old SSLeay release: SSLeay 0.9.0b
!git log -1
commit 300b9f0b704048f60776881f1d378c74d9c32fbd
Author: Dr. Stephen Henson <steve@openssl.org>
Date: Tue Apr 15 18:48:54 2014 +0100
Extension checking fixes.
When looking for an extension we need to set the last found
position to -1 to properly search all extensions.
PR#3309.
So we have commits from two days earlier that the official start up to today. More than 15 years of history. Good.
!git log --oneline | wc -l
11856
About twelve thousand commits.
First let's see how much space the current checkout (excluding the .git repo) takes:
!du -hs -I\.git
28M .
For a deeper analysis, we use David Wheeler's SLOCCount:
!sloccount .
Have a non-directory at the top, so creating directory top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./ACKNOWLEDGMENTS to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./CHANGES to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./CHANGES.SSLeay to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./Configure to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./FAQ to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./GitConfigure to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./GitMake to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.DJGPP to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.MacOS to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.NW to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.OS2 to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.VMS to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.W32 to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.W64 to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.WCE to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./LICENSE to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./Makefile.fips to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./Makefile.org to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./Makefile.shared to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./NEWS to top_dir Creating filelist for Netware Adding /Users/dirk/projekte/openssl-git/openssl/./PROBLEMS to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./README to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./README.ASN1 to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./README.ECC to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./README.ENGINE to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./README.FIPS to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./TABLE to top_dir Creating filelist for VMS Creating filelist for apps Creating filelist for bugs Creating filelist for certs Adding /Users/dirk/projekte/openssl-git/openssl/./config to top_dir Creating filelist for crypto Creating filelist for demos Creating filelist for doc Adding /Users/dirk/projekte/openssl-git/openssl/./e_os.h to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./e_os2.h to top_dir Creating filelist for engines Creating filelist for fips Creating filelist for include Adding /Users/dirk/projekte/openssl-git/openssl/./install.com to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./makevms.com to top_dir Creating filelist for ms Adding /Users/dirk/projekte/openssl-git/openssl/./openssl.doxy to top_dir Adding /Users/dirk/projekte/openssl-git/openssl/./openssl.spec to top_dir Creating filelist for os2 Creating filelist for perl Creating filelist for shlib Creating filelist for ssl Creating filelist for test Creating filelist for times Creating filelist for tools Creating filelist for util Categorizing files. Finding a working MD5 command.... Found a working MD5 command. Computing results. pod without closing cut in file /Users/dirk/projekte/openssl-git/openssl/crypto/sha/asm/sha256-c64xplus.pl SLOC Directory SLOC-by-Language (Sorted) 283856 crypto ansic=184902,perl=88876,asm=9463,cpp=605,sh=10 46606 ssl ansic=46606 36042 apps ansic=35535,perl=355,sh=152 20548 fips ansic=18413,perl=2017,sh=118 17520 engines ansic=16476,perl=1044 10418 demos ansic=9638,sh=550,cpp=218,perl=12 7769 util perl=7207,sh=562 3554 test perl=1562,sh=1304,ansic=688 1471 top_dir sh=764,ansic=707 543 ms ansic=320,perl=223 446 Netware perl=446 260 shlib sh=260 241 times cpp=225,perl=16 177 tools perl=146,sh=31 166 bugs ansic=166 31 VMS perl=31 27 os2 perl=27 24 doc lisp=24 0 certs (none) 0 include (none) 0 perl (none) Totals grouped by language (dominant language first): ansic: 313451 (72.95%) perl: 101962 (23.73%) asm: 9463 (2.20%) sh: 3751 (0.87%) cpp: 1048 (0.24%) lisp: 24 (0.01%) Total Physical Source Lines of Code (SLOC) = 429,699 Development Effort Estimate, Person-Years (Person-Months) = 116.37 (1,396.48) (Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05)) Schedule Estimate, Years (Months) = 3.26 (39.18) (Basic COCOMO model, Months = 2.5 * (person-months**0.38)) Estimated Average Number of Developers (Effort/Schedule) = 35.64 Total Estimated Cost to Develop = $ 15,720,421 (average salary = $56,286/year, overhead = 2.40). SLOCCount, Copyright (C) 2001-2004 David A. Wheeler SLOCCount is Open Source Software/Free Software, licensed under the GNU GPL. SLOCCount comes with ABSOLUTELY NO WARRANTY, and you are welcome to redistribute it under certain conditions as specified by the GNU GPL license; see the documentation for details. Please credit this data as "generated using David A. Wheeler's 'SLOCCount'."
So we have nearly 430kSLOC -- mostly C as expected, but roughly a quarter is Perl. And we have nearly 10000 lines of assembler code.
I'll save the commit authors and timestamps as a CSV file, that can be imported and analysed using the excellent pandas library:
!git log --format=format:"%ai,%an,%H" > ../commits
cd ..
/Users/dirk/projekte/openssl-git
import pandas as pd
df=pd.read_csv("commits", header=None, names=["time", "author", "id"], index_col="time", parse_dates=True)
df.sort(ascending=True, inplace=True)
df.head()
author | id | |
---|---|---|
time | ||
1998-12-21 10:52:45 | Ralf S. Engelschall | 90718ac5274e07cd7b1933f068e9546d12e621f5 |
1998-12-21 10:52:45 | Ralf S. Engelschall | ec96f926b98721d6b84c7023fde0ecc5fe98e644 |
1998-12-21 10:52:47 | Ralf S. Engelschall | b7896b3cb86d80206af14a14d69b0717786f2729 |
1998-12-21 10:52:47 | Ralf S. Engelschall | d02b48c63a58ea4367a0e905979f140b7d090f86 |
1998-12-21 10:56:30 | Ralf S. Engelschall | eda1f21f1af8b6f77327e7b37573af9c1ba73726 |
5 rows × 2 columns
Pandas provides a convenience function that shows how often each value occurs in a given column:
commits_per_author=df.author.value_counts()
commits_per_author
Dr. Stephen Henson 3558 Richard Levitte 2331 Andy Polyakov 1800 Bodo Möller 1699 Ulf Möller 661 Ben Laurie 590 Geoff Thorpe 408 Lutz Jänicke 300 Nils Larsch 197 Ralf S. Engelschall 189 Mark J. Cox 18 Paul C. Sutton 11 Adam Langley 11 Daniel Kahn Gillmor 10 Rob Stradling 10 Scott Deboy 7 stephen 5 Trevor Perrin 5 Carlos Alberto Lopez Perez 4 Bodo Moeller 4 Trevor 3 Piotr Sikora 3 Michael Tuexen 3 Lubomir Rintel 2 Robin Seggelmann 2 Kurt Roeckx 2 Matt Caswell 2 Jeff Trawick 2 Scott Schaefer 2 Kaspar Brand 2 Nick Mathewson 2 Krzysztof Kwiatkowski 1 Emilia Kasper 1 Lutz Jaenicke 1 Steve Marquess 1 Eric Young 1 Ard Biesheuvel 1 David Woodhouse 1 Veres Lajos 1 Klaus-Peter Junghanns 1 Mat 1 Tim Hudson 1 Jeff Walton 1 Nick Alcock 1 dtype: int64
So we have 10 People with more than 100 commits. Not a lot. But no suprises, either: The top 11 committers are exactly the current development team mentioned on the OpenSSL homepage.
Let's visualize the commit counts with Matplotlib. But first import seaborn, which gives us much prettier graphics:
import seaborn as sns
%matplotlib inline
commits_per_author.plot(kind="bar", figsize=(10,6))
<matplotlib.axes.AxesSubplot at 0x109a5e7d0>
Dr. Stephen Henson clearly dominates.
Introduce counter:
df["c"]=1 # counter
commits_over_time=df.c.cumsum().plot()
commits_over_time
<matplotlib.axes.AxesSubplot at 0x10999ded0>
authors = commits_per_author.index
timelines=pd.DataFrame(index=df.index)
for author in authors:
timelines[author]=df.c.where(df.author==author)
timelines.head()
Dr. Stephen Henson | Richard Levitte | Andy Polyakov | Bodo Möller | Ulf Möller | Ben Laurie | Geoff Thorpe | Lutz Jänicke | Nils Larsch | Ralf S. Engelschall | Mark J. Cox | Paul C. Sutton | Adam Langley | Daniel Kahn Gillmor | Rob Stradling | Scott Deboy | stephen | Trevor Perrin | Carlos Alberto Lopez Perez | Bodo Moeller | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
time | |||||||||||||||||||||
1998-12-21 10:52:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... |
1998-12-21 10:52:45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... |
1998-12-21 10:52:47 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... |
1998-12-21 10:52:47 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... |
1998-12-21 10:56:30 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... |
5 rows × 44 columns
default_palette = sns.color_palette()
sns.set_palette("Set1")
top_authors=authors[:10]
timelines[top_authors].cumsum().plot(style="o",figsize=(20,10))
<matplotlib.axes.AxesSubplot at 0x109f9dd10>
sns.set_palette(default_palette)
Let's see how many authors where active together, e.g. during a 3 month period:
per_months=timelines.resample("3M", how="sum")
per_months["nauthors"]=per_months.applymap(lambda x: min(x, 1)).sum(axis=1)
per_months["nauthors"].plot(kind="bar", figsize=(20,5))
<matplotlib.axes.AxesSubplot at 0x10a16d550>
So there have been 3 to 13 authors per quarter year.
For now we just cound the number of files:
cd openssl/
/Users/dirk/projekte/openssl-git/openssl
%%time
filecounts = []
for commit in df["id"]:
cfiles =! git ls-tree -r --name-only $commit
filecounts.append(len(cfiles))
CPU times: user 16.7 s, sys: 36.6 s, total: 53.3 s Wall time: 4min 2s
filestats=pd.DataFrame({"filecount": filecounts}, index=df.index)
filestats.plot(figsize=(10,6))
<matplotlib.axes.AxesSubplot at 0x109d19f10>
As we have seen before, at the beginning code was imported from SSLeay, so the graph starts with more than 1000 files.
The idea for the following git command comes from Gary Bernhardt's gitchurn. We can simplify it though, because we have Python and pandas:
file_changes =! git log --all -M -C --name-only --format='format:' | grep -v '^$'
dfc = pd.Series(list(file_changes))
dfc.value_counts()
CHANGES 2993 Configure 1440 Makefile.org 713 ssl/ssl.h 665 TABLE 628 util/libeay.num 567 ssl/s3_srvr.c 539 STATUS 513 ssl/ssl_lib.c 484 apps/s_server.c 439 ssl/s3_clnt.c 435 FAQ 407 ssl/t1_lib.c 387 config 384 ssl/s3_lib.c 376 ... fips/sha/fips_sha.h 1 VMS/compaq/cpq-axpvms-ssl-t0100--1.pcsi$text 1 fips/testvectors/des3/sample/TCBCinvperm.sam 1 fips/testvectors/des3/sample/TOFBvarkey.sam 1 fips/testvectors/des2/req/TCFB1permop.req 1 fips/testvectors/dsa/req/SigVer.req 1 cpq-axpvms-ssl-t0100--1.pcsi$desc 1 fips/testvectors/des3/req/TCFB8Monte2.req 1 doc/crypto/d2i_ASN1_OBJECT.pod 1 fips/testvectors/des2/sample/TCFB64MMT2.sam 1 fips/testvectors/des2/req/TECBinvperm.req 1 demos/vms_examples/ssl$simple_serv.c 1 fips-1.0/fipsalgtest.pl 1 fips/testvectors/des3/sample/TCBCvarkey.sam 1 fips/testvectors/des3/req/TCBCMonte2.req 1 Length: 4236, dtype: int64
c_changes=dfc.where(dfc.str.endswith(".c")).value_counts()
c_changes
ssl/s3_srvr.c 539 ssl/ssl_lib.c 484 apps/s_server.c 439 ssl/s3_clnt.c 435 ssl/t1_lib.c 387 ssl/s3_lib.c 376 apps/s_client.c 375 apps/apps.c 321 apps/ca.c 296 apps/speed.c 286 ssl/ssltest.c 269 crypto/x509/x509_vfy.c 248 ssl/ssl_err.c 248 ssl/s3_pkt.c 237 ssl/ssl_ciph.c 236 ... fips/ecdsa/fips_ecdsa_lib.c 1 demos/vms_examples/ssl$serv_sess_reuse.c 1 demos/err/main.c 1 crypto/evp/evp_aead.c 1 fips/sha/fips_sha1dgst.c 1 demos/vms_examples/ssl$serv_verify_client.c 1 demos/vms_examples/ssl$serv_sess_reuse_cli_ver.c 1 crypto/poly1305/poly1305_arm.c 1 engines/ccgost/md_gost.c 1 demos/vms_examples/ssl$cli_sess_renego.c 1 crypto/poly1305/poly1305.c 1 engines/ccgost/pmeth.c 1 demos/vms_examples/ssl$simple_cli.c 1 crypto/ts/ts_resp_sign.c 1 crypto/poly1305/poly1305test.c 1 Length: 1288, dtype: int64
c_changes.plot()
<matplotlib.axes.AxesSubplot at 0x10b498390>
As expected, a few files are changed very often and most files are changed infrequently.
What about header files?
h_changes=dfc.where(dfc.str.endswith(".h")).value_counts()
h_changes
ssl/ssl.h 665 crypto/evp/evp.h 366 ssl/ssl_locl.h 343 crypto/opensslv.h 329 crypto/asn1/asn1.h 280 crypto/x509/x509.h 260 crypto/objects/obj_dat.h 234 crypto/bn/bn.h 225 e_os.h 214 crypto/objects/obj_mac.h 194 crypto/rsa/rsa.h 190 crypto/crypto.h 190 apps/apps.h 189 crypto/x509v3/x509v3.h 182 ssl/ssl3.h 176 ... engines/ccgost/keywrap.h 1 apps/term_sock.h 1 fips/sha/fips_md32_common.h 1 crypto/poly1305/poly1305.h 1 engines/vendor_defns/cswift.h 1 fips/sha/fips_sha_locl.h 1 engines/ccgost/paramset.h 1 fips/sha/fips_sha.h 1 crypto/o_dir.h 1 crypto/engine/vendor_defns/keyclient.h 1 demos/err/test_err.h 1 crypto/chacha/chacha.h 1 engines/vendor_defns/hw_4758_cca.h 1 engines/vendor_defns/hw_ubsec.h 1 engines/vendor_defns/atalla.h 1 Length: 270, dtype: int64
To be continued... ;-)