In this tutorial we explore the data of APiCS, the Atlas of Pidgin and Creole Language Structures.
We download the complete dataset as sqlite database from http://apics-online.info/download
import urllib
url = 'http://apics-online.info/static/download/apics-dataset.sqlite.zip'
filename, headers = urllib.urlretrieve(url, url.rpartition('/')[2])
import zipfile
import os
os.listdir('.')
['include', 'bin', 'Untitled0.ipynb', 'apics.sqlite', 'local', 'lib', 'apics-dataset.sqlite.zip']
print zipfile.ZipFile(filename).namelist()
['apics-dataset.sqlite', 'README.txt']
with zipfile.ZipFile(filename) as fp:
with open('apics.sqlite', 'w') as fp2:
print fp.read('README.txt')
APiCS Online data download ========================== Data of APiCS Online is published under the following license: http://creativecommons.org/licenses/by-sa/3.0/ It should be cited as Michaelis, Susanne Maria & Maurer, Philippe & Haspelmath, Martin & Huber, Magnus (eds.) 2013. Atlas of Pidgin and Creole Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at http://apics-online.info, Accessed on 2013-08-13.)
with zipfile.ZipFile(filename) as fp:
with open('apics.sqlite', 'w') as fp2:
fp2.write(fp.read('apics-dataset.sqlite'))
import sqlite3
db = sqlite3.connect('apics.sqlite')
db.execute("select name from dataset").fetchone()
(u'APiCS Online',)
When exploring the database of a clld app, two things have to be kept in mind:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
db = sqlite3.connect('apics.sqlite')
cu = db.cursor()
print cu.execute("select count(*) from language").fetchone()
(104,)
That's curious. From looking at http://apics-online.info/contributions#list-container we would have expected to find just 76 languages. But consulting https://github.com/clld/apics/blob/master/apics/models.py#L55 we see, that languages may have lects, and these are listed in the language table as well. Knowing that the core languages are those that are not related to another language, i.e. that have language_pk == null, we get the expected result:
print cu.execute("select count(*) from lect where language_pk is null").fetchone()
(76,)
Now let's see what "joined table inheritance" means; We know that the lect table and the language table are associated in such a way that the lect table adds information for each of the objects in the language table, e.g. information about the lexifier of a language:
for row in cu.execute("select lexifier, count(pk) as c from lect where language_pk is null group by lexifier order by c desc"):
print row
(u'English', 26) (u'Portuguese', 14) (u'Other', 10) (u'French', 9) (u'Spanish', 6) (u'Bantu', 3) (u'Dutch', 3) (u'Malay', 3) (u'Arabic', 2)
for row in cu.execute("select l.name from language as l, lect as ll where ll.pk = l.pk and ll.lexifier = 'Malay'"):
print row
(u'Sri Lankan Malay',) (u'Singapore Bazaar Malay',) (u'Ambon Malay',)
So associated rows from language and lect have the same primary key pk.
import pandas
pandas.set_option('max_rows', 10)
languages = pandas.read_sql('SELECT * FROM language', db, 'id')
languages.latitude.hist(bins=45)
print
languages.longitude.hist(bins=45)
print