This notebook documents the process by which I created a pilot sample of the Review and Herald. It is a 10% sample of the available Review and Herald corpus released on the Adventists Archives as of Spring 2014.
The first function reads the file names from a given directory. The second function parses the standard file names from the SDA periodicals (created by the SDA archives) into a standard CSV format.
# %load '../lib/list-directory.py'
import os, sys
def get_corpus_list(directory):
wd = os.listdir(directory)
return(wd)
# %load '../lib/get_corpus_data_SDA.py'
import csv
import os, sys
import re
def get_corpus_data_SDA(directory):
wd = os.listdir(directory)
listing=[]
for each in wd:
if each.endswith('pdf'):
_id = each
#print(_id)
path = '/Users/jeriwieringa/Dissertation/text/corpus-RH/all-pdf/'+ _id
pre = re.findall(r'^[a-zA-Z]*', _id)
# print pre
foo = re.findall(r'\d+', _id)
# print (foo)
year = foo[0][0:4]
month = foo[0][4:6]
day = foo[0][6:8]
volume = foo[1]
issue = foo[2]
url = "http://documents.adventistarchives.org/Periodicals/" + pre[0] + "/" + _id
listing.append([_id, pre[0], year, month, day, volume, issue, path, url])
return(listing)
RHList = get_corpus_data_SDA('/Users/jeriwieringa/Dissertation/text/corpus-RH/all-pdf/')
I experimented with two methods of identifying the different periodicals in the list. The first adds a UUID to each issue in the list of periodicals.
import uuid
for each in RHList:
each.append(str(uuid.uuid4()))
print(RHList[1])
['RH18501201-V01-02.pdf', 'RH', '1850', '12', '01', '01', '02', '/Users/jeriwieringa/Dissertation/text/corpus-RH/all-pdf/RH18501201-V01-02.pdf', 'http://documents.adventistarchives.org/Periodicals/RH/RH18501201-V01-02.pdf', '6e759178-8c2c-4d88-893c-30eeae4600ed']
This second method adds a count to the array (called in the resulting CSVs num_id
) for all of the issues in the list of files. This was useful for quickly seeing how the random library sampled the total corpus.
n=1
for each in RHList:
each.append(n)
n=n+1
print(RHList[1])
['RH18501201-V01-02.pdf', 'RH', '1850', '12', '01', '01', '02', '/Users/jeriwieringa/Dissertation/text/corpus-RH/all-pdf/RH18501201-V01-02.pdf', 'http://documents.adventistarchives.org/Periodicals/RH/RH18501201-V01-02.pdf', '6e759178-8c2c-4d88-893c-30eeae4600ed', 2]
Next, for record keeping, I printed the list of all of the periodicals, with their identifiers, to a CSV file.
fout = open('20150904-2-corpus-list-RH-sample.csv', 'w')
writer = csv.writer(fout, delimiter=',', quotechar='"')
headers = ['_id', 'prefix', 'year', 'month', 'day', 'volume', 'issue', 'path', 'url', 'UUID', 'num_id']
writer.writerow(headers)
61
for each in RHList:
writer.writerow(each)
Next, I created a second CSV file to hold the list of sample files.
open('20150904-corpus-list-RH-sample.csv')
<_io.TextIOWrapper name='20150904-corpus-list-RH-sample.csv' mode='r' encoding='UTF-8'>
Then, I imported math to calculate the size of teh sample, given the total number of periodicals in the corpus.
import random
import math
sampleSize = math.floor(len(RHList)/10)
print(sampleSize)
382
Following advice from stackoverflow, I used random.sample
and passed in the full corpus and the sample size calculated above. random
pulled 382 numbers out of an available field of 3822 (the size of the whole corpus), sorts them, and then add the periodical that corresponds with that list position to a new list. Note that, while the number from the random generator matches the list position, the associated ID in the CSV is one ahead.
# http://stackoverflow.com/questions/6482889/get-random-sample-from-list-while-maintaining-ordering-of-items
rand_smpl = [RHList[i] for i in sorted(random.sample(range(len(RHList)), sampleSize)) ]
And to confirm that everything worked as anticipated, I printed both one random list item and the length of the sample. Both pointed to a successful sampling.
print(rand_smpl[3])
['RH18520722-V03-06.pdf', 'RH', '1852', '07', '22', '03', '06', '/Users/jeriwieringa/Dissertation/data-sources/corpus-RH/pdf/RH18520722-V03-06.pdf', 'http://documents.adventistarchives.org/Periodicals/RH/RH18520722-V03-06.pdf', '634b03d4-090b-499b-815f-3abe1986011e', 34]
len(rand_smpl)
382
The final step was to use the resulting list of periodicals to separate out the sample corpus. First, the file path was isolated out of the information on each sample member.
fileList = []
for each in rand_smpl:
fileList.append(each[7])
Then, using shutil
, each file in the sample was copied to a separate directory.
import shutil
for each in fileList:
if (os.path.isfile(each)):
shutil.copy(each, '../../data-sources/corpus-RH/sample-pdf/')
And finally, for record keeping, the information for the sample was saved into a CSV file.
sampleListOut = open('20150904-sample-list.csv', 'w')
writer2 = csv.writer(sampleListOut, delimiter=',', quotechar='"')
writer2.writerow(headers)
61
for each in rand_smpl:
writer2.writerow(each)