Here, we can run Python scripts to scan through the contents of various test corpora, invoking various tools and analysing the results.
Using an IPython Notebook makes it very easy to regenerate the results by re-running these analyses as part of a continious integration process. Furthermore, because it's an IPython Notebook it generates output that is easy to publish on the web as static pages.
See also https://github.com/richardlehane/comparator
sf DIR
java -jar ~/droid/droid-command-line-6.1.5.jar -Ns ~/.droid6/signature_files/DROID_SignatureFile_V81.xml -Nc ~/.droid6/container_sigs/container-signature-20150218.xml -recurse -Nr DIR
droid -Ns ~/.droid6/signature_files/DROID_SignatureFile_V81.xml -Nc ~/.droid6/container_sigs/container-signature-20150218.xml -recurse -Nr DIR
python fido.py -recurse DIR
find systems-showcase-files -type f -exec file -I {} \;
Where appropriate, the format corpus pull existing corpora by remote reference rather than duplicating them in the main repository. Therefore, the first step is to create/update the local copies of those resources.
We then run various tools of interest, and collect the results.
We then summarise the results from the various tools.
We take the latest results and combine them with earlier sets of results, in order to see how things have changed over time.
The data and graphs generated in this way are then used to generate a static website generated via Jekyll.
...???...
...
import os
import subprocess
import time
from __future__ import print_function
def run_command(cmd):
'''given shell command, returns communication tuple of stdout and stderr'''
return subprocess.Popen(cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
stdin=subprocess.PIPE)
def run_tika(fp,out_fp):
start_time = os.times()[4]
p = run_command(["tika", "-m", fp])
of = open(out_fp+".out",'wb')
tika_type = None
while p.poll() is None:
line = p.stdout.readline()
of.write(line)
# Convert bytes to string, as UTF-8:
line = line.decode()
if "Content-Type" in line:
tika_type = line.rstrip().split(':')[1].strip()
of.close()
# Determine run-time:
end_time = os.times()[4]
run_time = end_time - start_time
# Check for stderr
errs = p.stderr.readlines()
has_stderr = False
if len(errs) > 0:
ef = open(out_fp+".err",'wb')
ef.writelines(errs)
ef.close()
has_stderr = True
# Note return code:
#print(p.returncode)
# Return:
return { 'type': tika_type, 'returncode': p.returncode, 'has_stderr': has_stderr, 'duration': run_time }
prefix = '/Users/andy/Documents/workspace/format-corpus/'
indir = prefix+'corpora/'
outdir = prefix+'tool-output/tika/'
count = 0
of = open(prefix+"scan-results.out",'w')
for root, dirs, filenames in os.walk(indir):
rel_path = root.replace(indir, "")
out_name = rel_path.replace("/",".")
for f in filenames:
# Set up input and output filenames:
fp = os.path.join(root, f)
rel_fp = os.path.join(rel_path,f)
out_fp = os.path.join(outdir,out_name+"."+f)
out_fp = out_fp.replace(" ","_")
# Run tools
print("Running Tika on",f)
tika = run_tika(fp,out_fp)
of.write("%s\t%s\t%s\t%s\t%s\n" % (rel_fp, tika['type'], tika['returncode'], tika['has_stderr'], tika['duration']))
# Count files processed:
count+=1
# Only process one folder per file right now:
break
of.close()
print("DONE")
Running Tika on Curation outline3.nmind.tar Running Tika on 00000019.300.tif Running Tika on Neddy_Flyer_ft_HeatherRyan.jpg Running Tika on Aesops-Fables.azw Running Tika on .DS_Store Running Tika on create-variations.sh Running Tika on lorem-ipsum-openprintcopypw.pdf Running Tika on lorem-ipsum-plus-image-updated-opencopyprintpw.pdf Running Tika on readme.md Running Tika on MAPS.ARJ Running Tika on readme.md Running Tika on ! Running Tika on .gitignore Running Tika on null Running Tika on readme.md Running Tika on 008677.pdf Running Tika on 020747.pdf Running Tika on balloon.j2c Running Tika on diagram.png Running Tika on balloon.jpg Running Tika on balloon_eciRGBv2.tif Running Tika on balloon.tif Running Tika on ConceptDraw Format metadata template.csv Running Tika on copac-uknuc.mmp Running Tika on Curation outline 3.nmind Running Tika on readme.md Running Tika on KSBASE.WK1 Running Tika on PEYTREND.WK3 Running Tika on KSBASE.WQ1 Running Tika on KS4000.WQ2 Running Tika on MonteCarlo.xls Running Tika on demoLibreOfficeImagePasteBug.odt Running Tika on readme.md Running Tika on .DS_Store Running Tika on lorem-ipsum-lossless.jp2 Running Tika on convert-version.txt Running Tika on index.md Running Tika on MS Access Format metadata template.csv Running Tika on acc95.mdb Running Tika on MS Word 5 Format metadata template.csv Running Tika on README.md Running Tika on embedded-lucinda-sans-PDFA-1a.pdf Running Tika on jap_91055688_japredcross_ss_ue_fnl_12212011.pdf Running Tika on application-manifest.sha1 Running Tika on simple.odt Running Tika on .DS_Store Running Tika on ripole-0.1.4.tar.gz Running Tika on bt-int.c Running Tika on bt-int.c Running Tika on bt-int.c Running Tika on ecdl paris. 1997-2003.ppt Running Tika on AREA2.MAP Running Tika on corruptionOneByteMissing.pdf Running Tika on .DS_Store Running Tika on .DS_Store Running Tika on .DS_Store Running Tika on convert-dependencies.txt Running Tika on corkam-osx Running Tika on bmpjs.asm Running Tika on BOXLAAG.STG Running Tika on .history Running Tika on test-hw.doc Running Tika on .DS_Store Running Tika on grayscale_8bpp_wrong_bpptag.tif Running Tika on lorem-ipsum.txt Running Tika on lorem-ipsum-pages-09-4.1-923.epub Running Tika on lorem-ipsum-pages-09-4.1-923.doc Running Tika on lorem-ipsum-pages-09-4.1-923.pdf Running Tika on lorem-ipsum.rtf Running Tika on lorem-ipsum.oo3.2.odt Running Tika on lorem-ipsum.docx Running Tika on About Pages '08 3.0.3.png Running Tika on index.xml.gz Running Tika on PkgInfo Running Tika on Thumbnail.jpg Running Tika on lorem-ipsum.pages Running Tika on lorem-ipsum.pages Running Tika on .DS_Store Running Tika on Index.zip Running Tika on Hardcover_bullet_black-13.png Running Tika on BuildVersionHistory.plist Running Tika on lorem-ipsum.pages Running Tika on lorem-ipsum.im.jpg Running Tika on lorem-ipsum.im.png Running Tika on create-variations.sh Running Tika on index.md Running Tika on lorem-ipsum.htm Running Tika on filelist.xml Running Tika on animation.mov Running Tika on www-original-proposal-MacWord-1989