%run talktools.py
Paul Ivanov
TL;DR version: "What's easy, won't last. What lasts, won't be easy."
My background: (since this will be autobiographical)
Data is hard to get: 1-2 years training animal on task.
"Minor brain surgery", every day of data collection for 4-6 months, every day, 6-10 hours per day.
Data is very rich. It is hoarded. With a very tight lid.
My naive conclusion: Data is precious. Free the DATA!
If data was just more accessible....
But the reality is that, having accessible data is not enough...
You need the code, and I don't mean a tar ball. And even that's not enough...
Git specifically?
It's not rocket science! There are sane GUIs for novice users.
I explained the benefits of version control to my biologist friend Sara, and put SmartGit on her machine. No more _v1, _v3_works, etc. "I didn't think it would be this easy".
Back in my home lab, we do computational experiments.
Unsupervised learning of natural signals. "How should the brain encode images given their properties?"
Very popular dataset (camera calibrated, linearized, uncompressed, etc) - the paper to cite it came out in 1998.
As of 2007, it had 336 citations according to Google Scholar, (then 99th most cited paper in the Vision literature).
Today that number is up to 802.
Then is 2010, there's an email sent to a vision community mailing list saying:
Does anyone have a copy of van Hateren database? I have been looking
for the 4000 still image database. The links to images
http://hlab.phys.rug.nl/imlib/l1_200/index.html
are broken! And it looks like there is no mirror of the full database
anywhere. I would appreciate your help and suggestions.
So I put up a mirror.
Shortly thereafter, another grad student in a lab in Germany (one of my academic nephews), did the same.
This happened again a year later with another dataset. Luckily, I had downloaded that one as well, and now host the canonical version.
Lesson learned: don't take today's data sources for granted.
multiple resources for the same data (http, ftp, bittorrent) in one container file.
Let's start using them!
Use a Python decorator attached to your data loading routines to verifiy hashes on-load
Let's see an example of what that looks like.
I have some data, and a problem: I don't want to lose my data.
So I make a copy -- and now I have two problems.
%%bash
cd ~/cur/siam; chmod +rwx -R tmp fake_usb;
rm -fr tmp fake_usb; # start with a clean slate
mkdir -p tmp
cd tmp
git init .
git annex init "local laptop"
%cd ~/cur/siam/tmp
%%bash
# let' just make a file to see how annex works
echo "pretend this is a large file" > original.dus
for x in {1..10000}; do echo GATTACA >> original.dus; done
ls -lh
This is a ~80K file, let's check it into git annex
!git annex add original.dus
Let's see what happened to it:
!ls -lh
So by annexing the file, we've hashed its contents, and renamed the file to that hash, making a symbolic link to the file. (content-based addressing)
It turns out git annex
also staged this symbolic link for us in git.
!git status
Let's check that into git.
!git commit -m"original data of unusual size checked in"
!git log
What did we actually check in? just one line - a symbolic link pointing to the contents of original.dus
!git log -p
!git annex whereis ./original.dus
Let's copy this repository to an external harddrive:
!git clone ./ ../fake_usb
cd ../fake_usb/
!git annex init "pi's external harddrive"
ls -al
!head original.dus
On the external harddrive, we only have a catalogue of the annexed files. We can grab them explicitly:
!git annex whereis original.dus
!git annex get original.dus
ls -al
!head original.dus
Here's an example of one of my annexes: total known annex size is 557 Gb, but this laptop only has 6 Gb of it (and it only has a 100Gb SSD).
The key point is that the catalogue is available in a very lightwheight manner. Everything in the catalogue is just a git annex get away.
%%bash
# this cell will only run on pi's computer
cd ~/annex
git annex status
Ok, now that we have data under control, let's move on to doing something with it... (code)
import numpy as np
np.test()
Lightweight capture tool (I use this daily, it helps me account for how I spend my time). Just writes everything you see in the shell to a file, with timing information, which you can later play back.
demo in the shell (ttyplay ~/2012-08-01_2.tty
)
This stuff is obvious here, but in another context, I would mention these features of the ipython notebook
- simple availability is not enough
- let's start using metalinks, bittorrent, etc.
- hashes (python decorators: verify on-load)
- git annex
- "Running software without a test suite is like running experiments without calibrating your instruments"
- ttyrec
- IPython Notebook
Paul Ivanov
"The task must be made difficult, for only the difficult inspires the noble-hearted." -- Kierkegaard