# Alistair Miles <alimanfoo@googlemail.com>
import datetime
print(datetime.datetime.now().isoformat())
2015-06-24T21:54:06.202128
This notebook describes conventions used by the Anopheles gambiae 1000 genomes project for storing genome variation data using the HDF5 file format.
The conventions are illustrated with examples using data from the Ag1000G phase 1 AR2 data release. We assume data from the FTP site have been downloaded to the local file system and are stored in a folder called "data" relative to the current working directory.
The examples below use the Python 3 programming language and the h5py library for accessing data in the HDF5 files. However, HDF5 files can also be accessed from a range of other programming environments, including C++, Java and R.
Note that the HDF5 files described below have been built by extracting data from VCF files using the vcfnp library. VCF is a very flexible format in terms of the data that can be represented. Some of this flexibility is lost in the translation to HDF5 datasets, in order to achieve the significant performance improvements that can be gained from using HDF5. For example, in VCF files an INFO field can have a variable number of items, whereas in the translation to HDF5 we have fixed the shape for all datasets. For the use cases we have encountered so far in Ag1000G this has not been an issue, however in other situations it may become relevant.
The Ag1000G phase 1 AR2 data release comprises variant calls on 765 individual mosquito specimens, sequenced to around 30X using the Illumina platform. Variants were called using GATK UnifiedGenotyper.
Here is a list of all the HDF5 files we have created for this variant call set:
!ls -lh data/*.h5
-r--r--r-- 1 aliman aliman 38G Sep 4 2014 data/ag1000g.phase1.AR2.2L.h5 -r--r--r-- 1 aliman aliman 18G Sep 4 2014 data/ag1000g.phase1.AR2.2L.PASS.h5 -r--r--r-- 1 aliman aliman 44G Sep 4 2014 data/ag1000g.phase1.AR2.2R.h5 -r--r--r-- 1 aliman aliman 23G Sep 4 2014 data/ag1000g.phase1.AR2.2R.PASS.h5 -r--r--r-- 1 aliman aliman 32G Sep 4 2014 data/ag1000g.phase1.AR2.3L.h5 -r--r--r-- 1 aliman aliman 16G Sep 4 2014 data/ag1000g.phase1.AR2.3L.PASS.h5 -r--r--r-- 1 aliman aliman 44G Sep 4 2014 data/ag1000g.phase1.AR2.3R.h5 -r--r--r-- 1 aliman aliman 21G Sep 4 2014 data/ag1000g.phase1.AR2.3R.PASS.h5 -r--r--r-- 1 aliman aliman 8.1K Sep 4 2014 data/ag1000g.phase1.AR2.h5 -r--r--r-- 1 aliman aliman 8.1K Sep 4 2014 data/ag1000g.phase1.AR2.PASS.h5 -r--r--r-- 1 aliman aliman 12G Sep 4 2014 data/ag1000g.phase1.AR2.UNKN.h5 -r--r--r-- 1 aliman aliman 625M Sep 4 2014 data/ag1000g.phase1.AR2.UNKN.PASS.h5 -r--r--r-- 1 aliman aliman 17G Sep 4 2014 data/ag1000g.phase1.AR2.X.h5 -r--r--r-- 1 aliman aliman 7.9G Sep 4 2014 data/ag1000g.phase1.AR2.X.PASS.h5 -r--r--r-- 1 aliman aliman 22M Sep 4 2014 data/ag1000g.phase1.AR2.Y_unplaced.h5 -r--r--r-- 1 aliman aliman 736K Sep 4 2014 data/ag1000g.phase1.AR2.Y_unplaced.PASS.h5
The A. gambiae reference genome has 7 chromosomes: 2L, 2R, 3L, 3R, X, Y_unplaced, UNKN.
There are two HDF5 files for each chromosome. One file (e.g., "ag1000g.phase1.AR2.2L.h5") has data on all the variants discovered on that chromosome, including those that fail quality filters. The other file (e.g., "ag1000g.phase1.AR2.2L.PASS.h5") has data on only variants that pass all quality filters.
We have organised the files by chromosome for convenience - they are quicker to build that way, and analyses tend to work with data from a single chromosome at a time. To make it easier to work with data from multiple chromosomes, we have also created some entry point HDF5 files which do not contain any data, but have external links out to the other HDF5 files, to make it appear as though all data are contained a single HDF5 file. For example:
import h5py
callset = h5py.File("data/ag1000g.phase1.AR2.h5", mode='r')
callset
<HDF5 file "ag1000g.phase1.AR2.h5" (mode r)>
This file has a group for each chromosome:
list(callset.keys())
['samples', '2L', '2R', '3L', '3R', 'UNKN', 'X', 'Y_unplaced']
callset['2L']
<HDF5 group "/2L" (3 members)>
There is also a dataset containing identifiers for the 765 mosquito DNA samples that were sequenced and genotyped:
samples = callset['samples']
samples
<HDF5 dataset "samples": shape (765,), type "|S8">
samples[:10]
array([b'AB0085-C', b'AB0087-C', b'AB0088-C', b'AB0089-C', b'AB0090-C', b'AB0091-C', b'AB0092-C', b'AB0094-C', b'AB0095-C', b'AB0097-C'], dtype='|S8')
For now, let's work with a single chromosome:
chrom = '2L'
Data for each chromosome are further organised into two main groups, "variants" and "calldata":
variants = callset[chrom]['variants']
variants
<HDF5 group "/2L/variants" (50 members)>
calldata = callset[chrom]['calldata']
calldata
<HDF5 group "/2L/calldata" (5 members)>
The "variants" group contains a number of datasets, where each dataset is a column of data on the variants in the callset.
There are columns corresponding to the fixed VCF fields, e.g.:
variants['CHROM']
<HDF5 dataset "CHROM": shape (19453287,), type "|S12">
variants['POS']
<HDF5 dataset "POS": shape (19453287,), type "<i4">
variants['REF']
<HDF5 dataset "REF": shape (19453287,), type "|S1">
variants['ALT']
<HDF5 dataset "ALT": shape (19453287, 3), type "|S1">
You can see from the shape of these datasets that there are 19,453,287 variants on chromosome 2L in the callset.
There are also columns extracted from the VCF INFO field. Here's the full list:
list(variants.keys())
['ABHet', 'ABHom', 'AC', 'AF', 'ALT', 'AN', 'BaseCounts', 'BaseQRankSum', 'CHROM', 'DP', 'DS', 'Dels', 'EFF', 'FILTER_FS', 'FILTER_HRun', 'FILTER_HighDP', 'FILTER_LowDP', 'FILTER_LowQual', 'FILTER_MQ', 'FILTER_Multiallelic', 'FILTER_PASS', 'FILTER_QD', 'FILTER_ReadPosRankSum', 'FILTER_RepeatMask', 'FS', 'GC', 'HRun', 'HW', 'HaplotypeScore', 'InbreedingCoeff', 'LOF', 'MLEAC', 'MLEAF', 'MQ', 'MQ0', 'MQRankSum', 'NDA', 'NMD', 'OND', 'POS', 'QD', 'QUAL', 'REF', 'RPA', 'RU', 'ReadPosRankSum', 'STR', 'is_snp', 'num_alleles', 'svlen']
Note that the FILTER field from the VCF has been broken into a number of separate Boolean datasets, e.g.:
variants['FILTER_PASS']
<HDF5 dataset "FILTER_PASS": shape (19453287,), type "|b1">
An entire column can be extracted from the HDF5 and loaded into memory, e.g.:
filter_pass = variants['FILTER_PASS'][:]
filter_pass
array([False, False, False, ..., False, False, False], dtype=bool)
Here's a quick way to count the number of variants passing all filters:
import numpy as np
np.count_nonzero(filter_pass)
9085177
It's also straightforward to plot variant annotation data, e.g.:
import matplotlib.pyplot as plt
%matplotlib inline
dp = variants['DP'][:]
plt.hist(dp, bins=100);
Columns can also be used to query variants, e.g.:
dp = variants['DP'][:]
qd = variants['QD'][:]
mq = variants['MQ'][:]
query = (dp < 40000) & (mq > 30) & (qd > 5)
query
array([False, False, False, ..., False, False, False], dtype=bool)
Note that some columns are 1-dimensional, e.g.:
variants['REF']
<HDF5 dataset "REF": shape (19453287,), type "|S1">
variants['REF'][:10]
array([b'C', b'G', b'G', b'A', b'G', b'A', b'A', b'T', b'T', b'T'], dtype='|S1')
...however some columns are 2-dimensional, e.g.:
variants['ALT']
<HDF5 dataset "ALT": shape (19453287, 3), type "|S1">
variants['ALT'][:10]
array([[b'T', b'', b''], [b'A', b'', b''], [b'A', b'', b''], [b'G', b'', b''], [b'A', b'', b''], [b'G', b'', b''], [b'C', b'', b''], [b'C', b'', b''], [b'A', b'', b''], [b'C', b'', b'']], dtype='|S1')
We fixed the shape of the ALT dataset to have 3 columns, because we were only dealing with SNP variants and so at most we would only ever observe 3 alternate alleles. For any variant with less than 3 alternate alleles, the extra columns are filled with an empty string (e.g., all the variants above are biallelic, having only one alternate allele).
The "calldata" group contains several datasets extracted from the sample columns of the VCF:
list(calldata.keys())
['AD', 'DP', 'GQ', 'genotype', 'is_called']
calldata['DP']
<HDF5 dataset "DP": shape (19453287, 765), type "<u2">
calldata['GQ']
<HDF5 dataset "GQ": shape (19453287, 765), type "|u1">
calldata['genotype']
<HDF5 dataset "genotype": shape (19453287, 765, 2), type "|i1">
Here, AD, DP and GQ have come straight from the VCF FORMAT fields.
The genotype dataset corresponds to the "GT" FORMAT field, however it has been parsed into a pair of integers for each genotype call.
For example, the genotype call for the 1000th variant in the 10th sample is homozygous reference ("0/0"):
calldata['genotype'][999, 9]
array([0, 0], dtype=int8)
Note that indexing into a dataset is 0-based.
Rows, columns and contiguous slices of an HDF5 dataset can be accessed very efficiently.
For example, load genotype calls for the first 10000 variants:
g = calldata['genotype'][:10000]
g.shape
(10000, 765, 2)
For example, load genotype calls for the 10th sample:
gs = calldata['genotype'][:, 9]
gs.shape
(19453287, 2)
How efficient these access operations are depends a lot on how the dataset has been chunked. Here's what we've used, we found it gives a good compromise between row-wise and column-wise access:
calldata['genotype'].chunks
(6553, 10, 2)
The vcfnp library was used to parse the data from the original VCF files and load them into HDF5 files, see the GitHub page for usage information.
The scikit-allel library has various functions for exploring, visualising and analysing variant call data that has been organised in HDF5 files in a similar way to the conventions described here.
Feel free to email (alimanfoo@googlemail.com) if you have any questions.
import sys
print('Python ' + '.'.join(map(str, sys.version_info[:3])))
print('numpy ' + np.__version__)
print('h5py ' + h5py.__version__)
Python 3.4.2 numpy 1.9.2 h5py 2.5.0