#!/usr/bin/env python
# coding: utf-8

# # Ag1000G project - conventions for storing variant call data in HDF5 files

# In[1]:


# Alistair Miles <alimanfoo@googlemail.com>
import datetime
print(datetime.datetime.now().isoformat())


# This notebook describes conventions used by the [Anopheles gambiae 1000 genomes project](http://www.malariagen.net/ag1000g) for storing genome variation data using the [HDF5 file format](https://www.hdfgroup.org/HDF5/).
# 
# The conventions are illustrated with examples using data from the [Ag1000G phase 1 AR2 data release](http://www.malariagen.net/data/ag1000g-phase1-AR2). We assume data from the FTP site have been downloaded to the local file system and are stored in a folder called "data" relative to the current working directory.
# 
# The examples below use the Python 3 programming language and the [h5py](http://docs.h5py.org/en/latest/) library for accessing data in the HDF5 files. However, HDF5 files can also be accessed from a range of other programming environments, including C++, Java and R.
# 
# Note that the HDF5 files described below have been built by extracting data from [VCF files](http://www.1000genomes.org/wiki/analysis/variant%20call%20format/vcf-variant-call-format-version-41) using the [vcfnp](https://github.com/alimanfoo/vcfnp) library. VCF is a very flexible format in terms of the data that can be represented. Some of this flexibility is lost in the translation to HDF5 datasets, in order to achieve the significant performance improvements that can be gained from using HDF5. For example, in VCF files an INFO field can have a variable number of items, whereas in the translation to HDF5 we have fixed the shape for all datasets. For the use cases we have encountered so far in Ag1000G this has not been an issue, however in other situations it may become relevant.

# ## File organisation

# The Ag1000G phase 1 AR2 data release comprises variant calls on 765 individual mosquito specimens, sequenced to around 30X using the Illumina platform. Variants were called using GATK UnifiedGenotyper.
# 
# Here is a list of all the HDF5 files we have created for this variant call set:

# In[2]:


get_ipython().system('ls -lh data/*.h5')


# The *A. gambiae* reference genome has 7 chromosomes: 2L, 2R, 3L, 3R, X, Y_unplaced, UNKN.
# 
# There are two HDF5 files for each chromosome. One file (e.g., "ag1000g.phase1.AR2.2L.h5") has data on all the variants discovered on that chromosome, including those that fail quality filters. The other file (e.g., "ag1000g.phase1.AR2.2L.PASS.h5") has data on only variants that pass all quality filters.
# 
# We have organised the files by chromosome for convenience - they are quicker to build that way, and analyses tend to work with data from a single chromosome at a time. To make it easier to work with data from multiple chromosomes, we have also created some entry point HDF5 files which do not contain any data, but have external links out to the other HDF5 files, to make it appear as though all data are contained a single HDF5 file. For example:

# In[3]:


import h5py

callset = h5py.File("data/ag1000g.phase1.AR2.h5", mode='r')
callset


# This file has a group for each chromosome:

# In[4]:


list(callset.keys())


# In[5]:


callset['2L']


# There is also a dataset containing identifiers for the 765 mosquito DNA samples that were sequenced and genotyped:

# In[6]:


samples = callset['samples']
samples


# In[7]:


samples[:10]


# ## Internal layout

# For now, let's work with a single chromosome:

# In[8]:


chrom = '2L'


# Data for each chromosome are further organised into two main groups, "variants" and "calldata":

# In[9]:


variants = callset[chrom]['variants']
variants


# In[10]:


calldata = callset[chrom]['calldata']
calldata


# ### Variants

# The "variants" group contains a number of datasets, where each dataset is a column of data on the variants in the callset. 
# 
# There are columns corresponding to the fixed VCF fields, e.g.: 

# In[11]:


variants['CHROM']


# In[12]:


variants['POS']


# In[13]:


variants['REF']


# In[14]:


variants['ALT']


# You can see from the shape of these datasets that there are 19,453,287 variants on chromosome 2L in the callset.

# There are also columns extracted from the VCF INFO field. Here's the full list:

# In[15]:


list(variants.keys())


# Note that the FILTER field from the VCF has been broken into a number of separate Boolean datasets, e.g.:

# In[16]:


variants['FILTER_PASS']


# An entire column can be extracted from the HDF5 and loaded into memory, e.g.:

# In[17]:


filter_pass = variants['FILTER_PASS'][:]
filter_pass


# Here's a quick way to count the number of variants passing all filters:

# In[18]:


import numpy as np
np.count_nonzero(filter_pass)


# It's also straightforward to plot variant annotation data, e.g.:

# In[19]:


import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')

dp = variants['DP'][:]
plt.hist(dp, bins=100);


# Columns can also be used to query variants, e.g.:

# In[20]:


dp = variants['DP'][:]
qd = variants['QD'][:]
mq = variants['MQ'][:]
query = (dp < 40000) & (mq > 30) & (qd > 5)
query


# Note that some columns are 1-dimensional, e.g.:

# In[21]:


variants['REF']


# In[22]:


variants['REF'][:10]


# ...however some columns are 2-dimensional, e.g.:

# In[23]:


variants['ALT']


# In[24]:


variants['ALT'][:10]


# We fixed the shape of the *ALT* dataset to have 3 columns, because we were only dealing with SNP variants and so at most we would only ever observe 3 alternate alleles. For any variant with less than 3 alternate alleles, the extra columns are filled with an empty string (e.g., all the variants above are biallelic, having only one alternate allele).

# ### Calldata

# The "calldata" group contains several datasets extracted from the sample columns of the VCF:

# In[25]:


list(calldata.keys())


# In[26]:


calldata['DP']


# In[27]:


calldata['GQ']


# In[28]:


calldata['genotype']


# Here, *AD*, *DP* and *GQ* have come straight from the VCF FORMAT fields. 
# 
# The *genotype* dataset corresponds to the "GT" FORMAT field, however it has been parsed into a pair of integers for each genotype call. 
# 
# For example, the genotype call for the 1000th variant in the 10th sample is homozygous reference ("0/0"):

# In[29]:


calldata['genotype'][999, 9]


# Note that indexing into a dataset is 0-based.

# Rows, columns and contiguous slices of an HDF5 dataset can be accessed very efficiently.
# 
# For example, load genotype calls for the first 10000 variants:

# In[30]:


g = calldata['genotype'][:10000]
g.shape


# For example, load genotype calls for the 10th sample:

# In[31]:


gs = calldata['genotype'][:, 9]
gs.shape


# How efficient these access operations are depends a lot on how the dataset has been chunked. Here's what we've used, we found it gives a good compromise between row-wise and column-wise access:

# In[32]:


calldata['genotype'].chunks


# ## Further reading

# The [vcfnp](https://github.com/alimanfoo/vcfnp) library was used to parse the data from the original VCF files and load them into HDF5 files, see the GitHub page for usage information.
# 
# The [scikit-allel](http://scikit-allel.readthedocs.org/en/latest/) library has various functions for exploring, visualising and analysing variant call data that has been organised in HDF5 files in a similar way to the conventions described here.
# 
# Feel free to email (alimanfoo@googlemail.com) if you have any questions.

# ## Appendix

# In[33]:


import sys
print('Python ' + '.'.join(map(str, sys.version_info[:3])))
print('numpy ' + np.__version__)
print('h5py ' + h5py.__version__)