Notebook

Getting Started with Big Data Genomics¶

Big Data Genomics is a collection of schemas, libraries, and command-line utilities meant to standarize and scale the processing of the massive amounts of data generated by next generation sequencing (NGS). All Big Data Genomics tools use a common representation of NGS entities defined in an Avro schema. This allows for much greater flexibility in terms of the tools and languages which can be used to process genomic data.

Avro is useful because it specifies both an object serialization format as well as an object container file format for storing many objects. You can read and write Avro data from many languages, and the file format has been designed for parallel processing by Apache Hadoop MapReduce or Apache Spark. In addition to Avro's row-based container file format, objects conforming to an Avro schema can be stored in a file using the columnar Parquet container file format. Unfortunately, you can only work with Parquet files in Java today, though there are clients under development in other languages.

Loading the Schemas¶

The Big Data Genomics schemas are currently defined in single file written in the Avro interface description language (IDL). In order to be parsed by Python's Avro library, the IDL representation must be transformed to the standard JSON Avro schema representation using avro-tools.

Once this is done, the schema may be loaded and parsed by the Python library. We want to inspect the types (which will include records) used by Big Data Genomics tools.

The IDL representation is much easier to read than the JSON representation, so we'll be looking at the IDL while working with the "raw" JSON.

In [1]:

import avro.protocol as avpr

In [2]:

ADAM_formats = avpr.parse(open("adam.avpr", "r").read()) 
sorted([(v.type, k) for k, v in ADAM_formats.types_dict.items()])

Out[2]:

[('enum', u'ADAMGenotypeAllele'),
 ('enum', u'ADAMGenotypeType'),
 ('enum', u'Base'),
 (u'record', u'ADAMContig'),
 (u'record', u'ADAMDatabaseVariantAnnotation'),
 (u'record', u'ADAMGenotype'),
 (u'record', u'ADAMNestedPileup'),
 (u'record', u'ADAMNucleotideContigFragment'),
 (u'record', u'ADAMPileup'),
 (u'record', u'ADAMRecord'),
 (u'record', u'ADAMVariant'),
 (u'record', u'VariantCallingAnnotations'),
 (u'record', u'VariantEffect')]

Reading and Writing Schemas¶

We need to be able to read and write records for a given schema from and to disk. We will demonstrate that for an arbitrary record (a mock ADAMContig) below:

In [3]:

from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

In [4]:

ADAM_contig_schema = ADAM_formats.types_dict['ADAMContig']
ADAM_contig_schema.to_json()

Out[4]:

{'fields': [{'default': None, 'name': u'contigId', 'type': [u'null', u'int']},
  {'default': None, 'name': u'contigName', 'type': [u'null', u'string']},
  {'default': None, 'name': u'contigLength', 'type': [u'null', u'long']},
  {'default': None, 'name': u'contigMD5', 'type': [u'null', u'string']},
  {'default': None, 'name': u'referenceURL', 'type': [u'null', u'string']}],
 'name': u'ADAMContig',
 'namespace': u'org.bdgenomics.adam.avro',
 'type': u'record'}

In [5]:

# this is the Python representation of the Avro object. 
ADAM_contig = {'contigId': 9230,
               'contigName':"1Nabc", 
               'contigLength':7781,
               'contigMD5':"8743b52063cd84097a65d1633f5c74f5",
               'referenceURL': 'http://data.dna/223'} # mock data

In [6]:

with DataFileWriter(open("contigs.avro", "w"), DatumWriter(), ADAM_contig_schema) as contig_writer:
    contig_writer.append(ADAM_contig)

In [7]:

with DataFileReader(open("contigs.avro", "r"), DatumReader()) as contigs:
    for contig in contigs:
        print contig

{u'contigMD5': u'8743b52063cd84097a65d1633f5c74f5', u'contigId': 9230, u'contigLength': 7781, u'referenceURL': u'http://data.dna/223', u'contigName': u'1Nabc'}

The Schemas Themselves¶

We can inspect the JSON schema of the record, and see that which fields (and their respective types) that record contains.

We'll write a few records to a new file, using some arbitrary data (but of the correct type), and then load that file and print the Python representations of those records. This is about all there is to do with these records.

ADAMRecord¶

An ADAMRecord stores the alignment information of sequence reads against a reference sequence. Many additional metadata are stored alongside that core information, all of which can be found annotated below. It is the ADAM equivalent to a record in a SAM/BAM file (but much easier to read).

Below is a real example of a real ADAMRecord, exported to JSON. This was generated by converting an existing BAM file to ADAM format (e.g. adam bam2adam hg37.bam hg37.bam.adam), and then printing it (into JSON) with e.g. adam print hg37.bam.adam.

A Record, like a BAM file, stores a read sequence with quality data (ASCII+33 + PHRED score), along with alignment information stored in extended CIGAR format. More information on all fields can be found in the comments on the IDL representation below.

Additionally, important metainformation is parsed out of the BAM files and stored as first-class key-vals. Nonstandard (as well as many standard tags, still) additional attributes are stored in a string under attributes in the same format as found in SAM files.

In [8]:

ADAM_record = {"referenceName": "20", "referenceId": 19, "start": 19893804, "mapq": 60, 
               "readName": "20GAVAAXX100126:4:64:6132:191287", 
               "sequence": "GTTTTCTATGAAGTTATTTTCTAGGGATTCTGTTTTGTTGTCGTTGTTCACACTGTAGCTCTCAGATCTTACTGTTTTTTTTTTAATTGTGATAAAGCATA", 
               "mateReference": "20", "mateAlignmentStart": 19893476, "cigar": "101M", 
               "qual": "EHHHFGDGHFHF8EEFB=B@=GHFFBAA@8??>IHHEIHEH>EHH@HFG>GEGHEGHFA<GDGHFFHGHGFCFCE@EEHHHHHGGFHGEGFFFGGGFBDDD", 
               "recordGroupName": "20GAV.4", "recordGroupId": 11, "readPaired": True, "properPair": True, "readMapped": True, 
               "mateMapped": True, "readNegativeStrand": True, "mateNegativeStrand": False, "firstOfPair": True, 
               "secondOfPair": False, "primaryAlignment": True, "failedVendorQualityChecks": False, 
               "duplicateRead": False, "mismatchingPositions": "101", 
               "attributes": "XT:A:U\tOQ:Z:HHHHFHGHHHHG=FGGC?DC@HHHHFDDA:?E>HHHHHHHHHHHHFHGHGHGHHHHHHF@HFHHHHHHHHHGGGGGGGHHHHHHHHHHHHHHHHHHHGHHH\tMQ:i:60\tBQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\tXO:i:0\tXM:i:0\tSM:i:37\tNM:i:0\tAM:i:37\tXG:i:0\tRG:Z:20GAV.4\tX1:i:0\tX0:i:1", 
               "recordGroupSequencingCenter": "BI", "recordGroupDescription": None, "recordGroupRunDateEpoch": None, 
               "recordGroupFlowOrder": None, "recordGroupKeySequence": None, "recordGroupLibrary": "Solexa-18484", 
               "recordGroupPredictedMedianInsertSize": None, "recordGroupPlatform": "illumina", "recordGroupPlatformUnit": "20GAVAAXX100126.4", 
               "recordGroupSample": "NA12878", 
               "mateReferenceId": 19, "referenceLength": 63025520, "referenceUrl": None, 
               "mateReferenceLength": 63025520, "mateReferenceUrl": None, "origQual": None}

Below is the record from the SAM file that is represented above:

20GAVAAXX100126:4:64:6132:191287 83 20 19893805 60 101M = 19893477 -428 GTTTTCTATGAAGTTATTTTCTAGGGATTCTGTTTTGTTGTCGTTGTTCACACTGTAGCTCTCAGATCTTACTGTTTTTTTTTTAATTGTGATAAAGCATA EHHHFGDGHFHF8EEFB=B@=GHFFBAA@8??>IHHEIHEH>EHH@HFG>GEGHEGHFA

The IDL for ADAMRecord is below, with all fields commented.

record ADAMRecord {

    ////////////////////////////
    // Main alignment fields  //
    ////////////////////////////
    // This is the Query Template Name (the name of the read sequence "template")
    // Corresponds to QNAME (column 1) in SAM.
    union { null, string } readName = null;
    // The reference sequence name of the alignment.
    // Corresponds to RNAME (column 3) in SAM.
    union { null, string } referenceName = null;
    // Reference position start of the first matching base.
    // Corresponds to POS (column 4) in SAM (adjusted to be 0-indexed).
    union { null, long } start = null;
    // The PHRED-scored mapping quality 
    // -10Log_10 Pr{the mapping position is wrong}.
    // Corresponds to MAPQ (column 5) in SAM.
    union { null, int } mapq = null;
    // CIGAR string, indicated alignment, indels, misalignment to reference
    // Corresponds to CIGAR (column 6) in SAM.
    union { null, string } cigar = null;
    // The referenceName of the mate template (the next read in the template).
    // Corresponds to RNEXT (column 7) in SAM.
    union { null, string } mateReference = null;
    // The position of the primary alignment of the next read in the template.
    // Corresponds to PNEXT (column 8) in SAM (adjusted to be 0-indexed).
    union { null, long } mateAlignmentStart = null;
    // ASCII+33 Phred scores for the bases of sequence.
    // Corresponds to SEQ (column 10) in SAM.
    union { null, string } qual = null;
    // The actual sequence of bases. 
    // Corresponds to QUAL (column 11) in SAM.
    union { null, string } sequence = null;


    //////////////////////////////////////////
    // Read flags (all default to false)    //
    // Derived from FLAG (column 2) in SAM  //
    //////////////////////////////////////////
    union { boolean, null } readPaired = false;          // the template has multiple segments (0x1 in SAM) 
    union { boolean, null } properPair = false;          // each segment is properly aligned (0x2 in SAM)
    union { boolean, null } readMapped = false;          // the segment is mapped (opposite of 0x4 in SAM)
    union { boolean, null } mateMapped = false;          // the mate is mapped (opposite of 0x8 in SAM)
    union { boolean, null } readNegativeStrand = false;  // sequence is reverse complimented (0x10 in SAM)
    union { boolean, null } mateNegativeStrand = false;  // mate is reverse complimented (0x20 in SAM)
    union { boolean, null } firstOfPair = false;         // first segment in the template (0x40 in SAM)
    union { boolean, null } secondOfPair = false;        // the last segment in the template (0x80 in SAM)
    union { boolean, null } failedVendorQualityChecks = false;  // not passing quality controls (0x200 in SAM)
    union { boolean, null } duplicateRead = false;       // PCR or optimal duplicate (0x400 in SAM)
    union { boolean, null } primaryAlignment = false;    // not a supplementary alignment (opposite of 0x800 in SAM)


    //////////////////////////////
    // Derived & header fields  //
    //////////////////////////////
    // The corresponding reference length (LN in the SAM header).
    union { null, long }   referenceLength = null;
    // The corresponding reference URL (UR in the SAM header [NOTE: technically the URI in SAM spec]).
    union { null, string } referenceUrl = null;
    // The index of the referenceName (derived from the SAM header).
    union { null, int } referenceId = null;
    // c.f referenceLength
    union { null, long } mateReferenceLength = null;
    // c.f referenceUrl
    union { null, string } mateReferenceUrl = null;
    // c.f. referenceId
    union { null, int } mateReferenceId = null;

    // This is the name of the record group for the alignment.
    // It is taken from the header of the SAM file, by joining 
    // part of the QNAME with the PU, and taking the associated RG ID.
    union { null, string } recordGroupName = null;
    // The index of the readGroup.
    // (derived from the reference in the SAM header).
    union { null, int } recordGroupId = null;


    ///////////////////////////////////////
    // Commonly used optional attributes //
    ///////////////////////////////////////
    // String for mismatching positions.
    // Taken from the alignment optional field tag MD.
    union { null, string } mismatchingPositions = null;
    // Original base quality (generally from before BQSR).
    // Taken from the alignment optional field tag OQ.
    union { null, string } origQual = null;
    // Corresponds to column 11 in SAM, less the above optional
    // attributes which have been parsed out.
    union { null, string } attributes = null;


    /////////////////////////////////////////////////
    // Record group identifer from sequencing run  //
    /////////////////////////////////////////////////
    // The sequencing center which produced the read.
    // Taken from the CN attribute of the reference group (RG) line in the
    // SAM header corresponding to the particular reference of the alignment.
    union { null, string } recordGroupSequencingCenter = null;
    union { null, string } recordGroupDescription = null;
    union { null, long } recordGroupRunDateEpoch = null;
    union { null, string } recordGroupFlowOrder = null;
    union { null, string } recordGroupKeySequence = null;
    union { null, string } recordGroupLibrary = null;
    union { null, int } recordGroupPredictedMedianInsertSize = null;
    union { null, string } recordGroupPlatform = null;
    union { null, string } recordGroupPlatformUnit = null;
    union { null, string } recordGroupSample = null;
}

With a corresponding ADAM schema found in the protocol dictionary we created earlier (these will henceforth be omitted in favor of IDL):

In [9]:

ADAM_formats.types_dict['ADAMRecord'].to_json()

Out[9]:

{'fields': [{'default': None,
   'doc': u'* These two fields, along with the two\n     * reference{Length, Url} fields at the bottom\n     * of the schema, collectively form the contents\n     * of the Sequence Dictionary embedded in the these\n     * records from the BAM / SAM itself.\n     * TODO: this should be moved to ADAMContig',
   'name': u'referenceName',
   'type': [u'null', u'string']},
  {'default': None, 'name': u'referenceId', 'type': [u'null', u'int']},
  {'default': None, 'name': u'start', 'type': [u'null', u'long']},
  {'default': None, 'name': u'mapq', 'type': [u'null', u'int']},
  {'default': None, 'name': u'readName', 'type': [u'null', u'string']},
  {'default': None, 'name': u'sequence', 'type': [u'null', u'string']},
  {'default': None, 'name': u'mateReference', 'type': [u'null', u'string']},
  {'default': None, 'name': u'mateAlignmentStart', 'type': [u'null', u'long']},
  {'default': None, 'name': u'cigar', 'type': [u'null', u'string']},
  {'default': None, 'name': u'qual', 'type': [u'null', u'string']},
  {'default': None, 'name': u'recordGroupName', 'type': [u'null', u'string']},
  {'default': None, 'name': u'recordGroupId', 'type': [u'null', u'int']},
  {'default': False, 'name': u'readPaired', 'type': [u'boolean', u'null']},
  {'default': False, 'name': u'properPair', 'type': [u'boolean', u'null']},
  {'default': False, 'name': u'readMapped', 'type': [u'boolean', u'null']},
  {'default': False, 'name': u'mateMapped', 'type': [u'boolean', u'null']},
  {'default': False,
   'name': u'readNegativeStrand',
   'type': [u'boolean', u'null']},
  {'default': False,
   'name': u'mateNegativeStrand',
   'type': [u'boolean', u'null']},
  {'default': False, 'name': u'firstOfPair', 'type': [u'boolean', u'null']},
  {'default': False, 'name': u'secondOfPair', 'type': [u'boolean', u'null']},
  {'default': False,
   'name': u'primaryAlignment',
   'type': [u'boolean', u'null']},
  {'default': False,
   'name': u'failedVendorQualityChecks',
   'type': [u'boolean', u'null']},
  {'default': False, 'name': u'duplicateRead', 'type': [u'boolean', u'null']},
  {'default': None,
   'name': u'mismatchingPositions',
   'type': [u'null', u'string']},
  {'default': None, 'name': u'attributes', 'type': [u'null', u'string']},
  {'default': None,
   'name': u'recordGroupSequencingCenter',
   'type': [u'null', u'string']},
  {'default': None,
   'name': u'recordGroupDescription',
   'type': [u'null', u'string']},
  {'default': None,
   'name': u'recordGroupRunDateEpoch',
   'type': [u'null', u'long']},
  {'default': None,
   'name': u'recordGroupFlowOrder',
   'type': [u'null', u'string']},
  {'default': None,
   'name': u'recordGroupKeySequence',
   'type': [u'null', u'string']},
  {'default': None,
   'name': u'recordGroupLibrary',
   'type': [u'null', u'string']},
  {'default': None,
   'name': u'recordGroupPredictedMedianInsertSize',
   'type': [u'null', u'int']},
  {'default': None,
   'name': u'recordGroupPlatform',
   'type': [u'null', u'string']},
  {'default': None,
   'name': u'recordGroupPlatformUnit',
   'type': [u'null', u'string']},
  {'default': None,
   'name': u'recordGroupSample',
   'type': [u'null', u'string']},
  {'default': None, 'name': u'mateReferenceId', 'type': [u'null', u'int']},
  {'default': None, 'name': u'referenceLength', 'type': [u'null', u'long']},
  {'default': None, 'name': u'referenceUrl', 'type': [u'null', u'string']},
  {'default': None,
   'name': u'mateReferenceLength',
   'type': [u'null', u'long']},
  {'default': None, 'name': u'mateReferenceUrl', 'type': [u'null', u'string']},
  {'default': None, 'name': u'origQual', 'type': [u'null', u'string']}],
 'name': u'ADAMRecord',
 'namespace': u'org.bdgenomics.adam.avro',
 'type': u'record'}

ADAMGenotype¶

An ADAMGenotype stores a single variant (as compared to a reference sequence) in a given sample. It is the ADAM equivalent to a call (a line, or record) in the VCF (variant call format) file. ADAMGenotype does not store the ID of the variant (the ID column in VCF), and it splits up multi-allelic variants in separate records.

Information on the variant itself (e.g. the reference allele and alterntate allele) is stored in a subschema, ADAMVariant. Variant annotations is stored in the VariantCallingAnnocations subschema.

Below is a real example of an ADAMGenotype.

In [10]:

ADAM_genotype = {"variant": None, # c.f. ADAMVariant section
                 "variantCallingAnnotations": None, # c.f. VariantCallingAnnocations section
                 "sampleId": "GENOTYPE", "sampleDescription": None, 
                 "processingDescription": None, "alleles": ["Ref", "Alt"], 
                 "referenceReadDepth": None, "alternateReadDepth": None, 
                 "readDepth": None, "genotypeQuality": None, "genotypeLikelihoods": [], 
                 "expectedAlleleDosage": None, "readsMappedForwardStrand": None, 
                 "splitFromMultiAllelic": False, "isPhased": False, "phaseSetId": None, 
                 "phaseQuality": None}

An excerpt of the VCF from which the ADAMGenotype above was derived is found below. The first line of the file (chr17) corresponds with the actual record above.

##fileformat=VCFv4.1 ##fileDate=20140416 ##source=23andme_to_vcf.pl ##reference=file://23andme_hg19ref_20121017.txt.gz ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GENOTYPE chr17 7621777 rs1544724 T G . . . GT 0/1 chr1 82154 rs4477212 a . . . . GT 1/1 chr1 752566 rs3094315 g A . . . GT 0/1 chr1 752721 rs3131972 A G . . . GT 0/0 chr1 776546 rs12124819 A . . . . GT 0/0

record ADAMGenotype {
  // Information regarding the actual variant (c.f. ADAMVariant for more information)
  union { null, ADAMVariant }               variant; 
  // c.f. VariantCallingAnnocations for more information
  union { null, VariantCallingAnnotations } variantCallingAnnotations = null;

  // This is the actual name of the sample the genotype is associated with (taken from the VCF header line).
  union { null, string }  sampleId = null;
  // A description, if any, associated with the sample (this would be from the VCF metainformation).
  union { null, string }  sampleDescription = null;
  // Optional information regarding the processing of the VCF, taken from the file's metainformation.
  union { null, string }  processingDescription = null;

  // The actual genotype of the sample at the location; this is the call itself.
  // This correponds with the information in the Sample column in the VCF. It could look like, for example, 
  // ["REF", "ALT"] if one chromosome's allele corresponds to the reference's, and the other to the alternate's.
  // Note: Length is equal to the ploidy. Values: "REF", "ALT", "NOCALL". 
  // Note too that, unlike in the VCF format, in the sample columns,
  // where one line/record can contain multiple alternate alleles, ADAMGenotype corresponds to exactly one alternate
  // and thus we do not need to refer to the specific alternate allele being called. 
  array <ADAMGenotypeAllele> alleles = null;

  //////////////////////////////////////////////////////////////////
  // Information optionally encoded in the VCF's Samples columns  //
  //////////////////////////////////////////////////////////////////
  // How many reads consider this allele to be the reference.
  union { null, int }     referenceReadDepth = null;
  // How many reads consider this allele to be the alternate. 
  union { null, int }     alternateReadDepth = null;
  // How many total reads at this position. Correponds to the DP tag in VCF.
  union { null, int }     readDepth = null;
  // The phred-scaled probability that we're correct for this genotype call.
  union { null, int }     genotypeQuality = null;

  // Phred-scaled scores for the called genotypes. (Length 3)
  array<int>              genotypeLikelihoods = null;

  // (Not sure what this is, does not appear to be used in ADAM. 
  // Looks like it could be the expected allele counts (c.f. http://pngu.mgh.harvard.edu/~purcell/plink/dosage.shtml).)
  union { null, float }    expectedAlleleDosage = null;

  // Number of reads mapped at site on forward strand (Not sure what this means, where this is taken from. 
  // Is not used in ADAM.)
  union { null, int } readsMappedForwardStrand = null;

  // In the ADAM world we split multiallelic VCF lines into multiple
  // single-alternate records.  This bit is set if that happened for this record.
  boolean                 splitFromMultiAllelic = false;

  // Whether this is a phased genotype. A phased genotype means that we know which allege belongs to which
  // strand of the chromosome. (This information is encoded in the VCF's sample colum.)
  union { null, boolean } isPhased = null;
  // And if so, what is the phase ID.
  union { null, int }     phaseSetId = null;

  // The quality of the phasing.  (This isn't precisely defined in v4.2 of the spec.)
  union { null, int }     phaseQuality = null;
}

ADAMVariant¶

An ADAMVariant is used with ADAMGenotype to denote the actual variant and reference allele of the genotype, as well as the alignment information associated with it.

In [11]:

variant = {"contig": {"contigId": 0, "contigName": "chr17", "contigLength": None, 
                      "contigMD5": None, "referenceURL": None}, 
                      "position": 7621776, "referenceAllele": "T", "variantAllele": "G"}

In [12]:

ADAM_variant_schema = ADAM_formats.types_dict['ADAMVariant']
ADAM_variant_schema.to_json()

Out[12]:

{'fields': [{'default': None,
   'name': u'contig',
   'type': [u'null',
    {'fields': [{'default': None,
       'name': u'contigId',
       'type': [u'null', u'int']},
      {'default': None, 'name': u'contigName', 'type': [u'null', u'string']},
      {'default': None, 'name': u'contigLength', 'type': [u'null', u'long']},
      {'default': None, 'name': u'contigMD5', 'type': [u'null', u'string']},
      {'default': None,
       'name': u'referenceURL',
       'type': [u'null', u'string']}],
     'name': u'ADAMContig',
     'namespace': u'org.bdgenomics.adam.avro',
     'type': u'record'}]},
  {'default': None, 'name': u'position', 'type': [u'null', u'long']},
  {'name': u'referenceAllele', 'type': u'string'},
  {'name': u'variantAllele', 'type': u'string'}],
 'name': u'ADAMVariant',
 'namespace': u'org.bdgenomics.adam.avro',
 'type': u'record'}

record ADAMVariant {
  union { null, ADAMContig } contig = null;
  // The position in the reference sequence this variant is located.
  union { null, long }       position = null;
  // The reference allele at that position.
  string                     referenceAllele;
  // The alternate allele being considered in this variant.
  string                     variantAllele;
}

VariantCallingAnnotations¶

This record represents all stats that, inside a VCF, are stored outside of the sample but are computed based on the samples. For instance, MAPQ0 is an aggregate stat computed from all samples and stored inside the INFO line.

In [13]:

variant_calling_annotation  = {
    "readDepth": None,
    "downsampled": None,
    "baseQRankSum": None,
    "clippingRankSum": None, 
    "fisherStrandBiasPValue": None,
    "haplotypeScore": None,
    "inbreedingCoefficient": None,
    "alleleCountMLE": [], 
    "alleleFrequencyMLE": [],
    "rmsMapQ": None,
    "mapq0Reads": None,
    "mqRankSum": None, 
    "usedForNegativeTrainingSet": None,
    "usedForPositiveTrainingSet": None,
    "variantQualityByDepth": None, 
    "readPositionRankSum": None,
    "vqslod": None, "culprit": None, 
    "variantCallErrorProbability": None,
    "variantIsPassing": True,
    "variantFilters": [],
} 

In [14]:

ADAM_variant_calling_annotations_schema = ADAM_formats.types_dict['VariantCallingAnnotations']
ADAM_variant_calling_annotations_schema.to_json()

Out[14]:

{'fields': [{'default': None, 'name': u'readDepth', 'type': [u'null', u'int']},
  {'default': None, 'name': u'downsampled', 'type': [u'null', u'boolean']},
  {'default': None, 'name': u'baseQRankSum', 'type': [u'null', u'float']},
  {'default': None, 'name': u'clippingRankSum', 'type': [u'null', u'float']},
  {'default': None,
   'name': u'fisherStrandBiasPValue',
   'type': [u'null', u'float']},
  {'default': None, 'name': u'haplotypeScore', 'type': [u'null', u'float']},
  {'default': None,
   'name': u'inbreedingCoefficient',
   'type': [u'null', u'float']},
  {'default': None,
   'name': u'alleleCountMLE',
   'type': {'items': u'int', 'type': 'array'}},
  {'default': None,
   'name': u'alleleFrequencyMLE',
   'type': {'items': u'int', 'type': 'array'}},
  {'default': None, 'name': u'rmsMapQ', 'type': [u'null', u'float']},
  {'default': None, 'name': u'mapq0Reads', 'type': [u'null', u'int']},
  {'default': None, 'name': u'mqRankSum', 'type': [u'null', u'float']},
  {'default': None,
   'name': u'usedForNegativeTrainingSet',
   'type': [u'null', u'boolean']},
  {'default': None,
   'name': u'usedForPositiveTrainingSet',
   'type': [u'null', u'boolean']},
  {'default': None,
   'name': u'variantQualityByDepth',
   'type': [u'null', u'float']},
  {'default': None,
   'name': u'readPositionRankSum',
   'type': [u'null', u'float']},
  {'default': None, 'name': u'vqslod', 'type': [u'null', u'float']},
  {'default': None, 'name': u'culprit', 'type': [u'null', u'string']},
  {'default': None,
   'name': u'variantCallErrorProbability',
   'type': [u'null', u'float']},
  {'default': True, 'name': u'variantIsPassing', 'type': u'boolean'},
  {'default': None,
   'name': u'variantFilters',
   'type': {'items': u'string', 'type': 'array'}}],
 'name': u'VariantCallingAnnotations',
 'namespace': u'org.bdgenomics.adam.avro',
 'type': u'record'}

record VariantCallingAnnotations {
  union { null, int }     readDepth = null;
  // Was this downsampled?
  union { null, boolean } downsampled = null;

  // Base quality rank sum. 
  union { null, float }   baseQRankSum = null;
  union { null, float }   clippingRankSum = null;
  union { null, float }   fisherStrandBiasPValue = null; // Phred-scaled.
  union { null, float }   haplotypeScore = null;
  union { null, float }   inbreedingCoefficient = null;
  array<int>              alleleCountMLE = null;
  array<int>              alleleFrequencyMLE = null;
  union { null, float }   rmsMapQ = null;
  union { null, int }     mapq0Reads = null;
  union { null, float }   mqRankSum = null;
  union { null, boolean } usedForNegativeTrainingSet = null;
  union { null, boolean } usedForPositiveTrainingSet = null;
  union { null, float }   variantQualityByDepth = null;
  union { null, float }   readPositionRankSum = null;
  // Log-odds ratio of being a true vs false variant under trained
  // Gaussian mixture model.
  union { null, float }   vqslod = null;
  union { null, string }  culprit = null;
  // Phred-scaled probability of error for this variant call.
  union { null, float }   variantCallErrorProbability = null;
  // True implies either filters were applied and the variant passed
  // those filters, or no filters were applied.  False implies filters
  // were applied the variant did not pass.
  boolean                 variantIsPassing = true;
  // A list of filters applied.
  array <string>          variantFilters = null;
}

ADAMNucleotideContigFragment¶

Next we'll look at an ADAMNucleotideContigFragment, which stores a contig of nucleotides; this may be a reference chromosome, an assembly, or a BAC.

In [15]:

ADAM_nucleotide_contig_fragment_schema = ADAM_formats.types_dict['ADAMNucleotideContigFragment']
ADAM_nucleotide_contig_fragment_schema.to_json()

Out[15]:

{'fields': [{'default': None,
   'name': u'contigName',
   'type': [u'null', u'string']},
  {'default': None, 'name': u'contigId', 'type': [u'null', u'int']},
  {'default': None, 'name': u'description', 'type': [u'null', u'string']},
  {'default': None, 'name': u'url', 'type': [u'null', u'string']},
  {'default': None, 'name': u'fragmentSequence', 'type': [u'null', u'string']},
  {'default': None, 'name': u'contigLength', 'type': [u'null', u'long']},
  {'default': None, 'name': u'fragmentNumber', 'type': [u'null', u'int']},
  {'default': None,
   'name': u'fragmentStartPosition',
   'type': [u'null', u'long']},
  {'default': None,
   'name': u'numberOfFragmentsInContig',
   'type': [u'null', u'int']}],
 'name': u'ADAMNucleotideContigFragment',
 'namespace': u'org.bdgenomics.adam.avro',
 'type': u'record'}

For example, the below is a real contig from the human reference genome (with a bit of the sequence itself elided for sanity).

contigName = 20 contigId = 0 fragmentSequence = TTCG……………AACCGGCTCGA contigLength = 20000000 fragmentNumber = 4 fragmentStartPosition = 40000 numberOfFragmentsInContig = 2000

record ADAMNucleotideContigFragment {
    union { null, string } contigName = null;
    union { null, int } contigId = null;
    union { null, string } description = null;
    union { null, string } url = null; 
    union { null, string } fragmentSequence = null; // sequence of bases in this fragment
    union { null, long } contigLength = null; // length of the total contig (all fragments)
    union { null, int } fragmentNumber = null; // ordered number for this fragment
    union { null, long } fragmentStartPosition = null; // position of first base of fragment in contig
    union { null, int } numberOfFragmentsInContig = null; // total number of fragments in contig
}

ADAMContig¶

The ADAMContig record is used to describe region of the sequence being considered.

record ADAMContig {
  union { null, int }    contigId = null;
  union { null, string } contigName = null;
  union { null, long }   contigLength = null;
  union { null, string } contigMD5 = null;
  union { null, string } referenceURL = null;
}

ADAMPileup¶

ADAMPileup summarizes base calls against a reference sequence of aligned reads with various coverage. This is similar to the Pileup format used in the SAMTools suite, and is generally used in alignment and for visual inspection of alignment.

ADAM can print alignment information in pileup format with the command mpileup. A very abridged sample of output is found below, and more information on the format can be found in this SAMTools documentation page on mpileup.

20 9999944 A 35 ,,,....,.,,..,,...,.,,,,,,,.,,.,... 20 9999945 C 37 ,,,....,.,,..,,...,.,,,,,,,.,,.,...., 20 9999946 T 37 ,,,....,.,,..,,...,.,,,,,,,.,,.,...., 20 9999947 C 38 ,,,....,.,,..,,...,.,,,,,,,.,,.,....,, 20 9999948 T 40 ,,,....,.,,..,,...,.,,,,,,,.,,.,....,,., 20 9999949 T 41 ,,,....,.,,..,,...,.,,,,,,,.,,.,....,,.,. 20 9999950 A 42 ,,,....,.,,..,,...,.,,,,,,,.,,.,....,,.,.. 20 9999951 G 43 ,,,....,.,,..,,...,.,,,,,,,.,,.,....,,.,... 20 9999952 T 46 ,,,....,.,,..,,...,.,,,,,,,.,,.,....,,.,.....,

In [16]:

ADAM_pileup_schema = ADAM_formats.types_dict['ADAMPileup']
ADAM_pileup_schema.to_json()

Out[16]:

{'fields': [{'default': None,
   'name': u'referenceName',
   'type': [u'null', u'string']},
  {'default': None, 'name': u'referenceId', 'type': [u'null', u'int']},
  {'default': None, 'name': u'position', 'type': [u'null', u'long']},
  {'default': None, 'name': u'rangeOffset', 'type': [u'null', u'int']},
  {'default': None, 'name': u'rangeLength', 'type': [u'null', u'int']},
  {'default': None,
   'name': u'referenceBase',
   'type': [u'null',
    {'name': u'Base',
     'namespace': u'org.bdgenomics.adam.avro',
     'symbols': [u'A',
      u'C',
      u'T',
      u'G',
      u'U',
      u'N',
      u'X',
      u'K',
      u'M',
      u'R',
      u'Y',
      u'S',
      u'W',
      u'B',
      u'V',
      u'H',
      u'D'],
     'type': 'enum'}]},
  {'default': None,
   'name': u'readBase',
   'type': [u'null', u'org.bdgenomics.adam.avro.Base']},
  {'default': None, 'name': u'sangerQuality', 'type': [u'null', u'int']},
  {'default': None, 'name': u'mapQuality', 'type': [u'null', u'int']},
  {'default': None, 'name': u'numSoftClipped', 'type': [u'null', u'int']},
  {'default': None, 'name': u'numReverseStrand', 'type': [u'null', u'int']},
  {'default': None, 'name': u'countAtPosition', 'type': [u'null', u'int']},
  {'default': None, 'name': u'readName', 'type': [u'null', u'string']},
  {'default': None, 'name': u'readStart', 'type': [u'null', u'long']},
  {'default': None, 'name': u'readEnd', 'type': [u'null', u'long']},
  {'default': None,
   'name': u'recordGroupSequencingCenter',
   'type': [u'null', u'string']},
  {'default': None,
   'name': u'recordGroupDescription',
   'type': [u'null', u'string']},
  {'default': None,
   'name': u'recordGroupRunDateEpoch',
   'type': [u'null', u'long']},
  {'default': None,
   'name': u'recordGroupFlowOrder',
   'type': [u'null', u'string']},
  {'default': None,
   'name': u'recordGroupKeySequence',
   'type': [u'null', u'string']},
  {'default': None,
   'name': u'recordGroupLibrary',
   'type': [u'null', u'string']},
  {'default': None,
   'name': u'recordGroupPredictedMedianInsertSize',
   'type': [u'null', u'int']},
  {'default': None,
   'name': u'recordGroupPlatform',
   'type': [u'null', u'string']},
  {'default': None,
   'name': u'recordGroupPlatformUnit',
   'type': [u'null', u'string']},
  {'default': None,
   'name': u'recordGroupSample',
   'type': [u'null', u'string']}],
 'name': u'ADAMPileup',
 'namespace': u'org.bdgenomics.adam.avro',
 'type': u'record'}

record ADAMPileup {
    union { null, string } referenceName = null;
    union { null, int } referenceId = null;
    union { null, long } position = null;
    union { null, int } rangeOffset = null;
    union { null, int } rangeLength = null;
    union { null, Base } referenceBase = null;
    union { null, Base } readBase = null;
    union { null, int } sangerQuality = null;
    union { null, int } mapQuality = null;
    union { null, int } numSoftClipped = null;
    union { null, int } numReverseStrand = null;
    union { null, int } countAtPosition = null;

    union { null, string } readName = null;
    union { null, long } readStart = null;
    union { null, long } readEnd = null;

    // record group identifer from sequencing run
    union { null, string } recordGroupSequencingCenter = null;
    union { null, string } recordGroupDescription = null;
    union { null, long } recordGroupRunDateEpoch = null;
    union { null, string } recordGroupFlowOrder = null;
    union { null, string } recordGroupKeySequence = null;
    union { null, string } recordGroupLibrary = null;
    union { null, int } recordGroupPredictedMedianInsertSize = null;
    union { null, string } recordGroupPlatform = null;
    union { null, string } recordGroupPlatformUnit = null;
    union { null, string } recordGroupSample = null;
}

VariantEffect¶

VariantEffect denotes the effect that a variant allele has on a given gene. It notes the change, if any, in amino acid for a given codon's mutation.

In [17]:

ADAM_variant_effect_schema = ADAM_formats.types_dict['VariantEffect']
ADAM_variant_effect_schema.to_json()

Out[17]:

{'fields': [{'default': None, 'name': u'hgvs', 'type': [u'null', u'string']},
  {'default': None,
   'name': u'referenceAminoAcid',
   'type': [u'null', u'string']},
  {'default': None,
   'name': u'alternateAminoAcid',
   'type': [u'null', u'string']},
  {'default': None, 'name': u'geneId', 'type': [u'null', u'string']},
  {'default': None, 'name': u'transcriptId', 'type': [u'null', u'string']}],
 'name': u'VariantEffect',
 'namespace': u'org.bdgenomics.adam.avro',
 'type': u'record'}

record VariantEffect {
  union { null, string} hgvs = null;
  union { null, string } referenceAminoAcid = null;
  union { null, string } alternateAminoAcid = null;
  union {null, string} geneId = null;
  union {null, string} transcriptId = null;
}

ADAMDatabaseVariantAnnotation¶

This record documents the significance of a given allele.

record ADAMDatabaseVariantAnnotation {
  union { null, ADAMVariant } variant;
  union { null, int } dbSnpId = null;

  // Domain information
  union {null, string} geneSymbol = null;

  // Clinical fields
  union {null, string} omimId  = null;
  union {null, string} cosmicId = null;
  union {null, string} clinvarId  = null;
  union {null, string} clinicalSignificance  = null;

  // Conservation
  union { null, string } gerpNr  = null;
  union { null, string } gerpRs  = null;
  union { null, float } phylop  = null;
  union { null, string } ancestralAllele  = null;

  // Population statistics
  union {null, int} thousandGenomesAlleleCount = null;
  union {null, float} thousandGenomesAlleleFrequency = null;

  // Effect of the variant
  //array<VariantEffect> effects = null;

  // Predicted effects
  union { null, float } siftScore = null;
  union { null, float } siftScoreConverted = null;
  union { null, string } siftPred = null;

  union { null, float } mutationTasterScore = null;
  union { null, float } mutationTasterScoreConverted = null;
  union { null, string } mutationTasterPred = null;
}

ADAMNestedPileup¶

A nested pileup data type—contains reference to list of overlapping records assopciated with an ADAMPileup.

In [18]:

ADAM_nested_pileup_schema = ADAM_formats.types_dict['ADAMNestedPileup']
ADAM_nested_pileup_schema.to_json()

Out[18]:

{'fields': [{'name': u'pileup',
   'type': {'fields': [{'default': None,
      'name': u'referenceName',
      'type': [u'null', u'string']},
     {'default': None, 'name': u'referenceId', 'type': [u'null', u'int']},
     {'default': None, 'name': u'position', 'type': [u'null', u'long']},
     {'default': None, 'name': u'rangeOffset', 'type': [u'null', u'int']},
     {'default': None, 'name': u'rangeLength', 'type': [u'null', u'int']},
     {'default': None,
      'name': u'referenceBase',
      'type': [u'null',
       {'name': u'Base',
        'namespace': u'org.bdgenomics.adam.avro',
        'symbols': [u'A',
         u'C',
         u'T',
         u'G',
         u'U',
         u'N',
         u'X',
         u'K',
         u'M',
         u'R',
         u'Y',
         u'S',
         u'W',
         u'B',
         u'V',
         u'H',
         u'D'],
        'type': 'enum'}]},
     {'default': None,
      'name': u'readBase',
      'type': [u'null', u'org.bdgenomics.adam.avro.Base']},
     {'default': None, 'name': u'sangerQuality', 'type': [u'null', u'int']},
     {'default': None, 'name': u'mapQuality', 'type': [u'null', u'int']},
     {'default': None, 'name': u'numSoftClipped', 'type': [u'null', u'int']},
     {'default': None, 'name': u'numReverseStrand', 'type': [u'null', u'int']},
     {'default': None, 'name': u'countAtPosition', 'type': [u'null', u'int']},
     {'default': None, 'name': u'readName', 'type': [u'null', u'string']},
     {'default': None, 'name': u'readStart', 'type': [u'null', u'long']},
     {'default': None, 'name': u'readEnd', 'type': [u'null', u'long']},
     {'default': None,
      'name': u'recordGroupSequencingCenter',
      'type': [u'null', u'string']},
     {'default': None,
      'name': u'recordGroupDescription',
      'type': [u'null', u'string']},
     {'default': None,
      'name': u'recordGroupRunDateEpoch',
      'type': [u'null', u'long']},
     {'default': None,
      'name': u'recordGroupFlowOrder',
      'type': [u'null', u'string']},
     {'default': None,
      'name': u'recordGroupKeySequence',
      'type': [u'null', u'string']},
     {'default': None,
      'name': u'recordGroupLibrary',
      'type': [u'null', u'string']},
     {'default': None,
      'name': u'recordGroupPredictedMedianInsertSize',
      'type': [u'null', u'int']},
     {'default': None,
      'name': u'recordGroupPlatform',
      'type': [u'null', u'string']},
     {'default': None,
      'name': u'recordGroupPlatformUnit',
      'type': [u'null', u'string']},
     {'default': None,
      'name': u'recordGroupSample',
      'type': [u'null', u'string']}],
    'name': u'ADAMPileup',
    'namespace': u'org.bdgenomics.adam.avro',
    'type': u'record'}},
  {'name': u'readEvidence',
   'type': {'items': {'fields': [{'default': None,
       'doc': u'* These two fields, along with the two\n     * reference{Length, Url} fields at the bottom\n     * of the schema, collectively form the contents\n     * of the Sequence Dictionary embedded in the these\n     * records from the BAM / SAM itself.\n     * TODO: this should be moved to ADAMContig',
       'name': u'referenceName',
       'type': [u'null', u'string']},
      {'default': None, 'name': u'referenceId', 'type': [u'null', u'int']},
      {'default': None, 'name': u'start', 'type': [u'null', u'long']},
      {'default': None, 'name': u'mapq', 'type': [u'null', u'int']},
      {'default': None, 'name': u'readName', 'type': [u'null', u'string']},
      {'default': None, 'name': u'sequence', 'type': [u'null', u'string']},
      {'default': None,
       'name': u'mateReference',
       'type': [u'null', u'string']},
      {'default': None,
       'name': u'mateAlignmentStart',
       'type': [u'null', u'long']},
      {'default': None, 'name': u'cigar', 'type': [u'null', u'string']},
      {'default': None, 'name': u'qual', 'type': [u'null', u'string']},
      {'default': None,
       'name': u'recordGroupName',
       'type': [u'null', u'string']},
      {'default': None, 'name': u'recordGroupId', 'type': [u'null', u'int']},
      {'default': False, 'name': u'readPaired', 'type': [u'boolean', u'null']},
      {'default': False, 'name': u'properPair', 'type': [u'boolean', u'null']},
      {'default': False, 'name': u'readMapped', 'type': [u'boolean', u'null']},
      {'default': False, 'name': u'mateMapped', 'type': [u'boolean', u'null']},
      {'default': False,
       'name': u'readNegativeStrand',
       'type': [u'boolean', u'null']},
      {'default': False,
       'name': u'mateNegativeStrand',
       'type': [u'boolean', u'null']},
      {'default': False,
       'name': u'firstOfPair',
       'type': [u'boolean', u'null']},
      {'default': False,
       'name': u'secondOfPair',
       'type': [u'boolean', u'null']},
      {'default': False,
       'name': u'primaryAlignment',
       'type': [u'boolean', u'null']},
      {'default': False,
       'name': u'failedVendorQualityChecks',
       'type': [u'boolean', u'null']},
      {'default': False,
       'name': u'duplicateRead',
       'type': [u'boolean', u'null']},
      {'default': None,
       'name': u'mismatchingPositions',
       'type': [u'null', u'string']},
      {'default': None, 'name': u'attributes', 'type': [u'null', u'string']},
      {'default': None,
       'name': u'recordGroupSequencingCenter',
       'type': [u'null', u'string']},
      {'default': None,
       'name': u'recordGroupDescription',
       'type': [u'null', u'string']},
      {'default': None,
       'name': u'recordGroupRunDateEpoch',
       'type': [u'null', u'long']},
      {'default': None,
       'name': u'recordGroupFlowOrder',
       'type': [u'null', u'string']},
      {'default': None,
       'name': u'recordGroupKeySequence',
       'type': [u'null', u'string']},
      {'default': None,
       'name': u'recordGroupLibrary',
       'type': [u'null', u'string']},
      {'default': None,
       'name': u'recordGroupPredictedMedianInsertSize',
       'type': [u'null', u'int']},
      {'default': None,
       'name': u'recordGroupPlatform',
       'type': [u'null', u'string']},
      {'default': None,
       'name': u'recordGroupPlatformUnit',
       'type': [u'null', u'string']},
      {'default': None,
       'name': u'recordGroupSample',
       'type': [u'null', u'string']},
      {'default': None, 'name': u'mateReferenceId', 'type': [u'null', u'int']},
      {'default': None,
       'name': u'referenceLength',
       'type': [u'null', u'long']},
      {'default': None, 'name': u'referenceUrl', 'type': [u'null', u'string']},
      {'default': None,
       'name': u'mateReferenceLength',
       'type': [u'null', u'long']},
      {'default': None,
       'name': u'mateReferenceUrl',
       'type': [u'null', u'string']},
      {'default': None, 'name': u'origQual', 'type': [u'null', u'string']}],
     'name': u'ADAMRecord',
     'namespace': u'org.bdgenomics.adam.avro',
     'type': u'record'},
    'type': 'array'}}],
 'name': u'ADAMNestedPileup',
 'namespace': u'org.bdgenomics.adam.avro',
 'type': u'record'}

record ADAMNestedPileup {
     ADAMPileup pileup;
     array<ADAMRecord> readEvidence;
}