Reduced-representation genomic sequence data (e.g., RADseq, GBS, ddRAD) are commonly used to study population-level research questions and consequently most software packages for analyzing RADseq data are designed for data with little variation across samples. Phylogenetic analyses typically include species with deeper divergence times (more variable loci across samples) and thus a different approach to clustering and identifying orthologs will perform better.
pyRAD, the software pipeline described here, is intended to maximize phylogenetic information across disparate samples from RAD, ddRAD, or GBS data, hereafter referred to generally as RADseq. Written in Python, the program is flexible and easy to use. The code is human-readable, open-source and may be modified to meet users specific needs. With respect to constructing phylogenetic data sets, the largest difference between pyRAD and alternative programs such as Stacks, is in the way that loci are clustered. Stacks clusters reads by grouping those with less than N differences between them, but does not allow for indels, such that the presence of even a single indel will cause a frameshift resulting in many base differences between otherwise highly similar sequences. pyRAD instead uses a measure of sequence similarity based on a global alignment of sequences through the program USEARCH. Alignment clustering allows for indels, and should perform better in data sets including more distantly related samples.
pyRAD was written with the aim that it be easy to use, fast, and flexible. It is easy in the sense that installation does not require any difficult libraries or extensions, and that it requires few arguments to the command line. It is fast by offerring a number of heuristics to speed clustering of very large data sets, or long (paired-end) reads. And it is flexible in that it has built-in methods to account for different library types (RAD, ddRAD, GBS, paired-ddRAD, paired-GBS) as well as quality filtering, including overlap among paired or reverse-complemented reads, and the presence of adapter sequences. Thus, data do not need to be pre-screened or filtered before analysis in pyRAD.
Two fast clustering methods: hierarchical-clustering for large data sets, and split-clustering for paired-end ddRAD data are described further below. Using these methods population or phylogenetic scale data sets with even many hundreds of individuals can be assembled in relatively short time by distributing across multiple processing cores. Finally, pyRAD can be used to perform D-statistic tests on RADseq loci to test for historical introgression.
An open access pre-print describing pyRAD and comparing it with Stacks is available here: link
pyRAD is available for Linux and Mac (but not Windows). It can be downloaded at the following link.
A few Python packages and external software are required as dependencies. I list how you can find and install each below:
Numpy and Scipy are common Python libraries required for numerical and scientific calculations.
On Ubuntu Linux:
These can be installed using apt-get commands:
>>> sudo apt-get install python-numpy
>>> sudo apt-get install python-scipy
On a Mac:
There are a few ways to install these, and it varies depending on the version of your OS. I suggest you search 'Mac install scipy', or Scipysuperpack, or Conda, for installing these two packages.
To check that they installed properly you could open Python in a command terminal by typing "python" and enter the import arguments below. If everything is installed properly you will get no error messages.
import numpy
import scipy
import scipy.stats
The other two required pieces of software are MUSCLE and USEARCH. Both are available at http://www.drive5.com. Once downloaded you can move the executables into your $PATH such that they can be called from the command line. If you don't know what that means, don't worry, simply take note of the location where you saved the downloads, this is all you will need to execute them.
Sometimes the downloaded executable of USEARCH does not have the correct permissions set for execution (I don't know why). If you get an error when you execute the program saying you do not have permission to use it, then run the following unix command to change permissions of the file:
>>> chmod 555 usearch_v.7.0.1090_i86linux32
Important note: Versions of pyRAD v.1.70 and above are compatible with USEARCH v.7.0.1090 and above, not with older versions.
The barcodes file is a simple table linking barcodes to samples. Barcodes can be of varying lengths. Each line should have one name and one barcode, separated by a tab (important: not spaces).
sample1 ACAGG
sample2 ATTCA
sample3 CGGCATA
sample4 AAGAACA
If your data are already de-multiplexed (sorted) by samples into separate files then a barcode file is not needed.
The parameter input file, usually named params.txt, can be created with the -n option in pyRAD, like below. This file lists all of the options or paramater settings necessary for an analysis. An example is here: params.txt
Every line has a parameter (e.g., a number or character string) followed by any number of spaces or tabs and then two hash marks (“##”), after which the parameter is described, or comments can be added. In parentheses I list the step of a pyRAD analysis affected by the parameter. If a line is left blank the default value is used. I highly recommend beginning all analysis by creating an empty template params.txt file with the following command:
~/pyrad_v.2.0/pyRAD -n
This will create a file with the default settings which you will want to edit with a text editor.
The following section describes the params file line by line. The first 14 lines are required, all lines following this are optional and used for more advanced analyses or data filtering. For each input line I list several possible formats:
Line 1: The working directory for your analysis. Three different examples are shown below. This is where all of the output files from the analyses will be placed.
## 1. uses current as working directory
./ ## 1. uses current as working directory
/home/user/RAD/ ## 1. uses set location as working directory
Line 2: The location of the raw fastq formatted Illumina sequence data. Use the wildcard character * to select multiple files in the same directory. Paired-end data files should be in pairs that differ only by the presence of a "R1" and "R2" in their names. Data files can be gzipped.
./raw/*.fastq ## 2. path to raw data files
raw/*.fastq.gz ## 2. path to raw data files (gzipped)
Line 3: The location of the barcodes file (described above). If your data are already de-multiplexed then lines 2 and 3 can be left blank, but line 18 (see below) must be entered. Example:
./barcodes.txt ## 3. path to barcodes file
mydata.barcodes ## 3. path to barcodes file
Line 4: The command to call USEARCH. If USEARCH is in your $PATH, then simply list the executable file name here. Otherwise, enter the full path to USEARCH that is required for you to call it from the command line. Examples:
/user/bin/usearch7.0.1090_i86linux32 ## 4. full path
usearch7.0.1090_i86linux32 ## 4. in system PATH
Line 5: The command to call muscle. Similar to above.
muscle ## 5. path to call muscle
muscle3.8.31_i86linux32 ## 5. path to muscle long name
Line 6: The sequence of the restriction recognition site overhang. If your data are ddRAD list the common cutter (attached to first reads) followed by the rare cutter (attached to second reads), separated by a comma. If you're not sure of the sequence of your cutters look at your raw data files for the common sequence found at the left side of your reads. Cutters can contain ambiguous bases (Example: CWGC). Examples:
TGCAG ## 6. PstI cutsite overhang C|TGCAG
CWGC,AATT ## 6. ddRAD example for apeKI, EcoRI
Line 7: The number of processors (cores) to use. Entering a value that is more than the number of available processors will yield slower results. Parallel processing in pyRAD works by executing individual samples separately on different processors, so you cannot gain any speed by setting a number greater than the number of samples in your data set. Example:
8 ## 7. N processors
Line 8: The minimum depth of coverage to make a statistical base call at each site in a cluster. Depending on the inferred error rate, five is typically the minimum depth at which a base can be called. You may want to use a higher minimum depth, like 10. If your data are very low coverage you can instead call majority consensus base calls on reads with coverage as low as 2 by setting option 31 (see below).
10 ## 8. mindepth for base calls
Line 9: The maximum number of low quality, undetermined (“N”) sites in step 2 filtered sequences. This number should be selected with respect to the clustering threshold. A low clustering threshold (.85) can allow more undetermined sites; whereas a high clustering threshold (.95) should exclude reads with too many undetermined sites or they will affect clustering.
4 ## 9. maxN: max number of Ns in reads
Line 10: The similarity threshold to use for alignment clustering in USEARCH, entered as a decimal. This value is used for both within-sample clustering and across-sample clustering. I (more or less) require you to use the same threshold for both, and recommend doing so. Given the way the clustering algorithm works it only makes sense (to me) to do so, otherwise reads that do not cluster together within a sample will later cluster together when you do across-sample clustering, which is undesirable. During clustering the cutsite bases remain attached to the reads, thus all loci will have an invariable set of 4-6 bases on one side. For 100 bp reads containing a 5-bp cutter and with have a 5 bp barcode trimmed off this results in 95 bp of data, 5 of which are invariable. Thus, if you want to allow 5 base differences between reads (90/95 = .947) you should use a setting of .94. Example:
.90 ## 10. clustering threshold
Line 11: The type of Reduced representation library. There are five options, currently. Entered as all lowercase lettering: rad, ddrad, gbs, pairddrad, pairgbs. The different treatments of these data types are described below in section RADseq/RRL datatypes.
rad ## 11. Datatype
pairddrad ## 11. Datatype
Line 12: Minimum taxon coverage. The minimum number of (ingroup) samples with data for a given locus to be retained in the final data set. If you enter a number equal to the full number of samples in your data set then it will return only loci that have data across all samples. If you enter a lower value, like 4, it will return a more sparse matrix, including any loci for which at least four samples contain data. Example:
4 ## 12. minCov
Line 13: Maximum number (or proportion) of shared polymorphic sites in a locus. Enter a number, or a decimal with the prefix “p” (e.g., p.10 for 10%). This option is used to detect potential paralogs, as a shared heterozygous site across many samples likely represents clustering of paralogs with a fixed difference rather than a true heterozygous site. Example:
4 ## 13. maxSharedH.
Line 14: Prefix name for final output files. I usually enter a name indicative of the other parameters from the lines above. Example:
c90d10m4p4 ## 14. output name prefix
These options are not necessary for all analyses and can be left empty in the params file, in which case pyRAD uses their default values. They mostly pertain to options for additional quality filtering, for paired-end analyses, or for increasing the speed of analyses.
Line 15: Subset selector for steps 2-7. If, for example, you have data for a number of individuals from two different clades, and for one clade all of the sample names begin with the letter “A”, then you can select only these samples to analyze by entering the shared prefix "A*” on this line.
A ## 15. selects prefix subset of files
Line 16: Add-on (outgroup) taxa selector. The MinCov parameter on line 12 tells pyRAD to keep only loci with data from at least n ingroup samples. If you wanted to maximize coverage within a certain clade you could set the other taxa as "outgroups", and only loci with at least MinCov ingroup samples will be kept, and any additional matching of ’outgroup’ taxa to those will be included in the data, but not count toward the minimum number of samples. List ’outgroup’ taxa names here separated by commas.
outg1,outg2 ## 16. outgroup selector for step 7
Line 17: Exclude taxa. When you are constructing data sets at step 7 you can exclude certain taxa in order to maximize coverage among other included samples. For example you may want to exclude taxa that are found to have very few data. List sample names comma separated here. Example:
sample3,sample4 ## 17. Exclude from step 7
Line 18: If your data are already de-multiplexed into separate files for each barcode then you can skip step 1 in the analysis and go straight to step 2. Now, you must enter the location of your sorted fastq files on this line. The reads should have the barcode trimmed off. If they also have the cutsite trimmed off then enter the ’@’ symbol at the beginning of the line to indicate this. Files can be compressed (.gz) or not. Examples:
~/RAD/fastq/*.fastq ## 18. Sorted files
@~/RAD/fastq/*.fastq ## 18. sorted and stripped files
Line 19: The maximum number of mismatches allowed in a barcode during de-multiplexing. Default is 1, I don’t generally recommend going above this.
1 ## 19. max barcode mismatches
Line 20: Phred Quality score offset (usually 33, but sometimes 64). Base calls with a phred score below 20 are converted to Ns. If you want this value to be more strict you can change the offset to be higher. For example, to raise the minimum score to 30 set the offset to 43. If left blank the default is 33.
33 ## 20. min Quality score step 2
Line 21: Strictness of filtering in step 2. (0) means no filtering for barcodes, adapters and cut sites, only correct for quality scores of base calls. (1) looks for cutsite+adapters and trims them out. (2) tries to detect these while allowing some errors in them, thus enforcing the most strict filtering. For most data sets that do not include many overlapping and short fragments options 0 or 1 is recommended.
0 ## 21. strictness of Q filter
Line 22: a priori error rate and heterozygosity. Step 4 of the analysis jointly estimates the error rate and heterozygosity, and step 5 uses these values to make base calls. If for some reason you wish to skip the error rate estimate (e.g., your samples are polyploid), then you can enter an a priori estimate of the error rate here. Caution: you should generally let the program estimate it by using step 4, as the error rate here is being measured on data that have already passed through some filters, so error rates estimated elsewhere will not be correct. Across several data sets I typically find error rates between 0.0001 - 0.002, depending on length and sequence chemistry. Example:
0.0005,0.001 ## 22. E and H, respectively
Line 23: maximum number of “N”s in a consensus sequence. Clustering across samples can be affected by the number of “N”s, and so this number should be chosen with respect to line 6 and the lengths of reads. Default is 4.
5 ## 23. MaxN in consensus seqs
Line 24: maximum number Hs (heterozygous sites) in a consensus sequence. Among several methods to exclude potential paralogs or highly repetitive genomic regions, you can set a maximum number of heterozygous sites in a consensus sequence. Default is 4.
5 ## 24. MaxH in a consensus seq
Line 25: Allow only 2 haplotypes in a consensus sequence. In diploid organisms, after correcting for sequencing errors, there should be at most two haplotypes making up any consensus genotype call. by default this option will exclude loci with more than 2 alleles. If set to 0 it will not apply a filter. More advanced polyploid options will be implemented in the future. Example:
1 ## 25. diploid filter max haplos=2
Line 26: Maximum number of SNPs in a final locus. This can remove potential effects of poor alignments in repetitive regions in a final data set by excluding loci with more than N snps in them. For paired data you can set max SNPs in the first read and second read separately, or use one value for the pair, Examples:
10 ## 26. max SNPs in a final locus
8,12 ## 26. max SNPs in 1st and 2nd pair read
Line 27: Maximum number of insertions/deletions. If only one value, applies to within-sample clusters. If two values, first is within-sample clusters, second is across-sample clusters. For paired-data, enter four values: withinclusterread1, withinclusterread2, acrossclustersread1, acrossclustersread2. Defaults are 3,6,99,99 respectively. Examples:
3 ## 27. max indels in a within-sample cluster
3,10 ## 27. max within-sample, max for across-sample cluster
3,6,10,10 ## 27. max within_read1,within_read2,across_read1,across_read2
Line 28: random number seed. Randomization is used to sort the input order of sequences before clustering in steps 3 and 6. This can have an effect on results. If you set a random number seed then analyses should be repeatable.
112233 ## 28. random number seed
Line 29: Allow overhanging ends of reads in final data set. If reads are different lengths or overlap to different degrees, 1,1 will trim to shortest sequence on either side of locus; 0,1 would trim only the right side; 0,0 allows both ends to overhang. For paired data 1,1,1,1 would trim overhangs on left1,right1,left2,right2.
1,1 ## 29. overhang/trim ends of locus
Line 30: Output formats (u,s,a,n). u = unlinked snps, s = snps, a = alleles, n = nexus. See output/formatting section.
a,n ## 30. outputs formats
Line 31: Call simple consensus (majority base) on clusters with depth less than mindepth, (allows you to set option 8 as low as 2).
1 ## 31. call majority base on low depth sites
Line 32: Check for overlap of paired reads. Enter minimum overlap of paired reads. Overlapping reads are merged and written to a separate file, and a file is created with only non-merged reads.
30 ## 32. minimum overlap (nbases) of pairs
Line 33: Keep trimmed sequences. This option is only for very messy single end GBS or ddRAD data, where a size selection method did not work properly such that many fragments were shorter than the read length and thus the adapters were frequently sequenced. Instead of discarding sequences with adapters that are trimmed this option includes those that are still longer than X.
0 ## 33. don't keep trimmed sequences
50 ## 33. keep trimmed longer than 50
Line 34: maximum depth of a within-sample cluster. The default (empty) is max(meandepth+2*SD, 500), meaning the higher value of either the mean plus two times the standard deviation of cluster depth, or 500. If instead you want to set an absolute value enter it here. Or enter a ridiculously high value for no maxdepth filtering.
## 34. default max depth option
50000 ## 34. a set max depth
999999999999999 ## 34. no max depth
line 35: minimum number of copies of de-replicated reads used for clustering. Reads are dereplicated, or collapsed, into a single read that records its number of occurrences before clustering. The default is to not exclude any data. But for exploratory analyses this can be useful because the speed of within-sample clustering can be greatly improved by excluding singleton reads, or reads that occurred less than N times. In general I do not recommend discarding data for your final analysis, but for data sets with deep coverage it has little effect on results and greatly improves clustering times.
1 ## 35. exclude singletons
5 ## 35. exclude reads occurring less than 5 times
Remaining lines: Hierarchical clustering. If performing hierarchical (two-step) clustering the assignment of individuals to groups and the minimum cluster size within groups are listed here. Each group is listed on a separate line, it is OK to leave a line blank. Each line will contain three elements, each separated by a space: (1) a group name (e.g., 'A' or '1'), these names are arbitrary, just choose something different for each group. (2) the minimum size of a cluster within this group, any cluster smaller than this will not be clustered across-groups. (3) the names of samples/individuals in this group. Sample names can be individually listed comma separated, or selected using a wildcard selector (e.g., *). Example:
A 4 1A0,1B0,1C0,1D0
B 4 2*
C 4 3*
pyRAD should be called from the directory in which it comes (i.e., with its other dependency .py files). For example, to call pyRAD if it had been downloaded to a user’s home/ directory, type the command below, which will print the help menu.
~/pyrad_v.2.0/pyRAD -h
There are five main options to the command line interface of pyRAD:
The parameter input file (params.txt) should be created using the -n option and then filled in based on the type of analysis you plan to do. If the params file is properly filled out then the entire analysis – converting raw RADseq data into a assembled de novo loci – can be done by simply entering:
~/pyrad_v.2.0/pyRAD -p params.txt
This will perform all steps (1-7) of the analysis, as described below. If, however, you wish to perform only one, or a few step at a time, then you can select these using the -s (steps) option. To select only the first step:
~/pyrad_v.2.0/pyRAD -p params.txt -s 1
And to run steps 2-7 you could enter:
~/pyrad_v.2.0/pyRAD -p params.txt -s 234567
After finishing one assembly, you could rerun step 7 with the same or a different params.txt file indicating different step 7 filtering options.
~/pyrad_v.2.0/pyRAD -p params.txt -s 7
When first learning new software it often takes a while to learn how to properly fill in the correct parameter settings and options. This can be espeicially frustrating with computationally intensive analyses, where one might have to wait many hours (days) before learning that they made a simple mistake. The -s option is an attempt to reduce this annoyance by breaking up the analysis into seven discrete jobs, each of which is performed sequentially. In this way, you will know whether step 1 worked before moving on to step 2. And if step 2 fails, you will retain the results of step 1, and can start again from where the error arose. I also suggest running the example data sets to make yourself familiar with the steps of an analysis and what the output files and statistics look like before beginning your own analysis.
Step 1: de-multiplexing. This step uses information from the barcodes file to separate sequences from your raw fastq files into a separate file for each sample. These are placed in a new directory within your working directory called “fastq/”. For paired-end data it is necessary that the raw data file names follow a specific format: The first read files must contain "_R1_" in the name, and the second read files must be identical to the first read files but with "_R2_" in place of "_R1_". Here is an example pair of input files:
yourfilename_R1_001.fastq
yourfilename_R2_001.fastq
A reminder, if your data are already demultiplexed this step can be skipped. You will not need to enter a location for your raw data in the params file, but instead you must enter the location of your demultiplexed files into the proper location on line 18 of the params file.
Step 2: filtering. This step uses the Phred quality score recorded in the FASTQ data files to filter low quality base calls. Sites with a score below a set value are changed into “N”s, and loci with more than the number of allowed “N”s are discarded. Files are written to the “edits/” directory with the suffix “.edit”. It also implements a number of optional filters. For paired-end data you can separate out overlapping sequences (line 32). You can filter for Illumina adapter sequences (line 20).
Step 3: within-sample clustering. This step first dereplicates the filtered sequences from step 2, recording the number of times each unique read is observed. These are then clustered using USEARCH, recording if they match within a set sequence similarity. Sequences that clustered together are then aligned and written to a new file in the “clust.xx/” directory with the ending “.clustS.gz”.
Step 4: error-rate and heterozygosity estimates. This step uses the ML equation of Lynch (20XX) to jointly estimate error rate and heterozygosity from the base counts in each site across all clusters. Results are written to the Pi_estimate.txt file in the stats/ directory.
Step 5: create consensus sequences. Using the mean error rate and heterozygosity estimated in step 4, this step creates consensus sequences for each cluster. Those which have less than the minimum coverage, more than the maximum number of undetermined sites, or more than the maximum number of heterozygous sites, or more than the allowed number of alleles, are discarded. If two alleles are present the phase of heterozygous sites are retained for the consensus sequences.
Step 6. Consensus sequences are clustered across samples using the same settings as in step 3. If heterozygous one allele is randomly sampled and used in this step, although both alleles are retained in the final data set.
Step 7. Alignment, detection of paralogs, output of human readable fasta-like file (.loci), and concatenated loci (.phy), other optional formats can be specified (.nex, .snps, .alleles), and final statistics are written to .stats file. This step is relatively fast, and can be repeated with different values for options 12,13,16,17 to create different data sets that are optimized in coverage for different samples.
The following reduced representation library data types are supported by pyRAD, and I list beside each my general experience with analyzing empirical data sets. My quality judgements are based on the frequency with which users have generated data sets that include many short fragment reads with adapter sequences, or which require extra types of analyses (e.g., reverse complement matching for overlapping or paired-end GBS.) This may change as library preparations of GBS and ddRAD libraries are improving. I describe each data type and how pyRAD treats them differently below.
RAD (very good)
ddRAD (usually good)
GBS (sometimes difficult/messy)
nextRAD (very good)
paired ddRAD (usually difficult/messy)
paired GBS (very difficult/messy)
from IPython.core.display import SVG
SVG('https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/datatypes.svg')
In general single-end Illumina data that employ a good size selection window will produce very good and usable data by any of the library prep methods above. Problems tend to arise when the size selection is too small, and especially if this coincides with paired-end sequencing.
Poor size selection has little effect on traditional RAD or on single end ddRAD, as you can see above, except to include adapter sequences in the reads which can be filtered out. It has a more significant effect on paired ddRAD, and a very significant effect on single or paired GBS data. This is because short fragments will be sequenced from both ends of these types of data, leading to sequence overlap.
In paired ddRAD the first and second reads come in pairs, and pyRAD can test for overlap of these reads using the merge overlap option. This will remove merged reads and write them to a separate file. These can be analyzed separately from the non-overlapping reads if they make up a large proportion of your data set. Recommendations for this type of analysis are made in the paired ddRAD tutorial.
For GBS data the problem is more difficult. In single end data the Illumina adapter can ligate to either end of fragments, meaning short single-end fragments can have overlap leading to duplication in the data set. pyRAD can correct for this by performing reverse-complement clustering. This is much slower than normal clustering. It detects overlap of the reads and groups them into partially or fully overlapping clusters that are then used to make consensus base calls to create merged contigs. Recommendations for this type of analysis are made in the GBS tutorial.
For paired GBS data the merge overlap option can merge overlapping reads, however the first and second reads are completely interchangeable, so reverse complement clustering is also required in this case.
The links below are to a number of analyses including example data sets that you can download and try for yourself.
Basics
[Simple ddRAD analysis]
[Simple GBS analysis]
Advanced Examples
paired ddRAD with overlap and Hierarchical clustering of many individuals
[GBS analysis with short fragment rescue (poor size selection)]
[D-statistic tests for introgression]
By default pyRAD return the final data in two formats: individual loci (.loci file) and concatenated loci (.phy file). as well as outputting a file with the loci that did not pass the final step of paralog filtering (.excluded_loci).
There are a number of additional formats possible as well. These include nexus (.nex), haplotypes/alleles (.alleles), snps (.snps), and unlinked snps (.usnps). These additional formats will be output if indicated on line 30 of the params file by their first letter, i.e, a for alleles, n for nexus, etc.
This is a custom format that I use to output the individual loci and indicate variable sites. For each locus it provides the sequence data for each sample and indicates the location of variable sites or snps. Custom scripts can easily parse this file for loci containing certain amounts of taxon coverage or variable sites. Also it is the most easily readable file for assuring that your analyses are working properly. Example below:
from IPython.core.display import Image
Image('https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/dotLoci_single.png')
For paired-end data the two linked loci are shown separated by a 'xxxx' separator.
Image('https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/dotLoci_pairs.png')
This is a phylip formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, with missing data for any sample filled in with N's. This format is used in RAxML among other phylogenetic programs.
Image('https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/dotPhy.png')
This is a nexus formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, but printed in an interleaved format, with missing data for any sample filled in with N's, and with data information appended to the beginning. This format is used in BEAST among other phylogenetic programs.
Image('https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/dotNex.png')
This is the same as the .loci format but instead of ambiguity codes for the consensus sequences the two alleles are printed out for each sample at each locus. Output of polyploid (>2) alleles are not yet supported.
Image('https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/dotAllele.png')
This format lists only the variable sites in each locus with a space between SNPs from different loci. An underscore represents an invariable locus. Loci are in the same order as in the .loci file. Paired loci are treated as a single locus, meaning SNPs from the two loci are not separated in this file (linked).
Image('https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/dotSNPs.png')
Each column contains one SNP sampled from one locus. If multiple SNPs in a locus, SNP sites that contain the least missing data across taxa are sampled, if equal amounts of missing data, they are randomly sampled.
Image('https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/dotUSNPs.png')
The excluded loci file is in the same format as the .loci file but shows the loci that were filtered out in step 7. The filter is listed below each locus. %P indicates the locus was filtered as a paralog because the samples shared a heterozygous site over more than the allowed number of samples. %D indicates the locus had more than one sequence from the same individual. %S indicates there were more SNPs detected in the locus than the set maximum allowed number. %I means there were more indels in the locus than the max allowed number.
>A CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCCCCCGCGATCT
>B CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCCCCCGCGATCT
>C CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCCCCCGCGATCT
>D CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCMCCCGCGATCT
>E CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCAGCACCCCCGCGATCT
>F CCAACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCAGCAACCCCGCGATCT
//%P * *-- |
>A CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>B CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>C CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>D CGTACAATAGGATGCACCGTTCGTCAAGTTGTTGACGGTGGACCGAGAACTAGGT
>D CGTACAATAGGATGCACCGTTCGTCAAGTTGTTGACGGTGGACCGAGAACTAGGT
//%D - - |
>A CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>B CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGGACGGACGGACGGA
>C CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGGACGGACGGACGGA
>D CGTACAATAGGATGCACCGTTCGTCAAGTTGTTGACGGTGGACCGAGAACTAGGT
//%S - - ** ******* *|
...
Clustering across samples (step 6 of the analysis) can be quite fast for data sets with few sampled individuals, but for studies with many (>100) samples this step can take very very long (weeks), depending on the total number of unique sequences. Hierarchical clustering splits this job up into multiple smaller jobs that each run on a separate processing core, allowing clustering of large data sets many times faster.
The greatest speed increase comes from excluding loci at the first step of clustering that have significant missing data within a subgroup. By excluding singletons, or loci with little shared data early in the clustering process time is not spent comparing them to all of the other more distantly related samples. This is particularly relevant if you plan to only use loci in your final data set that recover data shared across many individuals.
SVG("https://dl.dropboxusercontent.com/u/2538935/PYRAD_TUTORIALS/figures/twostep2.svg")
The figure above illustrates the general concept. Individuals are grouped a priori into a small number of clades or populations, in this case a, b, and c. The assignment of individuals does not have to be perfectly correct in terms of knowing their true phylogeny, rather this is an approximation or heuristic, where those grouped together are expected to share more data than those which are not grouped together.
In this case each clade has four sampled taxa. To group our samples into these three clades we assign them to groups in the bottom lines of the params file.
Each group is listed on a separate line, it is OK to leave a line blank. Each line will contain three elements, each separated by a space: (1) a group name (e.g., 'A' or '1'), these names are arbitrary, just choose something different for each group. (2) the minimum size of a cluster within this group, any cluster smaller than this will not be clustered across-groups. (3) the names of samples/individuals in this group. Sample names can be individually listed comma separated, or selected using a wildcard selector (e.g., *).
The sample names that this is matching to are the prefix names to the ".consens.gz" files located in the clust directory. For example, imagine we have the samples below.
sample1.consens.gz
sample2.consens.gz
sample3.consens.gz
sample4.consens.gz
sample5.consens.gz
sample6.consens.gz
sample7.consens.gz
sample8.consens.gz
sample9.consens.gz
Imagine also that we know samples 1-4 are individuals of species A, samples 5-8 are individuals of species B, and sample9 is from a more distant outgroup species. We could cluster them using the following settings:
A 4 sample1,sample2,sample3,sample4
B 2 sample5,sample6,sample7,sample8
C 1 sample9
The minimum cluster size option tells pyRAD to only cluster loci across clades that have at least N matches within a clade. In this way loci that are singletons, or shared by very few taxa are not clustered across clades. In the case of the outgroup sample C, we entered 1 for the minimum size, and thus all reads from this sample are clustered to the other two clades. In contrast, we chose 4 for clade A, and 2 for clade B. This means only loci that have data for all four samples in clade A will be clustered with clades B and C, and only loci with data from at least 2 individuals in clade B will be clustered with clades A and C.
Following this logic, you can assign individuals to clades in a way that speeds clustering but also maximizes the amount of data retained in your data set. If a clade has very few individuals in it, or those individuals have less data overall than other samples then you may want to set a low minimum cluster size to discard as little data as possible. In general, though, you only get dramatic speed increases if at least some clades have minimum cluster size set >1.
There is an example tutorial implementing hierarchical clustering and which provides more detail in the Examples section above.
coming soon...