This is the full tutorial for pyRAD v.3.0. It is an attempt to explain the inner workings of each parameter setting used in assembling RADseq data sets in pyRAD. I highly recommend that users begin by doing a run through of the RAD tutorial (v.3.0), which only takes a few minutes, and will help you to gain a general sense for how the program works, and what the data and results look like. Then follow up with a more detailed look at this tutorial to see the full range of options available, and if you still have questions, please send them to the google group at the link above.
Reduced-representation genomic sequence data (e.g., RADseq, GBS, ddRAD) are commonly used to study population-level research questions and consequently most software packages for assembling or analyzing such data are designed for sequences with little variation across samples. Phylogenetic analyses typically include species with deeper divergence times (more variable loci across samples) and thus a different approach to clustering and identifying orthologs will perform better.
pyRAD, the software pipeline described here, is intended to maximize phylogenetic information across disparate samples from RAD, ddRAD, or GBS data, hereafter referred to generally as RADseq. Written in Python
, the program is flexible and easy to use. The code is human-readable, open-source and may be modified to meet users specific needs. With respect to constructing phylogenetic data sets, the largest difference between pyRAD and alternative programs such as Stacks, is in the way that loci are clustered. Stacks clusters reads by grouping those with less than N differences between them, but does not allow for indels, such that the presence of even a single indel will cause a frameshift resulting in many base differences between otherwise highly similar sequences. pyRAD instead uses a measure of sequence similarity based on a global alignment of sequences through the program VSEARCH (or USEARCH). Alignment clustering allows for indels, and should perform better in data sets including more distantly related samples.
pyRAD was written with the aim that it be easy to use, fast, and flexible. It is easy in that installation does not require any difficult libraries or extensions, and that it requires few arguments to the command line. It is fast, using both multi-processing and multi-threading, and also offers a number of heuristics to speed clustering of extremely large data sets. And it is flexible in that it has built-in methods to account for different library types (RAD, ddRAD, GBS, paired-ddRAD, paired-GBS) as well as quality filtering, including overlap among paired or reverse-complemented reads, and the presence of adapter sequences. Thus, data require little or no filtering before before analysis in pyRAD.
Two fast clustering methods: hierarchical-clustering for large data sets, and split-clustering for paired-end ddRAD data are described further below. Using these methods data sets with even many hundreds of individuals can be assembled in a few days by distributing across multiple processing cores. pyRAD can be used to perform D-statistic tests on RADseq loci to test for historical introgression. Assembled data sets can be output in a number of formats for easy input into external applications such as BEAST, RAxML, TreeMix, Structure, Admixture and migrate-n.
An open access pre-print describing pyRAD v.1 and comparing it with Stacks is available here: link
git clone https://github.com/dereneaton/pyrad.git
With git, future updates can be attained by using cd
into the pyrad directory and running git pull
.
pyRAD requires the common Python packages Numpy and Scipy, and two external programs, muscle and vsearch.
These will already be available on almost any HPC machine. To install them on a personal machine you can use the apt-get command sudo apt-get install python-numpy python-scipy
on Ubuntu Linux. On a Mac there are several ways to install these, but it varies depending on the version of your OS. I recommend using conda install which you should be able to find with google.
Muscle is used for sequence alignments. This will already be available on most HPC machines. It can be installed on Ubuntu linux with the command apt-get install muscle
, or you can find binaries at www.drive5.com/muscle/downloads.
Vsearch is used to cluster sequences by similarity. Both source and binaries are available here: https://github.com/torognes/vsearch. Older versions of pyrad required the program Usearch (usearch_v.7.0.1090
) for clustering, and this is still supported. However, I now recommend vsearch as a replacement because it is free and open-source, and I've found that it can perform more than 30X faster while yielding identical results. The speed improvement does require greater memory load.
Take note of the location where you install muscle and vsearch. Below I will describe how to ensure pyRAD can find your installations.
The barcodes file is a simple table linking barcodes to samples. Barcodes can be of varying lengths. Each line should have one name and one barcode, separated by a tab or space.
sample1 ACAGG
sample2 ATTCA
sample3 CGGCATA
sample4 AAGAACA
If your data are already de-multiplexed (sorted) into separate files for each individual then a barcodes file is not needed.
The parameter input file, usually named params.txt, can be created with the -n option in pyRAD, like below. This file lists all of the options or paramater settings necessary for an analysis.
Every line has a parameter (e.g., a number or character string) followed by any number of spaces or tabs and then two hash marks (“##”), after which the parameter is described, or comments can be added. In parentheses I list the step of a pyRAD analysis affected by the parameter. If a line is left blank the default value is used. I highly recommend beginning all analysis by creating an empty template params.txt file with the following command:
pyRAD -n
The following section describes the params file line by line. The first 14 lines are required, all lines following this are optional and used for more advanced analyses or data filtering. For each input line I list several possible formats:
## 1. uses current as working directory
./ ## 1. uses current as working directory
/home/user/RAD/ ## 1. uses set location as working directory
./raw/*.fastq ## 2. path to raw data files
raw/*.fastq.gz ## 2. path to raw data files (gzipped)
./barcodes.txt ## 3. path to barcodes file
mydata.barcodes ## 3. path to barcodes file
vsearch ## 4. in systemPATH
/user/bin/usearch7.0.1090_i86linux32 ## 4. full path
usearch7.0.1090_i86linux32 ## 4. in system PATH
muscle ## 5. path to call muscle
muscle3.8.31_i86linux32 ## 5. path to muscle long name
TGCAG ## 6. PstI cutsite overhang C|TGCAG
CWGC,AATT ## 6. ddRAD example for apeKI, EcoRI
8 ## 7. N processors
5 ## 8. mindepth for base calls
4 ## 9. maxN: max number of Ns in reads
.90 ## 10. clustering threshold
rad ## 11. Datatype
pairddrad ## 11. Datatype
4 ## 12. minCov
4 ## 13. maxSharedH.
arabidopsis_c90d5m4p4 ## 14. output name prefix
These options are not necessary for all analyses and can be left empty in the params file, in which case pyRAD uses their default values. They mostly pertain to options for additional quality filtering, for paired-end analyses, or for increasing the speed of analyses.
A ## 15. selects prefix subset of files
outg1,outg2 ## 16. outgroup selector for step 7
sample3,sample4 ## 17. Exclude from step 7
~/RAD/fastq/*.fastq ## 18. Sorted files
@~/RAD/fastq/*.fastq ## 18. sorted and stripped files
1 ## 19. max barcode mismatches
33 ## 20. min Quality score step 2
0 ## 21. strictness of Q filter
0.0005,0.001 ## 22. E and H, respectively
5 ## 23. MaxN in consensus seqs
5 ## 24. MaxH in a consensus seq
1 ## 25. diploid filter max haplos=2
10 ## 26. max SNPs in a final locus
8,12 ## 26. max SNPs in 1st and 2nd pair read
3 ## 27. max indels in a within-sample cluster
3,10 ## 27. max within-sample, max for across-sample cluster
3,6,10,10 ## 27. max within_read1,within_read2,across_read1,across_read2
998877 ## 28. random number seed
1,1 ## 29. overhang/trim ends of locus
a,n ## 30. outputs formats
1 ## 31. call majority base on low depth sites
0 ## 32. don't keep trimmed sequences
50 ## 32. keep trimmed longer than 50
## 33. default max depth option
50000 ## 33. a set max depth
9999999999 ## 33. no max depth
2 ## 34. exclude singletons
5 ## 34. exclude reads occurring less than 5 times
1 ## 35. use hierarchical clustering
0 ## 36. turn off repeat masking
6 ## 37. N vsearch threads
Group/clade assignments. These are used for "population" formatted output files (e.g., treemix, migrate-n), and for hierarchical clustering. Each group is listed on a separate line, it is OK to leave a line blank. Each line will contain three elements, each separated by spaces or tabs: (1) a group name (e.g., 'A' or '1'), these names are arbitrary, just choose something different for each group. (2) the minimum size of a cluster within this group, any cluster smaller than this will not be clustered across-groups, or not output in the assembled data set. (3) the names of samples/individuals in this group. Sample names can be individually listed comma separated, or selected using a wildcard selector (e.g., *). Example:
A 4 1A0,1B0,1C0,1D0
B 4 2*
C 4 3*
pyRAD should either be called from the directory in which it comes (i.e., with its other dependency .py files), or you can create a symlink to call the program from anywhere. For example, the command below would create a link so that you call pyRAD by simply typing 'pyRAD' if you were to put the link in your $PATH.
ln -s ~/pyrad_v.3.0/pyRAD pyRAD
If the symlink business seems confusing, don't worry. You can just call pyRAD from its directory. For example, to call pyRAD if it had been downloaded to a user’s home/ directory, type the command below, which will print the help menu.
~/pyrad_v.3.0/pyRAD -h
Or if you created a symlink:
pyRAD -h
There are six options to the command line interface of pyRAD:
The parameter input file (params.txt) should be created using the -n option and then filled in based on the type of analysis you plan to do. If the params file is properly filled out then the entire analysis – converting raw RADseq data into a assembled de novo loci – can be done by simply entering:
pyRAD -p params.txt
This will perform all steps (1-7) of the analysis, as described below. If, however, you wish to perform only one, or a few step at a time, then you can select these using the -s (steps) option. For example, to select only the first step:
pyRAD -p params.txt -s 1
And to run steps 2-7 you could enter:
pyRAD -p params.txt -s 234567
After finishing one assembly, you could rerun step 7 with the same or a different params.txt file indicating different step 7 filtering options.
pyRAD -p params.txt -s 7
When learning new software, or applying it to a new data set, it can take time to find the appropriate parameter settings and options. This can be espeicially frustrating with computationally intensive analyses, where one might have to wait many hours (or days) before finding they made a simple mistake. The -s option is an attempt to reduce this annoyance by breaking the analysis into seven discrete jobs, each of which is performed sequentially. In this way, you know whether step 1 worked before moving to step 2. And if step 2 fails, you retain the results of step 1, and can start again from where the error arose. I suggest running the example data sets (tutorials) to make yourself familiar with the steps of an analysis and what the output files and statistics look like before beginning your own analysis.
Step 1: de-multiplexing. This step uses information from the barcodes file to separate sequences from your raw fastq files into a separate file for each sample. These are placed in a new directory within your working directory called “fastq/”. File names are not important for single end data. However, for paired-end reads it is necessary that the raw data file names follow a specific format: The first read files must contain "_R1_" in the name, and the second read files must be identical to the first read files but with "_R2_" in place of "_R1_". Here is an example pair of input files:
yourfilename_R1_001.fastq
yourfilename_R2_001.fastq
A reminder, if your data are already demultiplexed this step can be skipped. You will not need to enter a location for your raw data in the params file, but instead you must enter the location of your demultiplexed files into the proper location on line 18 of the params file.
Step 2: filtering. This step uses the Phred quality score recorded in the FASTQ data files to filter low quality base calls. Sites with a score below a set value are changed into “N”s, and reads with more than the number of allowed “N”s are discarded. Files are written to the “edits/” directory with the suffix “.edit”. It also implements a number of optional filters.
Step 3: within-sample clustering. This step first dereplicates the filtered sequences from step 2, recording the number of times each unique read is observed. These are then clustered using VSEARCH, recording if they match within a set sequence similarity. Sequences that clustered together are then aligned and written to a new file in the “clust.xx/” directory with the ending “.clustS.gz”.
Step 4: error-rate and heterozygosity estimates. This step uses the ML equation of Lynch (20XX) to jointly estimate error rate and heterozygosity from the base counts in each site across all clusters. Results are written to the Pi_estimate.txt file in the stats/ directory.
Step 5: create consensus sequences. Using the mean error rate and heterozygosity estimated in step 4, this step creates consensus sequences for each cluster. Those which have less than the minimum coverage, more than the maximum number of undetermined sites, or more than the maximum number of heterozygous sites, or more than the allowed number of alleles, are discarded. In diploid data if two alleles are present the phase of heterozygous sites are retained in the consensus sequences.
Step 6. Consensus sequences are clustered across samples using the same settings as in step 3. If heterozygous, one allele is randomly sampled and used in this step, although both alleles are retained in the final data set.
Step 7. Alignment, filtering for paralogs, and output of human readable fasta-like file (.loci). A large number of alternative formats are also available (see output formats). Final assembly statistics are written to the stats/ directory in the .stats file. This step is relatively fast, and can be repeated with different values for options 12,13,14,16,17,etc. to create different data sets that are optimized in coverage for different samples.
The following reduced representation library data types are supported by pyRAD, and I list beside each my general experience with analyzing empirical data sets. My quality judgements are based on the frequency with which users have generated data sets that include many short fragment reads with adapter sequences, or which require extra types of analyses (e.g., reverse complement matching for overlapping or paired-end GBS.) This may change as library preparations of GBS and ddRAD libraries are improving. I describe each data type and how pyRAD treats them differently below.
RAD (very good)
ddRAD (usually good)
GBS (sometimes difficult/messy)
nextRAD (very good)
paired ddRAD (sometimes difficult/messy)
paired GBS (very difficult/messy)
other types (... I haven't tried yet)
In general single-end Illumina data that employ a good size selection window will produce very good and usable data by any of the library prep methods above. Problems tend to arise when the size selection is too small (or non-existent), and especially if this coincides with paired-end sequencing.
Poor size selection has little effect on traditional RAD or on single end ddRAD, as you can see above, except to include adapter sequences in the reads which can be filtered out. It has a more significant effect on paired ddRAD, and a very significant effect on single or paired GBS data. This is because short fragments will be sequenced from both ends of these types of data, leading to sequence overlap.
In paired ddRAD the first and second reads come in pairs, and pyRAD can test for overlap of these reads using the merge overlap option. This will remove merged reads and write them to a separate file. These can be analyzed separately from the non-overlapping reads if they make up a large proportion of your data set. Recommendations for this type of analysis are made in the paired ddRAD tutorial. I only recommend trying to rescue merged reads if they comprise a large proportion of your data.
For GBS data the problem is more difficult. In single end data the Illumina adapter can ligate to either end of fragments, meaning short single-end fragments can have overlap leading to duplication in the data set. pyRAD can correct for this by performing reverse-complement clustering. This is much slower than normal clustering. It detects overlap of the reads and groups them into partially or fully overlapping clusters that are then used to make consensus base calls to create merged contigs. Recommendations for this type of analysis are made in the GBS tutorial.
For paired GBS data the merge overlap option can merge overlapping reads, however the first and second reads are completely interchangeable, so reverse complement clustering is also required in this case.
See Example Tutorials at this page
By default pyRAD will return the final data set and excluded loci in the easily readable .loci format.
These data can be transformed into a large number of additional formats, each listed by a single letter in the params file:
And two more complicated output formats that allow pooling individuals into populations/clades using the assignments in the params file. See instructions in the Example RAD tutorial.
Examples are best viewed in the RAD tutorial above, but are also described below.
This is a custom format that is easy to read, showing each individual locus with variable sites indicated. Custom scripts can easily parse this file for loci containing certain amounts of taxon coverage or variable sites. Also it is the most easily readable file for assuring that your analyses are working properly. a (-) indicates a variable site, a (*) indicates the site is phylogenetically informative. Example:
For paired-end data the two linked loci are shown separated by a 'xxxx' separator.
This is a phylip formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, with missing data for any sample filled in with N's. This format is used in RAxML among other phylogenetic programs.
This is a nexus formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, but printed in an interleaved format, with missing data for any sample filled in with N's, and with data information appended to the beginning. This format is used in BEAST among other phylogenetic programs.
This is the same as the .loci format but instead of ambiguity codes for the consensus sequences the two alleles are printed out for each sample at each locus. Output of polyploid (>2) alleles is not supported.
This format lists only the variable sites in each locus with a space between SNPs from different loci. An underscore represents an invariable locus. Loci are in the same order as in the .loci file. Paired loci are treated as a single locus, meaning SNPs from the two reads are not separated in this file (they're linked).
Each column contains one SNP sampled from one locus. If multiple SNPs in a locus, SNP sites that contain the least missing data across taxa are sampled, if equal amounts of missing data, they are randomly sampled.
The excluded loci file is in the same format as the .loci file but shows the loci that were filtered out in step 7. The filter is listed below each locus. %P indicates the locus was filtered as a paralog because the samples shared a heterozygous site over more than the allowed number of samples. %D indicates the locus had more than one sequence from the same individual. %S indicates there were more SNPs detected in the locus than the set maximum allowed number. %I means there were more indels in the locus than the max allowed number.
>A CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCCCCCGCGATCT
>B CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCCCCCGCGATCT
>C CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCCCCCGCGATCT
>D CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCMCCCGCGATCT
>E CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCAGCACCCCCGCGATCT
>F CCAACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCAGCAACCCCGCGATCT
//%P * *-- |
>A CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>B CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>C CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>D CGTACAATAGGATGCACCGTTCGTCAAGTTGTTGACGGTGGACCGAGAACTAGGT
>D CGTACAATAGGATGCACCGTTCGTCAAGTTGTTGACGGTGGACCGAGAACTAGGT
//%D - - |
>A CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
>B CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGGACGGACGGACGGA
>C CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGGACGGACGGACGGA
>D CGTACAATAGGATGCACCGTTCGTCAAGTTGTTGACGGTGGACCGAGAACTAGGT
//%S - - ** ******* *|
...
Clustering across samples (step 6 of the analysis) can be quite fast for data sets with few sampled individuals, but for large studies (many hundreds of individuals) this step can sometimes take very long (>1 week), depending on the total number of unique sequences. Hierarchical clustering splits this job up into multiple smaller jobs that each run on a separate processing core, allowing clustering of large data sets many times faster.
(I no longer recommend excluding data through hierarchical clustering since clustering times are much shorter with vsearch, perhaps only for very very large data sets).
The most significant speed increase comes from excluding loci at the first step of clustering that have significant missing data within a subgroup. By excluding singletons, or loci with little shared data early in the clustering process time is not spent comparing them to all of the other more distantly related samples. This is particularly relevant if you plan to only use loci in your final data set that recover data shared across many individuals.
However, without excluding data hierarchical clustering can still be useful, particularly for studies of highly divergent organisms. Because the input order of sequences can effect their clustering, the use of hierarchical clustering can be used to better cluster very highly divergent sequences by making close relatives cluster to each other first, and then cluster to more distant groups.
The figure above illustrates the general concept. Individuals are grouped a priori into a small number of clades or populations, in this case a, b, and c, through their assignment listed in the params file. The assignment of individuals does not have to be perfectly correct in terms of knowing their true phylogeny, rather this is an approximation or heuristic, where those grouped together are expected to share more data than those which are not grouped together.
In this case each clade has four sampled taxa. To group our samples into these three clades we assign them to groups in the bottom lines of the params file.
Each group is listed on a separate line, it is OK to leave a line blank. Each line will contain three elements, each separated by spaces or tabs: (1) a group name (e.g., 'A' or '1'), these names are arbitrary, just choose something different for each group. (2) the minimum size of a cluster within this group, any cluster smaller than this will not be clustered across-groups, 1=do not exclude any data. (3) the names of samples/individuals in this group. Sample names can be individually listed comma separated, or selected using a wildcard selector (e.g., *).
The sample names that this is matching to are the prefix names to the ".consens.gz" files located in the clust directory. For example, imagine we have the samples below.
sample1.consens.gz
sample2.consens.gz
sample3.consens.gz
sample4.consens.gz
sample5.consens.gz
sample6.consens.gz
sample7.consens.gz
sample8.consens.gz
sample9.consens.gz
Imagine also that we know samples 1-4 are individuals of species A, samples 5-8 are individuals of species B, and sample9 is from a more distant outgroup species. We could cluster them using the following settings:
A 4 sample1,sample2,sample3,sample4
B 2 sample5,sample6,sample7,sample8
C 1 sample9
The minimum cluster size option tells pyRAD to only cluster loci across clades that have at least N matches within a clade. In this way loci that are singletons, or shared by very few taxa are not clustered across clades. In the case of the outgroup sample C, we entered 1 for the minimum size, and thus all reads from this sample are clustered to the other two clades. In contrast, we chose 4 for clade A, and 2 for clade B. This means only loci that have data for all four samples in clade A will be clustered with clades B and C, and only loci with data from at least 2 individuals in clade B will be clustered with clades A and C.
Following this logic, you can assign individuals to clades in a way that speeds clustering but also maximizes the amount of data retained in your data set. If a clade has very few individuals in it, or those individuals have less data overall than other samples then you may want to set a low minimum cluster size to discard as little data as possible.
There is an example tutorial implementing hierarchical clustering and which provides more detail in the Examples section above.
coming soon...
For the migrate-n and treemix outputs the minimum size option in the populations assignments filters out loci that have data for fewer than this number of individuals for that population. Only loci with data for at least the minium number of individuals across all populations are included in the data set.