The double digest library preparation method was developed and described by Peterson et al. 2012. Here I will be talking about __ paired-end ddRAD __ data and describe my recommendations for how to setup a params file to analyze them in pyRAD when many of the data are merged.
A ddRAD library is prepared by cutting genomic DNA with two different restriction enzymes and selecting the intervening fragments that are within a certain size window. These will contain overhangs from the respective cut sites on either side. One side will have a barcode+illumina adapter ligated, and the other end will have only the reverse Illumina adapter ligated. The first reads may come in one or multiple files with "_R1_" in the name, and the second reads are in a separate file/s with "_R2_". Second read files will contain the corresponding pair from the first read files in the exact same order.
A feature of ddRAD data is that two different cutters are used to generate the data. There is typically a rare cutter and a common cutter. You will need to know what the overhang sequence is that these cutters leave on your sequences. This can easily be found by looking at the raw forward and reverse reads files. Find the invariant sequence near the beginning of R1 files for the first cutter, and invariant sequence at the end of the R2 files for the second cutter. You will list them in this order in the params file, discussed below.
Your data will likely come to you non-demultiplexed (meaning not sorted by which individual the reads belong to). You can demultiplex the reads according to their barcodes using pyRAD or separate software. If your reads are already de-multiplexed that is OK as well.
A failure to merge paired end reads that have overlapping sequences can lead to major problems during the assembly. A number of external programs are available to check for overlap of paired end reads, and you can run your data through these programs before being input to pyRAD. At the time of writing this, I recommend the software PEAR (https://github.com/xflouris/PEAR), which I'll demonstrate below.
For this tutorial I simulated paired-end ddRAD reads on a 12 taxon species tree, shown below. You can download the data using the script below and assemble these data by following along with all of the instructions.
%%bash
## download the data
wget -q https://github.com/dereneaton/dereneaton.github.io/raw/master/downloads/simpairddradsmerge.zip
## unzip the data
unzip simpairddradsmerge.zip
Archive: simpairddradsmerge.zip inflating: simpairddradsmerge.barcodes inflating: simpairddradsmerge_R1_.fastq.gz inflating: simpairddradsmerge_R2_.fastq.gz
You can simply download the data, but it might also be worth describing how the data were generated. In this case I simulated ddRAD-like data using the egglib coalescent simulator with the following options using my program simRRLs.py
indels = 0.005 ## low rate of indels (prob. mutation is indel)
dropout = 0 ## if 1, mutations to restriction site can cause locus dropout.
nloci = 1000 ## Total Nloci simulated (less if locus dropout or merged reads)
ninds = 1 ## sampled individuals per tip taxon
shortfrag = 50 ## smallest size of digested fragments
longfrag = 300 ## largest size of digested fragments
## Because the data are simulated at 20X coverage,
## and the reads are sequenced from both ends (paired-end)
## the total number of reads is:
reads = 12*ninds*nloci*20*2
print "%s sequenced reads" % reads
480000 sequenced reads
Here I execute the simulation script to simulate the data, and write it to a file called simpairddradsmerge. If you simply downloaded the data you do not need to do this step, I show it only for those interested. The relevant point is that the fragment size is randomly selected between 50 and 300 bp, meaning that around half of our fragments will be <200 bp long, which in the case of paired 100 bp reads will lead to overlap.
%%bash
## download the simulation program
wget -q http://www.dereneaton.com/downloads/simRRLs.py
## simulate data with param settings described above
## (requires the Python package `Egglib`) <---- !!
python simRRLs.py 0.005 0 1000 1 50,300 pairddrad simpairddradsmerge
simulating pairddrad data simulating 1000 loci at 20X coverage across 12 tip taxa with 1 samples per taxon indels arise at frequency of 0.005000 per mutation mutations in restriction site = False sequencing error rate = 0.0005 theta=4Nu= 0.0014 min fragment length allows read overlaps/adapter sequences creating new barcode map .
We first create an empty params.txt input file for the pyRAD analysis.
The following command will create a template which we will fill in with all relevant parameter settings for the analysis.
%%bash
pyrad -n
new params.txt file created
Take a look at the default options. Each line designates a parameter, and contains a "##" symbol after which comments can be added, and which includes a description of the parameter. The affected step for each parameter is shown in parentheses. The first 14 parameters are required. Numbers 15-37 are optional.
%%bash
cat params.txt
==** parameter inputs for pyRAD version 3.0.4 **======================== affected step == ./ ## 1. Working directory (all) ./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 16) (s1) ./*.barcodes ## 3. Loc. of barcode file (if not line 16) (s1) vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) TGCAG ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG) (s1,s2) 2 ## 7. N processors (parallel) (all) 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 4 ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2) .88 ## 10. Wclust: clustering threshold as a decimal (s3,s6) rad ## 11. Datatype: rad,gbs,ddrad,pairgbs,pairddrad,merged (all) 4 ## 12. MinCov: min samples in a final locus (s7) 3 ## 13. MaxSH: max inds with shared hetero site (s7) c88d6m4p3 ## 14. Prefix name for final output (no spaces) (s7) ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) ## 18.opt.: loc. of de-multiplexed data (s2) ## 19.opt.: maxM: N mismatches in barcodes (def= 1) (s1) ## 20.opt.: phred Qscore offset (def= 33) (s2) ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2) ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5) (s5) ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7) ## 30.opt.: output formats: p,n,a,s,v,u,t,m,k,g,* (see docs) (s7) ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) ## 32.opt.: keep trimmed reads (def=0). Enter min length. (s2) ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
I will use the script below to substitute new values, but you should simply use any text editor to make changes. For this analysis I made the following changes from the defaults:
6. set the two restriction enzymes used to generate the ddRAD data
10. lowered clustering threshold to .85
11. set datatype to pairddrad
14. changed the output name prefix
19. mismatches for demulitiplexing set to 0, exact match.
24. Raised maxH. Lower is better for filtering paralogs.
30. added additional (all) output formats (e.g., nexus,SNPs,STRUCTURE)
%%bash
sed -i '/## 6. /c\TGCAG,AATT ## 6. cutsites... ' ./params.txt
sed -i '/## 10. /c\.85 ## 10. lowered clust thresh' ./params.txt
sed -i '/## 11. /c\pairddrad ## 11. datatype... ' ./params.txt
sed -i '/## 14. /c\merged ## 14. prefix name ' ./params.txt
sed -i '/## 19./c\0 ## 19. errbarcode... ' ./params.txt
sed -i '/## 24./c\10 ## 24. maxH... ' ./params.txt
sed -i '/## 30./c\* ## 30. outformats... ' ./params.txt
Let's take a look at the edited params.txt file
%%bash
cat params.txt
==** parameter inputs for pyRAD version 3.0.4 **======================== affected step == ./ ## 1. Working directory (all) ./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 16) (s1) ./*.barcodes ## 3. Loc. of barcode file (if not line 16) (s1) vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) TGCAG,AATT ## 6. cutsites... 2 ## 7. N processors (parallel) (all) 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 4 ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2) .85 ## 10. lowered clust thresh pairddrad ## 11. datatype... 4 ## 12. MinCov: min samples in a final locus (s7) 3 ## 13. MaxSH: max inds with shared hetero site (s7) merged ## 14. prefix name ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) ## 18.opt.: loc. of de-multiplexed data (s2) 0 ## 19. errbarcode... ## 20.opt.: phred Qscore offset (def= 33) (s2) ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2) ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) 10 ## 24. maxH... ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7) * ## 30. outformats... ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) ## 32.opt.: keep trimmed reads (def=0). Enter min length. (s2) ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
Four examples of acceptable input file name formats for paired-end data:
1. xxxx_R1_001.fastq xxxx_R2_001.fastq
2. xxxx_R1_001.fastq.gz xxxx_R2_001.fastq.gz
3. xxxx_R1_100.fq.gz xxxx_R2_100.fq.gz
4. xxxx_R1_.fq xxxx_R2_.fq
If your data are already demultiplexed skip step 1 and see step 2 below.
Now we run step 1 of the analysis by designating the params file with the -p flag, and the step with the -s flag.
%%bash
pyrad -p params.txt -s 1
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 1: sorting reads by barcode .
Now we can look at the stats output for this below:
%%bash
cat stats/s1.sorting.txt
file Nreads cut_found bar_matched simpairddradsmerge_.fastq.gz 240000 240000 240000 sample true_bar obs_bars N_obs 1C0 AAAAGA AAAAGA 20000 2G0 AAGTGA AAGTGA 20000 1D0 AGAATG AGAATG 20000 2H0 AGGGTG AGGGTG 20000 1A0 CATCAT CATCAT 20000 3I0 GAATGG GAATGG 20000 3K0 GAGTAA GAGTAA 20000 2E0 GGAGAG GGAGAG 20000 1B0 GTAGTG GTAGTG 20000 3L0 GTTGAA GTTGAA 20000 2F0 TGATTT TGATTT 20000 3J0 TTAATG TTAATG 20000 nomatch _ 0
The de-multiplexed reads are written to a new file for each individual in a new directory created within your working directory called fastq/
%%bash
ls fastq/
1A0_R1.fq.gz 1A0_R2.fq.gz 1B0_R1.fq.gz 1B0_R2.fq.gz 1C0_R1.fq.gz 1C0_R2.fq.gz 1D0_R1.fq.gz 1D0_R2.fq.gz 2E0_R1.fq.gz 2E0_R2.fq.gz 2F0_R1.fq.gz 2F0_R2.fq.gz 2G0_R1.fq.gz 2G0_R2.fq.gz 2H0_R1.fq.gz 2H0_R2.fq.gz 3I0_R1.fq.gz 3I0_R2.fq.gz 3J0_R1.fq.gz 3J0_R2.fq.gz 3K0_R1.fq.gz 3K0_R2.fq.gz 3L0_R1.fq.gz 3L0_R2.fq.gz
An individual file will look like this:
%%bash
## FIRST READS file --
## I show only the first 12 lines and 80 characters to print it clearly here.
less fastq/1A0_R1.fq.gz | head -n 12 | cut -c 1-80
@lane1_fakedata0_R1_0 1:N:0: TGCAGCATGACAATCTATGGACCACAGAGCGCGAAATCTGCTCTCGGACATCAACGTCTGAAGTTCGGGCCCGTAACGCT + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @lane1_fakedata0_R1_1 1:N:0: TGCAGCATGACAATCTATGGACCACAGAGCGCGAAATCTGCTCTCGGACATCAACGTCTGAAGTTCGGGCCCGTAACGCT + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @lane1_fakedata0_R1_2 1:N:0: TGCAGCATGACAATCTATGGACCACAGAGCGCGAAATCTGCTCTCGGACATCAACGTCTGAAGTTCGGGCCCGTAACGCT + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
%%bash
## SECOND READS file
less fastq/1A0_R2.fq.gz | head -n 12 | cut -c 1-80
@lane1_fakedata0_R2_0 1:N:0: AATTGTTGGTTGTTTTACATGCAGGATATGAACTAGTGGCACTGATAGGATATTATCCCGTGCGAGCGCGTATACCGTGG + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @lane1_fakedata0_R2_1 1:N:0: AATTGTTGGTTGTTTTACATGCAGGATATGAACTAGTGGCACTGATAGGATATTATCCCGTGCGAGCGCGTATACCGTGG + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @lane1_fakedata0_R2_2 1:N:0: AATTGTTGGTTGTTTTACATGCAGGATATGAACTAGTGGCACTGATAGGATATTATCCCGTGCGAGCGCGTATACCGTGG + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
The reads were sorted into a separate file for the first (R1) and second (R2) reads for each individual.
If your data were previously de-multiplexed you need the following things before step 2:
your sorted file names should be formatted similar to above, but with sample names substituted for 1A0, 1A1, etc.
file names should include "_R1." in first read files, and "_R2." in second read files (note this is different from the format for non de-multiplexed data files).
the files can be gzipped (.gz) or not (.fq or .fastq).
the barcode should be removed (not on left side of first reads)
the restriction site should not be removed, but if it is, enter a '@' symbol before the location of your sorted data.
Enter on line 18 of the params file the location of your sorted data.
Due to the use of a common cutter (e.g., ecoRI), paired ddRAD data often contains fragments that are shorter than the sequence length such that the first and second reads overlap. It is usually beneficial to remove these from your data. To do so, you could run the following script in the program PEAR. If you think your data don't have this problem then skip this step (but most data have overlaps).
%%bash
## you first need to un-archive the files
for gfile in fastq/*.gz;
do gunzip $gfile;
done
%%bash
## then run PEAR on each data file
for gfile in fastq/*_R1.fq;
do pear -f $gfile \
-r ${gfile/_R1.fq/_R2.fq} \
-o ${gfile/_R1.fq/} \
-n 33 \
-j 4 >> pear.log 2>&1
done
%%bash
tail -n 42 pear.log
____ _____ _ ____ | _ \| ____| / \ | _ \ | |_) | _| / _ \ | |_) | | __/| |___ / ___ \| _ < |_| |_____/_/ \_\_| \_\ PEAR v0.9.7 [February 12, 2015] Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593 Forward reads file.................: fastq/3L0_R1.fq Reverse reads file.................: fastq/3L0_R2.fq PHRED..............................: 33 Using empirical frequencies........: YES Statistical method.................: OES Maximum assembly length............: 999999 Minimum assembly length............: 33 p-value............................: 0.010000 Quality score threshold (trimming).: 0 Minimum read size after trimming...: 1 Maximal ratio of uncalled bases....: 1.000000 Minimum overlap....................: 10 Scoring method.....................: Scaled score Threads............................: 4 Allocating memory..................: 200,000,000 bytes Computing empirical frequencies....: DONE A: 0.256721 C: 0.240800 G: 0.247794 T: 0.254686 0 uncalled bases Assemblying reads: 100% Assembled reads ...................: 9,981 / 20,000 (49.905%) Discarded reads ...................: 0 / 20,000 (0.000%) Not assembled reads ...............: 10,019 / 20,000 (50.095%) Assembled reads file...............: fastq/3L0.assembled.fastq Discarded reads file...............: fastq/3L0.discarded.fastq Unassembled forward reads file.....: fastq/3L0.unassembled.forward.fastq Unassembled reverse reads file.....: fastq/3L0.unassembled.reverse.fastq
As expected, about 1/2 of reads are merged since we simulated fragment data in a size range of 50-300 bp. We now have a number of decisions to make. We could (1) analyze only the merged reads, (2) analyze only the non-merged reads, or (3) analyze both. If most of your data are merged, it may not be worth your time to bother with the non-merged data, or vice-versa. And combining them may just introduce more noise rather than more signal.
We could combine the merged and non-merged reads early in the analysis, however, I recommend assembling the two separately (if you choose to keep both), and combining them at the end, if desired. This way you can easily check whether the two give conflicting signals, and if one dataset or the other appears messy or noisy. To see how to assemble the non-merged reads see the paired ddRAD non-merged tutorial. For step 2 you would enter the location of the ".unassembled.*" reads below instead of the ".assembled.*", as we do here. I show at the end of this tutorial how to combine the data sets.
In this tutorial I focus on how to assemble the merged data:
For this, we need to make two changes to the params file:
%%bash
## set location of demultiplexed data that are 'pear' filtered
sed -i '/## 11./c\ddrad ## 11. datatype ' ./params.txt
sed -i '/## 18./c\fastq/*.assembled.* ## 18. demulti data loc ' ./params.txt
Next we apply the quality filtering.
%%bash
pyrad -p params.txt -s 2
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ sorted .fastq from fastq/*.assembled.* being used step 2: editing raw reads ............
Statistics for the number of reads that passed filtering can be found in the stats/ directory. You can see that not all samples have the same exact number of reads. This is because there is some variation around how well reads are merged by pear when they overlap by only a few bases.
%%bash
cat stats/s2.rawedit.txt
sample Nreads passed passed.w.trim passed.total 1A0.assembled 9960 9960 0 9960 1B0.assembled 9970 9970 0 9970 1C0.assembled 9961 9961 0 9961 1D0.assembled 9960 9960 0 9960 2E0.assembled 9961 9961 0 9961 2F0.assembled 9962 9962 0 9962 2G0.assembled 9980 9980 0 9980 2H0.assembled 9960 9960 0 9960 3I0.assembled 9960 9960 0 9960 3J0.assembled 9960 9960 0 9960 3K0.assembled 9961 9961 0 9961 3L0.assembled 9981 9981 0 9981 Nreads = total number of reads for a sample passed = retained reads that passed quality filtering at full length passed.w.trim= retained reads that were trimmed due to detection of adapters passed.total = total kept reads of sufficient length note: you can set the option in params file to include trimmed reads of xx length.
The filtered data files are converted to fasta format and written to a directory called edits/. Our merged data set will have the name ".assembled" attached to it. This is important to keep it separate from our un-merged data set.
%%bash
ls edits/
1A0.assembled.edit 1B0.assembled.edit 1C0.assembled.edit 1D0.assembled.edit 2E0.assembled.edit 2F0.assembled.edit 2G0.assembled.edit 2H0.assembled.edit 3I0.assembled.edit 3J0.assembled.edit 3K0.assembled.edit 3L0.assembled.edit
The merged data should generally have the first cutter at the beggining of the read, and the second cutter at the end of the read, like below:
%%bash
head -n 8 edits/1A0.assembled.edit
>1A0.assembled_0_r1 TGCAGCTCAAGCCTCTTAGGGGCCAGCGGTGCTACATGCTCCGCGTGCACTGGGTTCCTCCTGTTACCGTTCTGGGTGCTGCTACTACGTCTAGCGCTTCTGAAAGGTGGATGCCGGTTTGACCAAATATGCTACTGCGAGTATATACCTTGTGTGAGAGAATT >1A0.assembled_1_r1 TGCAGCTCAAGCCTCTTAGGGGCCAGCGGTGCTACATGCTCCGCGTGCACTGGGTTCCTCCTGTTACCGTTCTGGGTGCTGCTACTACGTCTAGCGCTTCTGAAAGGTGGATGCCGGTTTGACCAAATATGCTACTGCGAGTATATACCTTGTGTGAGAGAATT >1A0.assembled_2_r1 TGCAGCTCAAGCCTCTTAGGGGCCAGCGGTGCTACATGCTCCGCGTGCACTGGGTTCCTCCTGTTACCGTTCTGGGTGCTGCTACTACGTCTAGCGCTTCTGAAAGGTGGATGCCGGTTTGACCAAATATGCTACTGCGAGTATATACCTTGTGTGAGAGAATT >1A0.assembled_3_r1 TGCAGCTCAAGCCTCTTAGGGGCCAGCGGTGCTACATGCTCCGCGTGCACTGGGTTCCTCCTGTTACCGTTCTGGGTGCTGCTACTACGTCTAGCGCTTCTGAAAGGTGGATGCCGGTTTGACCAAATATGCTACTGCGAGTATATACCTTGTGTGAGAGAATT
Unlike for paired-GBS or ez-RAD data we do not need to perform reverse complement clustering for paired or single ddRAD data. There are some options for different ways to cluster paired-ddRAD data in the un-merged tutorial, but for merged data it is pretty straight forward, since the data act like single-end data. With our datatype set to 'ddrad' we simply run step 3 like below:
%%bash
pyrad -p params.txt -s 3
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ de-replicating files for clustering... step 3: within-sample clustering of 12 samples at '.85' similarity. Running 2 parallel jobs with up to 6 threads per job. If needed, adjust to avoid CPU and MEM limits sample 3L0 finished, 500 loci sample 2G0 finished, 499 loci sample 1B0 finished, 499 loci sample 2F0 finished, 499 loci sample 1C0 finished, 499 loci sample 3K0 finished, 499 loci sample 2E0 finished, 499 loci sample 2H0 finished, 498 loci sample 1A0 finished, 498 loci sample 3I0 finished, 498 loci sample 1D0 finished, 498 loci sample 3J0 finished, 498 loci
%%bash
head -n 23 stats/s3.clusters.txt
taxa total dpt.me dpt.sd d>5.tot d>5.me d>5.sd badpairs 1A0.assembled 498 20.0 0.0 498 20.0 0.0 0 1B0.assembled 499 19.98 0.447 499 19.98 0.447 0 1C0.assembled 499 19.962 0.85 498 20.0 0.0 0 1D0.assembled 498 20.0 0.0 498 20.0 0.0 0 2E0.assembled 499 19.962 0.85 498 20.0 0.0 0 2F0.assembled 499 19.964 0.805 498 20.0 0.0 0 2G0.assembled 499 19.98 0.447 499 19.98 0.447 0 2H0.assembled 498 20.0 0.0 498 20.0 0.0 0 3I0.assembled 498 20.0 0.0 498 20.0 0.0 0 3J0.assembled 498 20.0 0.0 498 20.0 0.0 0 3K0.assembled 499 19.962 0.85 498 20.0 0.0 0 3L0.assembled 500 19.962 0.849 499 20.0 0.0 0 ## total = total number of clusters, including singletons ## dpt.me = mean depth of clusters ## dpt.sd = standard deviation of cluster depth ## >N.tot = number of clusters with depth greater than N ## >N.me = mean depth of clusters with depth greater than N ## >N.sd = standard deviation of cluster depth for clusters with depth greater than N ## badpairs = mismatched 1st & 2nd reads (only for paired ddRAD data)
We next make consensus base calls for each cluster within each individual. First we estimate the error rate and heterozygosity within each sample:
%%bash
pyrad -p params.txt -s 4
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 4: estimating error rate and heterozygosity ............
Calling consensus sequences applies a number of filters, as listed in the params file and the general tutorial.
%%bash
pyrad -p params.txt -s 5
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 5: creating consensus seqs for 12 samples, using H=0.00129 E=0.00049 ............
%%bash
cat stats/s5.consens.txt
taxon nloci f1loci f2loci nsites npoly poly 1A0.assembled 498 498 498 55339 61 0.0011023 2E0.assembled 499 498 497 55187 83 0.001504 3I0.assembled 498 498 498 55339 83 0.0014998 3L0.assembled 500 499 499 55508 71 0.0012791 2F0.assembled 499 498 498 55340 88 0.0015902 2H0.assembled 498 498 498 55339 62 0.0011204 3J0.assembled 498 498 498 55339 73 0.0013191 1D0.assembled 498 498 498 55340 74 0.0013372 2G0.assembled 499 499 499 55509 83 0.0014953 1C0.assembled 499 498 498 55339 73 0.0013191 1B0.assembled 499 499 498 55349 70 0.0012647 3K0.assembled 499 498 498 55339 72 0.0013011 ## nloci = number of loci ## f1loci = number of loci with >N depth coverage ## f2loci = number of loci with >N depth and passed paralog filter ## nsites = number of sites across f loci ## npoly = number of polymorphic sites in nsites ## poly = frequency of polymorphic sites
This step will sometimes take the longest, depending on the size of your data set. Here it will go very quickly.
%%bash
pyrad -p params.txt -s 6
vsearch v1.0.7_linux_x86_64, 7.5GB RAM, 4 cores https://github.com/torognes/vsearch finished clustering
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 6: clustering across 12 samples at '.85' similarity Reading file /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/clust.85/cat.haplos_ 100% 724037 nt in 5977 seqs, min 59, max 184, avg 121 Indexing sequences 100% Counting unique k-mers 100% Clustering 100% Writing clusters 100% Clusters: 501 Size min 1, max 12, avg 11.9 Singletons: 3, 0.1% of seqs, 0.6% of clusters
Alignment of loci, statistics, and output of various formatted data files.
%%bash
pyrad -p params.txt -s 7
ingroup 1A0.assembled,1B0.assembled,1C0.assembled,1D0.assembled,2E0.assembled,2F0.assembled,2G0.assembled,2H0.assembled,3I0.assembled,3J0.assembled,3K0.assembled,3L0.assembled addon exclude final stats written to: /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/stats/merged.stats output files being written to: /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/outfiles/ directory writing nexus file writing phylip file + writing full SNPs file + writing unlinked SNPs file + writing STRUCTURE file + writing geno file ** must enter group/clade assignments for treemix output writing vcf file writing alleles file ** must enter group/clade assignments for migrate-n output
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ ...
Stats for this run. In this case ~490 loci were shared across all 12 samples.
%%bash
cat stats/merged.stats
498 ## loci with > minsp containing data 498 ## loci with > minsp containing data & paralogs removed 498 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci 1A0.assembled 498 1B0.assembled 497 1C0.assembled 498 1D0.assembled 498 2E0.assembled 497 2F0.assembled 498 2G0.assembled 498 2H0.assembled 498 3I0.assembled 498 3J0.assembled 498 3K0.assembled 498 3L0.assembled 498 ## nloci = number of loci with data for exactly ntaxa ## ntotal = number of loci for which at least ntaxa have data ntaxa nloci saved ntotal 1 - 2 - - 3 - - 4 0 * 498 5 0 * 498 6 0 * 498 7 0 * 498 8 0 * 498 9 0 * 498 10 0 * 498 11 2 * 498 12 496 * 496 ## nvar = number of loci containing n variable sites (pis+autapomorphies). ## sumvar = sum of variable sites (SNPs). ## pis = number of loci containing n parsimony informative sites. ## sumpis = sum of parsimony informative sites. nvar sumvar PIS sumPIS 0 9 0 107 0 1 21 21 150 150 2 53 127 125 400 3 78 361 63 589 4 69 637 33 721 5 82 1047 13 786 6 56 1383 3 804 7 45 1698 2 818 8 35 1978 2 834 9 21 2167 0 834 10 11 2277 0 834 11 9 2376 0 834 12 6 2448 0 834 13 2 2474 0 834 14 0 2474 0 834 15 1 2489 0 834 total var= 2489 total pis= 834 sampled unlinked SNPs= 489 sampled unlinked bi-allelic SNPs= 465
%%bash
ls outfiles/
merged.alleles merged.excluded_loci merged.geno merged.loci merged.nex merged.phy merged.snps merged.str merged.unlinked_snps merged.vcf
%%bash
ls stats/
merged.stats Pi_E_estimate.txt s1.sorting.txt s2.rawedit.txt s3.clusters.txt s5.consens.txt
%%bash
head -n 39 outfiles/merged.loci | cut -b 1-80
>1A0.assembled GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA >1B0.assembled GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA >1C0.assembled GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA >1D0.assembled GGGTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA >2E0.assembled GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCGAATATTTACCCCTATTA >2F0.assembled GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA >2G0.assembled GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA >2H0.assembled GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA >3I0.assembled GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA >3J0.assembled GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA >3K0.assembled GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA >3L0.assembled GGCTGCGTAAGAATGTCGACCTGACGACAAAGCTGCCCCCGCCCTAATATTTACCCCTATTA // - - >1A0.assembled GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTAGCCACTG >1B0.assembled GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTAGCCACTG >1C0.assembled GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTAGCCACTG >1D0.assembled GCGTGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTAGCCACTG >2E0.assembled GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTWTTCAGGGGCAGTTAGCCACTG >2F0.assembled GCATGAGCGAAAAATCTTAMGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTAGCCACTG >2G0.assembled GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTAGCCACTG >2H0.assembled GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTAGCCACTG >3I0.assembled GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTATCCACTG >3J0.assembled GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTATCCACTG >3K0.assembled GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTATCCACTG >3L0.assembled GCATGAGCGAAAAATCTTACGCTTAAGGAGTTCGATCTTTTTTCAGGGGCAGTTATCCACTG // - - - * >1A0.assembled TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC >1B0.assembled TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC >1C0.assembled TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC >1D0.assembled TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC >2E0.assembled TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC >2F0.assembled TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC >2G0.assembled TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC >2H0.assembled TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC >3I0.assembled TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC >3J0.assembled TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC >3K0.assembled TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTACGGTGAAGCCCGC >3L0.assembled TCATCCGACCCGCGCCTCCATGTACCATGCCTGTATTTGGTACAAACTMCGGTGAAGCCCRC // - -
Assemble them as described in the pairddrad non-merge tutorial link.
(1) Concatenate the .loci output files from the merged and non-merged analyses to a tempfile.
%%bash
#cat outfiles/merged.loci outfiles/nonmerged.loci > outfiles/totaldata.loci
(2) Remove ".assembled." and ".unassembled" from the names in the file.
%%bash
#sed -i s/.unassembled.//g outfiles/totaldata.loci
#sed -i s/.assembled.//g outfiles/totaldata.loci
(3) Set the output prefix (param 14) to the name of your concatenated data file. And ask for additional output formats.
%%bash
#sed -i '/## 14. /c\totaldata ## 14. prefix name ' ./params.txt
#sed -i '/## 30. /c\* ## 30. additional formats ' ./params.txt
(4) Run step 7 to create additional output file formats using the concatenated .loci file
%%bash
#pyrad -p params.txt -s 7