This tutorial focuses on the analysis of single-end GBS reads as well as methods for salvaging short fragments containing adapter sequence that often result from poor size selection on this type of library preparation. If you have not yet read the full tutorial, you should start there for a broader description of how pyRAD works.
This tutorial is written as an IPython notebook. Each cell that begins with the header %%bash
or with !
should be executed in a command line shell, for example by copying and pasting the text (but excluding the header). If you have IPython notebook installed on your machine you can download this notebook and edit/execute the cells yourself.
Genomic DNAs are digested with a single restriction enzyme such that the resulting DNA fragments have the same cutsite overhang on both ends. A key difference between GBS and other library types is that the adapter+barcode can ligate to either ends of these fragments, meaning that the sequence read may have been sequenced from either end.
GBS, like ddRAD, is cheaper to prepare than RAD, and easier to prepare in your own lab. Some GBS preps do not include a size selection step, in which case the data generally include many short fragments that when read from either end in different reads will overlap considerably. For this reason pyrad uses reverse complement clustering to identify that these reads come from the same fragment. Some people add a size selection step to GBS preps, in which case, depending on the size range selected, overlap between sequenced reads from either ends of fragments can be reduced.
In this tutorial I show how to test whether overlap occurs in your GBS data, and how to setup the params file to perform reverse complement clustering in the case that it does occur; or to not perform this type of clustering if your data do not overlap (reverse complement clustering is a bit slower.)
The bash script in the cell below can be used to download an example GBS data set and unarchive it into your current directory. This data set was simulated with the following parameters
This size selection window was chosen to demonstrate the analysis of short fragments.
%%bash
wget -q http://dereneaton.com/downloads/simGBS.zip
unzip simGBS.zip
Archive: simGBS.zip inflating: simGBS.barcodes inflating: simGBS_R1.fastq.gz
We begin by creating the params.txt file, which sets all of the parameters for the pyrad analysis. A default template should always be created first using the -n
option in pyRAD. Then you can use any text editor to fill in the relevant lines with parameter valuels.
%%bash
pyRAD -n
new params.txt file created
%%bash
cat params.txt
==** parameter inputs for pyRAD version 3.0.2 **======================== affected step == ./ ## 1. Working directory (all) ./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 16) (s1) ./*.barcodes ## 3. Loc. of barcode file (if not line 16) (s1) vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) TGCAG ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG) (s1,s2) 2 ## 7. N processors (parallel) (all) 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 4 ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2) .88 ## 10. Wclust: clustering threshold as a decimal (s3,s6) rad ## 11. Datatype: rad,gbs,ddrad,pairgbs,pairddrad (all) 4 ## 12. MinCov: min samples in a final locus (s7) 3 ## 13. MaxSH: max inds with shared hetero site (s7) c88d6m4p3 ## 14. Prefix name for final output (no spaces) (s7) ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) ## 18.opt.: loc. of de-multiplexed data (s2) ## 19.opt.: maxM: N mismatches in barcodes (def= 1) (s1) ## 20.opt.: phred Qscore offset (def= 33) (s2) ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2) ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5) (s5) ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7) ## 30.opt.: output formats: p,n,a,s,v,u,t,m,k,g,* (see docs) (s7) ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) ## 32.opt.: keep trimmed reads (def=0). Enter min length. (s2) ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
%%bash
sed -i '/## 10. /c\.85 ## 10. lowered clust thresh... ' params.txt
sed -i '/## 11. /c\gbs ## 11. changed datatype to gbs ' params.txt
sed -i '/## 14. /c\c85m4p3 ## 14. outprefix... ' params.txt
sed -i '/## 21./c\1 ## 21. set filter to 1 ' params.txt
sed -i '/## 24./c\8 ## 24. increased maxH ' params.txt
sed -i '/## 30./c\* ## 30. all output formats ' params.txt
sed -i '/## 32./c\50 ## 32. keep fragments longer than 50 ' params.txt
%%bash
cat params.txt
==** parameter inputs for pyRAD version 3.0.2 **======================== affected step == ./ ## 1. Working directory (all) ./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 16) (s1) ./*.barcodes ## 3. Loc. of barcode file (if not line 16) (s1) vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) TGCAG ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG) (s1,s2) 2 ## 7. N processors (parallel) (all) 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 4 ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2) .85 ## 10. lowered clust thresh... gbs ## 11. changed datatype to gbs 4 ## 12. MinCov: min samples in a final locus (s7) 3 ## 13. MaxSH: max inds with shared hetero site (s7) c85m4p3 ## 14. outprefix... ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) ## 18.opt.: loc. of de-multiplexed data (s2) ## 19.opt.: maxM: N mismatches in barcodes (def= 1) (s1) ## 20.opt.: phred Qscore offset (def= 33) (s2) 1 ## 21. set filter to 1 ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) 8 ## 24. increased maxH ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7) * ## 30. all output formats ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) 50 ## 32. keep fragments longer than 50 ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
Your input data will be in fastQ format, usually ending in .fq or .fastq. Your data could be split among multiple files, or all within a single file. The file/s may be compressed with gzip so that they have a .gz, but do not need to be compressed. The location of these files should be entered on line 2 of the params file unless your data are already demultiplexed in which case you can skip this step and instead list the location of your demultiplexed files on line 18. Below are the first three reads in the example data set.
%%bash
less simGBS_R1.fastq.gz | head -n 12 | cut -c 1-80
@lane1_fakedata0_R1_0 1:N:0: ATAATATGCAGATCTGTAAAGCCGTATGTAATTCTGTTGTAGTTTAGACTTCTCACTGTCGTTAGTTATTTAATGAGGAG + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @lane1_fakedata0_R1_0 1:N:0: ATAATATGCAGAGAAAGTCCAAAAATATATATCCCAATTACCCATCGACCTACGTTGTTCGCGATACCCAAGTTACTTTT + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @lane1_fakedata0_R1_1 1:N:0: ATAATATGCAGATCTGTAAAGCCGTATGTAATTCTGTTGTAGTTTAGACTTCTCACTGTCGTTAGTTATTTAATGAGGAG + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
Each read takes four lines. The first is the name of the read (its location on the plate). The second line contains the sequence data. The third line is a spacer. And the fourth line the quality scores for the base calls. In this case arbitrarily high since the data were simulated.
These are 100 bp single-end reads prepared as GBS. The first six bases form the barcode and the next five bases (TGCAG) the restriction site overhang. All following bases make up the sequence data.
This step uses information in the barcodes file to sort data into a separate file for each sample, which is then saved in a new directory called fastq/
%%bash
pyRAD -p params.txt -s 1
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 1: sorting reads by barcode .
%%bash
ls fastq/
1A0_R1.fq.gz 1B0_R1.fq.gz 1C0_R1.fq.gz 1D0_R1.fq.gz 2E0_R1.fq.gz 2F0_R1.fq.gz 2G0_R1.fq.gz 2H0_R1.fq.gz 3I0_R1.fq.gz 3J0_R1.fq.gz 3K0_R1.fq.gz 3L0_R1.fq.gz
For each raw data file the number of reads is shown, the number for which the cut site was detected, and the number for which the detected barcode matched in the barcodes file. Then for each true barcode it lists the observed barcodes that matched within the allowed number of mismatches, and their number of occurrences. Finally at the bottom the non-matching barcodes are shown, and their occurrences as well.
%%bash
cat stats/s1.sorting.txt
file Nreads cut_found bar_matched simGBS_R1.fastq.gz 480000 480000 480000 sample true_bar obs_bars N_obs 2H0 AAATAT AAATAT 40000 1C0 AAGGAG AAGGAG 40000 1B0 ATAATA ATAATA 40000 1A0 CATCAT CATCAT 40000 3L0 GAGGTG GAGGTG 40000 2G0 GATGTA GATGTA 40000 3J0 GGTTGA GGTTGA 40000 2E0 TAAGAT TAAGAT 40000 3I0 TAGGTA TAGGTA 40000 1D0 TGGGTT TGGGTT 40000 3K0 TGTGTA TGTGTA 40000 2F0 TTTTTG TTTTTG 40000 nomatch _ 0
Step 2 filters reads based on quality scores, and can be used to detect and trim off Illumina adapters if present, which we expect to have in these simulated data. Filter option 1 looks for a fuzzy match to the reverse complement cutsite and the Illumina adapter. If present the adapter is chopped off and the read is either discarded or kept if the remaining length is longer than the set minimum length (here 50) it is kept. If you wanted, you could instead use external software to filter adapter sequences from your data.
%%bash
pyRAD -p params.txt -s 2
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 2: editing raw reads ............
%%bash
ls edits/
1A0.edit 1B0.edit 1C0.edit 1D0.edit 2E0.edit 2F0.edit 2G0.edit 2H0.edit 3I0.edit 3J0.edit 3K0.edit 3L0.edit
The filtered data are written in fasta format (quality scores removed) into a new directory called edits/
%%bash
head -n 10 edits/1A0.edit | cut -c 1-80
>1A0_0_r1 TGCAGATCTGCAAAGCCGTATGTAATTCTGTTGTAGTTTAGACTTCTCACTGTCGTTAGTTATTTAATGAGGAGATCGCA >1A0_1_r1 TGCAGAGAAAGTCCAAAAATATATATCCCAATTACCCATCGACCTACGTTGTTCGCGATACCCAAGTTACTTTTTGAGGA >1A0_2_r1 TGCAGATCTGCAAAGCCGTATGTAATTCTGTTGTAGTTTAGACTTCTCACTGTCGTTAGTTATTTAATGAGGAGATCGCA >1A0_3_r1 TGCAGAGAAAGTCCAAAAATATATATCCCAATTACCCATCGACCTACGTTGTTCGCGATACCCAAGTTACTTTTTGAGGA >1A0_4_r1 TGCAGATCTGCAAAGCCGTATGTAATTCTGTTGTAGTTTAGACTTCTCACTGTCGTTAGTTATTTAATGAGGAGATCGCA
You can see here that of the 40K reads for each sample about 39,900 pass filtering. Of these, about 33,000 passed with no adapter sequence detected, and about 6,000 had the adapter trimmed off but still passed because their remaining length was longer than 50 bp. Only about 100 reads were excluded from each sample.
%%bash
cat stats/s2.rawedit.txt
sample Nreads passed passed.w.trim passed.total 1A0 40000 33560 6340 39900 1B0 40000 33558 6341 39899 1C0 40000 33561 6340 39901 1D0 40000 33521 6340 39861 2E0 40000 33520 6361 39881 2F0 40000 33521 6359 39880 2G0 40000 33520 6361 39881 2H0 40000 33541 6340 39881 3I0 40000 33540 6350 39890 3J0 40000 33538 6342 39880 3K0 40000 33521 6360 39881 3L0 40000 33539 6340 39879 Nreads = total number of reads for a sample passed = retained reads that passed quality filtering at full length passed.w.trim= retained reads that were trimmed due to detection of adapters passed.total = total kept reads of sufficient length note: you can set the option in params file to include trimmed reads of xx length.
Step 3 de-replicates and then clusters reads within samples by the set clustering threshold and writes the clusters to new files in a directory called clust.xx.
Below I show the results of this analysis which uses reverse complement clustering to detect overlap among sequences originating from either end of short GBS fragments. I then show how to check for overlap in your data, and how to turn off reverse complement clustering if overlap is not detected, because it will allow your analysis to run much faster.
%%bash
pyRAD -p params.txt -s 3
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ de-replicating files for clustering... step 3: within-sample clustering of 12 samples at '.85' similarity. Running 2 parallel jobs with up to 6 threads per job. If needed, adjust to avoid CPU and MEM limits sample 1C0 finished, 1594 loci sample 1A0 finished, 1594 loci sample 1B0 finished, 1594 loci sample 3I0 finished, 1594 loci sample 2H0 finished, 1593 loci sample 3J0 finished, 1592 loci sample 3K0 finished, 1594 loci sample 2G0 finished, 1594 loci sample 2E0 finished, 1594 loci sample 3L0 finished, 1593 loci sample 1D0 finished, 1592 loci sample 2F0 finished, 1593 loci
As you can see below, within the first few clusters we find two reverse complement matching clusters. The first cluster has reads from either end that overlap almost completely, meaning this fragment was only a little more than 100 bp long, and we sequenced 100 bp data. The third cluster is offset a little more. I found that reads which ovelap by less than about 30% are difficult to detect, and the minimum amount of overlap allowed is not a parameter you can change in the params file, but it can be edited in the pyrad code.
%%bash
less clust.85/1A0.clustS.gz | head -n 26 | cut -c 1-80
>1A0_27700_r1;size=20; TGCAGACTTACTGAGTAAGTTCCATTAAAAATGAATTTCGGAGATGAATAGTGTGATCTCTGCCCGGTGAGGTTGCTGCG >1A0_27701_c1;size=20;- -------------------------------------------------AGTGTGATCTCTGCCCGGTGAGGTTGCTGCG // // >1A0_5203_c1;size=19;- -----CGTCGTATGTCGCTCAGCACCTCGCGTGTATGTTGTTCTGGCCACCATGGACTGACAAGACGTCCGCACGGTTAA >1A0_5202_c1;size=19; TGCAGCGTCGTATGTCGCTCAGCACCTCGCGTGTATGTTGTTCTGGCCACCATGGACTGACAAGACGTCCGCACGGTTAA >1A0_5200_c1;size=1;+ TGCAGCGTCGTATGTCGCTCAGCACCTCGCGTGTATCTTGTTCTGGCCACCATGGACTGACAAGACGTCCGCACGGTTAA >1A0_5201_c1;size=1;- -----CGTCGTATGTCGCTCAGCACCTCGCGTGTATCTTGTTCTGGCCACCATGGACTGACAAGACGTCCGCACGGTTAA // // >1A0_26781_r1;size=19; TGCAGTTGAACTAACCAGCGTTTTTACTACTGGCGAAAATAGCTGGTACCATACCCTCGTTGTACAGACAAACAGAGGGC >1A0_26780_c1;size=18;- --------------------------CTACTGGCGAAAATAGCTGGTACCATACCCTCGTTGTACAGACAAACAGAGGGC >1A0_26804_c1;size=1;- --------------------------CTACTGGCGAAAATAGCTGGTACCATACCCTCGTTGTACAGACGAACAGAGGGC >1A0_26812_c1;size=1;- --------------------------CTACTGGCGAAAATAGCTGGTACCATACCCTCGTTGTACAGACAAACAGAGGGC >1A0_26805_r1;size=1;+ TGCAGTTGAACTAACCAGCGTTTTTACTACTGGCGAAAATAGCTGGTACCATACCCTCGTTGTACAGACGAACAGAGGGC
The stats output tells you how many clusters were found, and their mean depth of coverage. It also tells you how many pass your minimum depth setting. You can use this information to decide if you wish to increase or decrease the mindepth. It also shows histograms of coverage depth for each sample. Below I show sample 1A0 which clearly shows the number of samples in this data set that matched normally plus those which matched in the reverse complement direction, which doubled the coverage.
%%bash
head -n 53 stats/s3.clusters.txt
taxa total dpt.me dpt.sd d>5.tot d>5.me d>5.sd badpairs 1A0 1594 25.031 8.678 1594 25.031 8.678 0 1B0 1594 25.031 8.679 1594 25.031 8.679 0 1C0 1594 25.032 8.705 1593 25.047 8.687 0 1D0 1592 25.038 8.709 1591 25.053 8.691 0 2E0 1594 25.019 8.698 1593 25.035 8.68 0 2F0 1593 25.034 8.68 1593 25.034 8.68 0 2G0 1594 25.019 8.698 1593 25.035 8.68 0 2H0 1593 25.035 8.707 1592 25.05 8.689 0 3I0 1594 25.025 8.686 1594 25.025 8.686 0 3J0 1592 25.05 8.689 1592 25.05 8.689 0 3K0 1594 25.019 8.698 1593 25.035 8.68 0 3L0 1593 25.034 8.68 1593 25.034 8.68 0 ## total = total number of clusters, including singletons ## dpt.me = mean depth of clusters ## dpt.sd = standard deviation of cluster depth ## >N.tot = number of clusters with depth greater than N ## >N.me = mean depth of clusters with depth greater than N ## >N.sd = standard deviation of cluster depth for clusters with depth greater than N ## badpairs = mismatched 1st & 2nd reads (only for paired ddRAD data) HISTOGRAMS sample: 1A0 bins depth_histogram cnts : 0------------50-------------100% 0 0 5 0 10 0 15 0 20 *********************** 1193 25 0 30 0 35 0 40 ********* 401 50 0 100 0 250 0 500 0 sample: 1B0 bins depth_histogram cnts : 0------------50-------------100% 0 0 5 0 10 0 15 * 2 20 *********************** 1191 25 0 30 0
For any single end GBS analysis I suggest that you first begin step 3 with the datatype set to gbs, which tells pyRAD to perform reverse complement clustering. Let this run for about 15 minutes, and then look at the resulting data file in the clust directory that has a ".u" ending, using the unix command "less". I use the command "head" below to print only the first 20 lines.
%%bash
head -n 20 clust.85/1A0.u
1A0_1021_c1;size=20; 1A0_1020_c1;size=20; 100.0 0 - 92.6 1A0_10501_r1;size=20; 1A0_10500_r1;size=20; 100.0 0 - 35.1 1A0_11221_r1;size=20; 1A0_11220_r1;size=20; 100.0 0 - 38.3 1A0_11341_r1;size=20; 1A0_11340_r1;size=20; 100.0 0 - 74.5 1A0_11381_r1;size=20; 1A0_11380_r1;size=20; 100.0 0 - 72.3 1A0_11481_r1;size=20; 1A0_11480_r1;size=20; 100.0 0 - 76.6 1A0_11721_r1;size=20; 1A0_11720_r1;size=20; 100.0 0 - 88.3 1A0_11941_r1;size=20; 1A0_11940_r1;size=20; 100.0 0 - 92.6 1A0_12061_r1;size=20; 1A0_12060_r1;size=20; 100.0 0 - 80.9 1A0_121_c1;size=20; 1A0_120_c1;size=20; 100.0 0 - 92.3 1A0_12221_r1;size=20; 1A0_12220_r1;size=20; 100.0 0 - 43.6 1A0_12421_r1;size=20; 1A0_12420_r1;size=20; 100.0 0 - 88.3 1A0_1261_c1;size=20; 1A0_1260_c1;size=20; 100.0 0 - 93.3 1A0_12961_r1;size=20; 1A0_12960_r1;size=20; 100.0 0 - 76.6 1A0_13801_r1;size=20; 1A0_13800_r1;size=20; 100.0 0 - 35.1 1A0_1421_c1;size=20; 1A0_1420_c1;size=20; 100.0 0 - 92.3 1A0_14521_r1;size=20; 1A0_14520_r1;size=20; 100.0 0 - 76.6 1A0_15241_r1;size=20; 1A0_15240_r1;size=20; 100.0 0 - 37.2 1A0_1541_c1;size=20; 1A0_1540_c1;size=20; 100.0 0 - 91.7 1A0_1581_c1;size=20; 1A0_1580_c1;size=20; 100.0 0 - 92.3
The ".u" file records the clustering process, showing which hits match to seeds, their measured similarity, the direction in which they match, and the number of indels. If your data do not contain any reverse complement clusters, meaning you see only "+" in column 5, and no "-", then you do not need to use reverse complement clustering. In that case you could change the datatype option on from 'gbs' to 'rad' and you would get a speed increase. If speed is not a concern, or if you see many "-" in the .u file then you should perform the rest of your analysis using the 'gbs' option.
Step 4 jointly infers the error-rate and heterozygosity across samples.
%%bash
pyRAD -p params.txt -s 4
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 4: estimating error rate and heterozygosity ............
%%bash
cat stats/Pi_E_estimate.txt
taxa H E 1C0 0.00127475 0.00049401 2F0 0.00135011 0.0004923 3I0 0.00151533 0.00046613 1D0 0.0012648 0.00050004 1B0 0.00149071 0.00049023 3K0 0.00153561 0.00051244 2H0 0.00134803 0.00049295 2E0 0.00138114 0.00049448 2G0 0.00162647 0.00050323 1A0 0.00146453 0.00053174 3L0 0.00158455 0.00051403 3J0 0.00123198 0.00050979
Step 5 calls consensus sequences using the parameters inferred above, and filters for paralogs.
%%bash
pyRAD -p params.txt -s 5
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 5: creating consensus seqs for 12 samples, using H=0.00142 E=0.00050 ............
%%bash
cat stats/s5.consens.txt
taxon nloci f1loci f2loci nsites npoly poly 3J0 1592 1592 1582 144424 186 0.0012879 3L0 1593 1593 1583 144401 219 0.0015166 1A0 1594 1594 1584 144633 211 0.0014589 2G0 1594 1593 1586 144633 233 0.001611 2H0 1593 1592 1575 143678 195 0.0013572 2E0 1594 1593 1584 144296 207 0.0014346 3K0 1594 1593 1582 144157 212 0.0014706 1B0 1594 1594 1586 144695 214 0.001479 1D0 1592 1591 1579 144128 186 0.0012905 3I0 1594 1594 1591 145171 213 0.0014672 2F0 1593 1593 1587 144886 193 0.0013321 1C0 1594 1593 1586 144815 183 0.0012637 ## nloci = number of loci ## f1loci = number of loci with >N depth coverage ## f2loci = number of loci with >N depth and passed paralog filter ## nsites = number of sites across f loci ## npoly = number of polymorphic sites in nsites ## poly = frequency of polymorphic sites
Step 6 clusters consensus sequences across samples. Again this uses reverse complement clustering if the datatype is set to 'gbs'. This step can take a long time for very large data sets (>100 individuals). I suggest trying it first. If it takes too long you can implement the hierarchical clustering method (described in detail in a separate tutorial).
%%bash
pyRAD -p params.txt -s 6
vsearch v1.0.7_linux_x86_64, 7.5GB RAM, 4 cores https://github.com/torognes/vsearch finished clustering
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 6: clustering across 12 samples at '.85' similarity Reading file /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_GBS/clust.85/cat.haplos_ 100% 1828942 nt in 19005 seqs, min 57, max 152, avg 96 Indexing sequences 100% Counting unique k-mers 100% Clustering 100% Writing clusters 100% Clusters: 1594 Size min 1, max 19, avg 11.9 Singletons: 1, 0.0% of seqs, 0.1% of clusters
%%bash
pyRAD -p params.txt -s 7
ingroup 1A0,1B0,1C0,1D0,2E0,2F0,2G0,2H0,3I0,3J0,3K0,3L0 addon exclude final stats written to: /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_GBS/stats/c85m4p3.stats output files being written to: /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_GBS/outfiles/ directory writing nexus file writing phylip file + writing full SNPs file + writing unlinked SNPs file + writing STRUCTURE file + writing geno file ** must enter group/clade assignments for treemix output writing vcf file writing alleles file ** must enter group/clade assignments for migrate-n output
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ ...
As you can see we did not recover all 2000 loci in the data set. This is because some reads were filtered for containing Illumina adapters, and others were merged when they overlapped. Below I print the first several loci in the ".loci" output to show the variable length loci that are recovered from this GBS analysis that included fragmented and overlapping reads.
%%bash
cat stats/c85m4p3.stats
1591 ## loci with > minsp containing data 1591 ## loci with > minsp containing data & paralogs removed 1591 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci 1A0 1581 1B0 1583 1C0 1584 1D0 1579 2E0 1582 2F0 1585 2G0 1584 2H0 1574 3I0 1588 3J0 1581 3K0 1580 3L0 1581 ## nloci = number of loci with data for exactly ntaxa ## ntotal = number of loci for which at least ntaxa have data ntaxa nloci saved ntotal 1 - 2 - - 3 - - 4 0 * 1591 5 0 * 1591 6 1 * 1591 7 3 * 1590 8 2 * 1587 9 8 * 1585 10 10 * 1577 11 37 * 1567 12 1530 * 1530 ## nvar = number of loci containing n variable sites (pis+autapomorphies). ## sumvar = sum of variable sites (SNPs). ## pis = number of loci containing n parsimony informative sites. ## sumpis = sum of parsimony informative sites. nvar sumvar PIS sumPIS 0 26 0 466 0 1 112 112 543 543 2 259 630 328 1199 3 257 1401 169 1706 4 314 2657 60 1946 5 259 3952 18 2036 6 180 5032 6 2072 7 95 5697 1 2079 8 52 6113 0 2079 9 20 6293 0 2079 10 9 6383 0 2079 11 6 6449 0 2079 12 2 6473 0 2079 total var= 6473 total pis= 2079 sampled unlinked SNPs= 1565 sampled unlinked bi-allelic SNPs= 1445
%%bash
head -n 200 outfiles/c85m4p3.loci | cut -c 1-85
>1A0 TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT >1B0 TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGAYAATGCATCTTGCTTATCTAAT >1C0 TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT >1D0 TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT >2E0 TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT >2F0 TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT >2G0 TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT >3I0 TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT >3J0 TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT >3K0 TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT >3L0 TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT // - | >1A0 TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT >1B0 TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT >1C0 TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT >1D0 TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT >2E0 TAATCAAAACAGTAWGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT >2F0 TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTCCGTTGCCGTCCGGGACCGCGCT >2G0 TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTTCCGTCCGGGACCGCGCT >2H0 TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT >3I0 TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT >3J0 TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCSTCCGGGACCGCGCT >3K0 TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT >3L0 TAATCAAAACAATATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT // - - - - - >1A0 TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT >1B0 TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT >1C0 TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT >1D0 TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT >2E0 TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGACGTGGTCAAGTT >2F0 TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGACGTGGTCAAGTT >2G0 TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT >2H0 TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT >3I0 TCGTAGGCCTGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT >3J0 TCGTAGGCCTGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT >3K0 TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT >3L0 TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT // * * >1A0 ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT >1B0 ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT >1C0 ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT >1D0 ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT >2E0 ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT >2F0 ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT >2G0 ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT >2H0 ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT >3I0 ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT >3J0 ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT >3K0 ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT >3L0 ATACATAGTACAGTCCTCTAATTGCAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT // - >1A0 TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC >1B0 KGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC >1C0 TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC >1D0 TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC >2E0 TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTCATATCCAAGTGTCTGGGATGAAACGAATAATGGACTAC >2F0 TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTCATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC >2G0 TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTCATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC >2H0 TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC >3I0 TGCATATGCGGCTCACGGCCAGGCTTCCTCAGGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC >3J0 TGCATATGCGGCTCACGGCCAGGCTTCCTCASGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC >3K0 TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC >3L0 TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC // - * * - >1A0 ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC >1B0 ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC >1C0 ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC >1D0 ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC >2E0 ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACTACAGTTGCCTCCAGGAATAATC >2F0 ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACTACAGTTGCCTCCAGGAATAATC >2G0 ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACTACAGTTGCCTCCAGGAATAATC >2H0 ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC >3I0 ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC >3J0 ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC >3K0 ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC >3L0 ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACCGTTGCCTCCAGGAATAATC // * - >1A0 TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA >1B0 TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGRGGTCATGGTATAATTCCCTA >1C0 TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA >1D0 TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA >2E0 TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA >2F0 TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA >2G0 TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA >2H0 TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA >3I0 TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA >3J0 TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA >3K0 TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA >3L0 TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA // - >1A0 CRTTGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG >1B0 CATTGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG >1C0 CATTGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG >1D0 CATTGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG >2E0 CATGGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTWTTTCTCTAAC >2F0 CATGGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAC >2G0 CATGGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAC >2H0 CATGGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG >3I0 CATGGTTTAACAGAGGTGTTTATAGAGATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG >3J0 CATGGTTTAACAGAGGTGTTTATAGAGATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG >3K0 CATGGTTTAACAGAGGTGTTTATAGAGATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG >3L0 CATGGTTTAACAGAGGTGTTTATAGAGATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG // - * * - * >1A0 AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT >1B0 AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT >1C0 AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT >1D0 AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT >2E0 AAAATGGCAGCCGTACGCGAGCGTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTGCCCTT >2F0 AAAATGGCAGCCGTACGCGAGCGTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTGCCCTT >2G0 AAAATGGCAGCCGTACGCGAGCGTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTGCCCTT >2H0 AAAATGGCAGCCGTACGCGAGCGTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTGCCCTT >3I0 AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT >3J0 AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT >3K0 AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT >3L0 AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT // * * * >1A0 CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA >1B0 CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA >1C0 CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA >1D0 CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA >2E0 CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA >2F0 CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA >2G0 CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA >2H0 CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA >3I0 CTAGTACTTACGTAGAACGCCCCCCGTCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTGTCTA >3J0 CTAGTACTTACGTAGAACGCCCCCCGTCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTGTCTA >3K0 CTAGTACTTACGTAGAACGCCCCCCGTCACTATGATGCTCCCCAAGCAGCTGTGGAGTGTGTCTA >3L0 CTAGTACTTACGTAGARCGCCCCCCGTCGCTATGATGCTCCCCAAGCAGCCGTGGAGTGTCTCTA // - * - - * | >1A0 ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC >1B0 ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC >1C0 ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC >1D0 ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC >2E0 ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC >2F0 ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC >2G0 ATAAATCTGTAGAGACGTAGACTAAGCGTGGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC >2H0 ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC >3I0 ATAAATCTGTAGAGACGTAGACTAAGCATCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC >3J0 ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC >3K0 ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC >3L0 ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC // - - >1A0 GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA >1B0 GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA >1C0 GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA >1D0 GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA >2E0 GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA >2F0 GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA >2G0 GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA >2H0 GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA >3I0 GACATCGTGATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA >3J0 GACATCGTGATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA >3K0 GACATCGTGATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA >3L0 GACATCGTGATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA // * | >1A0 TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC >1B0 TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAAGGATAACGGTCC >1C0 TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC >1D0 TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC >2E0 TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC >2F0 TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC >2G0 TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC >2H0 TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC >3I0 TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC >3J0 TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC >3K0 TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC >3L0 TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTCTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC // - - >1A0 CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGACCTTCTTTGCCACGCGGGC >1B0 CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC >1C0 CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC >1D0 CGATGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC >2E0 CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC >2F0 CGCTGATTGTTTATTTCAATAGCTATATCAGGAWTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC >2G0 CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACYAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC >2H0 CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC >3I0 CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCMACGCGGGC >3J0 CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC >3K0 CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC >3L0 CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC // - - - - - >1A0 CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT >1B0 CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT >1C0 CACCGATGAGACGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGCCTTTTGGGCTACTTCGAAT >1D0 CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT >2E0 CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT >2F0 CACCGATGAGCCGRGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT >2G0 CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT >2H0 CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT >3I0 CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT >3J0 CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT >3K0 CACCGAYGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT >3L0 CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCAKCTCGTCCCCTGGGACTTTTGGACTACTTCGAAT // - - - - - - >1A0 TTGCCCCCGCAGTCCAAACTCACATGGCGGCGCCGTTGCGGGGAGAGTTCATCTTCGCACTACAGCACAGTGACTCG >1B0 TTGCCCCCGCAGTCCAAACTCACATGGCGGCGCCGTTGCGGGGAGAGTTCATCTTCGCACTACAGCACAGTGACTCG >1C0 TTGCCCCCGCAGTCCAAACTCACATGGCGGCGCCGTTGCGGGGAGAGTTCATCTTCGCACTACAGCGCAGTGACTCG >1D0 TTGCCCCCGCAGTCCAAACTCACATGGCGGCGCCGTTGCGGGGAGAGTTCTTCTTCGCACTACAGCACAGTGACTCG >2E0 TTGCCCCCGCAGTCCAAACTCACATGGCGGCGCCGTTGCGGGGAGAGTTCATCTTCGCACTTCAGCACAGTGACTCG >2F0 TTGCCCCCGCAGTCCAAACTCACATGGCGGCGCCGTTGCGGGGAGAGTTCATCTTCGCACTACAGCACAGTGACTCG