The double digest library preparation method was developed and described by Peterson et al. 2012. Here I will be talking about __ paired-end ddRAD __ data and describe my recommendations for how to setup a params file to analyze them in pyRAD.
A ddRAD library is prepared by cutting genomic DNA with two different restriction enzymes and selecting the intervening fragments that are within a certain size window. These will contain overhangs from the respective cut sites on either side. One side will have a barcode+illumina adapter ligated, and the other end will have only the reverse Illumina adapter ligated. The first reads may come in one or multiple files with "_R1_" in the name, and the second reads are in a separate file/s with "_R2_". Second read files will contain the corresponding pair from the first read files in the exact same order.
Your data will likely come to you non-demultiplexed (meaning not sorted by which individual the reads belong to). You can demultiplex the reads according to their barcodes using pyRAD or separate software.
A number of programs are available to check for overlap of paired end reads, and you can run your data through these programs before being input to pyRAD. At the time of writing this, I recommend the software PEAR (https://github.com/xflouris/PEAR), which I demonstrate on pairddrad data in a separate tutorial here.
For this tutorial I simulated paired-end ddRAD loci on a 12 taxon species tree, shown below. You can download the data using the script below and assemble these data by following along with all of the instructions.
%%bash
## download the data
wget -q http://www.dereneaton.com/downloads/simpairddrads.zip simpairddrads.zip
## unzip the data
unzip simpairddrads.zip
Archive: simpairddrads.zip inflating: simpairddrads_R1_.fastq.gz inflating: simpairddrads_R2_.fastq.gz inflating: simpairddrads.barcodes
We first create an empty params.txt input file for the pyRAD analysis.
The following command will create a template which we will fill in with all relevant parameter settings for the analysis.
%%bash
pyrad -n
new params.txt file created
Take a look at the default options. Each line designates a parameter, and contains a "##" symbol after which comments can be added, and which includes a description of the parameter. The affected step for each parameter is shown in parentheses. The first 14 parameters are required. Numbers 15-37 are optional.
%%bash
cat params.txt
==** parameter inputs for pyRAD version 3.0.4 **======================== affected step == ./ ## 1. Working directory (all) ./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 16) (s1) ./*.barcodes ## 3. Loc. of barcode file (if not line 16) (s1) vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) TGCAG ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG) (s1,s2) 2 ## 7. N processors (parallel) (all) 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 4 ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2) .88 ## 10. Wclust: clustering threshold as a decimal (s3,s6) rad ## 11. Datatype: rad,gbs,ddrad,pairgbs,pairddrad,merged (all) 4 ## 12. MinCov: min samples in a final locus (s7) 3 ## 13. MaxSH: max inds with shared hetero site (s7) c88d6m4p3 ## 14. Prefix name for final output (no spaces) (s7) ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) ## 18.opt.: loc. of de-multiplexed data (s2) ## 19.opt.: maxM: N mismatches in barcodes (def= 1) (s1) ## 20.opt.: phred Qscore offset (def= 33) (s2) ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2) ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5) (s5) ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7) ## 30.opt.: output formats: p,n,a,s,v,u,t,m,k,g,* (see docs) (s7) ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) ## 32.opt.: keep trimmed reads (def=0). Enter min length. (s2) ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
I will use the script below to substitute new values, but you should simply use any text editor to make changes. For this analysis I made the following changes from the defaults:
6. set the two restriction enzymes used to generate the ddRAD data
10. lowered clustering threshold to .85
11. set datatype to pairddrad
14. changed the output name prefix
19. mismatches for demulitiplexing set to 0, exact match.
24. Raised maxH. Lower is better for filtering paralogs.
30. added additional output formats (e.g., nexus,SNPs,STRUCTURE)
%%bash
sed -i '/## 6. /c\TGCAG,AATT ## 6. cutsites... ' ./params.txt
sed -i '/## 10. /c\.85 ## 10. lowered clust thresh' ./params.txt
sed -i '/## 11. /c\pairddrad ## 11. datatype... ' ./params.txt
sed -i '/## 14. /c\c85d6m4p3 ## 14. prefix name ' ./params.txt
sed -i '/## 19./c\0 ## 19. errbarcode... ' ./params.txt
sed -i '/## 24./c\10 ## 24. maxH... ' ./params.txt
sed -i '/## 30./c\* ## 30. outformats... ' ./params.txt
Let's take a look at the edited params.txt file
%%bash
cat params.txt
==** parameter inputs for pyRAD version 3.0.4 **======================== affected step == ./ ## 1. Working directory (all) ./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 16) (s1) ./*.barcodes ## 3. Loc. of barcode file (if not line 16) (s1) vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) TGCAG,AATT ## 6. cutsites... 2 ## 7. N processors (parallel) (all) 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 4 ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2) .85 ## 10. lowered clust thresh pairddrad ## 11. datatype... 4 ## 12. MinCov: min samples in a final locus (s7) 3 ## 13. MaxSH: max inds with shared hetero site (s7) c85d6m4p3 ## 14. prefix name ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) ## 18.opt.: loc. of de-multiplexed data (s2) 0 ## 19. errbarcode... ## 20.opt.: phred Qscore offset (def= 33) (s2) ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2) ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) 10 ## 24. maxH... ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7) * ## 30. outformats... ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) ## 32.opt.: keep trimmed reads (def=0). Enter min length. (s2) ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
Four examples of acceptable input file name formats for paired-end data:
1. xxxx_R1_001.fastq xxxx_R2_001.fastq
2. xxxx_R1_001.fastq.gz xxxx_R2_001.fastq.gz
3. xxxx_R1_001.fq.gz xxxx_R2_001.fq.gz
4. xxxx_R1_001.fq xxxx_R2_001.fq
The file ending can be .fastq, .fq, or .gz.
There should be a unique name or number shared by each pair and the characters "_R1_" and "_R2_"
For every file name with "_R1_" there should be a corresponding "_R2_" file.
If your data are already demultiplexed skip step 1 and see step 2 below.
%%bash
pyrad -p params.txt -s 1
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 1: sorting reads by barcode .
Now we can look at the stats output for this below:
%%bash
cat stats/s1.sorting.txt
file Nreads cut_found bar_matched simpairddrads_.fastq.gz 240000 240000 240000 sample true_bar obs_bars N_obs 2G0 AATATT AATATT 20000 1D0 ATATGG ATATGG 20000 1B0 ATGAAG ATGAAG 20000 1A0 CATCAT CATCAT 20000 3L0 GAGTTG GAGTTG 20000 1C0 GTATAA GTATAA 20000 3K0 GTGGAA GTGGAA 20000 2H0 GTTTAT GTTTAT 20000 3I0 TATATA TATATA 20000 2E0 TGAAAG TGAAAG 20000 3J0 TGTAGT TGTAGT 20000 2F0 TTGGTT TTGGTT 20000 nomatch _ 0
The de-multiplexed reads are written to a new file for each individual in a new directory created within your working directory called fastq/
%%bash
ls fastq/
1A0_R1.fq.gz 1A0_R2.fq.gz 1B0_R1.fq.gz 1B0_R2.fq.gz 1C0_R1.fq.gz 1C0_R2.fq.gz 1D0_R1.fq.gz 1D0_R2.fq.gz 2E0_R1.fq.gz 2E0_R2.fq.gz 2F0_R1.fq.gz 2F0_R2.fq.gz 2G0_R1.fq.gz 2G0_R2.fq.gz 2H0_R1.fq.gz 2H0_R2.fq.gz 3I0_R1.fq.gz 3I0_R2.fq.gz 3J0_R1.fq.gz 3J0_R2.fq.gz 3K0_R1.fq.gz 3K0_R2.fq.gz 3L0_R1.fq.gz 3L0_R2.fq.gz
An individual file will look like below:
%%bash
## FIRST READS file
## I show the first 12 lines and 80 characters to print clearly here
less fastq/1A0_R1.fq.gz | head -n 12 | cut -c 1-80
@lane1_fakedata0_R1_0 1:N:0: TGCAGTTACCTACTGTGATCGCCTAGACGGCAGTAAAACCGATGAGGCCCTCTCTAGAGTAACGGCTGAACTTATCCTAC + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @lane1_fakedata0_R1_1 1:N:0: TGCAGTTACCTACTGTGATCGCCTAGACGGCAGTAAAACCGATGAGGCCCTCTCTAGAGTAACGGCTGAACTTATCCTAC + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @lane1_fakedata0_R1_2 1:N:0: TGCAGTTACCTACTGTGATCGCCTAGACGGCAGTAAAACCGATGAGGCCCTCTCTAGAGTAACGGCTGAACTTATCCTAC + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
%%bash
## SECOND READS file
less fastq/1A0_R2.fq.gz | head -n 12 | cut -c 1-80
@lane1_fakedata0_R2_0 1:N:0: AATTATTAACCCAGACGAGTTGTCAGAGAGTGCTTTACCCCTTCTACGGTCCTCTGGGAGATTCGTGTGATTGTACTGGT + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @lane1_fakedata0_R2_1 1:N:0: AATTATTAACCCAGACGAGTTGTCAGAGAGTGCTTTACCCCTTCTACGGTCCTCTGGGAGATTCGTGTGATTGTACTGGT + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @lane1_fakedata0_R2_2 1:N:0: AATTATTAACCCAGACGAGTTGTCAGAGAGTGCTTTACCCCTTCTACGGTCCTCTGGGAGATTCGTGTGATTGTACTGGT + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
The reads were sorted into a separate file for the first (R1) and second (R2) reads for each individual.
If your data were previously de-multiplexed you need the following things before step 2:
your sorted file names should be formatted similar to above, but with sample names substituted for 1A0, 1A1, etc.
the files can be zipped (.gz) or not (.fq or .fastq).
the barcode should be removed (not on left side of reads)
the restriction site should not be removed, but if it is, enter a '@' symbol before the location of your sorted data.
Enter on line 18 of the params file the location of your sorted data.
Next we apply the quality filtering.
%%bash
pyrad -p params.txt -s 2
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 2: quality filtering ............
Statistics for the number of reads that passed filtering can be found in the stats/ directory.
%%bash
cat stats/s2.rawedit.txt
sample Nreads exclude trimmed passed 1B0 20000 0 0 20000 3I0 20000 0 0 20000 2H0 20000 0 0 20000 3J0 20000 0 0 20000 3K0 20000 0 0 20000 2G0 20000 0 0 20000 2E0 20000 0 0 20000 1C0 20000 0 0 20000 1D0 20000 0 0 20000 2F0 20000 0 0 20000 3L0 20000 0 0 20000 1A0 20000 0 0 20000 Nreads = total number of reads for a sample exclude = reads that were excluded trimmed = reads that had adapter trimmed but were kept passed = total kept reads
The filtered data files are converted to fasta format and written to a directory called edits/. I show this below:
%%bash
ls edits/
1A0.edit 1B0.edit 1C0.edit 1D0.edit 2E0.edit 2F0.edit 2G0.edit 2H0.edit 3I0.edit 3J0.edit 3K0.edit 3L0.edit
Here you can see that the first and second reads have been concatenated into a single read separated by 'nnnn' in these files, and the read quality scores have now been discarded.
%%bash
head -n 8 edits/1A0.edit
>1A0_0_pair TGCAGTTACCTACTGTGATCGCCTAGACGGCAGTAAAACCGATGAGGCCCTCTCTAGAGTAACGGCTGAACTTATCCTACCGAATGGTTGCGAnnnnTCATAAAACATTACTCTAAGACCAGTACAATCACACGAATCTCCCAGAGGACCGTAGAAGGGGTAAAGCACTCTCTGACAACTCGTCTGGGTTAATAATT >1A0_1_pair TGCAGTTACCTACTGTGATCGCCTAGACGGCAGTAAAACCGATGAGGCCCTCTCTAGAGTAACGGCTGAACTTATCCTACCGAATGGTTGCGAnnnnTCATAAAACATTACTCTAAGACCAGTACAATCACACGAATCTCCCAGAGGACCGTAGAAGGGGTAAAGCACTCTCTGACAACTCGTCTGGGTTAATAATT >1A0_2_pair TGCAGTTACCTACTGTGATCGCCTAGACGGCAGTAAAACCGATGAGGCCCTCTCTAGAGTAACGGCTGAACTTATCCTACCGAATGGTTGCGAnnnnTCATAAAACATTACTCTAAGACCAGTACAATCACACGAATCTCCCAGAGGACCGTAGAAGGGGTAAAGCACTCTCTGACAACTCGTCTGGGTTAATAATT >1A0_3_pair TGCAGTTACCTACTGTGATCGCCTAGACGGCAGTAAAACCGATGAGGCCCTCTCTAGAGTAACGGCTGAACTTATCCTACCGAATGGTTGCGAnnnnTCATAAAACATTACTCTAAGACCAGTACAATCACACGAATCTCCCAGAGGACCGTAGAAGGGGTAAAGCACTCTCTGACAACTCGTCTGGGTTAATAATT
From here you have two options for how to proceed. The first is to treat these concatenated reads like single end ddRAD sequences and cluster them as concatenated reads. To do this change the datatype option in the params folder from "pairddrad" to "ddrad", and proceed with steps 3-7. It will detect and remove the 'nnnn' seperator and cluster the long reads.
Alternatively, the more complex option is to cluster the reads separately, which I call the "split clustering" method. This will do the following:
This is illustrated graphically below:
The benefit of this method over clustering concatenated first+second reads is that second reads can be retained that are either more divergent than the clustering threshold, or which contain many low quality base calls (Ns), thus potentially yielding more SNPs. Also, for very very large data sets this method performs faster clustering.
Drawbacks are that it discards sequences for which the first and second reads do not appear to be from the same DNA fragment, whereas the concatenated clustering method would likely cluster them separately. And also, even a single incompletely digested second read can cause an otherwise good cluster to be discarded.
The split clustering method filters the alignment of second reads based on the number of indels in the alignment. This number can be set on line 27 of the params file. The default values are 3,6,99,99. Meaning for within-sample clusters we allow 3 indels in first read clusters and 6 in second read clusters. For across-sample clusters we allow 99 in first read clusters and 99 in second read clusters. If only two numbers are set (e.g., 3,10) this is equivalent to 3,3,10,10.
%%bash
## we are using the split-clustering method since the
## datatype is still set to pairddrad
pyrad -p params.txt -s 3
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ de-replicating files for clustering... step 3: within-sample clustering of 12 samples at '.85' similarity. Running 2 parallel jobs with up to 6 threads per job. If needed, adjust to avoid CPU and MEM limits sample 1C0 finished, 1000 loci sample 2H0 finished, 1000 loci sample 3L0 finished, 1000 loci sample 3I0 finished, 1000 loci sample 1D0 finished, 1000 loci sample 2F0 finished, 1000 loci sample 3K0 finished, 1000 loci sample 1B0 finished, 1000 loci sample 1A0 finished, 1000 loci sample 2G0 finished, 1000 loci sample 2E0 finished, 1000 loci sample 3J0 finished, 1000 loci
%%bash
head -n 23 stats/s3.clusters.txt
taxa total dpt.me dpt.sd d>5.tot d>5.me d>5.sd badpairs 1A0 1000 20.0 0.0 1000 20.0 0.0 0 1B0 1000 20.0 0.0 1000 20.0 0.0 0 1C0 1000 20.0 0.0 1000 20.0 0.0 0 1D0 1000 20.0 0.0 1000 20.0 0.0 0 2E0 1000 20.0 0.0 1000 20.0 0.0 0 2F0 1000 20.0 0.0 1000 20.0 0.0 0 2G0 1000 20.0 0.0 1000 20.0 0.0 0 2H0 1000 20.0 0.0 1000 20.0 0.0 0 3I0 1000 20.0 0.0 1000 20.0 0.0 0 3J0 1000 20.0 0.0 1000 20.0 0.0 0 3K0 1000 20.0 0.0 1000 20.0 0.0 0 3L0 1000 20.0 0.0 1000 20.0 0.0 0 ## total = total number of clusters, including singletons ## dpt.me = mean depth of clusters ## dpt.sd = standard deviation of cluster depth ## >N.tot = number of clusters with depth greater than N ## >N.me = mean depth of clusters with depth greater than N ## >N.sd = standard deviation of cluster depth for clusters with depth greater than N ## badpairs = mismatched 1st & 2nd reads (only for paired ddRAD data)
We next make consensus base calls for each cluster within each individual. First we estimate the error rate and heterozygosity within each sample:
%%bash
pyrad -p params.txt -s 4
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 4: estimating error rate and heterozygosity ............
Calling consensus sequences applies a number of filters, as listed in the params file, to remove potential paralogs or highly repetitive markers from the data set. For paired-end data the maxN filter only applies to first reads, since we don't mind allowing low quality base calls in second reads. The maxH filter applies to both reads together. The diploid filter (only allow 2 haplotypes in a consensus sequence) also applies to the two reads together.
%%bash
pyrad -p params.txt -s 5
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 5: created consensus seqs for 12 samples, using H=0.00141 E=0.00050 ............
%%bash
cat stats/s5.consens.txt
taxon nloci f1loci f2loci nsites npoly poly 3I0 1000 1000 1000 183003 268 0.0014645 2G0 1000 1000 1000 183006 287 0.0015683 3K0 1000 1000 1000 183008 266 0.0014535 2H0 1000 1000 1000 183003 272 0.0014863 3J0 1000 1000 1000 183008 274 0.0014972 1B0 1000 1000 1000 183010 269 0.0014699 1D0 1000 1000 1000 183008 245 0.0013387 2E0 1000 1000 1000 183009 232 0.0012677 2F0 1000 1000 1000 183004 264 0.0014426 1C0 1000 1000 1000 183008 267 0.001459 1A0 1000 1000 1000 183004 245 0.0013388 3L0 1000 1000 1000 183004 241 0.0013169 ## nloci = number of loci ## f1loci = number of loci with >N depth coverage ## f2loci = number of loci with >N depth and passed paralog filter ## nsites = number of sites across f loci ## npoly = number of polymorphic sites in nsites ## poly = frequency of polymorphic sites
This step clusters consensus sequences across samples. It can take a long time for very large data sets. If run normally it will print its progress to the screen so you can judge how long it might take. This example will take less than a minute.
%%bash
pyrad -p params.txt -s 6
vsearch v1.0.7_linux_x86_64, 7.5GB RAM, 4 cores https://github.com/torognes/vsearch finished clustering
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 6: clustering across 12 samples at '.85' similarity Reading file /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/clust.85/cat.firsts_ 100% 1116009 nt in 12000 seqs, min 93, max 94, avg 93 Indexing sequences 100% Counting unique k-mers 100% Clustering 100% Writing clusters 100% Clusters: 1000 Size min 12, max 12, avg 12.0 Singletons: 0, 0.0% of seqs, 0.0% of clusters
Alignment of loci, statistics, and output of various formatted data files.
%%bash
pyrad -p params.txt -s 7
ingroup 1A0,1B0,1C0,1D0,2E0,2F0,2G0,2H0,3I0,3J0,3K0,3L0 addon exclude final stats written to: /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/stats/c85d6m4p3.stats output files being written to: /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/outfiles/ directory writing nexus file writing phylip file + writing full SNPs file + writing unlinked SNPs file + writing STRUCTURE file + writing geno file ** must enter group/clade assignments for treemix output writing vcf file writing alleles file ** must enter group/clade assignments for migrate-n output
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ ...
Stats for this run. In this case 1000 loci were shared across all 12 samples.
%%bash
cat stats/c85d6m4p3.stats
1000 ## loci with > minsp containing data 1000 ## loci with > minsp containing data & paralogs removed 1000 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci 1A0 1000 1B0 1000 1C0 1000 1D0 1000 2E0 1000 2F0 1000 2G0 1000 2H0 1000 3I0 1000 3J0 1000 3K0 1000 3L0 1000 ## nloci = number of loci with data for exactly ntaxa ## ntotal = number of loci for which at least ntaxa have data ntaxa nloci saved ntotal 1 - 2 - - 3 - - 4 0 * 1000 5 0 * 1000 6 0 * 1000 7 0 * 1000 8 0 * 1000 9 0 * 1000 10 0 * 1000 11 0 * 1000 12 1000 * 1000 ## nvar = number of loci containing n variable sites (pis+autapomorphies). ## sumvar = sum of variable sites (SNPs). ## pis = number of loci containing n parsimony informative sites. ## sumpis = sum of parsimony informative sites. nvar sumvar PIS sumPIS 0 0 0 70 0 1 2 2 179 179 2 8 18 256 691 3 20 78 218 1345 4 43 250 144 1921 5 78 640 77 2306 6 123 1378 36 2522 7 146 2400 12 2606 8 150 3600 6 2654 9 126 4734 1 2663 10 93 5664 1 2673 11 83 6577 0 2673 12 61 7309 0 2673 13 32 7725 0 2673 14 16 7949 0 2673 15 11 8114 0 2673 16 3 8162 0 2673 17 3 8213 0 2673 18 1 8231 0 2673 19 1 8250 0 2673 total var= 8250 total pis= 2673 sampled unlinked SNPs= 1000 sampled unlinked bi-allelic SNPs= 1000
%%bash
ls outfiles/
c85d6m4p3.alleles c85d6m4p3.excluded_loci c85d6m4p3.geno c85d6m4p3.loci c85d6m4p3.nex c85d6m4p3.phy c85d6m4p3.snps c85d6m4p3.str c85d6m4p3.unlinked_snps c85d6m4p3.vcf
%%bash
ls stats/
c85d6m4p3.stats Pi_E_estimate.txt s1.sorting.txt s2.rawedit.txt s3.clusters.txt s5.consens.txt