samtools的具体使用
以下是一个简单的应用,具体的使用情况参见http://samtools.sourceforge.net/samtools.shtml
生成sam格式以后
第一步:先转化为bam格式
samtools view -bS aln.sam > aln.bam
第二步:sort一下
samtools sort aln.bam aln.sorted 得到aln.sorted.bam
第三步:index一下
samtools index aln.sorted.bam
第四步:找snp
samtools pileup -vcf ref.fa aln.bam | tee raw.txt | samtools.pl varFilter -D100 > flt.txt
以上命令是寻找最大深度为100的SNP,raw.txt是原始SNP的文件
The -D option of varFilter controls the maximum read depth
awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' flt.txt > final.txt
以上是根据sam文件的第三列和第六例进行质量控制。这个根据自己设定的阈值,进行筛选。
如果想了解每个命令的意思,最好先看一下sam格式的说明。
VCF FORMAT
The Variant Call Format (VCF) is a TAB-delimited format with each data line consists of the following fields:
Col Field Description
1 CHROM CHROMosome name
2 POS the left-most POSition of the variant
3 ID unique variant IDentifier
4 REF the REFerence allele
5 ALT the ALTernate allele(s), separated by comma
6 QUAL variant/reference QUALity
7 FILTER FILTers applied
8 INFO INFOrmation related to the variant, separated by semi-colon
9 FORMAT FORMAT of the genotype fields, separated by colon (optional)
10+ SAMPLE SAMPLE genotypes and per-sample information (optional)
Understanding the output: the VCF/BCF format
The VCF format
The Variant Call Format (VCF) is the emerging standard for storing variant data. Originally designed for SNPs and short INDELs, it also works for structural variations.
VCF consists of a header section and a data section. The header must contain a line starting with one '#', showing the name of each field, and then the sample names starting at the 10th column. The data section is TAB delimited with each line consisting of at least 8 mandatory fields (the first 8 fields in the table below). The FORMAT field and sample information are allowed to be absent. We refer to the official VCF spec for a more rigorous description of the format.
Col Field Description
1 CHROM Chromosome name
2 POS 1-based position. For an indel, this is the position preceding the indel.
3 ID Variant identifier. Usually the dbSNP rsID.
4 REF Reference sequence at POS involved in the variant. For a SNP, it is a single base.
5 ALT Comma delimited list of alternative seuqence(s).
6 QUAL Phred-scaled probability of all samples being homozygous reference.
7 FILTER Semicolon delimited list of filters that the variant fails to pass.
8 INFO Semicolon delimited list of variant information.
9 FORMAT Colon delimited list of the format of individual genotypes in the following fields.
10+ Sample(s) Individual genotype information defined by FORMAT.
这是我的结果
chr1 51459 G A 11 11 11 1 A
chr1 133012 C T 21 21 21 1 t
chr1 139463 T A 6 6 18 1 a
chr1 139467 G C 18 18 18 1 c 5
chr1 564598 A G 51 51 60 8 GggGggG^~G @CCCCCCC
chr1 564654 G A 39 39 60 4 a$aAa CCCC
chr1 564766 T C 42 42 60 5 cCcCC CCCCC
chr1 564862 T C 33 33 57 2 CC CC
chr1 564868 T C 30 30 60 1 C C
chr1 565286 C T 38 48 60 8 t$tttgtTt CCCC+CCC
chr1 565406 C T 39 39 60 4 TttT CCCC
chr1 565454 T C 36 36 54 3 ccC CCC
chr1 565464 T C 33 33 40 2 C^Ic CC
chr1 565467 A G 33 33 40 2 Gg CC
chr1 565490 T C 36 36 54 3 cCC CCC
chr1 565508 G A 33 33 60 2 Aa CC
chr1 565541 A G 42 42 59 5 GGGgg CCCCB
chr1 565591 C T 36 36 60 3 ttT BCC
chr1 565697 A G 45 45 58 6 gGggGG 0CCCCC
chr1 565827 T C 5 36 60 4 .CcC CCCC
chr1 565870 T C 54 54 60 9 cCcccCcC^gc CCCCCCCCA
chr1 565901 G A 33 33 60 2 aA AD
chr1 565937 T C 39 39 60 4 cccc CCCC
chr1 566021 A G 42 42 60 5 GggGg CCCCC
chr1 566024 G A 39 39 60 4 aAaA CCDC
chr1 566048 G A 45 45 60 6 AaaaA^~a CCDCCC
chr1 566130 C T 42 42 60 5 tttTT CCCCC
chr1 566573 A G 36 36 51 3 gGG CCC
chr1 566771 C T 36 36 58 3 TtT CCC
chr1 566792 T C 45 45 56 6 CCcccc ACCCC<
chr1 566816 C A 57 57 52 10 aaaAAAaAA^>a CBCCCCCCCC
chr1 566849 G A 36 36 29 3 AAa CCC
chr1 566916 A G 51 51 54 8 gGgggGgGgg CC?CCCCC
chr1 566933 A G 57 57 54 10 gggGGgGGgG CCCCCCCACC
chr1 566960 T C 36 36 54 3 cCC CCC
chr1 567002 T C 30 30 30 1 C C
chr1 567033 T C 30 30 30 1 c C
chr1 567039 C A 30 30 30 1 a C
chr1 567062 C T 33 33 47 2 tT CC
chr1 567092 T C 36 36 27 3 cCc CCC
chr1 567119 A C 42 42 43 5 cCccc CCCCC
chr1 567191 C T 45 45 40 6 Tttttt CCCCCC
chr1 567486 T C 42 42 47 5 CCcC CCCCC
chr1 567489 T C 36 36 57 3 CcC CBC
chr1 567579 C T 39 39 53 4 tttT CCCC
chr1 567697 G A 33 33 30 2 Aa CD
chr1 567783 T C 24 24 59 2 cc +0
chr1 567807 T C 26 26 26 1 c C
chr1 568072 A G 36 36 30 3 GGg CCC
chr1 568201 T C 30 30 58 1 c C
chr1 568256 C T 36 36 30 3 TTT CCC
有意义的是samtool软件老的版本没有在比对上参考上的每个碱基出现的次数,1.1,1.2版本以上会有一个DPR参数,这个可以显示每个碱基出现的次数,对于分析是有好处的。
欢迎关注生信人