知识中心 - 北京概普生物科技有限公司(GapTech)

SAM文件介绍

生信干货 sulin ·2015年8月30日 22:43

SAM bowtie output

Following is a brief description of the SAM format as output by bowtie when the -S/--sam option is specified. For more details, see the SAM format specification.

When -S/--sam is specified, bowtie prints a SAM header with @HD, @SQ and @PG lines. When one or more --sam-RG arguments are specified, bowtie will also print an @RG line that includes all user-specified --sam-RG tokens separated by tabs.

Each subsequnt line corresponds to a read or an alignment. Each line is a collection of at least 12 fields separated by tabs; from left to right, the fields are:

1. （A）Name of read that aligned

2. （B）Sum of all applicable flags. Flags relevant to Bowtie are:

1	The read is one of a pair
2	The alignment is one end of a proper paired-end alignment
4	The read has no reported alignments
8	The read is one of a pair and has no reported alignments
16	The alignment is to the reverse reference strand
32	The other mate in the paired-end alignment is aligned to the reverse reference strand
64	The read is the first (#1) mate in a pair
128	The read is the second (#2) mate in a pair

Thus, an unpaired read that aligns to the reverse reference strand will have flag 16. A paired-end read that aligns and is the first mate in the pair will have flag 83 (= 64 + 16 + 2 + 1).

3. （C）Name of reference sequence where alignment occurs, or ordinal ID if no name was provided

4. （D）1-based offset into the forward reference strand where leftmost character of the alignment occurs

5. （E）Mapping quality

6. （F）CIGAR string representation of alignment

7. （G）Name of reference sequence where mate's alignment occurs. Set to = if the mate's reference sequence is the same as this alignment's, or * if there is no mate.

8. （H）1-based offset into the forward reference strand where leftmost character of the mate's alignment occurs. Offset is 0 if there is no mate.

9. （I）Inferred insert size. Size is negative if the mate's alignment occurs upstream of this alignment. Size is 0 if there is no mate.

10. （J）Read sequence (reverse-complemented if aligned to the reverse strand)

11. （K）ASCII-encoded read qualities (reverse-complemented if the read aligned to the reverse strand). The encoded quality values are on the Phred quality scale and the encoding is ASCII-offset by 33 (ASCII char !), similarly to a FASTQ file.

12. （L）Optional fields. Fields are tab-separated. For descriptions of all possible optional fields, see the SAM format specification. bowtie outputs some of these optional fields for each alignment, depending on the type of the alignment:

NM:i:<N>	Aligned read has an edit distance of <N>.
CM:i:<N>	Aligned read has an edit distance of <N> in colorspace. This field is present in addition to the NM field in -C/--color mode, but is omitted otherwise.
MD:Z:<S>	For aligned reads, <S> is a string representation of the mismatched reference bases in the alignment. See SAM format specification for details. For colorspace alignments, <S> describes the decoded nucleotide alignment, not the colorspace alignment.
XA:i:<N>	Aligned read belongs to stratum <N>. See Strata for definition.
XM:i:<N>	For a read with no reported alignments, <N> is 0 if the read had no alignments. If -m was specified and the read's alignments were supressed because the -m ceiling was exceeded, <N> equals the -m ceiling + 1, to indicate that there were at least that many valid alignments (but all were suppressed). In -M mode, if the alignment was randomly selected because the -M ceiling was exceeded, <N> equals the -M ceiling + 1, to indicate that there were at least that many valid alignments (of which one was reported at random).

CIGAR:

| +-----+--------------+-----+

| |M |BAM_CMATCH |0 |

| +-----+--------------+-----+

| |I |BAM_CINS |1 |

| +-----+--------------+-----+

| |D |BAM_CDEL |2 |

| +-----+--------------+-----+

| |N |BAM_CREF_SKIP |3 |

| +-----+--------------+-----+

| |S |BAM_CSOFT_CLIP|4 |

| +-----+--------------+-----+

| |H |BAM_CHARD_CLIP|5 |

| +-----+--------------+-----+

| |P |BAM_CPAD |6 |

| +-----+--------------+-----+

| |= |BAM_CEQUAL |7 |

| +-----+--------------+-----+

| |X |BAM_CDIFF |8 |

| +-----+--------------+-----+

Quality

A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). Two different equations have been in use. The first is the standard Sanger variant to assess reliability of a base call, otherwise known as Phred quality score:

The Solexa pipeline (i.e., the software delivered with the Illumina Genome Analyzer) earlier used a different mapping, encoding the odds p/(1-p) instead of the probability p:

Although both mappings are asymptotically identical at higher quality values, they differ at lower quality levels (i.e., approximately p > 0.05, or equivalently, Q < 13).

Relationship between Q and p using the Sanger (red) and Solexa (black) equations (described above). The vertical dotted line indicates p = 0.05, or equivalently, Q ≈ 13.

At times there has been disagreement about which mapping Illumina actually uses. The user guide (Appendix B, page 122) for version 1.4 of the Illumina pipeline states that: "The scores are defined as Q=10*log10(p/(1-p)) [sic], where p is the probability of a base call corresponding to the base in question".[2] In retrospect, this entry in the manual appears to have been an error. The user guide (What's New, page 5) for version 1.5 of the Illumina pipeline lists this description instead: "Important Changes in Pipeline v1.3 [sic]. The quality scoring scheme has changed to the Phred [i.e., Sanger] scoring scheme, encoded as an ASCII character by adding 64 to the Phred value. A Phred score of a base is: , where e is the estimated probability of a base being wrong.[3]

SAM是以TAB分隔的，除去以@开始的header行，每一定位行（alignment）包含以下项目：

|Col | Field | Description |

+‐‐‐‐+‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐+

| 1 | QNAME | Query (pair) NAME 读段对应模板名称|

| 2 | FLAG | bitwise FLAG 按位的标签（意义见后面）|

| 3 | RNAME | Reference sequence NAME 参考序列名称|

| 4 | POS | 1‐based leftmost POSition/coordinate of clipped sequence 1-based左起位置/整齐的序列定位|

| 5 | MAPQ | MAPping Quality (Phred‐scaled) 读段定位质量|

| 6 | CIAGR | extended CIGAR string 扩展的CIGAR字符串（详见SAM格式详解）|

| 7 | MRNM | Mate Reference sequence NaMe (`=' if same as RNAME) 配对的参考序列名称，=表示相同|

| 8 | MPOS | 1‐based Mate POSistion 参考序列中1-based配对位置|

| 9 | ISIZE | Inferred insert SIZE 猜测的插入序列大小|

|10 | SEQ | query SEQuence on the same strand as the reference 与参考序列处于同一链的读段序列|

|11 | QUAL | query QUALity (ASCII‐33 gives the Phred base quality) 度短序列的碱基质量ASCII码-33为碱基质量|

|12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE 变量选项，格式TAG:VTYPE:VALUE |

下表列出了FLAG列的含义：

+‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐+

| Flag | Description |

+‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐+

|0x0001 (1)| the read is paired in sequencing |读段序列是成对的

|0x0002 (2)| the read is mapped in a proper pair |读段定位在适当位置

|0x0004 (4)| the query sequence itself is unmapped |读段序列自身没有定位

|0x0008 (8)| the mate is unmapped |与其配对的读段为定位

|0x0010 (16)| strand of the query (1 for reverse) |读段对应链

|0x0020 (32)| strand of the mate |配对链

|0x0040 (64)| the read is the first read in a pair |读段是读段对的第一个

|0x0080 (128)| the read is the second read in a pair |读段是读段对的第二个

|0x0100 (256)| the alignment is not primary |定位不是最优选

|0x0200 (512)| the read fails platform/vendor quality checks |读段质量未生成

|0x0400 (1024)| the read is either a PCR or an optical duplicate |读段是PCR或者光学重复

+‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐+

限制：

1、在bam_import.c, bam_endian.h, bam.c和bam_aux.c中的非定位单词

2、 CIGAR操作符P不能正确处理

3、合并时，输入文件需要有相同数目的参考基因序列。设备要求可以降低，另外，合并时不会自动重构头字典（header dictionary）。用户必须提供正确的头，Picard做合并表现更好。

4、 samtools的rmdup无法处理单端数据，不能移除染色体之间的重复。Picard表现更好。

如果你喜欢我们就动动手指加入吧！