【转录组测序分析专题6】
SRA数据库介绍
【转录组测序分析专题】将要讲解流程的内容如下:
往期精彩回顾
基本格式
数据下载和预处理
上次直接写了如何下载测序数据,本该先介绍一下相关数据库的,本期就来介绍SRA数据库。大致的了解一下这个数据库以及简单的搜索技巧还是很有必要的。如有更好地搜索技巧,欢迎交流分享~
先来看一眼数据库主页:
SRA数据库主页面
https://www.ncbi.nlm.nih.gov/sra/
SRA,全称Sequence Read Archive,序列read归档,即存储生物序列数据,可供研究团体通过比较数据集增强重现性结果并得到新的发现。它与数据库EMBL and DDBJ共享用户提交的测序数据。
目前存储的测序平台包括:
Roche 454 GS System
Illumina Genome Analyzer
Applied Biosystems SOLiD System
Helicos Heliscope
Complete Genomics
Pacific Biosciences SMRT
SRA database growth
(截止20190531)
数据增长曲线可在官网查看:
https://www.ncbi.nlm.nih.gov/sra/docs/sragrowth/
1,SRA数据库编号
这里主要有六种不同的SRA数据库编号,以S开头,官方说明链接:
https://www.ncbi.nlm.nih.gov/books/NBK56913/#search.what_do_the_different_sra_accessi
Accession Prefix | Accession Name | Definition |
SRA | SRA submission accession | The submission accession represents a virtual container that holds the objects represented by the other five accessions and is used to track the submission in the archive. |
SRP | SRA study accession | A Study is an object that contains the project metadata describing a sequencing study or project. Imported from BioProject. |
SRX | SRA experiment accession | An Experiment is an object that contains the metadata describing the library, platform selection, and processing parameters involved in a particular sequencing experiment. |
SRR | SRA run accession | A Run is an object that contains actual sequencing data for a particular sequencing experiment. Experiments may contain many Runs depending on the number of sequencing instrument runs that were needed. |
SRS | SRA sample accession | A Sample is an object that contains the metadata describing the physical sample upon which a sequencing experiment was performed. Imported from BioSample. |
SRZ | SRA analysis accession | An analysis is an object that contains a sequence data analysis BAM file and the metadata describing the sequence analysis. |
前面提到了NCBI-SRA与EMBL以及DDBJ数据库共享数据,因此还会看到以E开头的数据编号和以D开头的数据编号
来源于EMBL-EBI数据库的数据编号以E开头
Accession Prefix | Accession Name |
ERA | ERA submission accession |
ERP | ERA study accession |
ERX | ERA experiment accession |
ERR | ERA run accession |
ERS | ERA sample accession |
ERZ | ERA analysis accession |
来源于DDBJ数据库的数据编号以D开头
Accession Prefix | Definition |
DRA | DRA submission accession |
DRP | DRA study accession |
DRX | DRA experiment accession |
DRR | DRA run accession |
DRS | DRA sample accession |
DRZ | DRA analysis accession |
2,检索技巧
如何根据你的研究兴趣来检索到相应的数据,官方说明:
https://www.ncbi.nlm.nih.gov/sra/docs/srasearch/
1)SRX编号
也即实验编号,是SRA数据库发布的最小单元编号。SRA数据库检索会返回一个SRX编号列表。比如我想检索与breast cancer相关的数据,下面放一张图进行说明:
2)使用Boolean operators过滤条件
OR:与,至少包含一个关键term
AND:且,同时包含两个关键term
NOT:非,过滤掉包含的term
进一步过滤检索条件:三阴性乳腺癌,物种为人类,非细胞系数据。利用AND NOT OR关键字,如下:
(breast cancer AND triple negative) NOT "cell line" AND "Homo sapiens"[porgn:__txid9606]
3)使用accession搜索,四种编号:
STUDY:SRP#, ERP#, or DRP#
SAMPLE:SRS#, ERS#, or DRS#
EXPERIMENT:SRX#, ERX#, or DRX#
RUN:SRR#, ERR#, or DRR#
如SRP140592,这个project有六个样本,在GEO数据库中也有连接,如下:
4)高级检索
进入高级检索页面:
检索页面说明:
使用filter参数
Filter in SRA primarily allows you to find SRA records that are cross-referenced with other NCBI databases: PubMed, PubMed Central (PMC), Nucleotide, Assembly, and others.
例子:检索所有黑腹果蝇的数据,并且在PMC中有相关文献的数据:
得到PMC相关文章:
使用Properties参数
Properties in SRA primarily allows you to narrow search results by controlled-vocabulary library's annotations.
在Properties下使用SRA records的index:
1.'Instrument': "instrument illumina genome analyzer iix"[Properties]
2.'Library layout': "library layout paired"[Properties]
3.'Platform': "platform helicos"[Properties]
5)此外还可以去BioProject数据库和BioSample数据库进行检索。