知识中心 - 北京概普生物科技有限公司(GapTech)

近期高水平生物信息学文献速递

生信干货 article ·2018年3月31日 18:04

1. 来自Hungate1000收集的瘤胃微生物的培养和测序

Cultivation and sequencing of rumen microbiome members from the Hungate1000 Collection(Nature Biotechnology)

Abstract

Productivity of ruminant livestock depends on the rumen microbiota, which ferment indigestible plant polysaccharides into nutrients used for growth. Understanding the functions carried out by the rumen microbiota is important for reducing greenhouse gas production by ruminants and for developing biofuels from lignocellulose. We present 410 cultured bacteria and archaea, together with their reference genomes, representing every cultivated rumen-associated archaeal and bacterial family. We evaluate polysaccharide degradation, short-chain fatty acid production and methanogenesis pathways, and assign specific taxa to functions. A total of 336 organisms were present in available rumen metagenomic data sets, and 134 were present in human gut microbiome data sets. Comparison with the human microbiome revealed rumen-specific enrichment for genes encoding de novosynthesis of vitamin B12, ongoing evolution by gene loss and potential vertical inheritance of the rumen microbiome based on underrepresentation of markers of environmental stress. We estimate that our Hungate genome resource represents ∼75% of the genus-level bacterial and archaeal taxa present in the rumen.

2. Y染色体上人类着丝粒的线性组装

Linear assembly of a human centromere on the Y chromosome(Nature Biotechnology)

Abstract

The human genome reference sequence remains incomplete owing to the challenge of assembling long tracts of near-identical tandem repeats in centromeres. We implemented a nanopore sequencing strategy to generate high-quality reads that span hundreds of kilobases of highly repetitive DNA in a human Y chromosome centromere. Combining these data with short-read variant validation, we assembled and characterized the centromeric region of a human Y chromosome.

3.针对复杂基因组的经济有效的高通量单倍型迭代映射和测序

Cost-effective high-throughput single-haplotype iterative mapping and sequencing for complex genomic structures(Nature Protocols)

Abstract

The reference sequences of structurally complex regions can be obtained only through highly accurate clone-based approaches. We and others have successfully used single-haplotype iterative mapping and sequencing (SHIMS) 1.0 to assemble structurally complex regions across the sex chromosomes of several vertebrate species and to allow for targeted improvements to the reference sequences of human autosomes. However, SHIMS 1.0 is expensive and time consuming, requiring resources that only a genome center can provide. Here we introduce SHIMS 2.0, an improved SHIMS protocol that allows even a small laboratory to generate high-quality reference sequence from complex genomic regions. Using a streamlined and parallelized library-preparation protocol, and taking advantage of inexpensive high-throughput short-read-sequencing technologies, a small laboratory with both molecular biology and bioinformatics experience can sequence and assemble 192 large-insert bacterial artificial chromosome (BAC) or fosmid clones in 1 week. In SHIMS 2.0, in contrast to other pooling strategies, each clone is sequenced with a unique barcode, thus enabling clones containing nearly identical sequences to be multiplexed in a single sequencing run and assembled separately. Relative to SHIMS 1.0, SHIMS 2.0 decreases the required cost and time by two orders of magnitude while preserving high sequencing accuracy.

4. 目标富集测序,详细阐述小RNA

Target-enrichment sequencing for detailed characterization of small RNAs(Nature Protocols)

Abstract

Identification of important, functional small RNA (sRNA) species is currently hampered by the lack of reliable and sensitive methods to isolate and characterize them. We have developed a method, termed target-enrichment of sRNAs (TEsR), that enables targeted sequencing of rare sRNAs and diverse precursor and mature forms of sRNAs not detectable by current standard sRNA sequencing methods. It is based on the amplification of full-length sRNA molecules, production of biotinylated RNA probes, hybridization to one or multiple targeted RNAs, removal of nontargeted sRNAs and sequencing. By this approach, target sRNAs can be enriched by a factor of 500–30,000 while maintaining strand specificity. TEsR enriches for sRNAs irrespective of length or different molecular features, such as the presence or absence of a 5′ cap or of secondary structures or abundance levels. Moreover, TEsR allows the detection of the complete sequence (including sequence variants, and 5′ and 3′ ends) of precursors, as well as intermediate and mature forms, in a quantitative manner. A well-trained molecular biologist can complete the TEsR procedure, from RNA extraction to sequencing library preparation, within 4–6 d.

5.对英国生物库数据的全基因组分析提供了对骨关节炎遗传结构的见解

Genome-wide analyses using UK Biobank data provide insights into the genetic architecture of osteoarthritis(Nature Genetics)

Abstract

Osteoarthritis is a common complex disease imposing a large public-health burden. Here, we performed a genome-wide association study for osteoarthritis, using data across 16.5 million variants from the UK Biobank resource. After performing replication and meta-analysis in up to 30,727 cases and 297,191 controls, we identified nine new osteoarthritis loci, in all of which the most likely causal variant was noncoding. For three loci, we detected association with biologically relevant radiographic endophenotypes, and in five signals we identified genes that were differentially expressed in degraded compared with intact articular cartilage from patients with osteoarthritis. We established causal effects on osteoarthritis for higher body mass index but not for triglyceride levels or genetic predisposition to type 2 diabetes.

6. 520,000名受试者的多学科全基因组关联研究鉴定了32种与中风和中风亚型相关的基因座

Multiancestry genome-wide association study of 520,000 subjects identifies 32 loci associated with stroke and stroke subtypes(Nature Genetics)

Abstract

Stroke has multiple etiologies, but the underlying genes and pathways are largely unknown. We conducted a multiancestry genome-wide-association meta-analysis in 521,612 individuals (67,162 cases and 454,450 controls) and discovered 22 new stroke risk loci, bringing the total to 32. We further found shared genetic variation with related vascular traits, including blood pressure, cardiac traits, and venous thromboembolism, at individual loci (n = 18), and using genetic risk scores and linkage-disequilibrium-score regression. Several loci exhibited distinct association and pleiotropy patterns for etiological stroke subtypes. Eleven new susceptibility loci indicate mechanisms not previously implicated in stroke pathophysiology, with prioritization of risk variants and genes accomplished through bioinformatics analyses using extensive functional datasets. Stroke risk loci were significantly enriched in drug targets for antithrombotic therapy.

7. 基因表达数据模块检测方法的综合评估

A comprehensive evaluation of module detection methods for gene expression data(Nature Communications)

Abstract

A critical step in the analysis of large genome-wide gene expression datasets is the use of module detection methods to group genes into co-expression modules. Because of limitations of classical clustering methods, numerous alternative module detection methods have been proposed, which improve upon clustering by handling co-expression in only a subset of samples, modelling the regulatory network, and/or allowing overlap between modules. In this study we use known regulatory networks to do a comprehensive and robust evaluation of these different methods. Overall, decomposition methods outperform all other strategies, while we do not find a clear advantage of biclustering and network inference-based approaches on large gene expression datasets. Using our evaluation workflow, we also investigate several practical aspects of module detection, such as parameter estimation and the use of alternative similarity measures, and conclude with recommendations for the further development of these methods.

8. 捕获Hi-C在33个乳腺癌风险位点中鉴定到了靶基因

Capture Hi-C identifies putative target genes at 33 breast cancer risk loci(Nature Communications)

Abstract

Genome-wide association studies (GWAS) have identified approximately 100 breast cancer risk loci. Translating these findings into a greater understanding of the mechanisms that influence disease risk requires identification of the genes or non-coding RNAs that mediate these associations. Here, we use Capture Hi-C (CHi-C) to annotate 63 loci; we identify 110 putative target genes at 33 loci. To assess the support for these target genes in other data sources we test for associations between levels of expression and SNP genotype (eQTLs), disease-specific survival (DSS), and compare them with somatically mutated cancer genes. 22 putative target genes are eQTLs, 32 are associated with DSS and 14 are somatically mutated in breast, or other, cancers. Identifying the target genes at GWAS risk loci will lead to a greater understanding of the mechanisms that influence breast cancer risk and prognosis.

9. Fam20激酶的结构和进化

Structure and evolution of the Fam20 kinases(Nature Communications)

Abstract

The Fam20 proteins are novel kinases that phosphorylate secreted proteins and proteoglycans. Fam20C phosphorylates hundreds of secreted proteins and is activated by the pseudokinase Fam20A. Fam20B phosphorylates a xylose residue to regulate proteoglycan synthesis. Despite these wide-ranging and important functions, the molecular and structural basis for the regulation and substrate specificity of these kinases are unknown. Here we report molecular characterizations of all three Fam20 kinases, and show that Fam20C is activated by the formation of an evolutionarily conserved homodimer or heterodimer with Fam20A. Fam20B has a unique active site for recognizing Galβ1-4Xylβ1, the initiator disaccharide within the tetrasaccharide linker region of proteoglycans. We further show that in animals the monomeric Fam20B preceded the appearance of the dimeric Fam20C, and the dimerization trait of Fam20C emerged concomitantly with a change in substrate specificity. Our results provide comprehensive structural, biochemical, and evolutionary insights into the function of the Fam20 kinases.

10. QAPA：从RNA-seq数据中系统分析可变聚腺苷酸化的新方法

QAPA: a new method for the systematic analysis of alternative polyadenylation from RNA-seq data(Genome Biology)

Abstract

Alternative polyadenylation (APA) affects most mammalian genes. The genome-wide investigation of APA has been hampered by an inability to reliably profile it using conventional RNA-seq. We describe ‘Quantification of APA’ (QAPA), a method that infers APA from conventional RNA-seq data. QAPA is faster and more sensitive than other methods. Application of QAPA reveals discrete, temporally coordinated APA programs during neurogenesis and that there is little overlap between genes regulated by alternative splicing and those by APA. Modeling of these data uncovers an APA sequence code. QAPA thus enables the discovery and characterization of programs of regulated APA using conventional RNA-seq.

11. SUPPA2：跨多个条件的快速，准确和不确定性差异可变剪接分析

SUPPA2: fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions(Genome Biology)

Abstract

Despite the many approaches to study differential splicing from RNA-seq, many challenges remain unsolved, including computing capacity and sequencing depth requirements. Here we present SUPPA2, a new method that addresses these challenges, and enables streamlined analysis across multiple conditions taking into account biological variability. Using experimental and simulated data, we show that SUPPA2 achieves higher accuracy compared to other methods, especially at low sequencing depth and short read length. We use SUPPA2 to identify novel Transformer2-regulated exons, novel microexons induced during differentiation of bipolar neurons, and novel intron retention events during erythroblast differentiation.

12.FusorSV：用于优化组合来自多种结构变异检测方法的数据的算法

FusorSV: an algorithm for optimally combining data from multiple structural variation detection methods(Genome Biology)

Abstract

Comprehensive and accurate identification of structural variations (SVs) from next generation sequencing data remains a major challenge. We develop FusorSV, which uses a data mining approach to assess performance and merge callsets from an ensemble of SV-calling algorithms. It includes a fusion model built using analysis of 27 deep-coverage human genomes from the 1000 Genomes Project. We identify 843 novel SV calls that were not reported by the 1000 Genomes Project for these 27 samples. Experimental validation of a subset of these calls yields a validation rate of 86.7%. FusorSV is available at https://github.com/TheJacksonLaboratory/SVE.

13.taxMaps：在合理的时间内对短读长数据进行全面且高度准确的物种分类分析

taxMaps: Comprehensive and highly accurate taxonomic classification of short-read data in reasonable time(Genome Research)

Abstract

High-throughput sequencing is a revolutionary technology for the analysis of metagenomic samples. However, querying large volumes of reads against comprehensive DNA/RNA databases in a sensitive manner can be compute-intensive. Here, we present taxMaps, a highly efficient, sensitive and fully scalable taxonomic classification tool. Using a combination of simulated and real metagenomics datasets, we demonstrate that taxMaps is more sensitive and more precise than widely used taxonomic classifiers, being capable of delivering classification accuracy comparable to that of BLASTn, but at up to 3 orders of magnitude less computational cost.

14. Atacama骨骼的全基因组测序显示与发育不良有关的新突变

Whole-genome sequencing of Atacama skeleton shows novel mutations linked with dysplasia(Genome Research)

Abstract

Over a decade ago, the Atacama humanoid skeleton (Ata) was discovered in the Atacama region of Chile. The Ata specimen carried a strange phenotype—6-in stature, fewer than expected ribs, elongated cranium, and accelerated bone age—leading to speculation that this was a preserved nonhuman primate, human fetus harboring genetic mutations, or even an extraterrestrial. We previously reported that it was human by DNA analysis with an estimated bone age of about 6–8 yr at the time of demise. To determine the possible genetic drivers of the observed morphology, DNA from the specimen was subjected to whole-genome sequencing using the Illumina HiSeq platform with an average 11.5× coverage of 101-bp, paired-end reads. In total, 3,356,569 single nucleotide variations (SNVs) were found as compared to the human reference genome, 518,365 insertions and deletions (indels), and 1047 structural variations (SVs) were detected. Here, we present the detailed whole-genome analysis showing that Ata is a female of human origin, likely of Chilean descent, and its genome harbors mutations in genes (COL1A1, COL2A1, KMT2D, FLNB, ATR, TRIP11, PCNT) previously linked with diseases of small stature, rib anomalies, cranial malformations, premature joint fusion, and osteochondrodysplasia (also known as skeletal dysplasia). Together, these findings provide a molecular characterization of Ata's peculiar phenotype, which likely results from multiple known and novel putative gene mutations affecting bone development and ossification.

15. SvABA：全基因组范围通过局部组装检测结构变异和插入缺失

SvABA: genome-wide detection of structural variants and indels by local assembly(Genome Research)

Abstract

Structural variants (SVs), including small insertion and deletion variants (indels), are challenging to detect through standard alignment-based variant calling methods. Sequence assembly offers a powerful approach to identifying SVs, but is difficult to apply at scale genome-wide for SV detection due to its computational complexity and the difficulty of extracting SVs from assembly contigs. We describe SvABA, an efficient and accurate method for detecting SVs from short-read sequencing data using genome-wide local assembly with low memory and computing requirements. We evaluated SvABA's performance on the NA12878 human genome and in simulated and real cancer genomes. SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs and substantially improves detection performance for variants in the 20–300 bp range, compared with existing methods. SvABA also identifies complex somatic rearrangements with chains of short (<1000 bp) templated-sequence insertions copied from distant genomic regions. We applied SvABA to 344 cancer genomes from 11 cancer types and found that short templated-sequence insertions occur in ∼4% of all somatic rearrangements. Finally, we demonstrate that SvABA can identify sites of viral integration and cancer driver alterations containing medium-sized (50–300 bp) SVs.

欢迎关注生信人

一个物种一个家

1对1 VIP培训服务

TCGA | 小工具 | 数据库 |组装| 注释 | 基因家族 | Pvalue

基因预测 |bestorf | sci | NAR | 在线工具 | 生存分析 | 热图

舞台|基因组 | 黄金测序 | 套路 | 杂谈组装 | 进化 | 测序简史