知识中心 - 北京概普生物科技有限公司(GapTech)

近期生物信息学文献速递（11.11）

生信干货 sxr2 ·2017年11月12日 05:29

1.GRIDSS：使用基于位置的deBruijn图组装进行基因组重排敏感和特异性的检测软件。

GRIDSS:sensitive and specific genomic rearrangement detection using positional deBruijn graph assembly(Genome Research）

Abstract

The identification of genomic rearrangements with high sensitivity and specificity using massively parallel sequencing remains a major challenge, particularly in precision medicine and cancer research. Here, we describe a new method for detecting rearrangements, GRIDSS (Genome Rearrangement IDentification Software Suite). GRIDSS is a multithreaded structural variant (SV) caller that performs efficient genome-wide break end assembly prior to variant calling using a novel positional de Bruijn graph based assembler. By combining assembly, split read, and read pair evidence using a probabilistic scoring, GRIDSS achieves high sensitivity and specificity on simulated, cell line, and patient tumor data, recently winning SV subchallenge #5 of the ICGC TCGA DREAM8.5 Somatic Mutation Calling Challenge. On human cell line data, GRIDSS halves the false discovery rate compared to other recent methods while matching or exceeding their sensitivity. GRIDSS identifies nontemplate sequence insertions, microhomologies, and large imperfect homologies, estimates a quality score for each breakpoint, stratifies calls into high or low confidence, and supports multisample analysis.

2.酵母复杂基因组重排的纳米孔测序揭示了重复介导的双链断裂修复的机制。

Nanopore sequencing of complex genomic rearrangements in yeast reveals mechanisms of repeat-mediated double-strand break repair（Genome Research）

Abstract

Improper DNA double-strand break (DSB) repair results in complex genomic

rearrangements (CGRs) in many cancers and various congenital disorders in humans. Trinucleotide repeat sequences, such as (GAA)n repeats in Friedreich's ataxia, (CTG)n repeats in myotonic dystrophy and (CGG)n repeats in fragile X syndrome, are also subject to double strand breaks within the repetitive tract followed by DNA repair. Mapping the outcomes of CGRs is important for understanding their causes and potential phenotypic effects. However, high resolution mapping of CGRs has traditionally been a laborious and highly skilled process. Recent advances in long read DNA sequencing technologies, specifically Nanopore sequencing, have made possible the rapid identification of CGRs with single base pair resolution. Here we have employed wholegenome Nanopore sequencing to characterize several CGRs that originated from naturally occurring DSBs at (GAA)n microsatellites in S. cerevisiae. These data gave us important insights into the mechanisms of DSB repair leading to CGRs.

3.新的辣椒参考基因组序列揭示了重新引入植物抗病基因的大规模进化。

New reference genome sequences of hot pepper reveal the massive evolution of plant disease-resistance genes by retroduplication(Genome Biology)

Abstract

Background

Transposable elements are major evolutionary forces which can cause new genome structure and species diversification. The role of transposable elements in the expansion of nucleotidebinding and leucine-rich-repeat proteins (NLRs), the major disease-resistance gene families, has been unexplored in plants.

Results

We report two high-quality de novo genomes (Capsicum baccatum and C. chinense) and an improved reference genome (C. annuum) for peppers. Dynamic genome rearrangements involving translocations among chromosomes 3, 5, and 9 were detected in comparison between C. baccatum and the two other peppers. The amplification ofathila LTRretrotransposons, members of the gypsy superfamily, led to genome expansion in C. baccatum. Indepth genomewide comparison of genes and repeats unveiled that the copy numbers of NLRs were greatly increased by LTR-retrotransposon mediated retroduplication. Moreover, retroduplicated NLRs are abundant across the angiosperms and, in most cases, are lineage-specific.

Conclusions

Our study reveals that retroduplication has played key roles for the massive emergence of NLR genes including functional disease-resistance genes in pepper plants.

4.McEnhancer：通过半监督算法将增强子分配到靶向基因来预测基因表达。

McEnhancer: predicting gene expression via semi-supervised assignment of enhancers to target genes(Genome Biology)

Abstract

Transcriptional enhancers regulate spatiotemporal gene expression. While genomic assays can identify putative enhancers en masse, assigning target genes is a complex challenge. We devised a machine learning approach, McEnhancer, which links target genes to putative enhancers via a semisupervised learning algorithm that predicts gene expression patterns based on enriched sequence features. Predicted expression patterns were 73.98% accurate, predicted assignments showed strong HiC interaction enrichment, enhancer associated histone modifications were evident,

and known functional motifs were recovered. Our model provides a general framework to link globally identified enhancers to targets and contributes to deciphering the regulatory genome.

5.使用捕获长读长测序技术进行全长长链非编码RNA的高通量注释。

High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing（Nature Genetics）

Abstract

Accurate annotation of genes and their transcripts is a foundation of genomics, but currently no annotation technique combines throughput and accuracy. As a result, reference gene collections remain incompletemany gene models are fragmentary, and thousands more remain uncataloged, particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), which combines targeted RNA capture with thirdgeneration longread sequencing. Here we present an experimental reannotation of the GENCODE intergenic lncRNA populations in matched human and mouse tissues that resulted in novel transcript models for 3,574 and 561 gene loci, respectively. CLS approximately doubled the annotated complexity of targeted loci, outperforming existing shortread techniques. Fulllength transcript models produced by CLS enabled us to definitively characterize the genomic features of lncRNAs, including promoter and gene structure, and protein-coding potential. Thus, CLS removes a long-standing bottleneck in transcriptome annotation and generates manualquality fulllength transcript models at high-throughput scales.

6.外显子芯片meta分析确定了与血脂水平和冠状动脉疾病相关的新的基因座和东亚特异性编码区突变体。

Exome chip metaanalysis identifies novel loci and East Asianspecific coding variants that contribute to lipid levels and coronary artery diseas（Nature Genetics）

Abstract

Most genomewide association studies have been of European individuals, even though most genetic variation in humans is seen only in nonEuropean samples. To search for novel loci associated with blood lipid levels and clarify the mechanism of action at previously identified lipid loci, we used an exome array to examine proteincoding genetic variants in 47,532 East Asian individuals. We identified 255 variants at 41 loci that reached chipwide significance, including 3 novel loci and 14 East Asianspecific coding variant associations. After a metaanalysis including >300,000 European samples, we identified an additional nine novel loci. Sixteen genes were identified by protein altering variants in both East Asians and Europeans, and thus are likely to be functional genes. Our data demonstrate that most of the lowfrequency or rare coding variants associated with lipids are population specific, and that examining genomic data across diverse ancestries may facilitate the identification of functional genes at associated loci.

7. > 300,000个人的血浆脂质的外显子组关联研究。

Exome-wide association study of plasma lipids in >300,000 individuals

（Nature Genetics）

Abstract

We screened variants on an exomefocused genotyping array in >300,000 participants (replication in >280,000 participants) and identified 444 independent variants in 250 loci significantly associated with total cholesterol (TC), high-density-lipoprotein cholesterol (HDL-C), low-density lipoprotein cholesterol (LDLC), and/or triglycerides (TG). At two loci (JAK2 and A1CF), experimental analysis in mice showed lipid changes consistent with the human data. We also found that: (i) betathalassemia trait carriers displayed lower TC and were protected from coronary artery disease (CAD); (ii) excluding the CETP locus, there was not a predictable relationship between plasma HDL-C and risk for agerelated macular degeneration; (iii) only some mechanisms of lowering LDL-C appeared to increase risk for type 2 diabetes (T2D); and (iv) TGlowering alleles involved in hepatic production of TGrich lipoproteins (TM6SF2 and PNPLA3) tracked with higher liver fat, higher risk for T2D, and lower risk for CAD, whereas TGlowering alleles involved in peripheral lipolysis (LPL and ANGPTL4) had no effect on liver fat but decreased risks for both T2D and CAD.

8.通过早期拟南芥胚的基因表达图谱揭示的转录组动力学。

Transcriptome dynamics revealed by a gene expression atlas of the earlyArabidopsis embryo（Nature Plants）

Abstract

During early plant embryogenesis, precursors for all major tissues and stem cells are formed. While several components of the regulatory framework are known, how cell fates are instructed by genomewide transcriptional activity remains unansweredin part because of difficulties in capturing transcriptome changes at cellular resolution. Here, we have adapted a twocomponent transgenic labelling system to purify celltypespecific nuclear RNA and generate a transcriptome atlas of early Arabidopsis embryo development, with a focus on root stem cell niche formation. We validated the dataset through gene expression analysis, and show that gene activity shifts in a spatiotemporal manner, probably signifying transcriptional reprogramming, to induce developmental processes reflecting cell states and state transitions. This atlas provides the most comprehensive tissue- and cellspecific description of genomewide gene activity in the early plant embryo, and serves as a valuable resource for understanding the genetic control of early plant development.

9.使用基于信息论的纠错算法获得高度准确的荧光DNA测序。

Highly accurate fluorogenic DNA sequencing with information theory–based error correction（Nature Biotechnology)

Abstract

Eliminating errors in nextgeneration DNA sequencing has proved challenging. Here we present errorcorrection code (ECC) sequencing, a method to greatly improve sequencing accuracy by combining fluorogenic sequencing-by-synthesis (SBS) with an information theory based error correction algorithm. ECC embeds redundancy in sequencing reads by creating three orthogonal degenerate sequences, generated by alternate dualbase reactions. This is similar to encoding and decoding strategies that have proved effective in detecting and correcting errors in information communication and storage. We show that, when combined with a fluorogenic SBS chemistry with raw accuracy of 98.1%, ECC sequencing provides singleend, errorfree sequences up to 200 bp. ECC approaches should enable accurate identification of extremely rare genomic variations in various applications in biology and medicine.

10.microRPM：仅基于植物小RNA测序数据的miRNA预测模型。

microRPM: A microRNA Prediction Model based only on plant small RNA sequencing data（Bioinformatics）

Abstract

Motivation

MicroRNAs (miRNAs) are endogenous non-coding small RNAs (of about 22 nucleotides), which play an important role in the posttranscriptional regulation of gene expression via either mRNA cleavage or translation inhibition. Several machine learningbased approaches have been developed to identify novel miRNAs from next generation sequencing (NGS) data. Typically, precursor/genomic sequences are required as references for most methods. However, the nonavailability of genomic sequences is often a limitation in miRNA discovery in nonmodel plants. A systematic approach to determine novel miRNAs without reference sequences is thus necessary.

Results

In this study, an effective method was developed to identify miRNAs from nonmodel plants based only on NGS datasets. The miRNA prediction model was trained with several duplex structurerelated features of mature miRNAs and their passenger strands using a support vector machine (SVM) algorithm. The accuracy of the independent test reached 96.61% and 93.04% for dicots (Arabidopsis) and monocots (rice), respectively. Furthermore, true small RNA sequencing data from orchids was tested in this study. Twentyone predicted orchid miRNAs were selected and experimentally validated. Significantly, eighteen of them were confirmed in the qRTPCR experiment. This novel approach was also compiled as a userfriendly program called microRPM (microRNA Prediction Model).

11.使用linked-read 测序数据鉴定结构变异。

Identifying structural variants using linked-read sequencing data（Bioinformatics）

Abstract

Motivation

Structural variation, including large deletions, duplications, inversions, translocations, and other rearrangements, is common in human and cancer genomes. A number of methods have been developed to identify structural variants from Illumina shortread sequencing data. However, reliable identification of structural variants remains challenging because many variants have breakpoints in repetitive regions of the genome and thus are difficult to identify with short reads. The recently developed linkedread sequencing technology from 10X Genomics combines a novel barcoding strategy with Illumina sequencing. This technology labels all reads that originate from a small number (~5-10) DNA molecules ~50Kbp in length with the same molecular barcode. These barcoded reads contain long-range sequence information that is advantageous for identification of structural variants.

Results

We present Novel Adjacency Identification with Barcoded Reads (NAIBR), an algorithm to identify structural variants in linkedread sequencing data. NAIBR predicts novel adjacencies in a individual genome resulting from structural variants using a probabilistic model that combines multiple signals in barcoded reads. We show that NAIBR outperforms several existing methods for structural variant identification including two recent methods that also analyze linked-reads on simulated sequencing data and 10X wholegenome sequencing data from the NA12878 human genome and the HCC1954 breast cancer cell line. Several of the novel somatic structural variants identified in HCC1954 overlap known cancer genes.

12.ARCS：使用Linked Reads 进行scaffold组装。

ARCS: Scaffolding Genome Drafts with Linked Reads （Bioinformatics）

Abstract

Motivation

Sequencing of human genomes is now routine, and assembly of shotgun reads is increasingly feasible. However, assemblies often fail to inform about chromosome-scale structure due to a lack of linkage information over long stretches of DNAa shortcoming that is being addressed by new sequencing protocols, such as the GemCode and Chromium linked reads from 10x Genomics.

Results

Here, we present ARCS, an application that utilizes the barcoding information contained in linked reads to further organize draft genomes into highly contiguous assemblies. We show how the contiguity of an ABySS H. sapiens genome assembly can be increased over sixfold, using moderate coverage (25fold) Chromium data. We expect ARCS to have broad utility in harnessing the barcoding information contained in linked read data for connecting high-quality sequences in genome assembly drafts.

13.通过混合深度卷积神经网络预测染色质开放性。

Chromatin accessibility prediction via a hybrid deep convolutional neural network（Bioinformatics）

Abstract

Motivation

A majority of known genetic variants associated with human inherited diseases lie in non-coding regions that lack adequate interpretation, making it indispensable to systematically discover functional sites at the whole genome level and precisely decipher their implications in a comprehensive manner. Although computational approaches have been complementing highthroughput biological experiments towards the annotation of the human genome, it still remains a big challenge to accurately annotate regulatory elements in the context of a specific cell type via automatic learning of the DNA sequence code from largescale sequencing data. Indeed, the development of an accurate and interpretable model to learn the DNA sequence signature and further enable the identification of causative genetic variants has become essential in both genomic and genetic studies.

Results

We proposed Deopen, a hybrid framework mainly based on a deep convolutional neural network, to automatically learn the regulatory code of DNA sequences and predict chromatin accessibility. In a series of comparison with existing methods, we show the superior performance of our model in not only the classification of accessible regions against background sequences sampled at random, but also the regression of DNaseseq signals. Besides, we further visualize the convolutional kernels and show the match of identified sequence signatures and known motifs. We finally demonstrate the sensitivity of our model in finding causative noncoding variants in the analysis of a breast cancer dataset. We expect to see wide applications of Deopen with either public or inhouse chromatin accessibility data in the annotation of the human genome and the identification of non-coding variants associated with diseases.

14.对于长read的RNA-seq splice-align的工具评估。

valuation of tools for long read RNA-seq splice-aware alignment（Bioinformatics）

Abstract

Motivation

High–throughput sequencing has transformed the study of gene expression levels through RNA-seq, a technique that is now routinely used by various fields, such as genetic research or diagnostics. The advent of third generation sequencing technologies providing significantly longer reads opens up new possibilities. However, the high error rates common to these technologies set new bioinformatics challenges for the gapped alignment of reads to their genomic origin. In this study, we have explored how currently available RNA-seq splice-aware alignment tools cope with increased read lengths and error rates. All tested tools were initially developed for short NGS reads, but some have claimed support for long Pacific Biosciences (PacBio) or even Oxford Nanopore Technologies (ONT) MinION reads.

Results

The tools were tested on synthetic and real datasets from two technologies (PacBio and ONT MinION). Alignment quality and resource usage were compared across different aligners. The effect of error correction of long reads was explored, both using selfcorrection and correction with an external short reads dataset. A tool was developed for evaluating RNAseq alignment results. This tool can be used to compare the alignment of simulated reads to their genomic origin, or to compare the alignment of real reads to a set of annotated transcripts

Our tests show that while some RNAseq aligners were unable to cope with long errorprone reads, others produced overall good results. We further show that alignment accuracy can be improved using error-corrected reads.

15.miRTarBase更新2018年：实验验证的microRNA-靶标相互作用的资源库。

miRTarBase update 2018: a resource for experimentally validated microRNA-target interactions （Nucleic Acids Research）

Abstract

MicroRNAs (miRNAs) are small non-coding RNAs of ∼ 22 nucleotides that are involved in negative regulation of mRNA at the post-transcriptional level. Previously, we developed miRTarBase which provides information about experimentally validated miRNAtarget interactions (MTIs). Here, we describe an updated database containing 422 517 curated MTIs from 4076 miRNAs and 23 054 target genes collected from over 8500 articles. The number of MTIs curated by strong evidence has increased ∼1.4fold since the last update in 2016. In this updated version, target sites validated by reporter assay that are available in the literature can be downloaded. The target site sequence can extract new features for analysis via a machine learning approach which can help to evaluate the performance of miRNAtarget prediction tools. Furthermore, different ways of browsing enhance user browsing specific MTIs. With these improvements, miRTarBase serves as more comprehensively annotated, experimentally validated miRNAtarget interactions databases in the field of miRNA related research. miRTarBase is available athttp://miRTarBase.mbc.nctu.edu.tw/.

16.3DIV：3D基因组交互查看器和数据库。

3DIV: A 3D-genome Interaction Viewer and database （Nucleic Acids Research）

Abstract

Three-dimensional (3D) chromatin structure is an emerging paradigm for understanding gene regulation mechanisms. Hi-C (high-throughput chromatin conformation capture), a method to detect long-range chromatin interactions, allows extensive genomewide investigation of 3D chromatin structure. However, broad application of HiC data have been hindered by the level of complexity in processing HiC data and the large size of raw sequencing data. In order to overcome these limitations, we constructed a database named 3DIV (a 3Dgenome Interaction Viewer and database) that provides a list of longrange chromatin interaction partners for the queried locus with genomic and epigenomic annotations. 3DIV is the first of its kind to collect all publicly available human HiC data to provide 66 billion uniformly processed raw HiC read pairs obtained from 80 different human cell/tissue types. In contrast to other databases, 3DIV uniquely provides normalized chromatin interaction frequencies against genomic distance dependent background signals and a dynamic browsing visualization tool for the listed interactions, which could greatly advance the interpretation of chromatin interactions. ‘3DIV’ is available athttp://kobic.kr/3div.

欢迎关注生信人