知识中心 - 北京概普生物科技有限公司(GapTech)

10月bioRxiv生信好文速览

生信干货 montreal ·2018年10月31日 01:22

目前，不少杂志都允许作者在bioRxiv上预先发出预印本（preprint）文章，之后再向杂志社投稿。今年5月1日，为加强研究成果的快速分享和推广，PLoS系列杂志推行了一项新的服务：在作者们向Plos系列杂志投稿的同时，可以选择已预印本形式在bioRxiv服务器上刊发。上个月，选择这项服务的预印本终于达到了1000份！这项操作的流程图如下（为什么好像缺了Rejection这一环节）：

1. 【Transcriptomics】斯坦福大学学者：APEX-seq，让亚细胞转录本定位成为可能

Atlas of Subcellular RNA Localization Revealed by APEX-seq（CC-BY-ND 4.0）

We introduce APEX-seq, a method for RNA sequencing based on spatial proximity to the peroxidase enzyme APEX2. APEX-seq in nine distinct subcellular locales produced a nanometer-resolution spatial map of the human transcriptome, revealing extensive and exquisite patterns of localization for diverse RNA classes and transcript isoforms. We uncover a radial organization of the nuclear transcriptome, which is gated at the inner surface of the nuclear pore for cytoplasmic export of processed transcripts. We identify two distinct pathways of messenger RNA localization to mitochondria, each associated with specific sets of transcripts for building complementary macromolecular machines within the organelle. APEX-seq should be widely applicable to many systems, enabling comprehensive investigations of the spatial transcriptome.

本文遗憾之处在于，也许是为了防止遭到scoop，未提供methods部分，我们也只能从图1A和1E两幅图管中窥豹了

2. 【Bioinformatics】MiniScrub：神经网络算法筛查低质量测序片段

MiniScrub: de novo long read scrubbing using approximate alignment and deep learning (This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.)

Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. Many methods for resolving these errors require access to reference genomes, high-fidelity short reads, or reference genomes, which are often not available. De novo error correction modules are available, often as part of assembly tools, but large-scale errors still remain in resulting assemblies, motivating further innovation in this area. We developed a novel Convolutional Neural Network (CNN) based method, called MiniScrub, for de novo identification and subsequent "scrubbing" (removal) of low-quality Nanopore read segments. MiniScrub first generates read-to-read alignments by MiniMap, then encodes the alignments into images, and finally builds CNN models to predict low-quality segments that could be scrubbed based on a customized quality cutoff. Applying MiniScrub to real world control datasets under several different parameters, we show that it robustly improves read quality. Compared to raw reads, de novo genome assembly with scrubbed reads produces many fewer mis-assemblies and large indel errors. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub.

3. 【Single-cell】加州大学河滨分校Mortazavi教授新文：linking scATAC-seq and scRNA-seq

Building gene regulatory networks from single-cell ATAC-seq and RNA-seq using Linked Self-Organizing Maps (CC-BY 4.0)

Rapid advances in single-cell assays have outpaced methods for analysis of those data types. Different single-cell assay show extensive variation in sensitivity and signal to noise levels. In particular, single-cell ATAC-seq generates extremely sparse and noisy datasets. Existing methods developed to analyze this data require cells amenable to pseudo-time analysis or require a dataset with drastically different cell-types. We describe a novel approach using self-organizing maps (SOM) to link scATAC-seq and scRNA-seq data that overcomes these and can be generate draft regulatory networks. Our SOMatic software package generates chromatin and gene expression SOMs separately and combines them using a linking function. We applied SOMatic on a mouse pre-B cell differentiation time-course using controlled Ikaros over-expression to recover temporal gene ontology enrichments, identify motifs in genomic regions showing similar single-cell profiles, and generate a gene regulatory network that both recovers known interactions and predicts new Ikaros targets during the differentiation process. The ability of linked SOMs to detect emergent properties from multiple types of highly-dimensional genomic data with very different signal properties opens new avenues for integrative analysis of single-cells.

4. 【Genomics】水母基因组揭示主动捕食行为的早期进化

The jellyfish genome sheds light on the early evolution of active predation (CC-BY 4.0)

Unique among cnidarians, jellyfish have remarkable morphological and biochemical innovations that allow them to actively hunt in the water column. One of the first animals to become free-swimming, jellyfish employ pulsed jet propulsion and venomous tentacles to capture prey. To understand these key innovations, we sequenced the genome of the giant Nomura jellyfish (Nemopilema nomurai), and the transcriptomes of its medusa bell and tentacles and complemented it with transcriptomes across tissues and developmental stages of another jellyfish, Sanderia malayensis. Analyses of Nemopilema and other cnidarian genomes revealed adaptations associated with active swimming and mobile predation, marked by codon bias in muscle contraction and the expansion of neurotransmitter genes. Nemopilema also showed a conservation in cellular chemical homeostasis and ion transport function, probably reflecting the high demand for sodium ions created by their muscle contraction-based locomotion. We also discovered expanded myosin heavy and light chain genes, Wnt genes, posterior Hox genes, and venom domains, possibly contributing to jellyfish mobility, medusa structure formation, and active predation, respectively. Taken together, the jellyfish genome and transcriptomes genetically confirm their unique morphological and physiological traits that have combined to make these animals one of the worlds earliest and most successful multi-cellular predators.

原文图1a

5. 【Epigenomics】中国科技大学徐云课题组BitMapperBS助力DNA甲基化测序比对更快更准

BitMapperBS: a fast and accurate read aligner for whole-genome bisulfite sequencing (CC-BY-NC-ND 4.0)

As a gold-standard technique for DNA methylation analysis, whole-genome bisulfite sequencing (WGBS) helps researchers to study the genome-wide DNA methylation at single-base resolution. However, aligning WGBS reads to the large reference genome is a major computational bottleneck in DNA methylation analysis projects. Although several WGBS aligners have been developed in recent years, it is difficult for them to efficiently process the ever-increasing bisulfite sequencing data. Here we propose BitMapperBS, an ultrafast and memory-efficient aligner that is designed for WGBS reads. To improve the performance of BitMapperBS, we propose various strategies specifically for the challenges that are unique to the WGBS aligners, which are ignored in most existing methods. Our experiments on real and simulated datasets show that BitMapperBS is one order of magnitude faster than the state-of-the-art WGBS aligners, while achieves similar or better sensitivity and precision. BitMapperBS is freely available at https://github.com/chhylp123/BitMapperBS.

6. 【Genomics】鸣禽（songbird）基因组两篇

（1）染色体去哪儿了？瑞典乌普萨拉大学Alexander Suh发现只存在于生殖细胞的幽灵染色体

Programmed DNA elimination of germline development genes in songbirds (CC-BY 4.0)

Genomes can vary within individual organisms. Programmed DNA elimination leads to dramatic changes in genome organisation during the germline-soma differentiation of ciliates, lampreys, nematodes, and various other eukaryotes. A particularly remarkable example of tissue-specific genome differentiation is the germline-restricted chromosome (GRC) in the zebra finch which is consistently absent from somatic cells. Although the zebra finch is an important animal model system, molecular evidence from its large GRC (>150 megabases) is limited to a short intergenic region and a single mRNA. Here, we combined cytogenetic, genomic, transcriptomic, and proteomic evidence to resolve the evolutionary origin and functional significance of the GRC. First, by generating tissue-specific de-novo linked-read genome assemblies and re-sequencing two additional germline and soma samples, we found that the GRC contains at least 115 genes which are paralogous to single-copy genes on 18 autosomes and the Z chromosome. We detected an amplification of 38 GRC-linked genes into high copy numbers (up to 185 copies) but, surprisingly, no enrichment of transposable elements on the GRC. Second, transcriptome and proteome data provided evidence for functional expression of GRC genes at the RNA and protein levels in testes and ovaries. Interestingly, the GRC is enriched for genes with highly expressed orthologs in chicken gonads and gene ontologies involved in female gonad development. Third, we detected evolutionary strata of GRC-linked genes. Genes such as bicc1 and trim71 have resided on the GRC for tens of millions of years, whereas dozens have become GRC-linked very recently. The GRC is thus likely widespread in songbirds (half of all bird species) and its rapid evolution may have contributed to their diversification. Together, our results demonstrate a highly dynamic evolutionary history of the songbird GRC leading to dramatic germline-soma genome differences as a novel mechanism to minimize genetic conflict between germline and soma.

（2）【Evolution】染色体如何进化？浙江大学周琦教授解析鸣禽性染色体进化复杂历程

Dynamic evolutionary history and gene content of sex chromosomes across diverse songbirds（CC-BY-NC-ND 4.0）

Songbirds have a species number almost equivalent to that of mammals, and are classic models for studying mechanisms of speciation and sexual selection. Sex chromosomes are hotspots of both processes, yet their evolutionary history in songbirds remains unclear. To elucidate that, we characterize female genomes of 11 songbird species having ZW sex chromosomes, with 5 genomes of bird-of-paradise species newly produced in this work. We conclude that songbird sex chromosomes have undergone at least four steps of recombination suppression before their species radiation, producing a gradient pattern of pairwise sequence divergence termed 'evolutionary strata'. Interestingly, the latest stratum probably emerged due to a songbird-specific burst of retrotransposon CR1-E1 elements at its boundary, or chromosome inversion on the W chromosome. The formation of evolutionary strata has reshaped the genomic architecture of both sex chromosomes. We find stepwise variations of Z-linked inversions, repeat and GC contents, as well as W-linked gene loss rate that are associated with the age of strata. Over 30 W-linked genes have been preserved for their essential functions, indicated by their higher and broader expression of orthologs in lizard than those of other sex-linked genes. We also find a different degree of accelerated evolution of Z-linked genes vs. autosomal genes among different species, potentially reflecting their diversified intensity of sexual selection. Our results uncover the dynamic evolutionary history of songbird sex chromosomes, and provide novel insights into the mechanisms of recombination suppression.

7. 【Epigenomics】染色体开放性对拟南芥生长素调节的影响

A network of transcriptional repressors mediates auxin response specificity（CC-BY 4.0）

The regulation of signalling capacity plays a pivotal role in setting developmental patterns in both plants and animals. The hormone auxin is a key signal for plant growth and development that acts through the AUXIN RESPONSE FACTOR (ARF) transcription factors. A subset of these ARFs comprises transcriptional activators of target genes in response to auxin, and are essential for regulating auxin signalling throughout the plant lifecycle. While ARF activators show tissue-specific expression patterns, it is unknown how their expression patterns are established. Chromatin modifications and accessibility studies revealed the chromatin of loci encoding ARF activators is constitutively open for transcription. Using a high-throughput yeast one-hybrid (Y1H) approach, we discovered a network of transcriptional regulators of ARF activator genes from Arabidopsis thaliana. Expression analyses demonstrated that the majority of these regulators act as repressors of ARF transcription in planta. Our observations support a scenario where the default configuration of open chromatin enables a network of transcriptional repressors to shape the expression pattern of ARF activators and provide specificity in auxin signalling output throughout development.

8. 为何大沙鼠不得鼠疫？基因组测序揭晓答案

The genome of the plague-resistant great gerbil reveals species-specific duplication of an MHCII gene (CC-BY 4.0)

图片来自互动百科

Background: The great gerbil (Rhombomys opimus) is a social rodent living in permanent, complex burrow systems distributed throughout Central Asia, where it serves as the main host of several important vector-borne infectious diseases and is defined as a key reservoir species for plague (Yersinia pestis). Studies from the wild have shown that the great gerbil is largely resistant to plague but the genetic basis for resistance is yet to be determined. Results: Here, we present a highly contiguous annotated genome assembly of great gerbil, covering over 96 % of the estimated 2.47 Gb genome. Comparative genomic analyses focusing on the immune gene repertoire, reveal shared gene losses within TLR gene families (i.e. TLR8, TLR10 and all members of TLR11-subfamily) for the Gerbillinae lineage, accompanied with signs of diversifying selection of TLR7 and TLR9. Most notably, we find a great gerbil-specific duplication of the MHCII DRB locus. In silico analyses suggest that the duplicated gene provides high peptide binding affinity for Yersiniae epitopes. Conclusion: The great gerbil genome provides new insights into the genomic landscape that confers immunological resistance towards plague. The high affinity for Yersinia epitopes could be key in our understanding of the high resistance in great gerbils, putatively conferring a faster initiation of the adaptive immune response leading to survival of the infection. Our study demonstrates the power of studying zoonosis in natural hosts through the generation of a genome resource for further comparative and experimental work on plague survival and evolution of host-pathogen interactions.

9. 【Bioinformatics】普林斯顿大学Olga Troyanskaya：针对生物学序列深度学习工具Selene

Selene: a PyTorch-based deep learning library for sequence-level data（CC-BY-ND 4.0）

To enable the application of deep learning in biology, we present Selene (https://selene.flatironinstitute.org/), a PyTorch-based deep learning library for fast and easy development, training, and application of deep learning model architectures for any biological sequences. We demonstrate how Selene allows researchers to easily train a published architecture on new data, develop and evaluate a new architecture, and use a trained model to answer biological questions of interest.

原文Figure 1. Overview of functionality provided in Selene. (a) Selene enables users to train and evaluate new deep learning models with very few lines of code. As input, the library accepts (left) the model architecture, dataset, and (mid) a configuration file that specifies the necessary input data paths and training parameters; Selene automatically splits the data into training and validation/testing, trains the model, evaluates it, and (right) generates figures from the results. (b) Selene also supports the use of trained models to interpret variants. In addition to being able to run variant effect prediction with the same configuration file format, Selene includes a visualization of the variants and their difference scores as a Manhattan plot, where a user can hover over each point to see variant information. (c) Users interested in finding informative bases that contribute to the binding of a certain protein can run Selene to get mutation effect scores and visualize the scores as a heatmap.

10. 【Transcriptomics】Errors in RNA-seq transcript abundance estimates（CC-BY 4.0）

RNA-sequencing data is widely used to identify disease biomarkers and therapeutic targets. Here, using data from five RNA-seq processing pipelines applied to 6,690 human tumor and normal tissues, we show that for >12% of protein-coding genes, in at least 1% of samples, current RNA-seq processing pipelines differ in their abundance estimates by more than four-fold using the same samples and the same set of RNA-seq reads, raising clinical concern.

11. 【Review】edger软件作者Mark Robinson关于RNA-seq和差异表达基因鉴定的综述（刊载于PeerJ preprint服务器）

RNA sequencing data: hitchhiker's guide to expression analysis

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

欢迎关注生信人

TCGA | 小工具 | 数据库 |组装| 注释 | 基因家族 | Pvalue

基因预测 |bestorf | sci | NAR | 在线工具 | 生存分析 | 热图

舞台|基因组 | 黄金测序 | 套路 | 杂谈组装 | 进化 | 测序简史