上个月月尾,小编看到一则有趣的推文:来自加州大学洛杉矶分校(UCLA)骨科系的癌症生物学家Alice Soragni发推表示,她和来自藤校达特茅斯学院(Dartmouth College)的细胞生物学家Prachee Avasthi发起了一个被称为365preprints的项目:每人每天选取一篇有意思的预印本文章,大概进行五分钟左右的简短讨论。
正如Avasthi教授在下面推文中所说,尽管她平时也有阅读预印本的习惯365preprints项目让她们更加广泛地了解其他领域所发生的最新的故事。我们月度的biorxiv生信好文速览专栏受篇幅和小编能力与时间限制,每次虽然仅有十篇文章,但希望抛砖引玉,帮助大家养成阅读预印本的好习惯。
【概率】加州大学伯克利分校:概率模型在单细胞组学里的运用
scvi-tools: a library for deep probabilistic analysis of single-cell omics data
Probabilistic models have provided the underpinnings for state-of-the-art performance in many single-cell omics data analysis tasks, including dimensionality reduction, clustering, differential expression, annotation, removal of unwanted variation, and integration across modalities. Many of the models being deployed are amenable to scalable stochastic inference techniques, and accordingly they are able to process single-cell datasets of realistic and growing sizes. However, the community-wide adoption of probabilistic approaches is hindered by a fractured software ecosystem resulting in an array of packages with distinct, and often complex interfaces. To address this issue, we developed scvi-tools (https://scvi-tools.org), a Python package that implements a variety of leading probabilistic methods. These methods, which cover many fundamental analysis tasks, are accessible through a standardized, easy-to-use interface with direct links to Scanpy, Seurat, and Bioconductor workflows. By standardizing the implementations, we were able to develop and reuse novel functionalities across different models, such as support for complex study designs through nonlinear removal of unwanted variation due to multiple covariates and reference-query integration via scArches. The extensible software building blocks that underlie scvi-tools also enable a developer environment in which new probabilistic models for single cell omics can be efficiently developed, benchmarked, and deployed. We demonstrate this through a code-efficient reimplementation of Stereoscope for deconvolution of spatial transcriptomics profiles. By catering to both the end user and developer audiences, we expect scvi-tools to become an essential software dependency and serve to formulate a community standard for probabilistic modeling of single cell omics.
【蝼蚁】宾夕法尼亚大学:蚂蚁大脑的基因表达图谱有何与众不同之处
Genome annotation with long RNA reads reveals new patterns of gene expression in an ant brain
Functional genomic analyses rely on high-quality genome assemblies and annotations. Highly contiguous genome assemblies have become available for a variety of species, but accurate and complete annotation of gene models, inclusive of alternative splice isoforms and transcription start and termination sites remains difficult with traditional approaches. Here, we utilized full-length isoform sequencing (Iso-Seq), a long-read RNA sequencing technology, to obtain a comprehensive annotation of the transcriptome of the ant Harpegnathos saltator. The improved genome annotations include additional splice isoforms and extended 3’ untranslated regions for more than 4,000 genes. Reanalysis of RNA-seq experiments using these annotations revealed several genes with caste-specific differential expression and tissue-or caste-specific splicing patterns that were missed in previous analyses. The extended 3’ untranslated regions afforded great improvements in the analysis of existing single-cell RNA-seq data, resulting in the recovery of the transcriptomes of 18% more cells. The deeper single-cell transcriptomes obtained with these new annotations allowed us to identify additional markers for several cell types in the ant brain, as well as genes differentially expressed across castes in specific cell types. Our results demonstrate that Iso-Seq is an efficient and effective approach to improve genome annotations and maximize the amount of information that can be obtained from existing and future genomic datasets in Harpegnathos and other organisms.
【比拼】变异检出(variance calling)pipelines哪家强?来自俄罗斯圣彼得堡国立大学(Petersburg State University)
Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery
Accurate variant detection in the coding regions of the human genome is a key requirement for molecular diagnostics of Mendelian disorders. Efficiency of variant discovery from nextgeneration sequencing (NGS) data depends on multiple factors, including reproducible coverage biases of NGS methods and the performance of read alignment and variant calling software. In this work, we systematically evaluated the performance of 4 popular short read aligners (Bowtie2, BWA, Isaac, and Novoalign) and 6 variant calling and filtering methods (based on DeepVariant, Genome Analysis ToolKit (GATK), FreeBayes, and Strelka2) using a set of 10 “gold standard” WES and WGS datasets. Our results suggest that Bowtie2 performs significantly worse than other aligners and should not be used for medical variant calling. When other aligners were used, the accuracy of variant discovery mostly depended on the variant caller and not the read aligner. Among the tested variant callers, DeepVariant consistently showed the best performance and the highest robustness, with other state-of-the-art tools, i.e. Strelka2 and GATK with 1D convolutional neural network variant scoring, also showing high performance on both WES and WGS data. The results show surprisingly large differences in the performance of cutting-edge tools even on high confidence regions of the coding genome. This highlights the importance of regular benchmarking of quickly evolving tools and pipelines. Finally, we discuss the need for a more diverse set of gold standard genomes (e.g. of African, Hispanic, or mixed ancestry) that would allow to control for deep learning model overfitting. For similar reasons there is a need for better variant caller assessment in the repetitive regions of the coding genome.
【芥蓝】密苏里大学(University of Missouri):野生、家养和野化芥蓝的进化与群体基因组学分析(小编看到文中惊现的中文才翻译的标题)
The Evolutionary History of Wild, Domesticated, and Feral Brassica oleracea (Brassicaceae)
Understanding the evolutionary history of crops, including identifying wild relatives, helps to provide insight for designing new approaches in crop breeding efforts. Cultivated Brassica oleracea has intrigued researchers for centuries due to its wide diversity in forms, which include cabbage, broccoli, cauliflower, kale, kohlrabi, and Brussels sprouts. Yet, the evolutionary history of this species remains understudied. With such different vegetables produced from a single species, B. oleracea is a model organism for understanding the power of artificial selection. Persistent challenges in the study of B. oleracea include conflicting hypotheses regarding domestication and the identity of the closest living wild relative. Using a diversity panel of 224 accessions, which represents 14 different B. oleracea crop types and nine potential wild progenitor species, we integrate phylogenetic and population genetic techniques with ecological niche modeling, archaeological, and literary evidence to examine relationships among cultivars and wild relatives to clarify the origin of this horticulturally important species. Our analyses point to the Aegean endemic B. cretica as the closest living relative of cultivated B. oleracea, supporting an origin of cultivation in the Eastern Mediterranean region. Additionally, we identify several feral lineages, suggesting that cultivated plants of this species are able to revert to a wild-like state with relative ease. By expanding our understanding of the evolutionary history in B. oleracea, these results contribute to a growing body of knowledge on crop domestication that will facilitate continued breeding efforts including adaptation to changing environmental conditions.
【转座】剑桥大学学者:转座子分类利器TransposonUltimate
TransposonUltimate: software for transposon classification, annotation and detection
Motivation Most genomes harbor a large number of transposons, and they play an important role in evolution and gene regulation. They are also of interest to clinicians as they are involved in several diseases, including cancer and neurodegeneration. Although several methods for transposon identification are available, they are often highly specialised towards specific tasks or classes of transposons, and they lack common standards such as a unified taxonomy scheme and output file format. Moreover, many methods are difficult to install, poorly documented, and difficult to reproduce. Results We present TransposonUltimate, a powerful bundle of three modules for transposon classification, annotation, and detection of transposition events. TransposonUltimate comes as a Conda package under the GPL-3.0 licence, is well documented and it is easy to install. We benchmark the classification module on the large TransposonDB covering over 891,051 sequences to demonstrate that it outperforms the currently best existing solutions. The annotation and detection modules combine sixteen existing softwares, and we illustrate its use by annotating Caenorhabditis elegans, Rhizophagus irregularis and Oryza sativa subs. japonica genomes. Finally, we use the detection module to discover 29,554 transposition events in the genomes of twenty wild type strains of Caenorhabditis elegans. Availability Running software and source code available on https://github.com/DerKevinRiehl/TransposonClassifierRFSB. Databases, assemblies, annotations and further findings can be downloaded from https://cellgeni.cog.sanger.ac.uk/browser.html?shared=transposonultimate.
【单挑】为什么大多群体遗传学里的主成分分析都错了?本文仅有一位作者瑞典隆德大学(Lund University)Eran Elhaik
Why most Principal Component Analyses (PCA) in population genetic studies are wrong
Principal Component Analysis (PCA) is a multivariate analysis that allows reduction of the complexity of datasets while preserving data’s covariance and visualizing the information on colorful scatterplots, ideally with only a minimal loss of information. PCA applications are extensively used as the foremost analyses in population genetics and related fields (e.g., animal and plant or medical genetics), implemented in well-cited packages like EIGENSOFT and PLINK. PCA outcomes are used to shape study design, identify and characterize individuals and populations, and draw historical and ethnobiological conclusions on origins, evolution, whereabouts, and relatedness. The replicability crisis in science has prompted us to evaluate whether PCA results are reliable, robust, and replicable. We employed an intuitive color-based model alongside human population data for eleven common test cases. We demonstrate that PCA results are artifacts of the data and that they can be easily manipulated to generate desired outcomes. PCA results may not be reliable, robust, or replicable as the field assumes. Our findings raise concerns on the validity of results reported in the literature of population genetics and related fields that place a disproportionate reliance upon PCA outcomes and the insights derived from them. We conclude that PCA may have a biasing role in genetic investigations. An alternative mixed-admixture population genetic model is discussed.
【模体】著名motif分析工具包MEME suite的R包出来了!
Memes: an R interface to the MEME Suite
Identification of biopolymer motifs represents a key step in the analysis of biological sequences. The MEME Suite is a widely used toolkit for comprehensive analysis of biopolymer motifs; however, these tools are poorly integrated within popular analysis frameworks like the R/Bioconductor project, creating barriers to their use. Here we present memes, an R package which provides a seamless R interface to the MEME Suite. memes provides a novel “data aware” interface to these tools, enabling rapid and complex discriminative motif analysis workflows. In addition to interfacing with popular MEME Suite tools, memes leverages existing R/Bioconductor data structures to store the complex, multidimensional data returned by MEME Suite tools for rapid data access and manipulation. Finally, memes provides data visualization capabilities to facilitate communication of results. memes is available as a Bioconductor package at https://bioconductor.org/packages/memes, and the source code can be found at github.com/snystrom/memes.
【嗅觉】哥伦比亚大学学者:机器学习实现嗅觉系统的进化
Evolving the Olfactory System with Machine Learning
The role of cell-cell communication in cell fate decision-making has not been well-characterized through a dynamical systems perspective. To do so, here we develop multiscale models that couple cell-cell communication with cell-internal gene regulatory network dynamics. This allows us to study the influence of external signaling on cell fate decision-making at the resolution of single cells. We study the granulocyte-monocyte vs. megakaryocyte-erythrocyte fate decision, dictated by the GATA1-PU.1 network, as an exemplary bistable cell fate system, modeling the cell-internal dynamic with ordinary differential equations and the cell-cell communication via a Poisson process. We show that, for a wide range of cell communication topologies, subtle changes in signaling can lead to dramatic changes in cell fate. We find that cell-cell coupling can explain how populations of heterogeneous cell types can arise. Analysis of intrinsic and extrinsic cell-cell communication noise demonstrates that noise alone can alter the cell fate decision-making boundaries. These results illustrate how external signals alter transcriptional dynamics, and provide insight into hematopoietic cell fate decision-making.
【测序】单细胞水平的人基因组长读段测序单,瑞典乌普萨拉大学(Uppsala University)Adam Ameur实验室出品
Long-read whole genome analysis of human single cells
With long-read sequencing we have entered an era where individual genomes are routinely assembled to near-completion and where complex genetic variation can efficiently be resolved. Here we demonstrate that long reads can be applied also to study the genomic architecture of individual human cells. Clonally expanded CD8+ T-cells from a human donor were used as starting material for a droplet-based multiple displacement amplification (dMDA) method designed to ensure long molecule lengths and minimal amplification bias. Sequencing of two single cells was performed on the PacBio Sequel II system, generating over 2.5 million reads and ~20Gb HiFi data (>QV20) per cell, achieving up to 40% genome coverage. This data allowed for single nucleotide variant (SNV) detection, including in genomic regions inaccessible by short reads. Over 1000 high-confidence structural variants (SVs) per cell were discovered in the PacBio data, which is four times more than the number of SVs detected in Illumina dMDA data from clonally related cells. In addition, several putative clone-specific somatic SV events could be identified. Single-cell de novo assembly resulted in 454-598 Mb assembly sizes and 35-42 kb contig N50 values. 1762 (12.8%) of expected gene models were found to be complete in the best single-cell assembly. The de novo constructed mitochondrial genomes were 100% identical for the two single cells subjected to PacBio sequencing, although mitochondrial heteroplasmy was also observed. In summary, the work presented here demonstrates the utility of long-read sequencing towards understanding the extent and distribution of complex genetic variation at the single cell level.
【种内】种内基因组大小差异可以有多少——以小米草为例
The nature of intraspecific genome size variation in taxonomically complex eyebrights
Genome size (GS) is a key trait related to morphology, life history, and evolvability. Although GS is, by definition, affected by presence/absence variants (PAVs), which are ubiquitous in population sequencing studies, GS is often treated as an intrinsic property of a species. Here, we studied intra- and interspecific GS variation in taxonomically complex British eyebrights (Euphrasia). We generated GS data for 192 individuals of diploid and tetraploid Euphrasia and analysed GS variation in relation to ploidy, taxonomy, population affiliation, and geography. We further compared the genomic repeat content of 30 samples. We found considerable genuine intraspecific GS variation, and observed isolation-by-distance for GS in outcrossing diploids. Tetraploid Euphrasia showed contrasting patterns, with GS increasing with latitude in outcrossing Euphrasia arctica, but little GS variation in the highly selfing Euphrasia micrantha. Interspecific differences in GS genomic repeat percentages were small. We show the utility of treating GS as the outcome of polygenic variation. Like other types of genetic variation, such as single nucleotide polymorphisms, GS variation may be increased through hybridisation and population subdivision. In addition to selection on associated traits, GS is predicted to be affected indirectly by selection due to pleiotropy of the underlying PAVs.