2020年10月biorxiv生信好文速览
生信干货
montreal ·2020年11月6日 20:09
刚刚过去的十月中,川普以及一众白宫工作人员确诊新冠,同时为11月的美国大选加足了戏码。这股肆虐“宫”中的病毒到底源自何方?来自华盛顿大学的研究人员已于日前完成了对病毒的基因组测序和分析工作,并在10月的最后一天将成果以预印本的形式投放至医学预印本服务平台medRxiv(好文速览7)。从10月2号川普确诊,到月末文章出来,相信来自西雅图市的研究人员们一定度过了多个不眠之夜。提到华盛顿大学,有的读者也许会出现下面的困惑:似乎美国还有两个华盛顿大学:圣路易斯华盛顿大学,以及乔治·华盛顿大学,且这几所学校都具有很强的学术水平,排名也相差不多,所以特别容易被搞混。这几个“华大”有何区别?一般来说,美国学校的命名原则是,如以地名为基础则称为University of 某地,比如University of California,University of Florida。而以人名命名的大学,则将名字放在前面,比如Harvard University,Yale University,Stanford University都是以捐助人的名字命名的。所以,如果我们看一下前面三个“华大”的英文名,就一目了然了:西雅图华盛顿大学(University of Washington):位于美国西北部的华盛顿州首府、西海岸明珠西雅图市市郊,得名华盛顿因为该州的州名叫华盛顿州。圣路易斯华盛顿大学(Washington University in St. Louis):位于美国中部密苏里市的圣路易斯市,“在1854年校董事会为了纪念华盛顿,而将校名改为“Washington Institute”。在1857年校名又改为“Washington University”(维基)。乔治华盛顿大学(George Washington University):戏称为国父大学,或者说最根正苗红的华盛顿大学,因该校以美国国父华盛顿的全名命名,且位于首都华盛顿特区市中心。有趣的是,这几个华盛顿大学,翻译成中文的时候,都可以叫做“华盛顿大学”,所以往往只能通过加上地点等其他方式来区分,反而令很多人认为它们之间有某些联系甚至是几所分校罢了。另一方面,有些城市拥有不同的学府,但从名字却不能看出。学术界不乏这种情况,然而出现在我们好文速览的10篇preprint中却很难得。本期,终于出现了这样一个例子,它就是来自澳洲名城墨尔本的墨尔本大学和莫纳什大学(好文速览3,4),让我们看看两篇文章讲了什么吧。1.【组装】德国马普植物育种所(Max Planck Institute for Plant Breeding)学者开发长读段无空缺染色体组装方法GALA: gap-free chromosome-scale assembly with long readsHigh-quality genome assembly has wide applications in genetics and medical studies. However, it is still very challenging to achieve gap-free chromosome-scale assemblies using current workflows of long-read platforms. Here we propose a chromosome-by-chromosome assembly strategy implemented through the multiple-layer computer graph which identifies mis-assemblies within preliminary assemblies or chimeric raw reads and partitions the data into chromosome-scale linkage groups. The subsequent independent assembly of each linkage group generates gap-free assembly free from the mis-assembly errors which usually plague existing workflows. This flexible framework also allows us to integrate data from various technologies, such as Pacbio, Nanopore, Hi-C, and the genetic map, to generate gap-free chromosome-scale assembly. We de novo assembled C. elegans and A. thaliana genomes using GALA with combined Pacbio and Nanopore sequening data from publicly available datasets. We also demonstrated its applicability with a gap-free assembly of two chromosomes in the human genome. In addition, GALA showed promising performance for Pacbio high-fidelity long reads. Our method enables straightforward assembly of genomes with multiple data sources and multiple computational tools, overcoming barriers that at present restrict the application of de novo genome assembly technology.
2.【史前】古DNA测序揭示史前生物巨型狐猴基因组的奥秘Evolutionary and phylogenetic insights from a nuclear genome sequence of the extinct, giant ‘subfossil’ koala lemur Megaladapis edwardsiNo endemic Madagascar animal with body mass >10 kg survived a relatively recent wave of extinction on the island. From morphological and isotopic analyses of skeletal ‘subfossil’ remains we can reconstruct some of the biology and behavioral ecology of giant lemurs (primates; up to ~160 kg), elephant birds (up to ~860 kg), and other extraordinary Malagasy megafauna that survived well into the past millennium. Yet much about the evolutionary biology of these now extinct species remains unknown, along with persistent phylogenetic uncertainty in some cases. Thankfully, despite the challenges of DNA preservation in tropical and sub-tropical environments, technical advances have enabled the recovery of ancient DNA from some Malagasy subfossil specimens. Here we present a nuclear genome sequence (~2X coverage) for one of the largest extinct lemurs, the koala lemur Megaladapis edwardsi (~85kg). To support the testing of key phylogenetic and evolutionary hypotheses we also generated new high-coverage complete nuclear genomes for two extant lemur species, Eulemur rufifrons and Lepilemur mustelinus, and we aligned these sequences with previously published genomes for three other extant lemur species and 47 non-lemur vertebrates. Our phylogenetic results confirm that Megaladapis is most closely related to the extant Lemuridae (typified in our analysis by E. rufifrons) to the exclusion of L. mustelinus, which contradicts morphology-based phylogenies. Our evolutionary analyses identified significant convergent evolution between M. edwardsi and extant folivorous primates (colobine monkeys) and ungulate herbivores (horses) in genes encoding protein products that function in the biodegradation of plant toxins and nutrient absorption. These results suggest that koala lemurs were highly adapted to a leaf-based diet, which may also explain their convergent craniodental morphology with the small-bodied folivore Lepilemur.
3.【评分】一款评估科学软件质量的软件SoftWipeSoftWipe – a tool and benchmark to assess scientific software qualityScientific software from all areas of scientific research is pivotal to obtaining novel insights. Yet the quality of scientific software is rarely assessed, even though it might lead to incorrect scientific results in the worst case. Therefore, we have developed an open source tool and benchmark called SoftWipe, that provides a relative software quality ranking of 51 computational tools from diverse research areas. SoftWipe can be used in the review process of software papers and to inform the scientific software selection process.4.【墨尔本x1】墨尔本大学(The University of Melbourne)学者:微生物组学数据批次效应矫正的多变量方法A multivariate method to correct for batch effects in microbiome dataMicrobial communities are highly dynamic and sensitive to changes in the environment. Thus, microbiome data are highly susceptible to batch effects, defined as sources of unwanted variation that are not related to, and obscure any factors of interest. Existing batch correction methods have been primarily developed for gene expression data. As such, they do not consider the inherent characteristics of microbiome data, including zero inflation, overdispersion and correlation between variables. We introduce a new multivariate and non-parametric batch correction method based on Partial Least Squares Discriminant Analysis. PLSDA-batch first estimates treatment and batch variation with latent components to then subtract batch variation from the data. The resulting batch effect corrected data can then be input in any downstream statistical analysis. Two variants are also proposed to handle unbalanced batch x treatment designs and to include variable selection during component estimation. We compare our approaches with existing batch correction methods removeBatchEffect and ComBat on simulated and three case studies. We show that our three methods lead to competitive performance in removing batch variation while preserving treatment variation, and especially when batch effects have high variability. Reproducible code and vignettes are available on GitHub.5.【墨尔本x2】莫纳什大学(Monash University)学者:大脑转录图谱中的区域异质性Dynamical consequences of regional heterogeneity in the brain’s transcriptional landscapeBrain regions vary in their molecular and cellular composition, but how this heterogeneity shapes neuronal dynamics is unclear. Here, we investigate the dynamical consequences of regional heterogeneity using a biophysical model of whole-brain functional magnetic resonance imaging (MRI) dynamics in humans. We show that models in which transcriptional variations in excitatory and inhibitory receptor (E:I) gene expression constrain regional heterogeneity more accurately reproduce the spatiotemporal structure of empirical functional connectivity estimates than do models constrained by global gene expression profiles and MRI-derived estimates of myeloarchitecture. We further show that regional heterogeneity is essential for yielding both ignition-like dynamics, which are thought to support conscious processing, and a wide variance of regional activity timescales, which supports a broad dynamical range. We thus identify a key role for E:I heterogeneity in generating complex neuronal dynamics and demonstrate the viability of using transcriptional data to constrain models of large-scale brain function.6.【古菌】深圳大学Li Meng课题组:75个新的asgard古菌基因组暗示真核生物的起源另有玄机Expanding diversity of Asgard archaea and the elusive ancestry of eukaryotesComparative analysis of 162 (nearly) complete genomes of Asgard archaea, including 75 not reported previously, substantially expands the phylogenetic and metabolic diversity of the Asgard superphylum, with six additional phyla proposed. Phylogenetic analysis does not strongly support origin of eukaryotes from within Asgard, leaning instead towards a three-domain topology, with eukaryotes branching outside archaea. Comprehensive protein domain analysis in the 162 Asgard genomes results in a major expansion of the set of eukaryote signature proteins (ESPs). The Asgard ESPs show variable phyletic distributions and domain architectures, suggestive of dynamic evolution via horizontal gene transfer (HGT), gene loss, gene duplication and domain shuffling. The results appear best compatible with the origin of the conserved core of eukaryote genes from an unknown ancestral lineage deep within or outside the extant archaeal diversity. Such hypothetical ancestors would accumulate components of the mobile archaeal ‘eukaryome’ via extensive HGT, eventually, giving rise to eukaryote-like cells.Viral genome sequencing places White House COVID-19 outbreak into phylogenetic contextIn October 2020, an outbreak of at least 50 COVID-19 cases was reported surrounding individuals employed at or visiting the White House. Here, we applied genomic epidemiology to investigate the origins of this outbreak. We enrolled two individuals with exposures linked to the White House COVID-19 outbreak into an IRB-approved research study and sequenced their SARS-CoV-2 infections. We find these viral sequences are highly genetically similar to each other, but are distinct from over 160,000 publicly available SARS-CoV-2 genomes, possessing 5 nucleotide mutations that differentiate this lineage from all other circulating lineages sequenced to date. We estimate this lineage has a common ancestor in the USA in April or May 2020, but its whereabouts for the past 5 to 6 months are not clear. Looking forwards, sequencing of additional community SARS-CoV-2 infections collected in the USA prior to October 2020 may reveal linked infections and shed light on its geographic ancestry. In sequencing of SARS-CoV-2 infections collected after October 2020, the relative rarity of this constellation of mutations may make it possible to identify infections that likely descend from the White House COVID-19 outbreak.
8.【敲除】加州大学伯克利分校Savage组:从一个密码子到整个基因尺度的敲除(making every possible deletion across a gene)Comprehensive deletion landscape of CRISPR-Cas9 identifies minimal RNA-guided DNA-binding modulesProteins evolve through the modular rearrangement of elements known as domains. It is hypothesized that extant, multidomain proteins are the result of domain accretion, but there has been limited experimental validation of this idea. Here, we introduce a technique for genetic minimization by iterative size-exclusion and recombination (MISER) that comprehensively assays all possible deletions of a protein. Using MISER, we generated a deletion landscape for the CRISPR protein Cas9. We found that Cas9 can tolerate large single deletions to the REC2, REC3, HNH, and RuvC domains, while still functioning in vitro and in vivo, and that these deletions can be stacked together to engineer minimal, DNA-binding effector proteins. In total, our results demonstrate that extant proteins retain significant modularity from the accretion process and, as genetic size is a major limitation for viral delivery systems, establish a general technique to improve genome editing and gene therapy-based therapeutics.9.【错愕】意大利学者:细胞内的端粒转移实现T细胞寿命的延长Intercellular telomere transfer extends T cell lifespanThe common view is that T-lymphocytes activate telomerase, a DNA polymerase that extends telomeres at chromosome ends, to delay senescence. We show that independently of telomerase, T cells elongate telomeres by acquiring telomere vesicles from antigen-presenting cells (APCs). Upon contact with T cells, APCs degraded shelterin to donate telomeres, which were cleaved by TZAP, and then transferred in extracellular vesicles (EVs) at the immunological synapse. Telomere vesicles retained the Rad51 recombination factor that enabled them to fuse with T cell chromosomal ends causing an average lengthening of ∼3000 base pairs. Thus, we identify a previously unknown telomere transfer program that supports T cell lifespan.Creating Clear and Informative Image-based Figures for Scientific PublicationsScientists routinely use images to display data. Readers often examine figures first; therefore, it is important that figures are accessible to a broad audience. Many resources discuss fraudulent image manipulation and technical specifications for image acquisition; however, data on the legibility and interpretability of images are scarce. We systematically examined these factors in non-blot images published in the top 15 journals in three fields; plant sciences, cell biology and physiology. Common problems included missing scale bars, misplaced or poorly marked insets, images or labels that were not accessible to colorblind readers, and insufficient explanations of colors, labels, annotations, or the species and tissue or object depicted in the image. Papers that met all good practice criteria examined for all image-based figures were uncommon (physiology 16%, cell biology 12%, plant sciences 2%). We present detailed descriptions and visual examples to help scientists avoid common pitfalls when publishing images. Our recommendations address image magnification, scale information, insets, annotation, and color and may encourage discussion about quality standards for bioimage publishing.