上月,苏伊士运河成了全球的焦点。随着时间的发酵,新闻的图也成为了大家恶搞的对象
来自英国爱丁堡大学的生物信息学教授Mick Watson在网上发图戏称,来自NCBI数据库的参考序列数据库refseq遭到了来自宏基因组学(metagenomics)巨轮的冲击。的确,宏基因组学凭借其对环境中微生物DNA的更加全面的分析,对基于克隆培养的传统微生物基因组学研究产生了很大冲击。然而,新鲜事物(尽管宏基因组已很难配得上这个称呼了)必伴随着一些争议,其中一个方面就在于对其的合理使用。上个月,来自蒙大拿州立大学的微生物学家Roland Hatzenpichler发推声称,有一篇在bioRxiv上的preprint不合规矩地使用了超过15000个dataset——这些dataset在NCBI上被标记为公共,其中95%仍处于embargo(禁止使用期限)内。特别,在Hatzenpichler同该文作者沟通之后,仍被告知拒绝做出修改。为此,Hatzenpichler大为光火,称此PI正在窃取数据(the PI is STEALING data!)。
不久后,此文的一位作者在网上给予回应,表示他们之所以拒绝做出修改是因为这样会通过回溯的方式(或者说挖别人老底)破坏文章对于已标记为公共数据的使用。
需要强调的是,这已不是Hatzenpichler首次就相关问题“开炮”了。今年二月,Hatzenpichler就针对一篇发表在知名杂志Science advances的文章进行了抨击,称后者使用了包括其本人实验室的在JGI测序的仍在embargo的数据,并最终导致该文在发表后被迅速撤稿。其理由几乎如出一辙:
本来这些争议事件就难分对错,小编又不是该领域从业人员,其中的道理很难讲。不过小编做了一些工作(以下内容不保证完全正确,有问题请纠正),发现这里争论的焦点在于,Hatzenpichler通过美国国家能源部下属的联合基因组研究所(Joint genome institute,简称JGI)测序了自己实验室采集到宏基因组数据,但这些数据在JGI上需要有embargo期,在此期限内数据未经许可不得被他人发表。JGI还有另一个要求:数据一经测序当立即公开,因此他们可能会将数据传至NCBI。有趣的是,NCBI往往只要你把数据放了上去,它就只负责保存不负责其他:如果你没有embargo的话,那最好了;如果有embargo的话,那就遵循原来的embargo期限,只是NCBI上不会标注是否存在embargo——这也正是最为诡异的地方。
我们再来看一看JGI的embargo到底怎么算的? 18年11月以后,数据public两年之后或数据被以文章形势publish,才可以使用;18年11月之前,当且仅当数据被publish,才可以使用,详情请看【1】。当然了,如果不满足上述条件,你可以和PI沟通对方同意也可以。
作为一个外行,我认为最大的问题还是在于JGI和NCBI没有做好标注和规范化,特别是JGI,你既然要你的数据立即公开又要保持embargo还不要传到一个没有说明的地方(NCBI),不论大家怎么看,我是觉得JGI有一点,当什么立什么的感觉?
对,就是当机立断!
据信,JGI已同NCBI就此事进行了接洽,相信在不久的将来将会达成共识,对事件进行妥善解决。既然讲到了宏基因组,本月的biorxiv生信好文速览中,我们特意选取了两篇和宏基因组学有关的文章,其中用到的TARA海洋微生物组学研究在生信人往期的栏目里亦有报道,此外还有一个好消息是去年12月TARA号再度起航,开启了一轮沿南美洲非洲达南大西洋的13万海里的科考之旅,感兴趣的朋友可查看文末链接。
1. 【TARA】巴黎萨克雷大学(Université Paris-Saclay):TARA海洋计划全面梳理表示表层海水中异样固氮细菌在氮循环中的重要作用
Heterotrophic bacterial diazotrophs are more abundant than their cyanobacterial counterparts in metagenomes covering most of the sunlit ocean
Biological nitrogen fixation is a major factor contributing to microbial primary productivity in the open ocean. The current view depicts a few cyanobacterial diazotrophs as the most relevant marine nitrogen fixers, whereas heterotrophic diazotrophs are more diverse and considered to have lower impacts on the nitrogen balance. Here, we used 891 Tara Oceans metagenomes to create a manually curated, non-redundant genomic database corresponding to free-living, as well as filamentous, colony-forming, particle-attached and symbiotic bacterial and archaeal populations occurring in the surface of five oceans and two seas. Notably, the database provided the genomic content of eight cyanobacterial diazotrophs including Trichodesmium populations and a newly discovered population similar to Richelia, as well as 40 heterotrophic bacterial diazotrophs organized into three main functional groups that considerably expand the known diversity of abundant marine nitrogen fixers compared to previous genomic surveys. Critically, these 48 populations may account for more than 90% of cells containing known nifH genes and occurring in the sunlit ocean, suggesting that the genomic characterization of the most abundant marine diazotrophs may be nearing completion. The newly identified heterotrophic bacterial diazotrophs are widespread, express their nifH genes in situ, and co-occur under nitrate-depleted conditions in large size fractions where they might form aggregates providing the low-oxygen microenvironments required for nitrogen fixation. Most significantly, we found heterotrophic bacterial diazotrophs to be more abundant than cyanobacterial diazotrophs in most metagenomes from the open oceans and seas. This large-scale environmental genomic survey emphasizes the considerable potential of heterotrophs in the marine nitrogen balance.
2. 【合成】海洋微生物组数据挖掘暗示海洋细菌的在生物合成方面的巨大潜力(来自瑞士苏黎世联邦理工学院Shinichi Sunagawa)
Uncharted biosynthetic potential of the ocean microbiome
Microbes are phylogenetically and metabolically diverse. Yet capturing this diversity, assigning functions to host organisms and exploring the biosynthetic potential in natural environments remains challenging. We reconstructed >25,000 draft genomes, including from >2,500 uncharacterized species, from globally-distributed ocean microbial communities, and combined them with ∼10,000 genomes from cultivated and single cells. Mining this resource revealed ∼40,000 putative biosynthetic gene clusters (BGCs), many from unknown phylogenetic groups. Among these, we discovered Candidatus Eudoremicrobiaceae as one of the most biosynthetically diverse microbes detected to date. Discrete transcriptional states structuring natural populations were associated with a potentially niche-partitioning role for BGC products. Together with the characterization of the first Eudoremicrobiaceae natural product, this study demonstrates how microbiomics enables prospecting for candidate bioactive compounds in underexplored microbes and environments.
3. 【补偿】马普分子与遗传所:X染色体计量补偿机制新探
Distal and proximal cis-regulatory elements sense X-chromosomal dosage and developmental state at the Xist locus
Developmental genes such as Xist, the master regulator of X-chromosome inactivation (XCI), are controlled by complex cis-regulatory landscapes, which decode multiple signals to establish specific spatio-temporal expression patterns. Xist integrates information on X-chromosomal dosage and developmental stage to trigger XCI at the primed pluripotent state in females only. Through a pooled CRISPR interference screen in differentiating mouse embryonic stem cells, we identify functional enhancer elements of Xist during the onset of random XCI. By quantifying how enhancer activity is modulated by X-dosage and differentiation, we find that X-dosage controls the promoter-proximal region in a binary switch-like manner. By contrast, differentiation cues activate a series of distal elements and bring them into closer spatial proximity of the Xist promoter. The strongest distal element is part of an enhancer cluster ∼200 kb upstream of the Xist gene which is associated with a previously unannotated Xist-enhancing regulatory transcript, we named Xert. Developmental cues and X-dosage are thus decoded by distinct regulatory regions, which cooperate to ensure female-specific Xist upregulation at the correct developmental time. Our study is the first step to disentangle how multiple, functionally distinct regulatory regions interact to generate complex expression patterns in mammals.
4. 【海拔】解放军总院何昆仑、麻省理工Manolis Kellis:300+基因组测序进一步透视西藏人高海拔适应机制
Structural variant selection for high-altitude adaptation using single-molecule long-read sequencing
Structural variants (SVs) can be important drivers of human adaptation with strong effects, but previous studies have focused primarily on common variants with weak effects. Here, we used large-scale single-molecule long-read sequencing of 320 Tibetan and Han samples, to show that SVs are key drivers of selection under high-altitude adaptation. We expand the landscape of global SVs, apply robust models of selection and population differentiation combining SVs, SNPs and InDels, and use epigenomic analyses to predict driver enhancers, target genes, upstream regulators, and biological functions, which we validate using enhancer reporter and DNA pull-down assays. We reveal diverse Tibetan-specific SVs affecting the cis- and trans-regulatory circuitry of diverse biological functions, including hypoxia response, energy metabolism, lung function, etc. Our study greatly expands the global SV landscape, reveals the central role of gene-regulatory circuitry rewiring in human adaptation, and illustrates the diverse functional roles that SVs can play in human biology.
5. 【对称】中科院南海所喻子牛:贝壳左右不对称的原因是什么?看比较基因组学带来的答案
Comparative genomics reveals evolutionary drivers of sessile life and left-right shell asymmetry in bivalves
Bivalves are species-rich mollusks with prominent protective roles in coastal ecosystems. Across these ancient lineages, colony-founding larvae anchor themselves either by byssus production or by cemented attachment. The latter mode of sessile life is strongly molded by left-right shell asymmetry during larval development of Ostreoida oysters such as Crassostrea hongkongensis. Here, we sequenced the genome of C. hongkongensis in high resolution and compared it to reference bivalve genomes to unveil genomic determinants driving cemented attachment and shell asymmetry. Importantly, loss of the homeobox gene antennapedia (Antp) and broad expansion of lineage-specific extracellular gene families are implicated in a shift from byssal to cemented attachment in bivalves. Evidence from comparative transcriptomics shows that the left-right asymmetrical C. hongkongensis plausibly diverged from the symmetrical Pinctada fucata in expression profiles marked by elevated activities of orthologous transcription factors and lineage-specific shell-related gene families including tyrosinases, which may cooperatively govern asymmetrical shell formation in Ostreoida oysters.
6. 【点滴】RNA测序中基因共表达分析的点点滴滴
Guidance for RNA-seq co-expression estimates: the importance of data normalization, batch effects, and correlation measures
We conducted a systematic analysis of 50 different data processing workflows and applied them on RNA-seq data of 68 human and 76 mouse cell types and tissues. We analyzed the resulting 7,200 gene co-expression networks and identified the factors that contribute to their quality focusing on data normalization, batch effect correction and the measure of correlation. We confirmed the key importance of large sample counts for generating high-quality networks. However, choosing a suitable normalization approach and applying batch effect correction can further improve the quality of co-expression networks, equivalent to a >70% and >40% increase in samples count. Finally, Pearson correlation appears more suitable than Spearman correlation, except for smaller datasets.
7. 【绿色】澳大利亚Baker Heart and Diabetes Institute:生物信息学的环境问题
The carbon footprint of bioinformatics
Bioinformatic research relies on large-scale computational infrastructures which have a non-zero carbon footprint. So far, no study has quantified the environmental costs of bioinformatic tools and commonly run analyses. In this study, we estimate the bioinformatic carbon footprint (in kilograms of CO2 equivalent units, kgCO2e) using the freely available Green Algorithms calculator (www.green-algorithms.org). We assess (i) bioinformatic approaches in genome-wide association studies (GWAS), RNA sequencing, genome assembly, metagenomics, phylogenetics and molecular simulations, as well as (ii) computation strategies, such as parallelisation, CPU (central processing unit) vs GPU (graphics processing unit), cloud vs. local computing infrastructure and geography. In particular, for GWAS, we found that biobank-scale analyses emitted substantial kgCO2e and simple software upgrades could make GWAS greener, e.g. upgrading from BOLT-LMM v1 to v2.3 reduced carbon footprint by 73%. Switching from the average data centre to a more efficient data centres can reduce carbon footprint by ~34%. Memory over-allocation can be a substantial contributor to an algorithm’s carbon footprint. The use of faster processors or greater parallelisation reduces run time but can lead to, sometimes substantially, greater carbon footprint. Finally, we provide guidance on how researchers can reduce power consumption and minimise kgCO2e. Overall, this work elucidates the carbon footprint of common analyses in bioinformatics and provides solutions which empower a move toward greener research.
8. Live-seq:让活细胞scRANseq成为可能——瑞士洛桑联邦理工学院(EPFL)出品
Genome-wide molecular recording using Live-seq
Single-cell transcriptomics (scRNA-seq) has greatly advanced our ability to characterize cellular heterogeneity in health and disease. However, scRNA-seq requires lysing cells, which makes it impossible to link the individual cells to downstream molecular and phenotypic states. Here, we established Live-seq, an approach for single-cell transcriptome profiling that preserves cell viability during RNA extraction using fluidic force microscopy. Based on cell division, functional responses and whole-cell transcriptome read-outs, we show that Live-seq does not induce major cellular perturbations and therefore can function as a transcriptomic recorder. We demonstrate this recording capacity by preregistering the transcriptomes of individual macrophage-like RAW 264.7 cells that were subsequently subjected to time-lapse imaging after lipopolysaccharide (LPS) exposure. This enabled the unsupervised, genome-wide ranking of genes based on their ability to impact macrophage LPS response heterogeneity, revealing basal NFKBIA expression level and cell cycle state as major phenotypic determinants. Furthermore, we show that Live-seq can be used to sequentially profile the transcriptomes of individual macrophages before and after stimulation with LPS, thus enabling the direct mapping of a cell’s trajectory. Live-seq can address a broad range of biological questions by transforming scRNA-seq from an end-point to a temporal analysis approach.
9. 英国桑格研究所(Wellcome Sanger Institute):新皮层发育的空间转录组学
Transcriptome-wide spatial RNA profiling maps the cellular architecture of the developing human neocortex
Spatial genomic technologies can map gene expression in tissues, but provide limited potential for transcriptome-wide discovery approaches and application to fixed tissue samples. Here, we introduce the GeoMX Whole Transcriptome Atlas (WTA), a new technology for transcriptome-wide spatial profiling of tissues with cellular resolution. WTA significantly expands the Digital Spatial Profiling approach to enable in situ hybridisation against 18,190 genes at high-throughput using a sequencing readout. We applied WTA to generate the first spatial transcriptomic map of the fetal human neocortex, validating transcriptome-wide spatial profiling on formalin-fixed tissue material and demonstrating the spatial enrichment of autism gene expression in deep cortical layers. To demonstrate the value of WTA for cell atlasing, we integrated single-cell RNA-sequencing (scRNA-seq) and WTA data to spatially map dozens of neural cell types and showed that WTA can be used to directly measure cell type specific transcriptomes in situ. Moreover, we developed computational tools for background correction of WTA data and accurate integration with scRNA-seq. Our results present WTA as a versatile transcriptome-wide discovery tool for cell atlasing and fixed tissue spatial transcriptomics.
10. Ensembl的新冠病毒研究专题资源
The Ensembl COVID-19 resource: Ongoing integration of public SARS-CoV-2 data
The COVID-19 pandemic has seen unprecedented use of SARS-CoV-2 genome sequencing for epidemiological tracking and identification of emerging variants. Understanding the potential impact of these variants on the infectivity of the virus and the efficacy of emerging therapeutics and vaccines has become a cornerstone of the fight against the disease. To support the maximal use of genomic information for SARS-CoV-2 research, we launched the Ensembl COVID-19 browser, incorporating a new Ensembl gene set, multiple variant sets (including novel variation calls), and annotation from several relevant resources integrated into the reference SARS-CoV-2 assembly. This work included key adaptations of existing Ensembl genome annotation methods to model ribosomal slippage, stringent filters to elucidate the highest confidence variants and utilisation of our comparative genomics pipelines on viruses for the first time. Since May 2020, the content has been regularly updated and tools such as the Ensembl Variant Effect Predictor have been integrated. The Ensembl COVID-19 browser is freely available at https://covid-19.ensembl.org.
生信分析热点咨询
引文
1. JGI Data Release Policy. https://twitter.com/environmicrobio/status/1377377728098230273
2. Montreal,当代的小猎犬号——TARA号远航及其浮游生物微生物组项目速览,生信人(2020)
The Tara Microbiome Mission. https://oceans.taraexpeditions.org/en/m/science/news/microbiomes-mission