知识中心 - 北京概普生物科技有限公司(GapTech)

2019年4月bioRxiv生信好文速览

生信干货 montreal ·2019年5月7日 19:51

上个月月初，iPlants公众号报道了一组有趣的预印本文章，说的是两篇文章分别独立报道了植物激素水杨酸合称的重要进展【1】。第一篇由来自德国格廷根大学（University of Goettingen）和加拿大不列颠哥伦比亚大学（University of British Columbia）的团队合作完成，刊载于4月5日【2】。第二篇来自麻省理工学院的翁经科（weng jing-ke）团队【3】，刊载于4月7日，与德加合作团队相比内容上十分接近【1】。

作为相隔十万八千里的外行，小编看了一下文章和相关信息，有以下几个感觉，一是两篇文章写作上来看都有些仓促，二是两篇文章在bioRxiv提供的文章版权共享中都未做出选择，三是两篇文章仅仅相差两天的时间，如果说是恰好赶在一起，恐怕实难服众。

综上，小编以为，不难看出两篇文章几乎同时在bioRxiv投放出来的背后所隐藏着激烈的竞争——这样做的好处很明显，就是绕过费时而充满刁难的审稿流程抢先宣布发现权，并通过bioRxiv让自己的研究走入公众视野。

其实，抢先宣布自己研究成果的“第一性”也是包括bioRxiv在内的预印本平台的一大作用，可以尽量帮助科学家免受审稿阶段可能遭受的不公正打压，这一点对于年轻学者或许更为关键。记得14年的诺贝尔化学奖，美国学者Eric Betzig因PALM技术的发明和另两位科学家分享了这一殊荣。而哈佛+麻省理工的双聘教授、华人天才科学家庄小威——与PALM（Photoactivated Localization Microscopy）原理上相似的STORM（Stochastic Optical Reconstruction Microscopy）的发明者——却遗憾与得奖失之交臂。庄老师的文章于06年8月9日在Science发表，比Betzig的文章在Nature上在线刊登早一天。支持庄的人认为，从时间上看，庄小威老师理应有份。诺奖委员会也曾对此事予以回应，认为一要考虑Betzig此前在该领域的贡献，二来Betzig投稿明显早于庄：Betzig于当年3月就已经投稿到Nature，只是在审稿上花费了四个多月，而庄小威老师7月初投稿，不到一个月的时间就被接受【4】。此事至今仍留有争议。小编在想，如果那时候已经有bioRxiv的话，这些流言蜚语也会少一些吧。先说到这里，下面请看上个月的biorxiv生信好文速递。

1.【Sequencing】加州Genapsys公司推出利用边合成边测序和电子检测的便携测序仪

High accuracy DNA sequencing on a small, scalable platform via electrical detection of single base incorporations

High throughput DNA sequencing technologies have undergone tremendous development over the past decade. Although optical detection-based sequencing has constituted the majority of data output, it requires a large capital investment and aggregation of samples to achieve optimal cost per sample. We have developed a novel electronic detection-based platform capable of accurately detecting single base incorporations. The GenapSys technology with its electronic detection modality allows the system to be compact, accessible, and affordable. We demonstrate the performance of the system by sequencing several different microbial genomes with varying GC content. The platform is capable of generating 1.5 Gb of high-quality nucleic acid sequence in a single run. We routinely generate sequence data that exceeds 99% raw accuracy with read lengths of up to 175 bp. The utility of the platform is highlighted by targeted sequencing of the human genome. We show high concordance of SNP detection on the human NA12878 HapMap cell line with data generated on the Illumina sequencing platform. In addition, we sequenced a targeted panel of cancer-associated genes in a well characterized reference standard. With multiple library preparation approaches on this sample, we were able to identify low frequency mutations at expected allele frequencies.

2.【Bioinformatics】哈工大王亚东老师开发基于de Bruijn图的转录组长读段比对新方法

deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index（CC-BY-NC 4.0）

Long-read RNA sequencing (RNA-seq) is promising to transcriptomics studies, however, the alignment of the reads is still a fundamental but non-trivial task due to the sequencing errors and complicated gene structures. We propose deSALT, a tailored two-pass long RNA-seq read alignment approach, which constructs graph-based alignment skeletons to sensitively infer exons, and use them to generate spliced reference sequence to produce refined alignments. deSALT addresses several difficult issues, such as small exons, serious sequencing errors and consensus spliced alignment. Benchmarks demonstrate that this approach has a better ability to produce high-quality full-length alignments, which has enormous potentials to transcriptomics studies.

3. 【Genomics】miRNA结合位点中的人类群体差异

The impact of population variation in the analysis of microRNA target sites（CC-BY 4.0）

The impact of population variation in the analysis of regulatory interactions is an underdeveloped area. MicroRNA target recognition occurs via pairwise complementarity. Consequently, a number of computational prediction tools have been developed to identify potential target sites, that can be further validated experimentally. However, as microRNA target predictions are done mostly considering a reference genome sequence, target sites showing variation among populations are neglected. Here we study variation at microRNA target sites in human populations and quantify their impact in microRNA target prediction. We found that African populations carry a significant number of potential microRNA target sites that are not detectable in the current human reference genome sequence. Some of these targets are conserved in primates and only lost in Out-of-Africa populations. Indeed, we identified experimentally validated microRNA/transcript interactions that are not detected in standard microRNA target prediction programs, yet they have segregating target alleles abundant in non-European populations. In conclusion, here we show that ignoring population diversity may leave out regulatory elements essential to understand disease and gene expression, particularly neglecting populations of African origin.

4. 【Transcriptomics】首尔国立大学学者近两千人基因组测序揭示东北亚人群中的罕见变异

Whole-genome reference panel of 1,781 Northeast Asians improves imputation accuracy of rare and low-frequency variants

Genotype imputation using the reference panel is a cost-effective strategy to fill millions of missing genotypes for the purpose of various genetic analyses. Here, we present the Northeast Asian Reference Database (NARD), including whole-genome sequencing data of 1,781 individuals from Korea, Mongolia, Japan, China, and Hong Kong. NARD provides the genetic diversities of Korean (n=850) and Mongolian (n=386) ancestries that were not present in the 1000 Genomes Project Phase 3 (1KGP3). We combined and re-phased the genotypes from NARD and 1KGP3 to construct a union set of haplotypes. This approach established a robust imputation reference panel for the Northeast Asian populations, which yields the greatest imputation accuracy of rare and low-frequency variants compared with the existing panels. Also, we illustrate that NARD can potentially improve disease variant discovery by reducing pathogenic candidates. Overall, this study provides a decent reference panel for the genetic studies in Northeast Asia.

5. 【Bioinformatics】74种alignment-free序列比较方法的benchmark

Benchmarking of alignment-free sequence comparison methods（CC-BY-ND 4.0）

Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. Here, we present a community resource (http://afproject.org) to establish standards for comparing AF methods across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference and reconstruction of species trees under horizontal gene transfer and recombination events. The interactive web service allows researchers to explore the performance of AF tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with the current state-of-the art tools, accelerating the development of new, more accurate AF solutions.

6.【Transcriptomics】瑞士学者不同组织、不同发育阶段、不同物种间转录组研究发现功能限制的长非编码RNA

Comparative transcriptomics analyses across species, organs and developmental stages reveal functionally constrained lncRNAs（CC-BY-NC-ND 4.0）

Background Transcription of long non-coding RNAs (lncRNAs) is pervasive, but their functionality is disputed. As a class, lncRNAs show little selective constraint and negligible phenotypic effects upon perturbation. However, key biological roles were demonstrated for individual lncRNAs. Most validated lncRNAs were implicated in gene expression regulation, in pathways related to cellular pluripotency, differentiation and organ morphogenesis, suggesting that functional lncRNAs may be more abundant in embryonic development, rather than in adult organs. Results Here, we perform a multi-dimensional comparative transcriptomics analysis, across five developmental time-points (two embryonic stages, newborn, adult and aged individuals), four organs (brain, kidney, liver and testes) and three species (mouse, rat and chicken). Overwhelmingly, lncRNAs are preferentially expressed in adult and aged testes, consistent with the presence of permissive transcription during spermatogenesis. LncRNAs are often differentially expressed among developmental stages and are less abundant in embryos and newborns compared to adult individuals, in agreement with a requirement for tighter expression control and less tolerance for noisy transcription early in development. However, lncRNAs expressed during embryonic development show increased levels of evolutionary conservation, both in terms of primary sequence and of expression patterns, and in particular at their promoter regions. We find that species-specific lncRNA transcription is frequent for enhancer-associated loci and occurs in parallel with expression pattern changes for neighboring protein-coding genes. Conclusions We show that functionally constrained lncRNA loci are enriched in developing organ transcriptomes, and propose that many of these loci may function in an RNA-independent manner.

7. 【Evolution】全基因组进化树揭秘细菌群体的独特结构

Whole genome phylogenies reflect long-tailed distributions of recombination rates in many bacterial species（CC-BY-NC-ND 4.0）

Although homologous recombination is accepted to be common in bacteria, so far it has been challenging to accurately quantify its impact on genome evolution within bacterial species. We here introduce methods that use the statistics of single-nucleotide polymorphism (SNP) splits in the core genome alignment of a set of strains to show that, for many bacterial species, recombination dominates genome evolution. Each genomic locus has been overwritten so many times by recombination that it is impossible to reconstruct the clonal phylogeny and, instead of a consensus phylogeny, the phylogeny typically changes many thousands of times along the core genome alignment. We also show how SNP splits can be used to quantify the relative rates with which different subsets of strains have recombined in the past. We find that virtually every strain has a unique pattern of recombination frequencies with other strains and that the relative rates with which different subsets of strains share SNPs follow long-tailed distributions. Our findings show that bacterial populations are neither clonal nor freely recombining, but structured such that recombination rates between different lineages vary along a continuum spanning several orders of magnitude, with a unique pattern of rates for each lineage. Thus, rather than reflecting clonal ancestry, whole genome phylogenies reflect these long-tailed distributions of recombination rates.

BTW：斯坦福大学进化生物学家Petrov评价此文全面改变了他对于细菌进化的理解。

8. 【Bioinformatics】生物信息学共同体“首肯”的pipeline大集合

nf-core: Community curated bioinformatics pipelines（CC-BY-NC 4.0）

The standardization, portability, and reproducibility of analysis pipelines is a renowned problem within the bioinformatics community. Bioinformatic analysis pipelines are often designed for execution on-premise, and this inevitably leads to a level of customisation and integration that is only applicable to the local infrastructure. More notably, the software required to run these pipelines is also tightly coupled with the local compute environment, and this leads to poor pipeline portability, and reproducibility of the ensuing results - both of which are fundamental requirements for the validation of scientific findings. Here we introduce nf-core, a framework that provides a community-driven platform for the creation and development of best practice analysis pipelines written in the Nextflow language. Nextflow has built-in support for pipeline execution on most computational infrastructures, as well as automated deployment using container technologies such as Conda, Docker, and Singularity. Therefore, key obstacles in pipeline development such as portability, reproducibility, scalability and unified parallelism are inherently addressed by all nf-core pipelines. Furthermore, to ensure that new pipelines can be added seamlessly, and existing pipelines are able to inherit up-to-date functionality the nf-core community is actively developing a suite of tools that automate pipeline creation, testing, deployment and synchronization. The peer-review process during pipeline development ensures that best practices and common usage patterns are imposed and therefore, adhere to community guidelines. Our primary goal is to provide a community-driven platform for high-quality, excellent documented and reproducible bioinformatics pipelines that can be utilized across various institutions and research facilities.

BTW：本文有多篇相关文章可以在原文主页下方的链接里找到

https://www.biorxiv.org/content/10.1101/610741v1

9. 【Transcriptomics】阿尔茨海默症研究中小鼠中的disease-associated microglia特征是否适用于人类患者？转录组揭晓答案

Alzheimer’s patient brain myeloid cells exhibit enhanced aging and unique transcriptional activation（CC-BY-ND 4.0）

Gene expression changes in brain microglia from mouse models of Alzheimer’s disease (AD) are highly characterized and reflect specific myeloid cell activation states that could modulate AD risk or progression. While some groups have produced valuable expression profiles for human brain cells1–4, the cellular clarity with which we now view transcriptional responses in mouse AD models has not yet been realized for human AD tissues due to limited availability of fresh tissue samples and technological hurdles of recovering transcriptomic data with cell-type resolution from frozen samples. We developed a novel method for isolating multiple cell types from frozen post-mortem specimens of superior frontal gyrus for RNA-Seq and identified 66 genes differentially expressed between AD and control subjects in the myeloid cell compartment. Myeloid cells sorted from fusiform gyrus of the same subjects showed similar changes, and whole tissue RNA analyses further corroborated our findings. The changes we observed did not resemble the “damage-associated microglia” (DAM) profile described in mouse AD models5, or other known activation states from other disease models. Instead, roughly half of the changes were consistent with an “enhanced human aging” phenotype, whereas the other half, including the AD risk gene APOE, were altered in AD myeloid cells but not differentially expressed with age. We refer to this novel profile in human Alzheimer’s microglia/myeloid cells as the HAM signature. These results, which can be browsed at research-pub.gene.com/BrainMyeloidLandscape/reviewVersion, highlight considerable differences between myeloid activation in mouse models and human disease, and provide a genome-wide picture of brain myeloid activation in human AD.

10.【Bioinformatics】比利时学者：长读段重测序中的读段长度要求

Critical length in long read resequencing（CC-BY 4.0）

Long read sequencing has a substantial advantage for structural variant discovery and phasing of variants compared to short-read technologies, but the required and optimal read length has not been assessed. In this work, we used simulated long reads and evaluated structural variant discovery and variant phasing using current best practice bioinformatics methods. We determined that optimal discovery of structural variants from human genomes can be obtained with reads of minimally 15 kbp. Haplotyping genes entirely only reaches its optimum from reads of 100 kbp. These findings are important for the design of future long read sequencing projects.

引文

1. iPlants 竞争激烈！几乎同时又一文章揭示了植物水杨酸全合成还有新的因子？

2. Rekhter D, et al., PBS3 is the missing link in plant-specific isochorismate-derived salicylic acid biosynthesis, bioRxiv https://www.biorxiv.org/content/10.1101/600692v1（本文4月19日有升级版刊出v2）

3. Torrens-Spence M, et al., PBS3 and EPS1 complete salicylic acid biosynthesis from isochorismate in Arabidopsis, bioRxiv https://www.biorxiv.org/content/10.1101/601948v1

4. 澎湃新闻诺奖委员回应“华裔女科学家未获奖”质疑：得主有论文早19年

更多生信分析套路，请加微信13621202201

TCGA | 小工具 | 数据库 |组装| 注释 | 基因家族 | Pvalue

基因预测 |bestorf | sci | NAR | 在线工具 | 生存分析 | 热图

舞台|基因组 | 黄金测序 | 套路 | 杂谈组装 | 进化 | 测序简史