六月份,预印本(preprint)家庭迎来了新的成员——医学及健康科学领域的专用预印本服务器medRxiv诞生了!
网址https://www.medrxiv.org
MedRxiv由美国冷泉港实验室(也是bioRxiv的赞助者),BMJ杂志,以及耶鲁大学联合发起。与bioRxiv旨在面对生物学研究有所不同,medRxiv主要发布医学领域特别是临床方面的预印本文章。有理由相信,medRxiv将成为bioRxiv的重要伙伴平台。
高兴之余,我们也要意识到,医学领域的预印本可能要比bioRxiv上基础生命科学领域的预印本承载着更多负担:虽然它会促进有医学价值的发现的快速传播,但其中的问题也会在缺乏同行评议——这一现行的最佳科研成果评审制度——的情况下,被直接应用于临床并造成损失。因此,medRxiv在主页下方有红色英文提醒使用者:preprint未经同行评议,临床应用需要格外谨慎。
此外,有网友在社交媒体指出bioRxiv和medRxiv上的文章显然还是或多或少存在某种程度上的重叠,换言之,有些交叉领域的手稿(manuscript)并不容易界定更适合放在哪个预印本服务器上。对此,bioRxiv和medRxiv的发起人、来自美国冷泉港实验室的Richard Sever认为,主要取决于读者群,如果是以临床研究人员为主那么投放到medRxiv比较适合,而如果是面向基础生物学研究者,则放在bioRxiv上为妥。此外,未来将推出服务可以让用户在biorxiv和medrxiv上进行跨平台文献搜索。总之,对于尚处于初创阶段的medRxiv来说,还有很多的东西亟待完善。本期的生信预印本好文速览中我们也特别选择了一篇来自medRxiv的文章。如前所述,请大家时刻注意medRxiv主页的红色字样,审慎待之。其实,就算是经过同行评议后刊出的文章,又何尝不应如此呢?
1. 【Transcriptomics】mapping方法会影响转录本丰度计算结果?
Alignment and mapping methodology influence transcript abundance estimation(CC-BY-ND 4.0)
The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy. We investigate the effect of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential gene expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large, and can affect downstream analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally-acquired samples. We discuss best practices regarding alignment for the purposes of quantification, and also introduce a new hybrid alignment methodology, called selective alignment (SA), to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment.
2. 【Genomics】韩国人参考基因组KOREF组装:纳米孔比太平洋更划算?
Chromosome-scale assembly comparison of the Korean Reference Genome KOREF from PromethION and PacBio with Hi-C mapping information(CC-BY-NC 4.0)
Background Long DNA reads produced by single molecule and pore-based sequencers are more suitable for assembly and structural variation discovery than short read DNA fragments. For de novo assembly, PacBio and Oxford Nanopore Technologies (ONT) are favorite options. However, PacBio’s SMRT sequencing is expensive for a full human genome assembly and costs over 50,000 USD for 30x coverage as of 2018. ONT PromethION sequencing, on the other hand, is cost-effective at one-twentieth the price of PacBio for the same coverage and provides longer average and maximum read lengths, heralding the personalized reference genome era. We aimed to check the cost-effectiveness of PromethION sequencing product. Findings We performed whole genome de novo assemblies and comparison to construct the new version of KOREF, the Korean reference genome, using sequencing data produced by PromethION and PacBio. With PromethION, an assembly using sequenced reads with 64x coverage (193 Gb, 3 flowcell sequencing) resulted in 3,725 contigs with N50s of 16.7 Mbp and a total genome length of 2.8 Gbp. It was comparable to the KOREF assembly constructed using PacBio at 62x coverage (188 Gbp, 2,695 contigs and N50s of 17.9 Mbp). When we applied Hi-C-derived long-range mapping data, an even higher quality assembly for the 64x coverage was achieved, resulting in 3,179 scaffolds with an N50 of 56.4 Mbp. Conclusion In conclusion, the pore-based PromethION approach provides a good quality chromosome-scale human genome assembly at a low cost and is much more cost-effective than PacBio.
3. 【Genomics】宾夕法尼亚州立大学科学家声称在寄生植物菟丝子中发现有可能靶向宿主mRNA的小RNA
Compensatory sequence variation between trans-species small RNAs and their target sites【CC-BY-NC-ND 4.0】
Trans-species small regulatory RNAs (sRNAs) are delivered to host plants from diverse pathogens and parasites and can target host mRNAs. How trans-species sRNAs can be effective on diverse hosts has been unclear. Multiple species of the parasitic plant Cuscuta produce trans-species sRNAs that collectively target many host mRNAs. Confirmed target sites are nearly always in highly conserved, protein-coding regions of host mRNAs. Cuscuta trans-species sRNAs can be grouped into superfamilies that have variation in a three-nucleotide period. These variants compensate for synonymous-site variation in host mRNAs. By targeting host mRNAs at highly conserved protein-coding sites, and simultaneously expressing multiple variants to cover synonymous-site variation, Cuscuta trans-species sRNAs may be able to successfully target homologous mRNAs from diverse hosts. One Sentence Summary The parasitic plant Cuscuta produces a diverse set of sRNAs that compensate for sequence variation in mRNA targets in diverse hosts.
4. 【Transcriptomics】斯坦福大学Tony Wyss-Coray等人报道小鼠17个器官整个lifespan的表达情况
The murine transcriptome reveals global aging nodes with organ-specific phase and amplitude(CC-BY-NC 4.0)
Aging is the single greatest cause of disease and death worldwide, and so understanding the associated processes could vastly improve quality of life. While the field has identified major categories of aging damage such as altered intercellular communication, loss of proteostasis, and eroded mitochondrial function1, these deleterious processes interact with extraordinary complexity within and between organs. Yet, a comprehensive analysis of aging dynamics organism-wide is lacking. Here we performed RNA-sequencing of 17 organs and plasma proteomics at 10 ages across the mouse lifespan. We uncover previously unknown linear and non-linear expression shifts during aging, which cluster in strikingly consistent trajectory groups with coherent biological functions, including extracellular matrix regulation, unfolded protein binding, mitochondrial function, and inflammatory and immune response. Remarkably, these gene sets are expressed similarly across tissues, differing merely in age of onset and amplitude. Especially pronounced is widespread immune cell activation, detectable first in white adipose depots in middle age. Single-cell RNA-sequencing confirms the accumulation of adipose T and B cells, including immunoglobulin J-expressing plasma cells, which also accrue concurrently across diverse organs. Finally, we show how expression shifts in distinct tissues are highly correlated with corresponding protein levels in plasma, thus potentially contributing to aging of the systemic circulation. Together, these data demonstrate a similar yet asynchronous inter- and intra-organ progression of aging, thereby providing a foundation to track systemic sources of declining health at old age.
5. 【Genomics】内蒙古农业大学:128头骆驼的基因组测序揭示双峰驼起源与驯化历史
Whole-genome sequencing of 128 camels across Asia provides insights into origin and migration of domestic Bactrian camels(CC-BY-NC-ND 4.0)
The domestic Bactrian camels were treated as the principal means of locomotion between the eastern and western cultures in history. To address the question of their origin, we performed whole-genome sequencing of 128 camels across Asia, including representative populations of domestic Bactrian camels from the Mongolian Plateau to the Caspian Sea, as well as the extant wild Bactrian camels and dromedaries. The domestic and wild Bactrian camels showed remarkable genetic divergence since they were split from dromedaries, confirming they were separated species. The wild Bactrian camels made also little contribution to the ancestry of domestic ones. Among the domestic Bactrian camels, those from Iran exhibited the largest genetic distance from others, and were the first population to separate in the phylogeny. Although evident admixture was observed between domestic Bactrian camels and dromedaries living around the Caspian Sea, the large genetic distance and basal position of Iranian Bactrian camels could not be explained by introgression alone. Taken together, our study favored the Iranian origin of domestic Bactrian camels, which were then immigrated eastward to Mongolia where the native wild Bactrian camels inhabited. This study illustrated the complex genomic landscape of migration underlying domestication in Bactrian camels.
6. 【Genomics】生殖系结构变异(SV)的大型benchmark
A robust benchmark for germline structural variant detection
New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution, and comprehensiveness. Translating these methods to routine research and clinical practice requires robust benchmark sets. We developed the first benchmark set for identification of both false negative and false positive germline SVs, which complements recent efforts emphasizing increasingly comprehensive characterization of SVs. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle (GIAB) Consortium integrated 19 sequence-resolved variant calling methods, both alignment- and de novo assembly-based, from short-, linked-, and long-read sequencing, as well as optical and electronic mapping. The final benchmark set contains 12745 isolated, sequence-resolved insertion and deletion calls ≥50 base pairs (bp) discovered by at least 2 technologies or 5 callsets, genotyped as heterozygous or homozygous variants by long reads. The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.66 Gbp and 9641 SVs supported by at least one diploid assembly. Support for SVs was assessed using svviz with short-, linked-, and long-read sequence data. In general, there was strong support from multiple technologies for the benchmark SVs, with 90% of the Tier 1 SVs having support in reads from more than one technology. The Mendelian genotype error rate was 0.3%, and genotype concordance with manual curation was >98.7%. We demonstrate the utility of the benchmark set by showing it reliably identifies both false negatives and false positives in high-quality SV callsets from short-, linked-, and long-read sequencing and optical mapping. GIAB is working towards a new version of the benchmark set that will use new technologies and methods such as PacBio Circular Consensus Sequencing and ultralong Oxford Nanopore sequencing to expand to more challenging genome regions and include more challenging SVs such as inversions. We are also developing a robust integration process to make calls on GRCh37 and GRCh38 for all seven GIAB samples.
7. 【Bioinformatics】堪萨斯大学Unckless实验室开发基于深度学习的拷贝数变异检测新方法
A Simple Deep Learning Approach for Detecting Duplications and Deletions in Next-Generation Sequencing Data(CC-BY-ND 4.0)
Copy number variants (CNV) are associated with phenotypic variation in several species. However, properly detecting changes in copy numbers of sequences remains a difficult problem, especially in lower quality or lower coverage next-generation sequencing data. Here, inspired by recent applications of machine learning in genomics, we describe a method to detect duplications and deletions in short-read sequencing data. In low coverage data, machine learning appears to be more powerful in the detection of CNVs than the gold-standard methods or coverage estimation alone, and of equal power in high coverage data. We also demonstrate how replicating training sets allows a more precise detection of CNVs, even identifying novel CNVs in two genomes previously surveyed thoroughly for CNVs using long read data. Available at: https://github.com/tomh1lll/dudem
8.【Genomics】斯坦福大学Hunter Fraser等人:人体细胞突变全景图谱
The somatic mutation landscape of the human body(CC-BY-ND 4.0)
Somatic mutations in healthy tissues contribute to aging, neurodegeneration, and cancer initiation, yet remain largely uncharacterized. To gain a better understanding of their distribution and functional impacts, we leveraged the genomic information contained in the transcriptome to uniformly call somatic mutations from over 7,500 tissue samples, representing 36 distinct tissues. This catalog, containing over 280,000 mutations, revealed a wide diversity of tissuespecific mutation profiles associated with gene expression levels and chromatin states. We found pervasive negative selection acting on missense and nonsense mutations, except for mutations previously observed in cancer samples, which were under positive selection and were highly enriched in many healthy tissues. These findings reveal fundamental patterns of tissue-specific somatic evolution and shed light on aging and the earliest stages of tumorigenesis.
9.【Bioinformatics】加州大学圣地亚哥分校高中实习生独立发文提出PCR deduplicating新算法
Algorithms for efficiently collapsing reads with Unique Molecular Identifiers(CC-BY 4.0)
Background Unique Molecular Identifiers (UMI) are used in many experiments to find and remove PCR duplicates. Although there are many tools for solving the problem of deduplicating reads based on their finding reads with the same alignment coordinates and UMIs, many tools either cannot handle substitution errors, or require expensive pairwise UMI comparisons that do not efficiently scale to larger datasets. Results We formulate the problem of deduplicating UMIs in a manner that enables optimizations to be made, and more efficient data structures to be used. We implement our data structures and optimizations in a tool called UMICollapse, which is able to deduplicate over one million unique UMIs of length 9 at a single alignment position in around 26 seconds. Conclusions We present a new formulation of the UMI deduplication problem, and show that it can be solved faster, with more sophisticated data structures.
10. 【Omics】新生儿天然免疫和适应性免疫中的基因表达分析(medRxiv)
Molecular profiling of neonatal dried blood spots reveals changes in innate and adaptive immunity following fetal inflammatory response(CC-BY-NC-ND 4.0)
The fetal inflammatory response (FIR) increases the risk of perinatal brain injury, particularly in extremely low gestational age newborns (ELGANs, < 28 weeks of gestation). One of the mechanisms contributing to such a risk is a postnatal intermittent or sustained systemic inflammation (ISSI) following FIR. The link between prenatal and postnatal systemic inflammation is supported by the presence of well-established inflammatory biomarkers in the umbilical cord and peripheral blood. However, the extent of molecular changes contributing to this association is unknown. Using RNA sequencing and mass spectrometry proteomics, we profiled the transcriptome and proteome of archived neonatal dried blood spot (DBS) specimens from 21 ELGANs. Comparing FIR-affected and unaffected ELGANs, we identified 783 gene and 27 protein expression changes of 50% magnitude or more, and an experiment-wide significance level below 5% false discovery rate. These expression changes confirm the robust postnatal activation of the innate immune system in FIR-affected ELGANs and reveal an impairment of their adaptive immunity. In turn, the altered pathways provide clues about the molecular mechanisms triggering ISSI after FIR, and the onset of perinatal brain injury.
更多生信分析套路,请加微信13621202201
TCGA | 小工具 | 数据库 |组装| 注释 | 基因家族 | Pvalue
基因预测 |bestorf | sci | NAR | 在线工具 | 生存分析 | 热图
生信不死 | 初学者 | circRNA | 一箭画心| 十二生肖 | circos
舞台|基因组 | 黄金测序 | 套路 | 杂谈组装 | 进化 | 测序简史