BLM运动或影响生信&新比对工具干掉bowtie2
生信干货
montreal ·2020年9月9日 23:45
近几个月以来在美国掀起并迅速蔓延至多国的BLM(black lives matter)运动对人民群众的生活影响重大。终于,BLM运动波及到了生物信息学界。我了解此事还是从国际著名的society for the study of evolution(SSE)所设立的主席奖说起,该奖是为表彰杰出博士学位论文(phd dissertation)所设。该奖还有另一个非常好记的名字——费舍尔奖,为纪念英国著名进化生物学家罗纳德·费舍尔爵士(Sir Ronald Fisher)所立,要知道,费舍尔在群体和数量遗传学领域有着奠基人的地位。然而,就在今年的BLM爆发后不久,该奖中费舍尔的名字被正式抹去,理由是其在人类种群和优生学(eugenics)上的激进观点(原文见下)。接着,就在费舍尔的母校剑桥大学,一群自称反种族主义者的激进分子盯上了费舍尔的纪念窗(The Sir Ronald Fisher memorial window),他们对其所在college的大门进行了“粉饰”,喷上了“eugenics is genocide - Fisher must fall”的标语,该事件最终致使校方宣布移除该纪念窗【1】。也许你会说:我又不研究进化生物学,跟我有何关系?这你就大错特错了,因为更为大家所熟知的,是作为统计学家的费舍尔:除了在进化生物学领域外,费舍尔在统计方面的光辉成就深刻地影响着科学界的几乎每一个领域。费舍尔精确检验、方差分析(ANOVA)、费舍尔判别式、置换检验等广泛应用的统计分析方法也都出自这位伟大的英国学者。此外,有由哥赛特提出的鼎鼎大名的t test,也是经由费舍尔之手才得以登堂入室。甚至,假设检验中p值的概念也是其首创。所以,小编想战战兢兢地问一下,以后使用上述各种统计学方法和概念的时候,是不是也要上黑名单呢?之前的论文里用过的,能否既往不咎呢?如果一定要揪的话,赶快认错能否宽大处理?另外,我们是否也要给它们换一换名字呢?如果是这样的话,是不是谁先重新定义出来这些检验,就以谁的名字命名啊?既然说到统计检验,本期的预印本好文速览我们也专门为大家选择了一篇在数学预印本arxiv服务器上发布了新的非参数检验方法。按照惯例,每年的八月特别是八月底都是国外教授们度假的旺季,记得当年,我的毕业答辩也因此从七月推迟到九月,直接跳过八月。不管多忙,科研人员都会停下手头的工作,或是陪着家人,或是约三五知己,或是一个人背起行囊,去享受盛夏里剩下的时光。然而,今年的疫情让这一切变得困难,不少人也许只好选择呆在家里,继续着work from home的生活方式,这大概从bioRxiv上一篇篇层出不穷的preprint也可见一斑吧。1.SiT:捕获可变剪切空间信息的新技术,来自瑞典皇家理工学院(KTH)
The spatial landscape of gene expression isoforms in tissue sectionsIn situ capturing technologies add tissue context to gene expression data, with the potential of providing a greater understanding of complex biological systems. However, splicing variants and full-length sequence heterogeneity cannot be characterized with current methods. Here, we introduce Spatial Isoform Transcriptomics (SiT), an explorative method for characterizing spatial isoform and sequence heterogeneity in tissue sections, and show how it can be used to profile isoform expression and sequence heterogeneity in a tissue context2.草蜢基因组中大规模的DNA转座子扩增,来自瑞典乌普萨拉大学(Uppsala university)Too much too many: comparative analysis of morabine grasshopper genomes reveals highly abundant transposable elements and rapidly proliferating satellite DNA repeatsWe obtained linked-read genome assemblies of 2.73-3.27 Gb from estimated genome sizes of 4.26-5.07 Gb DNA per haploid genome of the four chromosomal races of V. viatica. These constitute the third largest insect genomes assembled so far (the largest being two locust grasshoppers). Combining complementary annotation tools and manual curation, we found a large diversity of TEs and satDNAs constituting 66 to 75 % per genome assembly. A comparison of sequence divergence within the TE classes revealed massive accumulation of recent TEs in all four races (314-463 Mb per assembly), indicating that their large genome size is likely due to similar rates of TE accumulation across the four races. Transcriptome sequencing showed more biased TE expression in reproductive tissues than somatic tissues, implying permissive transcription in gametogenesis. Out of 129 satDNA families, 102 satDNA families were shared among the four chromosomal races, which likely represent a repertoire of satDNA families in the ancestor of the V. viatica chromosomal races. Notably, 50 of these shared satDNA families underwent differential proliferation since the recent diversification of the V. viatica species complex.Divergent influenza-like viruses of amphibians and fish support an ancient evolutionary associationInfluenza viruses (family Orthomyxoviridae) infect a variety of vertebrates, including birds, humans, and other mammals. Recent metatranscriptomic studies have uncovered divergent influenza viruses in amphibians, fish and jawless vertebrates, suggesting that these viruses may be widely distributed. We sought to identify additional vertebrate influenza-like viruses through the analysis of publicly available RNA sequencing data. Accordingly, by data mining, we identified the complete coding segments of five divergent vertebrate influenza-like viruses. Three fell as sister lineages to influenza B virus: salamander influenza-like virus in Mexican walking fish (Ambystoma mexicanum) and plateau tiger salamander (Ambystoma velasci), siamese algae-eater influenza-like virus in siamese algae-eater fish (Gyrinocheilus aymonieri) and chum salmon influenza-like virus in chum salmon (Oncorhynchus keta). Similarly, we identified two influenza-like viruses of amphibians that fell as sister lineages to influenza D virus: cane toad influenza-like virus and the ornate chorus frog influenza-like virus, in the cane toad (Rhinella marina) and ornate chorus frog (Microhyla fissipes), respectively. Despite their divergent phylogenetic positions, these viruses retained segment conservation and splicing consistent with transcriptional regulation in influenza B and influenza D viruses, and were detected in respiratory tissues. These data suggest that influenza viruses have been associated with vertebrates for their entire evolutionary history.
4.进化基因组学分析表示被子植物中倍增基因的DNA甲基化可能与基因剂量调节有关DNA methylation signatures of duplicate gene evolution in angiospermsGene duplications have greatly shaped the gene content of plants. Multiple factors, such as the epigenome, can shape the subsequent evolution of duplicate genes and are the subject of ongoing study. We analyze genic DNA methylation patterns in 43 angiosperm species and 928 Arabidopsis thaliana ecotypes to finding differences in the association of whole-genome and single-gene duplicates with genic DNA methylation patterns. Whole-genome duplicates were enriched for patterns associated with higher gene expression and depleted for patterns of non-CG DNA methylation associated with gene silencing. Single-gene duplicates showed variation in DNA methylation patterns based on modes of duplication (tandem, proximal, transposed, and dispersed) and species. Age of gene duplication was a key factor in the DNA methylation of single-gene duplicates. In single-gene duplicates, non-CG DNA methylation patterns associated with silencing were younger, less conserved, and enriched for presence-absence variation. In comparison, DNA methylation patterns associated with constitutive expression were older and more highly conserved. Surprisingly, across the phylogeny, genes marked by non-CG DNA methylation were enriched for duplicate pairs with evidence of positive selection. We propose that DNA methylation has a role in maintaining gene-dosage balance and silencing by non-CG methylation and may facilitate the evolutionary fate of duplicate genes.5.马里兰大学Rob Patro 组推出新比对工具Puffaligner,号称超过bowtie2、STAR等同类软件Puffaligner: An Efficient and Accurate Aligner Based on the Pufferfish IndexIn this paper, we introduce PuffAligner, a fast, accurate and versatile aligner built on top of the Pufferfish index. PuffAligner is able to produce highly-sensitive alignments, similar to those of Bowtie2, but much more quickly. While exhibiting similar speed to the ultrafast STAR aligner, PuffAligner requires considerably less memory to construct its index and align reads. PuffAligner strikes a desirable balance with respect to the time, space, and accuracy tradeoffs made by different alignment tools, and provides a promising foundation on which to test new alignment ideas over large collections of sequences.6.墨尔本大学:纳米孔RNA直接测序解析细胞群体间的差异表达Nanopore direct RNA sequencing detects differential expression between human cell populationsAccurately quantifying gene and isoform expression changes is essential to understanding cell functions, differentiation and disease. Therefore, a crucial requirement of RNA sequencing is identifying differential expression. The recent development of long-read direct RNA (dRNA) sequencing has the potential to overcome many limitations of short and long-read sequencing methods that require RNA fragmentation, cDNA synthesis or PCR. dRNA sequences native RNA and can encompass an entire RNA in a single read. However, its ability to identify differential gene and isoform expression in complex organisms is poorly characterised. Using a mixture of synthetic controls and human SH-SY5Y cell differentiation into neuron-like cells, we show that dRNA sequencing accurately quantifies RNA expression and identifies differential expression of genes and isoforms. We generated ∼4 million dRNA reads with a median length of 991 nt. On average, reads covered 74% of SH-SY5Y transcripts and 29% were full-length. Measurement of expression and fold changes between synthetic control RNAs confirmed accurate quantification of genes and isoforms. Differential expression of 231 genes, 291 isoforms, plus 27 isoform switches were detected between undifferentiated and differentiated SH-SY5Y cells and samples clustered by differentiation state at the gene and isoform level. Genes upregulated in neuron-like cells were associated with neurogenesis. We further identified >30,000 expressed transcripts including thousands of novel splice isoforms and transcriptional units. Our results establish the ability of dRNA sequencing to identify biologically relevant differences in gene and isoform expression and perform the key capabilities of expression profiling methodologies.7.巴黎高等师范学者推出网络爆红的DNA宏条形码分析R包metabaR : an R package for the evaluation and improvement of DNA metabarcoding data qualityHere, we present metabaR, an R package that provides a comprehensive suite of tools to effectively curate DNA metabarcoding data after basic bioinformatic analyses. In particular, metabaR uses experimental negative or positive controls to identify different types of artefactual sequences, i.e. reagent contaminants and tag-jumps. It also flags potentially dysfunctional PCRs based on PCR replicate similarities when those are available. Finally, metabaR provides tools to visualise DNA metabarcoding data characteristics in their experimental context as well as their distribution, and facilitate assessment of the appropriateness of data curation filtering thresholds. metabaR is applicable to any DNA metabarcoding experimental design but is most powerful when the design includes experimental controls and replicates. More generally, the simplicity and flexibility of the package makes it applicable any DNA marker, and data generated with any sequencing platform, and pre-analysed with any bioinformatic pipeline. Its outputs are easily usable for downstream analyses with any ecological R package.metabaR complements existing bioinformatics pipelines by providing scientists with a variety of functions with customisable methods that will allow the user to effectively clean DNA metabarcoding data and avoid serious misinterpretations. It thus offers a promising platform for automatised data quality assessments of DNA metabarcoding data for environmental research and biomonitoring.
8.Raxml作者Alexandros Stamatakis团队:对于新冠病毒的系统发育分析十分具误导性Phylogenetic analysis of SARS-CoV-2 data is difficultNumerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising all virus sequences available on May 5, 2020 from gisaid.org. We find that it is difficult to infer a reliable phylogeny on these data due to the large number of sequences in conjunction with the low number of mutations. We further find that rooting the inferred phylogeny with some degree of confidence either via the bat and pangolin outgroups or by applying novel computational methods on the ingroup phylogeny does not appear to be possible. Finally, an automatic classification of the current sequences into sub-classes based on statistical criteria is also not possible, as the sequences are too closely related. We conclude that, although the application of phylogenetic methods to disentangle the evolution and spread of COVID-19 provides some insight, results of phylogenetic analyses, in particular those conducted under the default settings of current phylogenetic inference tools, as well as downstream analyses on the inferred phylogenies, should be considered and interpreted with extreme caution.9.【新冠肺炎】medRxiv:五美元快速检测新冠病毒的技术SwabSeqSwab-Seq: A high-throughput platform for massively scaled up SARS-CoV-2 testingThe rapid spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is due to the high rates of transmission by individuals who are asymptomatic at the time of transmission. Frequent, widespread testing of the asymptomatic population for SARS-CoV-2 is essential to suppress viral transmission and is a key element in safely reopening society. Despite increases in testing capacity, multiple challenges remain in deploying traditional reverse transcription and quantitative PCR (RT-qPCR) tests at the scale required for population screening of asymptomatic individuals. We have developed SwabSeq, a high-throughput testing platform for SARS-CoV-2 that uses next-generation sequencing as a readout. SwabSeq employs sample-specific molecular barcodes to enable thousands of samples to be combined and simultaneously analyzed for the presence or absence of SARS-CoV-2 in a single run. Importantly, SwabSeq incorporates an in vitro RNA standard that mimics the viral amplicon, but can be distinguished by sequencing. This standard allows for end-point rather than quantitative PCR, improves quantitation, reduces requirements for automation and sample-to-sample normalization, enables purification-free detection, and gives better ability to call true negatives. We show that SwabSeq can test nasal and oral specimens for SARS-CoV-2 with or without RNA extraction while maintaining analytical sensitivity better than or comparable to that of fluorescence-based RT-qPCR tests. SwabSeq is simple, sensitive, flexible, rapidly scalable, inexpensive enough to test widely and frequently, and can provide a turn around time of 12 to 24 hours.
Generalized Spacing-Statistics and a New Family of Non-Parametric TestsRandom divisions of an interval arise in various context, including statistics, physics, and geometric analysis. For testing the uniformity of a random partition of the unit interval [0,1] into k disjoint subintervals of size (Sk[1],…,Sk[k]), Greenwood (1946) suggested using the squared ℓ2-norm of this size vector as a test statistic, prompting a number of subsequent studies. Despite much progress on understanding its power and asymptotic properties, attempts to find its exact distribution have succeeded so far for only small values of k. Here, we develop an efficient method to compute the distribution of the Greenwood statistic and more general spacing-statistics for an arbitrary value of k. Specifically, we consider random divisions of {1,2,…,n} into k subsets of consecutive integers and study ∥Sn,k∥pp,w, the pth power of the weighted ℓp-norm of the subset size vector Sn,k=(Sn,k[1],…,Sn,k[k]) for arbitrary weights w=(w1,…,wk). We present an exact and quickly computable formula for its moments, as well as a simple algorithm to accurately reconstruct a probability distribution using the moment sequence. We also study various scaling limits, one of which corresponds to the Greenwood statistic in the case of p=2 and w=(1,…,1), and this connection allows us to obtain information about regularity, monotonicity and local behavior of its distribution. Lastly, we devise a new family of non-parametric tests using ∥Sn,k∥pp,w and demonstrate that they exhibit substantially improved power for a large class of alternatives, compared to existing popular methods such as the Kolmogorov-Smirnov, Cramer-von Mises, and Mann-Whitney/Wilcoxon rank-sum tests.1.BBC news, 2020, Sir Ronald Fisher memorial in Cambridge targeted by activists