知识中心 - 北京概普生物科技有限公司(GapTech)

每周生物信息学文献速递（09.23）

生信干货 sxr2 ·2017年9月24日 07:22

1. 珍珠粟基因组序列提供了干旱环境下改善农艺性状的资源

Pearl millet genome sequence provides a resource to improve agronomic traits in arid environments（Nature Biotechnology）

Abstract

Pearl millet [Cenchrus americanus (L.) Morrone] is a staple food for more than 90 million farmers in arid and semi-arid regions of sub-Saharan Africa, India and South Asia. We report the ~1.79 Gb draft whole genome sequence of reference genotype Tift 23D2B1-P1-P5, which contains an estimated 38,579 genes. We highlight the substantial enrichment for wax biosynthesis genes, which may contribute to heat and drought tolerance in this crop. We resequenced and analyzed 994 pearl millet lines, enabling insights into population structure, genetic diversity and domestication. We use these resequencing data to establish marker trait associations for genomic selection, to define heterotic pools, and to predict hybrid performance. We believe that these resources should empower researchers and breeders to improve this important staple crop.

2. MECAT：三代测序read的快速映射，错误校正和从头组装软件

MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads（Nature Methods）

Abstract

We present a tool that combines fast mapping, error correction, and de novo assembly (MECAT; accessible at https://github.com/xiaochuanle/MECAT) for processing single-molecule sequencing (SMS) reads. MECAT's computing efficiency is superior to that of current tools, while the results MECAT produces are comparable or improved. MECAT enables reference mapping or de novoassembly of large genomes using SMS reads on a single computer.

3. 宏基因组分类的全面的标准和方法

Comprehensive benchmarking and ensemble approaches for metagenomic classifiers（Genome Biology）

Abstract

Background

One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole-genome shotgun sequencing data, comprehensive comparisons of these methods are limited.

Results

In this study, we use the largest-to-date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of 11 metagenomic classifiers. Tools were characterized on the basis of their ability to identify taxa at the genus, species, and strain levels, quantify relative abundances of taxa, and classify individual reads to the species level. Strikingly, the number of species identified by the 11 tools can differ by over three orders of magnitude on the same datasets. Various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species. Overall, pairing tools with different classification strategies (k-mer, alignment, marker) can combine their respective advantages.

Conclusions

This study provides positive and negative controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision, accuracy, and recall. We show that proper experimental design and analysis parameters can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.

4. Linnorm：改进单细胞RNA-seq表达数据的统计分析软件

Linnorm: improved statistical analysis for single cell RNA-seq expression data（Nucleic Acids Research）

Abstract

Linnorm is a novel normalization and transformation method for the analysis of single cell RNA sequencing (scRNA-seq) data. Linnorm is developed to remove technical noises and simultaneously preserve biological variations in scRNA-seq data, such that existing statistical methods can be improved. Using real scRNA-seq data, we compared Linnorm with existing normalization methods, including NODES, SAMstrt, SCnorm, scran, DESeq and TMM. Linnorm shows advantages in speed, technical noise removal and preservation of cell heterogeneity, which can improve existing methods in the discovery of novel subtypes, pseudo-temporal ordering of cells, clustering analysis, etc. Linnorm also performs better than existing DEG analysis methods, including BASiCS, NODES, SAMstrt, Seurat and DESeq2, in false positive rate control and accuracy.

5. 通过miRvial检测和表征不同基因组起源的miRNA

Detecting and characterizing microRNAs of diverse genomic origins via miRvial（Nucleic Acids Research）

Abstract

MicroRNAs form an essential class of post-transcriptional gene regulator of eukaryotic species, and play critical parts in development and disease and stress responses. MicroRNAs may originate from various genomic loci, have structural characteristics, and appear in canonical or modified forms, making them subtle to detect and analyze. We present miRvial, a robust computational method and companion software package that supports parameter adjustment and visual inspection of candidate microRNAs. Extensive results comparing miRvial and six existing microRNA finding methods on six model organisms, Mus musculus, Drosophila melanogaste, Arabidopsis thaliana, Oryza sativa, Physcomitrella patens and Chlamydomonas reinhardtii, demonstrated the utility and rigor of miRvial in detecting novel microRNAs and characterizing features of microRNAs. Experimental validation of several novel microRNAs in C. reinhardtii that were predicted by miRvial but missed by the other methods illustrated the superior performance of miRvial over the existing methods. miRvial is open source and available at

https://github.com/SystemsBiologyOfJianghanUniversity/miRvial

6. IntPred：基于结构的预测蛋白质 - 蛋白质相互作用的预测

IntPred: a structure-based predictor of protein-protein interaction sites（Bioinformatics）

Abstract

Motivation: Protein-protein interactions are vital for protein function with the average protein having between three and ten interacting partners. Knowledge of precise protein-protein interfaces comes from crystal structures deposited in the Protein Data Bank (PDB), but only 50% of structures in the PDB are complexes. There is therefore a need to predict protein-protein interfaces in silico and various methods for this purpose. Here we explore the use of a predictor based on structural features and which exploits random forest machine learning, comparing its performance with a number of popular established methods.

Results: On an independent test set of obligate and transient complexes, our IntPred predictor performs well (MCC=0.370, ACC=0.811, SPEC=0.916, SENS=0.411) and compares favourably with other methods. Overall, IntPred ranks second of six methods tested with SPPIDER having slightly better overall performance (MCC=0.410, ACC=0.759, SPEC=0.783, SENS=0.676), but considerably worse specificity than IntPred. As with SPPIDER, using an independent test set of obligate complexes enhanced performance (MCC=0.381) while performance is somewhat reduced on a dataset of transient complexes (MCC=0.303). The trade-off between sensitivity and specificity compared with SPPIDER suggests that the choice of the appropriate tool is application-dependent.

7. DeepLoc：使用深度学习算法进行蛋白质亚细胞定位预测

DeepLoc: prediction of protein subcellular localization using deep learning （Bioinformatics）

Abstract

Motivation: The prediction of eukaryotic protein subcellular localization is a well-studied topic in bioinformatics due to its relevance in proteomics research. Many machine learning methods have been successfully applied in this task, but in most of them, predictions rely on annotation of homologues from knowledge databases. For novel proteins where no annotated homologues exist, and for predicting the effects of sequence variants, it is desirable to have methods for predicting protein properties from sequence information only.

Results: Here, we present a prediction algorithm using deep neural networks to predict protein subcellular localization relying only on sequence information. At its core, the prediction model uses a recurrent neural network that processes the entire protein sequence and an attention mechanism identifying protein regions important for the subcellular localization. The model was trained and tested on a protein dataset extracted from one of the latest UniProt releases, in which experimentally annotated proteins follow more stringent criteria than previously. We demonstrate that our model achieves a good accuracy (78% for 10 categories; 92% for membrane-bound or soluble), outperforming current state-of-the-art algorithms, including those relying on homology information.

8. 使用低覆盖度测序数据在多倍体中进行SNP基因分型和参数估计

SNP genotyping and parameter estimation in polyploids using low-coverage sequencing data（Bioinformatics）

Abstract

Motivation

Genotyping and parameter estimation using high throughput sequencing data are everyday tasks for population geneticists, but methods developed for diploids are typically not applicable to polyploid taxa. This is due to their duplicated chromosomes, as well as the complex patterns of allelic exchange that often accompany whole genome duplication (WGD) events. For WGDs within a single lineage (autopolyploids), inbreeding can result from mixed mating and/or double reduction. For WGDs that involve hybridization (allopolyploids), alleles are typically inherited through independently segregating subgenomes.

Results

We present two new models for estimating genotypes and population genetic parameters from genotype likelihoods for auto- and allopolyploids. We then use simulations to compare these models to existing approaches at varying depths of sequencing coverage and ploidy levels. These simulations show that our models typically have lower levels of estimation error for genotype and parameter estimates, especially when sequencing coverage is low. Finally, we also apply these models to two empirical data sets from the literature. Overall, we show that the use of genotype likelihoods to model non-standard inheritance patterns is a promising approach for conducting population genomic inferences in polyploids.

欢迎关注生信人