知识中心 - 北京概普生物科技有限公司(GapTech)

八月bioRxiv生信好文速览

生信干货 montreal ·2018年9月2日 17:41

最近，越来越多的预印本（preprint）服务器涌现出来了，研究人员也有更多的地方投放预印本文稿了。我们在网上找到了一个最新的预印本的列表。值得注意的是，其中也包括了来自中国的ChinaXiv。

图片来源：https://pbs.twimg.com/media/Di7rCOCXcAEQFfF.jpg

1. 【NGS】宏基因组分箱（metagenome binning）软件哪家强

AMBER: Assessment of Metagenome BinnERs（CC-BY 4.0）

Reconstructing the genomes of microbial community members is key to the interpretation of shotgun metagenome samples. Genome binning programs deconvolute reads or assembled contigs of such samples into individual bins, but assessing their quality is difficult due to the lack of evaluation software and standardized metrics. We present AMBER, an evaluation package for the comparative assessment of genome reconstructions from metagenome benchmark data sets. It calculates the performance metrics and comparative visualizations used in the first benchmarking challenge of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). As an application, we show the outputs of AMBER for ten different binnings on two CAMI benchmark data sets. AMBER is implemented in Python and available under the Apache 2.0 license on GitHub (https://github.com/CAMI-challenge/AMBER).

2. 【Population genetics】1000人类基因组质量评估

Evaluating the quality of the 1000 Genomes Project data（CC-BY-ND 4.0）

Data from the 1000 Genomes project is quite often used as a reference for human genomic analysis. However, its accuracy needs to be assessed to understand the quality of predictions made using this reference. We present here an assessment of the genotype, phasing, and imputation accuracy data in the 1000 Genomes project. We compare the phased haplotype calls from the 1000 Genomes project to experimentally phased haplotypes for 28 of the same individuals sequenced using the 10X Genomics platform. We observe that phasing and imputation for rare variants are unreliable, which likely reflects the limited sample size of the 1000 Genomes project data. Further, it appears that using a population specific reference panel does not improve the accuracy of imputation over using the entire 1000 Genomes data set as a reference panel. We also note that the error rates and trends depend on the choice of definition of error, and hence any error reporting needs to take these definitions into account.

3. 【Genomics】SNPpet：神经网络模型解析非编码调控序列

Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays（ CC-BY-NC-ND 4.0）

The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present SNPpet, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained SNPpet on the Sharpr-MPRA dataset that measures the activity of ~500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. SNPpet's predictions were moderately correlated (Spearman p = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of SNPpet to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.

4. 【Database】专门针对乳制品中微生物的数据库

DAIRYdb: A manually curated gold standard reference database for improved taxonomy annotation of 16S rRNA gene sequences from dairy products（CC-BY-NC 4.0）

Reads assignment to taxonomic units is a key step in microbiome analysis pipelines. To date, accurate taxonomy annotation, particularly at species rank, is still challenging due to the short size of read sequences and differently curated classification databases. However, the close phylogenetic relationship between species encountered in dairy products requires accurate species annotation to achieve sufficient phylogenetic resolution for further downstream ecological studies or for food diagnostics. Taxonomy annotation in universal 16S databases with environmental sequences like Silva, RDP or Greengenes is based on predictions rather than on studies of type strains or isolates. We provide a manually curated database composed of 10'290 full-length 16S rRNA gene sequences from prokaryotes tailored for dairy products analysis (https://github.com/marcomeola/DAIRYdb). The performance of the DAIRYdb was compared with the universal databases Silva, LTP, RDP and Greengenes. The DAIRYdb significantly outperformed all other databases independently of the classification algorithm by enabling higher accurate taxonomy annotation down to the species rank. The DAIRYdb accurately annotates over 90% of the sequences of either single or paired hypervariable regions automatically. The manually curated DAIRYdb strongly improves taxonomic classification accuracy for microbiome studies in dairy environments. The DAIRYdb is a practical solution that enables automatization of this key step, thus facilitating the routine application of NGS microbiome analyses for microbial ecology studies and diagnostics in dairy products.

5. 【Evolution】蓝藻竟然可以产生甲烷？若成立改写教科书

Widespread formation of methane by Cyanobacteria in aquatic and terrestrial environments（CC-BY-NC-ND 4.0）

Evidence is accumulating to challenge the paradigm that biogenic methanogenesis, traditionally considered a strictly anaerobic process, is exclusive to Archaea. This change in our perception on methane production has important consequences for the feedback with global climate. Our study shows that cyanobacteria produce methane at substantial rates under light and dark oxic conditions, demonstrating biogenic methane production within the Bacteria, the second prokaryotic domain. Biogenic methane production was enhanced during oxygenic photosynthesis and directly attributed to the cyanobacteria by stable isotope labelling. Global production of methane by cyanobacteria is conservatively estimated to be up to 20 Tg CH4 per year, about a third of the methane thought to be emitted from non-wetland biogenic sources. Climate change, leading to worldwide increases in cyanobacterial blooms frequency and intensity will accordingly have a direct feedback on warming. With a ubiquitous presence on Earth for 3.5 billion years, cyanobacteria have had and will continue to have a substantial, yet not considered, impact on the global methane budget.

6. 【Genome sequencing】大豆线虫基因组测序

The genome of the soybean cyst nematode (Heterodera glycines) reveals complex patterns of duplications involved in the evolution of parasitism genes（CC-BY 4.0）

Heterodera glycines, commonly referred to as the soybean cyst nematode (SCN), is an obligatory and sedentary plant parasite that causes over a billion-dollar yield loss to soybean production annually. Although there are genetic determinants that render soybean plants resistant to certain nematode genotypes, resistant soybean cultivars are increasingly ineffective because their multi-year usage has selected for virulent H. glycines populations. The parasitic success of H. glycines relies on the comprehensive re-engineering of an infection site into a syncytium, as well as the long-term suppression of host defense to ensure syncytial viability. At the forefront of these complex molecular interactions are effectors, the proteins secreted by H. glycines into host root tissues. The mechanisms of effector acquisition, diversification, and selection need to be understood before effective control strategies can be developed, but the lack of an annotated genome has been a major roadblock. Here, we use PacBio long-read technology to assemble a H. glycines genome of 738 contigs into 123Mb with annotations for 29,769 genes. The genome contains significant numbers of repeats (34%), tandem duplicates (18.7Mb), and horizontal gene transfer events (151 genes). Using previously published effector sequences, the newly generated H. glycines genome, and comparisons to other nematode genomes, we investigate the evolutionary mechanisms responsible for the emergence and diversification of effector genes.

7. 【Comparative genomics】节肢动物大型比较基因组学和DNA甲基化组学分析

The Genomic Basis of Arthropod Diversity（CC-BY 4.0）

Arthropods comprise the largest and most diverse phylum on Earth and play vital roles in nearly every ecosystem. Their diversity stems in part from variations on a conserved body plan, resulting from and recorded in adaptive changes in the genome. Dissection of the genomic record of sequence change enables broad questions regarding genome evolution to be addressed, even across hyper-diverse taxa within arthropods. Using 76 whole genome sequences representing 21 orders spanning more than 500 million years of arthropod evolution, we document changes in gene and protein domain content and provide temporal and phylogenetic context for interpreting these innovations. We identify many novel gene families that arose early in the evolution of arthropods and during the diversification of insects into modern orders. We reveal unexpected variation in patterns of DNA methylation across arthropods and examples of gene family and protein domain evolution coincident with the appearance of notable phenotypic and physiological adaptations such as flight, metamorphosis, sociality and chemoperception. These analyses demonstrate how large-scale comparative genomics can provide broad new insights into the genotype to phenotype map and generate testable hypotheses about the evolution of animal diversity.

8. 【NGS】VALOR2：新软件解析基因组结构变异（genomic structural variation）

Characterization of segmental duplications and large inversions using Linked-Reads（CC-BY-NC 4.0）

Many algorithms aimed at characterizing genomic structural variation (SV) have been developed since the inception of high-throughput sequencing. However, the full spectrum of SVs in the human genome is not yet assessed. Most of the existing methods focus on discovery and genotyping of deletions, insertions, and mobile elements. Detection of balanced SVs with no gain or loss of genomic segments (e.g., inversions) is particularly a challenging task. Long read sequencing has been leveraged to find short inversions but there is still a need to develop methods to detect large genomic inversions. Furthermore, currently there are no algorithms to predict the insertion locus of large interspersed segmental duplications. Here we propose novel algorithms to characterize large (>40Kbp) interspersed segmental duplications and (>80Kbp) inversions using Linked-Read sequencing data. Linked-Read sequencing provides long range information, where Illumina reads are tagged with barcodes that can be used to assign short reads to pools of larger (30-50 Kbp) molecules. Our methods rely on split molecule sequence signature that we have previously described. Similar to the split read, split molecules refer to large segments of DNA that span an SV breakpoint. Therefore, when mapped to the reference genome, the mapping of these segments would be discontinuous. We redesign our earlier algorithm, VALOR, to specifically leverage Linked-Read sequencing data to discover large inversions and characterize interspersed segmental duplications. We implement our new algorithms in a new software package, called VALOR2.

9. 【NGS】Magic-BLAST：一种具有魔力的BLAST

Magic-BLAST, an accurate DNA and RNA-seq aligner for long and short reads（The copyright holder for this preprint is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.）

Next-generation sequencing technologies can produce tens of millions of reads, often paired-end, from transcripts or genomes. But few programs can align RNA on the genome and accurately discover introns, especially with long reads. To address these issues, we introduce Magic-BLAST, a new aligner based on ideas from the Magic pipeline. It uses innovative techniques that include the optimization of a spliced alignment score and selective masking during seed selection. We evaluate the performance of Magic-BLAST to accurately map short or long sequences and its ability to discover introns on real RNA-seq data sets from PacBio, Roche and Illumina runs, and on six benchmarks, and compare it to other popular aligners. Additionally, we look at alignments of human idealized RefSeq mRNA sequences perfectly matching the genome. We show that Magic-BLAST is the best at intron discovery over a wide range of conditions. It is versatile and robust to high levels of mismatches or extreme base composition and works well with very long reads. It is reasonably fast. It can align reads to a BLAST database or a FASTA file. It can accept a FASTQ file as input or automatically retrieve an accession from the SRA repository at the NCBI.

10. 【Evolution】最新版真核生物进化树

New phylogenomic analysis of the enigmatic phylum Telonemia further resolves the eukaryote tree of life（CC-BY-NC-ND 4.0）

The broad-scale tree of eukaryotes is constantly improving, but the evolutionary origin of several major groups remains unknown. Resolving the phylogenetic position of these 'orphan' groups is important, especially those that originated early in evolution, because they represent missing evolutionary links between established groups. Telonemia is one such orphan taxon for which little is known. The group is composed of molecularly diverse biflagellated protists, often prevalent although not abundant in aquatic environments. Telonemia has been hypothesized to represent a deeply diverging eukaryotic phylum but no consensus exists as to where it is placed in the tree. Here, we established cultures and report the phylogenomic analyses of three new transcriptome datasets for divergent telonemid lineages. All our phylogenetic reconstructions, based on 248 genes and using site-heterogeneous mixture models, robustly resolve the evolutionary origin of Telonemia as sister to the Sar supergroup. This grouping remains well supported when as few as 60% of the genes are randomly subsampled, thus is not sensitive to the sets of genes used but requires a minimal alignment length to recover enough phylogenetic signal. Telonemia occupies a crucial position in the tree to examine the origin of Sar, one of the most lineage-rich eukaryote supergroups. We propose the moniker 'TSAR' to accommodate this new mega-assemblage in the phylogeny of eukaryotes.

欢迎关注生信人

TCGA | 小工具 | 数据库 |组装| 注释 | 基因家族 | Pvalue

基因预测 |bestorf | sci | NAR | 在线工具 | 生存分析 | 热图

舞台|基因组 | 黄金测序 | 套路 | 杂谈组装 | 进化 | 测序简史