1. 拟兰基因组和兰花的进化
The Apostasia genome and the evolution of orchids(Nature)
Abstract
Constituting approximately 10% of flowering plant species, orchids (Orchidaceae) display unique flower morphologies, possess an extraordinary diversity in lifestyle, and have successfully colonized almost every habitat on Earth. Here we report the draft genome sequence of Apostasia shenzhenica, a representative of one of two genera that form a sister lineage to the rest of the Orchidaceae, providing a reference for inferring the genome content and structure of the most recent common ancestor of all extant orchids and improving our understanding of their origins and evolution. In addition, we present transcriptome data for representatives of Vanilloideae, Cypripedioideae and Orchidoideae, and novel third-generation genome data for two species of Epidendroideae, covering all five orchid subfamilies. A. shenzhenica shows clear evidence of a whole-genome duplication, which is shared by all orchids and occurred shortly before their divergence. Comparisons between A. shenzhenica and other orchids and angiosperms also permitted the reconstruction of an ancestral orchid gene toolkit. We identify new gene families, gene family expansions and contractions, and changes within MADS-box gene classes, which control a diverse suite of developmental processes, during orchid evolution. This study sheds new light on the genetic mechanisms underpinning key orchid innovations, including the development of the labellum and gynostemium, pollinia, and seeds without endosperm, as well as the evolution of epiphytism; reveals relationships between the Orchidaceae subfamilies; and helps clarify the evolutionary history of orchids within the angiosperms.
2. 宏基因组测序,从选样到分析
Shotgun metagenomics, from sampling to analysis(Nature Biotechnology)
Abstract
Diverse microbial communities of bacteria, archaea, viruses and single-celled eukaryotes have crucial roles in the environment and in human health. However, microbes are frequently difficult to culture in the laboratory, which can confound cataloging of members and understanding of how communities function. High-throughput sequencing technologies and a suite of computational pipelines have been combined into shotgun metagenomics methods that have transformed microbiology. Still, computational approaches to overcome the challenges that affect both assembly-based and mapping-based metagenomic profiling, particularly of high-complexity samples or environments containing organisms with limited similarity to sequenced genomes, are needed. Understanding the functions and characterizing specific strains of these communities offers biotechnological promise in therapeutic discovery and innovative ways to synthesize products using microbial factories and can pinpoint the contributions of microorganisms to planetary, animal and human health.
3.体细胞微卫星indels的分析识别人类肿瘤发生的驱动事件
Analysis of somatic microsatellite indels identifies driver events in human tumors(Nature Biotechnology)
Abstract
Microsatellites (MSs) are tracts of variable-length repeats of short DNA motifs that exhibit high rates of mutation in the form of insertions or deletions (indels) of the repeated motif. Despite their prevalence, the contribution of somatic MS indels to cancer has been largely unexplored, owing to difficulties in detecting them in short-read sequencing data. Here we present two tools: MSMuTect, for accurate detection of somatic MS indels, and MSMutSig, for identification of genes containing MS indels at a higher frequency than expected by chance. Applying MSMuTect to whole-exome data from 6,747 human tumors representing 20 tumor types, we identified >1,000 previously undescribed MS indels in cancer genes. Additionally, we demonstrate that the number and pattern of MS indels can accurately distinguish microsatellite-stable tumors from tumors with microsatellite instability, thus potentially improving classification of clinically relevant subgroups. Finally, we identified seven MS indel driver hotspots: four in known cancer genes (ACVR2A,RNF43, JAK1, and MSH3) and three in genes not previously implicated as cancer drivers (ESRP1, PRDM2, and DOCK3).
4. 在种子发育和发芽过程中DNA甲基化动态重构
Dynamic DNA methylation reconfiguration during seed development and germination(Genome Biology)
Abstract
Background
Unlike animals, plants can pause their life cycle as dormant seeds. In both plants and animals, DNA methylation is involved in the regulation of gene expression and genome integrity. In animals, reprogramming erases and re-establishes DNA methylation during development. However, knowledge of reprogramming or reconfiguration in plants has been limited to pollen and the central cell. To better understand epigenetic reconfiguration in the embryo, which forms the plant body, we compared time-series methylomes of dry and germinating seeds to publicly available seed development methylomes.
Results
Time-series whole genome bisulfite sequencing reveals extensive gain of CHH methylation during seed development and drastic loss of CHH methylation during germination. These dynamic changes in methylation mainly occur within transposable elements. Active DNA methylation during seed development depends on both RNA-directed DNA methylation and heterochromatin formation pathways, whereas global demethylation during germination occurs in a passive manner. However, an active DNA demethylation pathway is initiated during late seed development.
Conclusions
This study provides new insights into dynamic DNA methylation reprogramming events during seed development and germination and suggests possible mechanisms of regulation. The observed sequential methylation/demethylation cycle suggests an important role of DNA methylation in seed dormancy.
5. 在拟南芥萌发期间发生广泛的转录和表观遗传重组
Extensive transcriptomic and epigenomic remodelling occurs during Arabidopsis thaliana germination(Genome Biology)
Abstract
Background
Seed germination involves progression from complete metabolic dormancy to a highly active, growing seedling. Many factors regulate germination and these interact extensively, forming a complex network of inputs that control the seed-to-seedling transition. Our understanding of the direct regulation of gene expression and the dynamic changes in the epigenome and small RNAs during germination is limited. The interactions between genome, transcriptome and epigenome must be revealed in order to identify the regulatory mechanisms that control seed germination.
Results
We present an integrated analysis of high-resolution RNA sequencing, small RNA sequencing and MethylC sequencing over ten developmental time points in Arabidopsis thaliana seeds, finding extensive transcriptomic and epigenomic transformations associated with seed germination. We identify previously unannotated loci from which messenger RNAs are expressed transiently during germination and find widespread alternative splicing and divergent isoform abundance of genes involved in RNA processing and splicing. We generate the first dynamic transcription factor network model of germination, identifying known and novel regulatory factors. Expression of both microRNA and short interfering RNA loci changes significantly during germination, particularly between the seed and the post-germinative seedling. These are associated with changes in gene expression and large-scale demethylation observed towards the end of germination, as the epigenome transitions from an embryo-like to a vegetative seedling state.
Conclusions
This study reveals the complex dynamics and interactions of the transcriptome and epigenome during seed germination, including the extensive remodelling of the seed DNA methylome from an embryo-like to vegetative-like state during the seed-to-seedling transition. Data are available for exploration in a user-friendly browser at
https://jbrowse.latrobe.edu.au/germination_epigenome.
6.Splatter:模拟单细胞RNA测序数据数据的工具
Splatter: simulation of single-cell RNA sequencing data(Genome Biology)
Abstract
As single-cell RNA sequencing (scRNA-seq) technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed, and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available. Here, we present the Splatter Bioconductor package for simple, reproducible, and well-documented simulation of scRNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types, or differentiation paths.
7. 通过结合RNA-蛋白结合位点的分类鉴定高信度RNA调控元件
Identification of high-confidence RNA regulatory elements by combinatorial classification of RNA–protein binding sites(Genome Biology)
Abstract
Crosslinking immunoprecipitation sequencing (CLIP-seq) technologies have enabled researchers to characterize transcriptome-wide binding sites of RNA-binding protein (RBP) with high resolution. We apply a soft-clustering method, RBPgroup, to various CLIP-seq datasets to group together RBPs that specifically bind the same RNA sites. Such combinatorial clustering of RBPs helps interpret CLIP-seq data and suggests functional RNA regulatory elements. Furthermore, we validate two RBP–RBP interactions in cell lines. Our approach links proteins and RNA motifs known to possess similar biochemical and cellular properties and can, when used in conjunction with additional experimental data, identify high-confidence RBP groups and their associated RNA regulatory elements.
8. Ritornello:高准确性不需要control的染色质免疫沉淀峰的鉴定软件
Ritornello: high fidelity control-free chromatin immunoprecipitation peak calling(Nucleic Acids Research)
Abstract
With the advent of next generation high-throughput DNA sequencing technologies, omics experiments have become the mainstay for studying diverse biological effects on a genome wide scale. Chromatin immunoprecipitation (ChIP-seq) is the omics technique that enables genome wide localization of transcription factor (TF) binding or epigenetic modification events. Since the inception of ChIP-seq in 2007, many methods have been developed to infer ChIP-target binding loci from the resultant reads after mapping them to a reference genome. However, interpreting these data has proven challenging, and as such these algorithms have several shortcomings, including susceptibility to false positives due to artifactual peaks, poor localization of binding sites and the requirement for a total DNA input control which increases the cost of performing these experiments. We present Ritornello, a new approach for finding TF-binding sites in ChIP-seq, with roots in digital signal processing that addresses all of these problems. We show that Ritornello generally performs equally or better than the peak callers tested and recommended by the ENCODE consortium, but in contrast, Ritornello does not require a matched total DNA input control to avoid false positives, effectively decreasing the sequencing cost to perform ChIP-seq. Ritornello is freely available at
https://github.com/KlugerLab/Ritornello
.
9. IGESS:一个整合个体水平的基因型数据和全基因组关联研究中总结统计数据的统计学方法
IGESS: a statistical approach to integrating individual-level genotype data and summary statistics in genome-wide association studies(Bioinformatics)
Abstract
Motivation: Results from genome-wide association studies (GWAS) suggest that a complex phenotype is often affected by many variants with small effects, known as ‘polygenicity’. Tens of thousands of samples are often required to ensure statistical power of identifying these variants with small effects. However, it is often the case that a research group can only get approval for the access to individual-level genotype data with a limited sample size (e.g. a few hundreds or thousands). Meanwhile, summary statistics generated using single-variant-based analysis are becoming publicly available. The sample sizes associated with the summary statistics datasets are usually quite large. How to make the most efficient use of existing abundant data resources largely remains an open question.
Results: In this study, we propose a statistical approach, IGESS, to increasing statistical power of identifying risk variants and improving accuracy of risk prediction by integrating individual level genotype data and summarystatistics. An efficient algorithm based on variational inference is developed to handle the genome-wide analysis. Through comprehensive simulation studies, we demonstrated the advantages of IGESS over the methods which take either individual-level data or summary statistics data as input. We applied IGESS to perform integrative analysis of Crohns Disease from WTCCC and summary statistics from other studies. IGESS was able to significantly increase the statistical power of identifying risk variants and improve the risk prediction accuracy from 63.2% (±0.4%±0.4%) to 69.4% (±0.1%±0.1%) using about 240 000 variants.
10. FunGAP:基于证据类型的基因模型评估的真菌基因组注释流程
FunGAP: Fungal Genome Annotation Pipeline using evidence-based gene model evaluation(Bioinformatics)
Abstract
Motivation: Successful genome analysis depends on the quality of gene prediction. Although fungal genome sequencing and assembly have become trivial, its annotation procedure has not been standardized yet.
Results: FunGAP predicts protein-coding genes in a fungal genome assembly. To attain high-quality gene models, this program runs multiple gene predictors, evaluates all predicted genes, and assembles gene models that are highly supported by homology to known sequences. To do this, we built a scoring function to estimate the congruency of each gene model based on known protein or domain homology.
11.通过高斯混合模型和比例测试来识别拓扑结构域域和子域
Identifying topologically associating domains and subdomains by Gaussian Mixture model And Proportion test(Nature Communications)
Abstract
The spatial organization of the genome plays a critical role in regulating gene expression. Recent chromatin interaction mapping studies have revealed that topologically associating domains and subdomains are fundamental building blocks of the three-dimensional genome. Identifying such hierarchical structures is a critical step toward understanding the three-dimensional structure–function relationship of the genome. Existing computational algorithms lack statistical assessment of domain predictions and are computationally inefficient for high-resolution Hi-C data. We introduce the Gaussian Mixture model And Proportion test (GMAP) algorithm to address the above-mentioned challenges. Using simulated and experimental Hi-C data, we show that domains identified by GMAP are more consistent with multiple lines of supporting evidence than three state-of-the-art methods. Application of GMAP to normal and cancer cells reveals several unique features of subdomain boundary as compared to domain boundary, including its higher dynamics across cell types and enrichment for somatic mutations in cancer.
12. 全基因组关联研究确定了聚乙二醇免疫原性的新的易感基因座
A genome-wide association study identifies a novel susceptibility locus for the immunogenicity of polyethylene glycol(Nature Communications)
Abstract
Conjugation of polyethylene glycol (PEG) to therapeutic molecules can improve bioavailability and therapeutic efficacy. However, some healthy individuals have pre-existing anti-PEG antibodies and certain patients develop anti-PEG antibody during treatment with PEGylated medicines, suggesting that genetics might play a role in PEG immunogenicity. Here we perform genome-wide association studies for anti-PEG IgM and IgG responses in Han Chinese with 177 and 140 individuals, defined as positive for anti-PEG IgM and IgG responses, respectively, and with 492 subjects without either anti-PEG IgM or IgG as controls. We validate the association results in the replication cohort, consisting of 211 and 192 subjects with anti-PEG IgM and anti-PEG IgG, respectively, and 596 controls. We identify the immunoglobulin heavy chain (IGH) locus to be associated with anti-PEG IgM response at genome-wide significance (P = 2.23 × 10−22). Our findings may provide novel genetic markers for predicting the immunogenicity of PEG and efficacy of PEGylated therapeutics.
13. 全基因组关联研究确定了与日本人群体重指数相关的112个新的基因座
Genome-wide association study identifies 112 new loci for body mass index in the Japanese population(Nature Genetics)
Abstract
Obesity is a risk factor for a wide variety of health problems. In a genome-wide association study (GWAS) of body mass index (BMI) in Japanese people (n = 173,430), we found 85 loci significantly associated with obesity (P < 5.0 × 10−8), of which 51 were previously unknown. We conducted trans-ancestral meta-analyses by integrating these results with the results from a GWAS of Europeans and identified 61 additional new loci. In total, this study identifies 112 novel loci, doubling the number of previously known BMI-associated loci. By annotating associated variants with cell-type-specific regulatory marks, we found enrichment of variants in CD19+cells. We also found significant genetic correlations between BMI and lymphocyte count (P = 6.46 × 10−5, rg = 0.18) and between BMI and multiple complex diseases. These findings provide genetic evidence that lymphocytes are relevant to body weight regulation and offer insights into the pathogenesis of obesity.
欢迎关注生信人