知识中心 - 北京概普生物科技有限公司(GapTech)

2021年五月biorxiv生信好文速览

生信干货 Montreal ·2021年6月17日 17:03

上月最轰动的预印本（preprint）也许是完整的人类基因组的公布了。自从2001年第一个人类基因组草图完成，到人类基因组的拼接完全完成，整整走过了20年的光景。为此，Nature专门以新闻的形式高度评论了这一成果。该篇preprint还有两篇伴随文章（companion preprint），分别对这个完整的人类基因组的表观遗传组及片段倍增进行了新的解读，让人们一窥此前基因组中的gap区域所隐藏的奥秘。

说起人类基因组，小编想到近日看到的一则有趣的推文。来自英国班戈大学（Bangor University）动物学家John Mulley写到：他读到的一篇文章里，第一句话赫然写道“人类基因组是自然界里最复杂的分子结构”。然后John评论“Oh dear”。的确，说人类的基因组最复杂大概要招致太多的批评之声，至少在一般意义上而言，很多生物的基因组结构之复杂已胜过了人，比如大名鼎鼎的复杂的多倍体小麦基因组（基因组大小是人的5倍），转座子横行霸道的玉米（占到基因组大小的七八成），以及基因数目多过人25%的甲壳纲动物水蚤（Daphnia），等等。然而，如果说人类拥有最复杂的大脑，大概没啥争议了吧。对人类大脑运作原理的解读也是科学界的最热话题之一，上个月来自哈佛大学等机构在biorxiv上发文，号称完成了“1.4 PB（拍字节，Petabytes）级别的人类脑组织小样本渲染图”，“是迄今为止在所有生物中以这种详尽程度成像和重建的最大脑组织样本”【1】。欲知详情，请往下看吧。

对了，本篇推送是我们生信人biorxiv生信好文速月度专栏的第37期，也就是说，我们这一栏目已经做了三年了。这三年来，我们每期选取上个月的十篇预印本（preprint）文章为大家进行推送，与此同时对近期预印本等相关领域所发生的故事和新闻进行报道和点评。我们每期推荐的文章很有限，希望借此抛砖引玉，唤起更多人对预印本的关注并在未来投稿中更多、更好地利用预印本。

一个完整测序的人类基因组

The complete sequence of a human genome——T2T-CHM13

In 2001, Celera Genomics and the International Human Genome Sequencing Consortium published their initial drafts of the human genome, which revolutionized the field of genomics. While these drafts and the updates that followed effectively covered the euchromatic fraction of the genome, the heterochromatin and many other complex regions were left unfinished or erroneous. Addressing this remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium has finished the first truly complete 3.055 billion base pair (bp) sequence of a human genome, representing the largest improvement to the human reference genome since its initial release. The new T2T-CHM13 reference includes gapless assemblies for all 22 autosomes plus Chromosome X, corrects numerous errors, and introduces nearly 200 million bp of novel sequence containing 2,226 paralogous gene copies, 115 of which are predicted to be protein coding. The newly completed regions include all centromeric satellite arrays and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies for the first time.

完整人类基因组中的片段倍增

Segmental duplications and their variation in a complete human genome

Despite their importance in disease and evolution, highly identical segmental duplications (SDs) have been among the last regions of the human reference genome (GRCh38) to be finished. Based on a complete telomere-to-telomere human genome (T2T-CHM13), we present the first comprehensive view of human SD organization. SDs account for nearly one-third of the additional sequence increasing the genome-wide estimate from 5.4% to 7.0% (218 Mbp). An analysis of 266 human genomes shows that 91% of the new T2T-CHM13 SD sequence (68.3 Mbp) better represents human copy number. We find that SDs show increased single-nucleotide variation diversity when compared to unique regions; we characterize methylation signatures that correlate with duplicate gene transcription and predict 182 novel protein-coding gene candidates. We find that 63% (35.11/55.7 Mbp) of acrocentric chromosomes consist of SDs distinct from rDNA and satellite sequences. Acrocentric SDs are 1.75-fold longer (p=0.00034) than other SDs, are frequently shared with autosomal pericentromeric regions, and are heteromorphic among human chromosomes. Comparing long-read assemblies from other human (n=12) and nonhuman primate (n=5) genomes, we use the T2T-CHM13 genome to systematically reconstruct the evolution and structural haplotype diversity of biomedically relevant (LPA, SMN) and duplicated genes (TBC1D3, SRGAP2C, ARHGAP11B) important in the expansion of the human frontal cortex. The analysis reveals unprecedented patterns of structural heterozygosity and massive evolutionary differences in SD organization between humans and their closest living relatives.

完整人基因组的表观遗传模式

Epigenetic Patterns in a Complete Human Genome

The completion of the first telomere-to-telomere human genome, T2T-CHM13, enables exploration of the full epigenome, removing limitations previously imposed by the missing reference sequence. Existing epigenetic studies omit unassembled and unmappable genomic regions (e.g. centromeres, pericentromeres, acrocentric chromosome arms, subtelomeres, segmental duplications, tandem repeats). Leveraging the new assembly, we were able to measure enrichment of epigenetic marks with short reads using k-mer assisted mapping methods. This granted array-level enrichment information to characterize the epigenetic regulation of these satellite repeats. Using nanopore sequencing data, we generated base level maps of the most complete human methylome ever produced. We examined methylation patterns in satellite DNA and revealed organized patterns of methylation along individual molecules. When exploring the centromeric epigenome, we discovered a distinctive dip in centromere methylation consistent with active sites of kinetochore assembly. Through long-read chromatin accessibility measurements (nanoNOMe) paired to CUT&RUN data, we found the hypomethylated region was extremely inaccessible and paired to CENP-A/B binding. With long-reads we interrogated allele-specific, longrange epigenetic patterns in complex macro-satellite arrays such as those involved in X chromosome inactivation. Using the single molecule measurements we can clustered reads based on methylation status alone distinguishing epigenetically heterogeneous and homogeneous areas. The analysis provides a framework to investigate the most elusive regions of the human genome, applying both long and short-read technology to grant new insights into epigenetic regulation.

25年间可完成近万代演化的人肠道微生物组的群体遗传学有何特别之处

Comparative Population Genetics in the Human Gut Microbiome

The genetic variation in the human gut microbiome is responsible for conferring a number of crucial phenotypes like the ability to digest food and metabolize drugs. Yet, our understanding of how this variation arises and is maintained remains relatively poor. Thus, the microbiome remains a largely untapped resource, as the large number of co-existing species in this microbiome presents a unique opportunity to compare and contrast evolutionary processes across species to identify universal trends and deviations. Here we outline features of the human gut microbiome that, while not unique in isolation, as an assemblage make it a system with unparalleled potential for comparative population genomics studies. We consciously take a broad view of comparative population genetics, emphasizing how sampling a large number of species allows researchers to identify universal evolutionary dynamics in addition to new genes, which can then be leveraged to identify exceptional species that deviate from general patterns. To highlight the potential power of comparative population genetics in the microbiome, we re-analyzed patterns of purifying selection across ~40 prevalent species in the human gut microbiome to identify intriguing trends which highlight functional categories in the microbiome that may be under more or less constraint.

【酵母】全球土壤中的酵母多样性，来自加拿大麦克马斯特大学（Mcmaster University）Jianping Xu课题组

Global Patterns in Culturable Soil Yeast Diversity

Yeasts, broadly defined as unicellular fungi, fulfill essential roles in soil ecosystems as decomposers and nutrition sources for fellow soil-dwellers. Broad-scale investigations of soil yeasts pose a methodological challenge as metagenomics are of limited use on this group of fungi. Here we characterize global soil yeast diversity using fungal DNA barcoding on 1473 yeasts cultured from 3826 soil samples obtained from nine countries in six continents. We identify mean annual precipitation and international air travel as two significant predictors of soil yeast community structure and composition worldwide. Anthropogenic influences on soil yeast communities, directly via travel and indirectly via altered rainfall patterns resulting from climate change, are concerning as we found common infectious yeasts frequently distributed in soil in several countries. Our discovery of 41 putative novel species highlights the need to revise the current estimate of ~1500 recognized yeast species. Our findings demonstrate the continued need for culture-based studies to advance our knowledge of environmental yeast diversity.

印度虎的保护基因组研究

Genomic evidence for inbreeding depression and purging of deleterious genetic variation in Indian tigers

Increasing habitat fragmentation leads to wild populations becoming small, isolated, and threatened by inbreeding depression. However, small populations may be able to purge recessive deleterious alleles as they become expressed in homozygotes, thus reducing inbreeding depression and increasing population viability. We used genome sequencing of 57 tigers to estimate individual inbreeding and mutation loads in a small-isolated, and two large-connected populations in India. As expected, the small-isolated population had substantially higher average genomic inbreeding (FROH=0.57) than the large-connected (FROH=0.35 and FROH=0.46) populations. The small-isolated population had the lowest loss-of-function mutation load, likely due to purging of highly deleterious recessive mutations. The large populations had lower missense mutation loads than the small-isolated population, but were not identical, possibly due to different demographic histories. While the number of the loss-of-function alleles in the small-isolated population was lower, these alleles were at high frequencies and homozygosity than in the large populations. Together, our data and analyses provide evidence of (a) high mutation load; (b) purging and (c) the highest predicted inbreeding depression, despite purging, in the small-isolated population. Frequency distributions of damaging and neutral alleles uncover genomic evidence that purifying selection has removed part of the mutation load across Indian tiger populations. These results provide genomic evidence for purifying selection in both small and large populations, but also suggest that the remaining deleterious alleles may have inbreeding associated fitness costs. We suggest that genetic rescue from sources selected based on genome-wide differentiation should offset any possible impacts of inbreeding depression.

美国杨伯翰大学（Brigham Young University）：50个分类算法在基因表达数据集上的表现比较

Benchmarking 50 classification algorithms on 50 gene-expression datasets

By classifying patients into subgroups, clinicians can provide more effective care than using a uniform approach for all patients. Such subgroups might include patients with a particular disease subtype, patients with a good (or poor) prognosis, or patients most (or least) likely to respond to a particular therapy. Diverse types of biomarkers have been proposed for assigning patients to subgroups. For example, DNA variants in tumors show promise as biomarkers; however, tumors exhibit considerable genomic heterogeneity. As an alternative, transcriptomic measurements reflect the downstream effects of genomic and epigenomic variations. However, high-throughput technologies generate thousands of measurements per patient, and complex dependencies exist among genes, so it may be infeasible to classify patients using traditional statistical models. Machine-learning classification algorithms can help with this problem. However, hundreds of classification algorithms exist—and most support diverse hyperparameters—so it is difficult for researchers to know which are optimal for gene-expression biomarkers. We performed a benchmark comparison, applying 50 classification algorithms to 50 gene-expression datasets (143 class variables). We evaluated algorithms that represent diverse machine-learning methodologies and have been implemented in general-purpose, open-source, machine-learning libraries. When available, we combined clinical predictors with gene-expression data. Additionally, we evaluated the effects of performing hyperparameter optimization and feature selection in nested cross-validation folds. Kernel- and ensemble-based algorithms consistently outperformed other types of classification algorithms; however, even the top-performing algorithms performed poorly in some cases. Hyperparameter optimization and feature selection typically improved predictive performance, and univariate feature-selection algorithms outperformed more sophisticated methods. Together, our findings illustrate that algorithm performance varies considerably when other factors are held constant and thus that algorithm selection is a critical step in biomarker studies.

NERD-seq，针对非编码RNA的测序新技术——来自加拿大莱斯布里奇大学（University of Lethbridge）

NERD-seq: A novel approach of Nanopore direct RNA sequencing that expands representation of non-coding RNAs

The new next-generation sequencing platforms by Oxford Nanopore Technologies for direct RNA sequencing (direct RNA-seq) allow for an in-depth and comprehensive study of the epitranscriptome by enabling direct base calling of RNA modifications. Non-coding RNAs constitute the most frequently documented targets for RNA modifications. However, the current standard direct RNA-seq approach is unable to detect many of these RNAs. Here we present NERD-seq, a sequencing approach which enables the detection of multiple classes of non-coding RNAs excluded by the current standard approach. Using total RNA from a tissue with high known transcriptional and non-coding RNA activity in mouse, the brain hippocampus, we show that, in addition to detecting polyadenylated coding and non-coding transcripts as the standard approach does, NERD-seq is able to significantly expand the representation for other classes of RNAs such as snoRNAs, snRNAs, scRNAs, srpRNAs, tRNAs, rRFs and non-coding RNAs originating from LINE L1 elements. Thus, NERD-seq presents a new comprehensive direct RNA-seq approach for the study of epitranscriptomes in brain tissues and beyond.

阿根廷国立圣马丁大学（Universidad Nacional de San Martín）：用构象缓冲解释基因组中无序蛋白存在的意义？

Conformational buffering underlies functional selection in intrinsically disordered protein regions

Many disordered proteins conserve essential functions in the face of extensive sequence variation. This makes it challenging to identify the forces responsible for functional selection. Viruses are robust model systems to investigate functional selection and they take advantage of protein disorder to acquire novel traits. Here, we combine structural and computational biophysics with evolutionary analysis to determine the molecular basis for functional selection in the intrinsically disordered adenovirus early gene 1A (E1A) protein. E1A competes with host factors to bind the retinoblastoma (Rb) protein, triggering early S-phase entry and disrupting normal cellular proliferation. We show that the ability to outcompete host factors depends on the picomolar binding affinity of E1A for Rb, which is driven by two binding motifs tethered by a hypervariable disordered linker. Binding affinity is determined by the spatial dimensions of the linker, which constrain the relative position of the two binding motifs. Despite substantial sequence variation across evolution, the linker dimensions are finely optimized through compensatory changes in amino acid sequence and sequence length, leading to conserved linker dimensions and maximal affinity. We refer to the mechanism that conserves spatial dimensions despite large-scale variations in sequence as conformational buffering. Conformational buffering explains how variable disordered proteins encode functions and could be a general mechanism for functional selection within disordered protein regions.

德州西南医学中心Jian Zhou：Orca，一款基于序列的染色体三维结构软件（单人作者文章）

Sequence-based modeling of genome 3D architecture from kilobase to chromosome-scale

The structural organization of the genome plays an important role in multiple aspects of genome function. Understanding how genomic sequence influences 3D organization can help elucidate their roles in various processes in healthy and disease states. However, the sequence determinants of genome structure across multiple spatial scales are still not well understood. To learn the complex sequence dependencies of multiscale genome architecture, here we developed a sequence-based deep learning approach, Orca, that predicts genome 3D architecture from kilobase to whole-chromosome scale, covering structures including chromatin compartments and topologically associating domains. Orca also makes both intrachromosomal and interchromosomal predictions and captures the sequence dependencies of diverse types of interactions, from CTCF-mediated to enhancer-promoter interactions and Polycomb-mediated interactions. Orca enables the interpretation of the effects of any structural variant at any size on multiscale genome organization and provides an in silico model to help study the sequence-dependent mechanistic basis of genome architecture. We show that the models accurately recapitulate effects of experimentally studied structural variants at varying sizes (300bp-80Mb) using only sequence. Furthermore, these sequence models enable in silico virtual screen assays to probe the sequence-basis of genome 3D organization at different scales. At the submegabase scale, the models predicted specific transcription factor motifs underlying cell-type-specific genome interactions. At the compartment scale, based on virtual screens of sequence activities, we propose a new model for the sequence basis of chromatin compartments: sequences at active transcription start sites are primarily responsible for establishing the expression-active compartment A, while the inactive compartment B typically requires extended stretches of AT-rich sequences (at least 6-12kb) and can form ‘passively’ without depending on any particular sequence pattern. Orca thus effectively provides an “in silico genome observatory” to predict variant effects on genome structure and probe the sequence-based mechanisms of genome organization.

哈佛大学等：人类大脑地图

A connectomic study of a petascale fragment of human cerebral cortex

We acquired a rapidly preserved human surgical sample from the temporal lobe of the cerebral cortex. We stained a 1 mm3 volume with heavy metals, embedded it in resin, cut more than 5000 slices at ∼30 nm and imaged these sections using a high-speed multibeam scanning electron microscope. We used computational methods to render the three-dimensional structure of 50,000 cells, hundreds of millions of neurites and 130 million synaptic connections. The 1.4 petabyte electron microscopy volume, the segmented cells, cell parts, blood vessels, myelin, inhibitory and excitatory synapses, and 100 manually proofread cells are available to peruse online. Despite the incompleteness of the automated segmentation caused by split and merge errors, many interesting features were evident. Glia outnumbered neurons 2:1 and oligodendrocytes were the most common cell type in the volume. The E:I balance of neurons was 69:31%, as was the ratio of excitatory versus inhibitory synapses in the volume. The E:I ratio of synapses was significantly higher on pyramidal neurons than inhibitory interneurons. We found that deep layer excitatory cell types can be classified into subsets based on structural and connectivity differences, that chandelier interneurons not only innervate excitatory neuron initial segments as previously described, but also each other’s initial segments, and that among the thousands of weak connections established on each neuron, there exist rarer highly powerful axonal inputs that establish multi-synaptic contacts (up to ∼20 synapses) with target neurons. Our analysis indicates that these strong inputs are specific, and allow small numbers of axons to have an outsized role in the activity of some of their postsynaptic partners.

引文

DeepTech深科技谷歌与哈佛联合发布“人类大脑地图”，3D还原数万神经元