Consumables Terminology
•insert size: The length of the double-stranded nucleic acid fragment in a SMRTbell™ template, excluding the hairpin adapters.
•MagBead: Small paramagnetic bead, 2-3 μm in size. The DNA-polymerase complexes are attached to the magnetic beads, which can then be pulled down for easy removal of contaminants from the supernatant during the binding step. The DNA-polymerase complex/bead mixture is then used for the on-instrument immobilization step. See also MagBead loading.
•Plasmidbell Complex (11 kb): A fixed-length DNA template of 11 kb pre-bound to DNA polymerase.
•primed template: Refers to a template molecule that is annealed with primer; product of the template prep protocol and an input to the binding protocol.
•SMRT® Cells: Consumable substrates comprising arrays of zero-mode waveguide nanostructures. SMRT® Cells are used in conjunction with the DNA Sequencing Kit for on-instrument DNA sequencing.
•SMRTbell™ template: A double-stranded DNA template capped by hairpin adapters (i.e., SMRTbell™ adapters) at both ends. A SMRTbell™ template is topologically circular and structurally linear, and is the library format created by the DNA Template Prep Kit.
•template: A nucleic acid molecule to be sequenced; the DNA Template Prep Kit produces templates.
•template annealing: Process of hybridizing primer(s) to nucleic acid templates.
•template library: A set of nucleic acid molecules to be sequenced; the DNA Template Prep Kit process generates template libraries.
•template-polymerase complex: Primed template bound to DNA polymerase; the output of the DNA/Polymerase Binding Kit process.
•zero-mode waveguide (ZMW): A nanophotonic device for confining light to a small observation volume. This can be, for example, a small hole in a conductive layer whose diameter is too small to permit the propagation of light in the wavelength range used for detection. Physically part of a SMRT® Cell.
Sample Preparation Terminology
•AT ligation: The library construction protocol option by which an adapter with a single-nucleotide T overhang, is ligated to an insert with a single-nucleotide A overhang. The workflow that uses this ligation option also contains an A-tailing step.
•barcode padding: An optional 5 bp constant sequence appended to unique barcode sequences. Can be used to normalize ligation of adapters during template preparation.
•barcoded adapter: A SMRTbell™ adapter with a barcode sequence appended to the end of the stem region. When using barcoded adapters, SMRTbell™ templates will have a symmetric barcode structure.
•barcoded SMRTbell™ template: A SMRTbell™ template with two barcoded adapters.
•blunt ligation: The library construction protocol option by which an adapter lacking any overhangs is ligated to an insert also lacking any overhangs. The workflow that uses this ligation option also lacks the A-tailing step.
•diffusion loading: Immobilization of DNA-polymerase complex into the ZMWs on the SMRT® Cell via diffusion. Smaller inserts load preferentially compared to larger inserts.
•DNA damage repair: A step in the SMRTbell™ library preparation that repairs a variety of types of DNA damage, including pyrimidine dimers, abasics, and nicks.
•DNA end repair: A step in the SMRTbell™ library preparation that removes 5’ and 3’ overhangs, and phosphorylates 5’ ends.
•DNA fragmentation: The generation of smaller DNA fragments. Multiple methods may be used to fragment DNA, including hydrodynamic shearing, mechanical shearing, sonication, and enzymatic digestion.
•MagBead loading: Immobilization of large DNA molecules into the ZMWs on the SMRT® Cell chip via MagBeads. The smallest inserts, hairpin dimers, and excess polymerase are washed out in the initial MagBead binding and washing steps. As a result, medium and larger size inserts load better and have a higher sequencing accuracy (compared to diffusion loading of similar- sized inserts).
•PacBio® SampleNet (http://www.smrtcommunity.com/SampleNet): Resource for information and discussion on sample preparation and sequencing with the PacBio® System.
•polymerase binding: The binding of the sequencing polymerase to an appropriate binding site on a nucleic acid template.
•primer annealing: The hybridization of a sequencing primer to an appropriate binding site on a template.
•size selection: The removal of unwanted fragments from a mixture based on size. This can refer to the removal of only the shortest fragments, such as adapter dimers, or to the isolation of a very narrow range of insert sizes. Depending on the size range of interest and the equipment available, size selection can be accomplished with AMPure PB beads, manual isolation from an agarose gel, or automated gel isolation.
Software Terminology
•AHA: A hybrid assembly algorithm that takes a draft assembly and joins contigs using PacBio® reads as evidence. Part of the SMRT® Analysis suite.
•Binding Calculator: Web-based application used to calculate binding and annealing reactions for preparing DNA samples for use on the PacBio® System.
•BLASR: Used for targeted sequencing. Maps reads against a reference; part of SMRT® Analysis.
•Celera® Assembler: Combines Pacific Biosciences’ long reads with short reads generated by other technologies. Used for de novo assembly. Third party software integrated with the SMRT® Analysis suite.
•GATK: Identifies haploid and diploid SNPs using the Broad’s Unified Genotyper software.
HGAP: The Hierarchical Genome Assembly Process (HGAP) can generate high quality (≥ 99.999% accurate) de novo assemblies using a single PacBio® library prep. HGAP includes pre-assembly, de novo assembly with Celera® Assembler, and assembly polishing with Quiver.
•PacBio DevNet (http://pacbiodevnet.com/): Resource for informatics researchers, independent software vendors, and life scientists; includes data sets, source code, application programming interfaces, and documentation.
•pacBioToCA: A software module that aligns high-accuracy reads to the CLRs (continuous long reads), error-corrects the CLRs when a minimum coverage is satisfied, and splits or trims the CLRs otherwise. Third party software integrated with the SMRT® Analysis suite.
•PBJelly: A gap-filling algorithm from Baylor University that takes an assembly containing scaffolds and tries to fill the internal gaps. Not part of the SMRT® Analysis suite.
•Quiver: A highly accurate consensus and variant caller that can generate 99.999% accurate consensus sequences using local realignment and the full range of quality scores associated with Pacific Biosciences reads. Part of the SMRT® Analysis suite.
•DBG2OLC: Efficient Assembly of Large Genomes Using the Compressed Overlap Graph
•Racon :The assembly of long reads from Pacific Biosciences and Oxford Nanopore Technologies typically requires resource intensive error correction and consensus generation steps to obtain high quality assemblies. We show that the error correction step can be omitted and high quality consensus sequences can be generated efficiently with a SIMD accelerated, partial order alignment based stand-alone consensus module called Racon: Based on tests with PacBio and Oxford Nanopore datasets we show that Racon coupled with Miniasm enables consensus genomes with similar or better quality than state-of-the-art methods while being an order of magnitude faster.
•Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
•FALCON and FALCON-Unzip: algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble long-read sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We generate new reference sequences for heterozygous samples including an F1 hybrid of Arabidopsis thaliana, the widely cultivated Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata, samples that have challenged short-read assembly approaches. The FALCON-based assemblies are substantially more contiguous and complete than alternate short- or long-read approaches. The phased diploid assembly enabled the study of haplotype structure and heterozygosities between homologous chromosomes, including the identification of widespread heterozygous structural variation within coding sequences.
•HINGE: long-read assembly achieves optimal repeat resolution.We present HINGE, an assembler that seeks to achieve optimal repeat resolution by distinguishing repeats that can be resolved given the data from those that cannot. This is accomplished by adding "hinges" to reads for constructing an overlap graph where only unresolvable repeats are merged
•MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads (accessible at https://github. com/xiaochuanle/MECAT) for processing single-molecule sequencing (SMS) reads. MECAT’s computing efficiency is superior to that of current tools, while the results MECAT produces are comparable or improved. MECAT enables reference mapping or de novo assembly of large genomes using SMS reads on a single computer
•RS Dashboard: Web-based software that displays quality metrics for individual instruments, runs, and SMRT® Cells.
•RS Remote: Windows-based client software used to design and monitor sequencing runs. Directs user to look at primary analysis data in-depth using RS Dashboard.
•RS Touch: Touchscreen user interface on the instruments to help the user load the instrument and start a run. Provides direct feedback on instrument status.
•SMRT® Analysis Suite: Client/server software that performs automated and distributed analysis of sequencing data generated by the PacBio® System.
•SMRT® Pipe: Command-line software used to launch secondary analysis jobs. Part of the SMRT® Analysis suite.
•SMRT® Portal: Web-based software used to help set up secondary analysis jobs and view quality reports. Part of the SMRT® Analysis suite.
•SMRT® View: Java-based genome browser used to visualize aligned or assembled reads. Part of the SMRT® Analysis suite.
Analysis Terminology
•collection time: The time specified for collecting data from a SMRT® Cell.
•consensus accuracy: Accuracy based on aligning multiple sequencing reads or subreads together, optionally with a reference sequence.
•high quality (HQ) region screening: Annotates the high quality sequencing regions of a read to be used during raw read trimming.
•movie: The set of data collected during real-time observation of the SMRT® Cell; including spectral information and temporal information used to determine a read.
•primary analysis: Includes signal processing of the movie, base calling of the traces and pulses, and quality assessment of the base calls.
•pulse: The representation of an illumination event derived from a trace that includes metrics such as interpulse duration, pulse height, and pulse width.
•raw read trimming: Extraction of high quality regions from an unfiltered read. Trimming of an unfiltered read produces a polymerase read.
•reads/SMRT® Cell: The average number of reads generated per SMRT® Cell.
•SMRT® Sequencing: The process of nucleic acid sequencing using Pacific Biosciences’ single molecule, real-time sequencing technology.
•standard sequencing: Sequencing of SMRTbell™ templates to produce either single pass reads or circular consensus reads, depending on the template length and collection time.
•tertiary analysis: Statistical analyses following secondary analysis, which includes comparisons of secondary analysis results across different samples, application-specific analyses, variant classification, and disease/gene annotations.
•trace: The raw intensity values from all four spectral channels of a single ZMW derived from a movie.
Secondary Analysis Terminology
•analysis group: A group of reads from a single or multiple SMRT® Cells to be analyzed together in secondary analysis.
•barcode FASTA: A FASTA-format file used by barcoding software to identify ideal barcode sequences. For symmetric barcodes, each barcode sequence identifies a single bin for demultiplexing reads. For paired barcodes, each unique pair of barcodes should be listed as two sequentially-named FASTA sequences.
•barcode score: The alignment score between a read and an ideal barcode sequence. The maximum barcode score is twice the length of the ideal barcode sequence.
•circular consensus accuracy: Accuracy based on multiple sequencing passes around a single circular template molecule.
circular consensus analysis: Processing of sequencing data generated by circular consensus sequencing to create a circular consensus read.
•circular consensus sequencing (CCS): Sequencing performed on a circular template in which multiple subreads are generated during multiple sequencing passes around the template, and then collapsed to form a single high-accuracy read. CCS data are generated when at least two full-pass subreads are present.
•consensus sequence determination: Generation of a consensus sequence from multiple individual reads of the same template or identical copies thereof. Also termed “consensus calling.”
•paired barcodes: Barcode sequences that are different (asymmetric) on either end of an insert present in a SMRTbell™ template. The barcoding analysis software uses unique pairs of barcodes to separate and analyze reads.
•QV Metric: "Phred"-like scores that predict, for each base call, the probability of a correct call.
•secondary analysis: Statistical analyses following primary analysis base calling that includes: 1) Filtering/selection of data that meets a desired criteria (such as quality, read length, and so on); 2) comparison of reads to a reference for mapping and variant calling, consensus sequence determination, alignment and assembly (de novo or reference-based), variant identification and base modification detection; and 3) quality evaluations for a sequencing run, consensus sequence, assembly, and so on.
•secondary analysis job: An analysis of data from an analysis group using a secondary analysis protocol.
•symmetric barcodes: Barcode sequences that are identical on both ends of an insert present in a SMRTbell™ template.
Read Terminology
.polymerase read (formerly called “read”): A sequence of nucleotides incorporated by the DNA polymerase while reading a template, such as a circular SMRTbell™ template. Polymerase reads are most useful for quality control of the instrument run. Polymerase read metrics primarily reflect movie length and other run parameters rather than insert size distribution. Polymerase reads are trimmed to include only the high quality region; they include sequences from adapters; and can further include sequence from multiple passes around a circular template.
•subread: Each polymerase read is partitioned to form one or more subreads, which contain sequence from a single pass of a polymerase on a single strand of an insert within a SMRTbell™ template and no adapter sequences. The subreads contain the full set of quality values and kinetic measurements. Subreads are useful for applications like de novo assembly, resequencing, base modification analysis, and so on.
•circular consensus (CCS) read: The consensus sequence determined using subreads taken from a single ZMW. This is not aligned against a reference sequence. In contrast to Reads of Insert, CCS reads require at least two full-pass subreads from the insert.
•continuous long read (CLR): A continuous, long read generated from a multi-kilobase insert sequence that is long enough that circular consensus sequences are not generated. As such, a CLR comprises only sequence for a single strand of an insert sequence in a nucleic acid template.
•full-pass subread: A subread that begins at one adapter sequence and ends at another adapter sequence. A full-pass subread does not begin or end in the middle of an insert sequence.
•mapped polymerase read length: The total number of bases along a read from the first adapter or aligned subread to the last adapter or aligned subread. Approximates the sequence produced by a polymerase in a ZMW.
•mapped subread length: The length of the subread alignment to a target reference sequence. This does not include the adapter sequence.
•N50 subread length metric: The read length at which 50% of the bases are in subreads (or polymerase reads) longer than, or equal to, this value.
•PacBio® corrected read (PBcR): The result of error-correcting Pacific Biosciences CLR data with high-accuracy reads using either Pacific Biosciences’ CCS or a short-read technology.
•polymerase read length: The total number of bases produced from a ZMW after trimming. This may include the adapter sequence.
•polymerase read quality: A trained prediction of a read’s mapped accuracy based on its pulse and base file characteristics (peak signal-to-noise ratio, average base QV, inter-pulse duration, and so on).
•preassembled long read (PLR): A generated read that has been output from the preassembly step of HGAP.
•productivity: A measure of the reads from a ZMW. P=1 means that there is a polymerase read from that ZMW. P=0 means that a ZMW did not produce a read and is presumed to be lacking a polymerase. P=2 means “other” and the signal collected from the ZMW was not conducive to efficient base calling, possibly due to multiple template-polymerase complexes bound in the ZMW, high background signal, and so on.
•read length: The number of contiguous bases incorporated into a nascent strand during template-directed synthesis, reported as average, mode, and max (>1%).
•read of insert: Represents the highest quality single sequence for an insert, regardless of the number of passes. For example, if your template received one-and-a-half subreads, that information will be combined into a Read of Insert. CCS is an example of a special case where at least two full subreads are collected for an insert. Reads of Insert give the most accurate estimate of the length of the insert sequence loaded onto a SMRT® Cell. For long templates, Reads of Insert may be the same as Polymerase Reads.
•Read length of insert: The average length of the Read of Insert, which is a representative read of a DNA molecule from a single ZMW; that is, the sequence of a DNA molecule read from a single ZMW. On circularized SMRTbell™ templates that are shorter than the read length, a Read Length of Insert distribution will closely resemble the insert size distribution.
•read quality (RQ): The de novo prediction of the mapped accuracy of subreads from a single ZMW. Sometimes also referred to as QC Score or Read Score.
Base Modification Terminology
•amplified control: Control created by separately sequencing an amplified version of the sample of interest.
•interpulse duration (IPD): Metric for the length of time between emission pulses indicative of base incorporation events. Base modifications in a template molecule can impact IPD, so changes in IPD are used to detect base modifications during SMRT® Sequencing.
•IPD Ratio: The ratio of the mean IPD of a native sample to the mean IPD in a second sample or control at a position of inquiry in a template.
•in silico control: Computational model for predicting the mean IPD per given sequence context at the position of inquiry.
•native control: Native DNA sample used as a control to analyze a second, typically native, DNA sample to identify differences in modification.
欢迎关注生信人