一、CNV定义
CNV refers to a type of intermediate-scale SVs with copy number changes involving a DNA fragment that is typically greater than one kilobases (Kb) and less than five megabases(Mb)
1、测序方法
(1)Whole genome sequencing
(2)Exome sequencing
2、检测方法
共以下五种:
1、Summary of paired-end mapping (PEM), split read (SR), and de novo assembly (AS)-based tools for CNV detection using NGS data
2、Summary of bioinformatics tools for CNV detection using exome sequencing data
四、具体检测方法简介
RP methods compare the average insert size between the actual sequenced
read-pairs with the expected size based on a reference genome. PEM methods
can efficiently identify not only insertions and deletions but also mobile element insertions, inversions, and tandem duplications
1) Based on insert size of the sequencing library;
2) inability for duplicated region;
3)Inability for larger insert than the average insert size of the PE library;Ability for larger insert when have large insert Mate-Pair library
Softwares
Ulysses(2014) :for mate pairs; for repeat region;
BreakDancer(2009) :for short copy and medium size copy; ignore multiple aligned reads;not applicable for repeat region; detecting small indels (usually between 10bp and 100bp)
VariationHunter(2010): solve the reads mapped to multiple locations; detect transposon insertion;
commonLAW(2011): compares all the samples to a reference genome simultaneously; solve the reads mapped to multiple locations
Split Read method uses reads from pair end sequencing where only one read of the pair has a reliable mapping and the other one either completely or partially fails to map to the genome. SR-based approach heavily relies on the length of reads and is only applicable to the unique regions in the reference genome provides the precise start and end
have limited ability to identify large-scale SVs
Softwares
Pindel (2009): for large deletions (1 bp - 10 Kb) and medium-sized insertions (1 - 20 bps) ; not use probabilistic models result in a higher false-positive rate;
Gustaf(2014) :detect dispersed duplications and translocations(≥30 bp length)
PRISM(2012) :faster run times as well as higher sensitivities at detection of large CNVs.
SVseq2 (2012):supports INDEL calling from low- coverage sequence data.
Major approach to estimate copy numbers;
Can detect the exact copy numbers, detect large insertions and CNVs in complex genomic region three categories : single samples,paired case/control samples, and a large population of samples. Based on a sliding window steps:
First reads are aligned to a reference genome and RD will be counted using a predefined window. Then the counts will be normalized to remove potential biases, mainly due to GC content and repeat regions (Boeva et al., 2011; Janevski et al., 2012) a segmentation algorithm will be applied to identify a contiguous set of windows having the same num- ber of CNVs. Finally, the statistical significance of the calls will be predicted and filtering will be applied (Janevski et al., 2012; Zhao et al., 2013).
Softwares
CNV-seq(2009): paired case and control; following a Poisson distribution but might not be an optimal model in many CNVs;get better results in higher depth.
BIC-seq/ CNVnator(2011) : for paired sample; MSB approach, the adjacent genomic windows with similar read depths are merged together along chromosomes. The breakpoints are reported when read depths of a sliding window are significantly discordant with the depths of the merged windows;
CNVnator :single sequencing data ;is capable of detecting CNV from a 500bp window for4–6× coverage,to a 30bp window for 100× coverage.
Cn.MOPS(2012): for population;
RDXplorer(2009): individual genome; used EWT (Event-Wise Testing) get high resolution for CNV detection using 100-bp windows;could detect CNVs with a length of ~500 bp;only for human
ReadDepth(2011): set an appropriate size for asliding; It does not require a reference sample;applied to single samples; high resolution from low-coverage experiments using breakpoint information from paired end sequencing; 2-5kb CNV;
SegSeq(2009): for case/control samples; uses the log ratio of the case versus control read counts of single samples [41].
CNVeM(2012) :individual samples ;can predict breakpoints in base pair resolution. Can be used for dunplication region.
CNVrd2(2014): for large populations.
JointSLM(2011):based on multiple samples. detecting small CNV regions as short as 500 bp.
Control-FREEC, is able to call CNVs from WGS and WES data with or without control samples
AS methods are less used in CNV detection due to their overwhelming demand on computational resources;
In addition, eukaryotic genomes contain a significant fraction of repeats and segmental duplications which makes the AS methods less accurate and more complex as they perform poorly in these complex regions.
Another issue with the AS methods is that they are unable to handle haplotype sequences and therefore only homozygous structure variations can be detected
While RD based methods are best suited for detecting absolute copy number(Alkan etal.,2009) they suffer from lower efficiency for determining small CNVs(<1 kb; Bellosetal.,2012).
Tools using RP, have low sensitivity for detecting variation in repeating regions(Medvedevetal.,2009).
SR approaches as they can achieve single-base-pair resolution but remain highly dependent on the read length and are less reliable in repetitive regions (Bellos et al., 2012).
AS-based tools take advantage of not requiring a reference genome, but they suffer from extensive computation and perform poorly on repeat regions (Zhao et al., 2013).
PEM methods can report accurate breakpoints, but low in detecting large CNV regions (e.g., insertions longer than the insert size) or counting exact copy numbers.
RD methods are advantageous to detect large CNVs but cannot report exact breakpoints.
Combining: exact breakpoints and spanning various lengths, especially for longer insertions that are undetectable by PEM
SVDetect(Zeitouni etal.,2010):combined an RP approach and RD ratios between case and control samples. Used pair-end and mate pair datas;
cnvHiTSeq (Bellos et al., 2012) :combining outcomes from RD, RP, and SR ; for population sequencing data;even from low-coverage sequence data. it utilizes LOESS smoothing and GC correction to mitigate sequencing biases.
Clever-sv (Marschall et al., 2013) :combines SR and discordant RP ; works best for calling midsize deletions at medium coverage.
CNVer (Medvedev et al., 2010) combines RP and RD information for CNV detection. uses all good mappings for every mate pair reads to detect repeat and duplication regions.
DELLY (Rausch et al., 2012) :RP with SR; ascertain the full pectrum of genomic rearrangements,
inGAP-SV(2011) combined PEM and RD; capacity to identify different types of CNVs and customized visualization
GenomeSTRiP (Haraksingh and Snyder, 2013): combines discordant RP and ;can accurately call relatively long CNVs (≥200 bp).working with large populations, works best at least 20 individuals is analyzed together.
Gindel (Chu et al., 2014) uses discordant RP, SR, spanning reads (readsmapped to a region that overlaps the indel), and RD near the deletion. For populations;
GASVPro (Sindi et al., 2012) integrates RP and RD methods to achieve improved specificity in detection of structural variation especially in repetitive regions.
Hydra-Multi (Lindberg et al., 2014) works with multi-sample to detect SVs. Combined RP and AS methods;
LUMPY (Layer et al., 2014) : combined RP, SR, and RD, for single sample
PSCC(population-scale CNVcalling; Li et al., 2014) combines RP and RD for population. combined statistics test to ensure the best performance;
SoftSearch (Hart et al., 2013) utilizes SR and RP strategies;can deal with multiple mapping for detecting SVs to increase sensitivity
五、Compared the different tested CNV detection methods
Min Zhao , et al.,(2013) Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives BMC Bioinformatics, 14(Suppl 11):S1
Mehdi Pirooznia, FernandoS.Goes and Peter P.Zandi, (2015)Whole-genome CNV analysis: advances in computationa lapproaches. Front. Genet. 6:138.
Tan,R.,Wang,Y.,Kleinstein,S.E.,Liu,Y.,Zhu,X.,Guo,H.,etal.(2014).An eval- uation of copy number variation detection tools from whole-exome sequencing data. Hum. Mutat. 35,
Duan J, Zhang J-G, Deng H-W, Wang Y-P (2013) Comparative Studies of Copy Number Variation Detection Methods for Next-Generation Sequencing Technologies. PLoS ONE 8(3)