fastp数据质控
fastp数据质控
- 一、fastp质控的特点
- 二、数据质控的标准
- 三、数据质控的方法
- 四、参数说明
下机数据质控:利用FastQC或FASTP等软件对原始数据进行检查,可以评估数据的质量和数据结构等信息,帮助筛选出低质量的数据。
一、fastp质控的特点
fastp是指一款用于处理高通量测序数据的快速工具,它能够对Illumina平台的测序数据进行质量控制、过滤低质量序列、截断3’端低质量序列、去除接头序列等操作。具体来说,fastp具有以下特点和功能:
- 全面质控:对Fastq文件进行全方位质量评估,生成可视化报告,类似于FastQC
- 过滤低质量序列:筛选过滤低质量、过短序列和含有过多不明碱基N的序列。
- 滑动窗口质量控制:通过滑动窗口方法评估序列平均质量,裁剪头尾部分的低质量碱基片段。
- 自动接头去除:能够自动检测并裁剪接头序列,无需提供参考接头序列。
- 双端测序数据处理:自动查找每一对读取的重叠区域,并校正其中的不匹配碱基对。
- polyG去除:针对Illumina NextSeq/NovaSeq测序数据,去除尾部的polyG序列。
- 支持长读长:不仅适用于Illumina平台,还支持PacBio和Ion Torrent等平台的测序数据。
- 多线程加速:基于C++编译,运行速度快,准确性高。
- 多种数据格式支持:支持多种数据输入和输出格式,便于数据处理和分析。
- 报告输出:以JSON和HTML格式输出结果,便于数据理解和分享。
- 多线程支持:支持多线程处理,适合大规模数据集。
- 安装便捷:通过Conda或从GitHub下载可执行文件安装,操作简单。
- 命令行界面:提供丰富的参数选项,便于用户根据需求进行设置。
二、数据质控的标准
认识一个原始的测序数据:fastq
特征值 | 一般标准 |
---|---|
read各个位置的碱基质量值分布 | 基本大于30 |
碱基的总体质量值分布 | >20 |
Q20 | >95%(最差不低于90%) |
Q30 | >85%(最差也不要低于80%) |
read各个位置上碱基分布比例,目的是为了分析碱基的分离程度 | A和T比例应该差不多,C和G的比例也应该差不多, 偏差最好平均在1%以内 |
GC含量分布 | 人类基因组的GC含量一般在40%左右 |
read各位置的N含量 | 0,值越小越好 |
read是否还包含测序的接头序列 | |
read重复率,这个是实验的扩增过程所引入的 |
说明:
参考文章:https://zhuanlan.zhihu.com/p/28802083
1、GC含量
GC含量指的是G和C这两种碱基占总碱基的比例。二代测序平台或多或少都存在一定的测序偏向性,我们可以通过查看这个值来协助判断测序过程是否足够随机。对于人类来说,我们基因组的GC含量一般在40%左右。因此,如果发现GC含量的图谱明显偏离这个值那么说明测序过程存在较高的序列偏向性,结果就是基因组中某些特定区域被反复测序的几率高于平均水平,除了覆盖度会有偏离之后,将会影响下游的变异检测和CNV分析。
2、是否包含接头:
当测序read的长度大于被测序的DNA片段【注】时,就会在read的末尾测到这些接头序列(如下图)。一般的WGS测序是不会测到这些接头序列的,因为构建WGS测序的文库序列(插入片段)都比较长,约几百bp,而read的测序长度都在100bp-150bp这个范围。不过在进行一些RNA测序的时候,由于它们的序列本来就比较短,很多只有几十bp长(特别是miRNA),那么就很容易会出现read测通的现象,这个时候就会在read的末尾测到这些接头序列。
测到的接头序列和低质量碱基一样都是需要在正式分析之前进行切除的read片段。
三、数据质控的方法
1、单端测序
fastp -i fq1 -o sample.clean.R1.fq.gz -R sample -j sample.json -h sample.html
2、双端测序
fastp -i fq1 -o sample.clean.R1.fq.gz -I fq2 -O sample.clean.R2.fq.gz -R sample -j sample.json -h sample.html
3、其他参数:
-w 16 -f 2 -F 2 -x -5 -3 -W 7 -M 20 -l 60
四、参数说明
1、输入输出文件设置
-i, --in1 read1 input file name (string) #输入read1文件-o, --out1 read1 output file name (string [=]) #输出read1文件-I, --in2 read2 input file name (string [=]) #输入read2文件-O, --out2 read2 output file name (string [=]) #输出read2文件-6, --phred64 indicates the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33) #Phred+64 质量字符的ASCII值 - 64,Phred+64所使用的字符的ASCII值都大于等于59,字符的ASCII值都小于59使用phred33-z, --compression compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 2. (int [=2]) #输出的压缩级别(1-9),1是最快的,9是最小的,默认设置是2--reads_to_process specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0]) #指定要处理多少的reads/pairs,默认是0,指的是处理全部读数
2、adapter trimming options 过滤序列接头参数设置
-A, --disable_adapter_trimming adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled #一般默认是自动对原始数据去掉接头的,如果选择该选项则代表不去除接头-a, --adapter_sequence the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped. (string [=auto])#对于单端测序数据来说,这个选项是直接针对read1数据进行接头处理,如果是双端测序数据,则是针对那些R1/R2没有重叠的reads的--adapter_sequence_r2 the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as <adapter_sequence> (string [=])#对于单端测序数据来说,这个选项是直接针对read2数据进行接头处理,如果是双端测序数据,则是针对那些R1/R2没有重叠的reads的
3、global trimming options 剪除序列起始和末端的低质量碱基数量参数
-f, --trim_front1 trimming how many bases in front for read1, default is 0 (int [=0]) #设置处理read1起始低质量碱基数量,默认是0-t, --trim_tail1 trimming how many bases in tail for read1, default is 0 (int [=0]) #设置处理read1末端低质量碱基,默认是0-F, --trim_front2 trimming how many bases in front for read2. If it's not specified, it will follow read1's settings (int [=0]) #设置处理read2起始低质量碱基数量,默认是0,如果没有设置,将会按照read1的设置来-T, --trim_tail2 trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings (int [=0]) #设置处理read2末端低质量碱基,默认是0,如果没有设置,将会按照read1的设置来
4、polyG tail trimming, useful for NextSeq/NovaSeq data polyG剪裁
-g, --trim_poly_g force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data #默认会对Illumina NextSeq/NovaSeq数据尾部进行PolyG进行处理--poly_g_min_len the minimum length to detect polyG in the read tail. 10 by default. (int [=10]) #对尾部PolyG进行处理的最小长度,默认是10-G, --disable_trim_poly_g disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data #该选项的使用是不对尾部的PolyG进行处理# polyX tail trimming-x, --trim_poly_x enable polyX trimming in 3' ends. #截取3'末端polyX--poly_x_min_len the minimum length to detect polyX in the read tail. 10 by default. (int [=10]) #检测read末尾的polyX的长度,默认10;
5、per read cutting by quality options 滑窗裁剪
-5, --cut_by_quality5 enable per read cutting by quality in front (5'), default is disabled (WARNING: this will interfere deduplication for both PE/SE data) #从read的5'端至末尾移动窗口,去除窗口中平均质量值小于'<'阈值的碱基-3, --cut_by_quality3 enable per read cutting by quality in tail (3'), default is disabled (WARNING: this will interfere deduplication for SE data) #从read的3'端值至开头移动窗口,去除窗口中平均质量值小于'<'阈值的碱基;-W, --cut_window_size the size of the sliding window for sliding window trimming, default is 4 (int [=4]) #滑动窗口过滤,这个类似于计算kmer,1~1000, 默认是4个碱基作为窗口大小;-M, --cut_mean_quality the bases in the sliding window with mean quality below cutting_quality will be cut, default is Q20 (int [=20]) #选择的窗口中,碱基平均质量值,范围1~36,默认是Q20,如果这个区域窗口平均低于20,则认为是一个低质量区域,处理掉;-r, --cut_right #从read的开头到末尾移动窗口,如果某一窗口的平均质量值小于阈值,去除窗口中的碱基及其右侧部分,并停止;
6、quality filtering options 根据碱基质量来过滤序列
-Q, --disable_quality_filtering quality filtering is enabled by default. If this option is specified, quality filtering is disabled #控制是否去除低质量,默认自动去除,设置-Q关闭;-q, --qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15]) #设置低质量的标准,默认是15,也就是质量值小于15认为是低质量碱基,一般我们设置20,常说的Q20;-u, --unqualified_percent_limit how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40]) #低质量碱基所占百分比,并不是包含低质量碱基就把一条reads丢掉,而是设置一定的比例,默认40代表40%,也就是150bpreads,包含60个以上低质量的碱基就丢掉,只要有一条reads不满足条件就成对丢掉;-n, --n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5]) #过滤N碱基过多的reads,如果N碱基含量大于n,这条read/pair将被舍弃,默认5;
6、length filtering options 根据序列长度来过滤序列
-L, --disable_length_filtering length filtering is enabled by default. If this option is specified, length filtering is disabled #关闭reads长度过滤选项;-l, --length_required reads shorter than length_required will be discarded, default is 15. (int [=15]) #接一个长度值,小于这个长度reads被丢掉,默认是15,这个在处理非illumina测序数据时很有用。# low complexity filtering 低复杂度过滤-y, --low_complexity_filter enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]). #使用低复杂度过滤,这里低复杂度的定义是与其下一个碱基不同的碱基比例(base[i] != base[i+1]).-Y, --complexity_threshold the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30]) #低复杂度的阈值(0~100),默认30;
7、filter reads with unwanted indexes (to remove possible contamination) 根据indexes过滤reads–删除可能的污染
--filter_by_index1 specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line (string [=])--filter_by_index2 specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line (string [=])--filter_by_index_threshold the allowed difference of index barcode for index filtering, default 0 means completely identical. (int [=0])
8、base correction by overlap analysis options 通过overlap来校正碱基
-c, --correction enable base correction in overlapped regions (only for PE data), default is disabled #是对overlap的区域进行纠错,所以只适用于pairend reads。
9、UMI processing 分子标签处理
-U, --umi enable unique molecular identifer (UMI) preprocessing--umi_loc specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none (string [=])--umi_len if the UMI is in read1/read2, its length should be provided (int [=0])--umi_prefix if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default (string [=])--umi_skip if the UMI is in read1/read2, fastp can skip several bases following UMI, default is 0 (int [=0])
10、overrepresented sequence analysis
-p, --overrepresentation_analysis enable overrepresented sequence analysis.-P, --overrepresentation_sampling One in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20. (int [=20])# reporting options-j, --json the json format report file name (string [=fastp.json]) #输出json格式报告文件名(string [=fastp.json])-h, --html the html format report file name (string [=fastp.html]) #输出html 格式报告文件名-R, --report_title should be quoted with ' or ", default is "fastp report" (string [=fastp report])
11、threading options 设置线程数
-w, --thread worker thread number, default is 3 (int [=3]) #使用线程数,默认是3(int [=3])
12、output splitting options 控制split选项,有时候单条reads文件太大,可以分割为多份分别比对,在合并bam结果,这样可以提高效率。
-s, --split split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (int [=0]) #切割数目(2~999),默认是0,不分割-S, --split_by_lines split output by limiting lines of each file with this option(>=1000), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (long [=0])-d, --split_prefix_digits the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding (int [=4]) #输出前缀位数,默认是4,0001,0002这种命名,如果设置为3,就是001,002这种--help print this message