VarScan2 Call SNP

先搬一段VarScan的介绍http://dkoboldt.github.io/varscan/
VarScan is a platform-independent mutation caller for targeted, exome, and whole-genome resequencing data generated on Illumina, SOLiD, Life/PGM, Roche/454, and similar instruments.

It can be used to detect different types of variation:

  • Germline variants (SNPs an dindels) in individual samples or pools of samples
  • Multi-sample variants (shared or private) in multi-sample datasets (with mpileup)
  • Somatic mutations, LOH events, and germline variants in tumor-normal pairs
  • Somatic copy number alterations (CNAs) in tumor-normal exome data

从上面可以看出,VarScan可以用于Germline variants以及Somatic mutations,因此我用这软件来Call SNP。。。

至于为什么选用VarScan来call variant,VarScan是这么说的:对于NGS数据,大多数variant caller都是基于贝叶斯算法来识别变异位点,然后通过可信度来评估,但会受extreme read depth,pooled samples,contaminated or impure samples的影响;而VarScan是基于启发式算法来寻找变异位点,可以很好的meet desired thresholds for read depth, base quality, variant allele frequency, and statistical significance

安装很简单,下载jar文件即可使用,https://sourceforge.net/projects/varscan/files/

  1. 用samtools mpileup生成mpileup文件作为VarScan的输入文件,这里跟samtools+bcftools流程的是不同的,所以只需要输出mpileup文件即可,那么就不用加-g参数了,还有其他参数倒是可以加上

    -q, -min-MQ 比对的mapping quality

    -d, –max-depth 最大测序深度,过滤掉超深度测序的位点

    samtools mpileup -q 1 -d 30000 -f ~/reference/genome/human_g1k_v37/human_g1k_v37.fasta merged_marked.bam 1>merged_marked.mpileup 2>mpileup.log
    
  2. 既然是call snp,那么肯定是用VarScan2的mpileup2snp命令,具体参数可以查看http://dkoboldt.github.io/varscan/using-varscan.html#v2.3_mpileup2snp,其他命令也类似

    USAGE: java -jar VarScan.jar mpileup2snp [mpileup file] OPTIONS mpileup file - The SAMtools mpileup file
    

    参数不多,全部列出来也可以:

    –min-coverage Minimum read depth at a position to make a call(位点的测序深度,覆盖度)默认 8

    –min-reads2 Minimum supporting reads at a position to call variants(paired reads,同上)默认 2

    –min-avg-qual Minimum base quality at a position to count a read(位点上的reads碱基质量)默认 15

    –min-var-freq Minimum variant allele frequency threshold(不太懂这个阈值是怎么卡的。。。。。。)默认 0.01

    –min-freq-for-hom Minimum frequency to call homozygote(这个还是不懂)默认 0.75

    –p-value Default p-value threshold for calling variants(call出来的varians的可行度)默认是0.99

    –strand-filter Ignore variants with >90% support on one strand 默认1

    –output-vcf If set to 1, outputs in VCF format

    –variants Report only variant (SNP/indel) positions (mpileup2cns only)(对于mpileup2snp就无视这个了) [0]

    由上可看出,大部分默认参数即可,特定情况再依情况修改

    java -jar VarScan.jar mpileup2snp merged_marked.mpileup --output-vcf 1 >varscan.snp.vcf
    

    但是VarScan的http://dkoboldt.github.io/varscan/germline-calling.html里面对于input文件(mpileup)又一个卡阈值,如下:

    Only bases meeting the minimum base quality (default: 20) from reads meeting the minimum mapping quality (default: 1) are considered. The coverage (number of qualifying bases) is calculated. If this meets the minimum threshold (default: 20), VarScan examines each allele that was observed, testing to see if it:
    Was supported by at least the minimum number of supporting reads [–min-reads2]
    Meets the minimum allele frequency threshold [–min-var-freq]
    Passes a basic strand-bias filter (if –strand-filter set to 1)
    Has a Fisher’s Exact Test p-value below the threshold (if –p-value specified)

    简单的说就是满足minimum base quality=20,minimum mapping quality=1,minimum coverage=20的位点才call variant,怎么跟mpileup2snp的默认参数又有些不一致了呢,我暂时也没搞清楚

  3. 过滤位点,VarScan call出来的vcf文件有点跟其他软件不同,最明显的地方就是没有QUAL值,毕竟算法不一样嘛,但是其他常用的参数比如GQ,DP还是都有值的

    varscan也有自己filter命令,内容很简单http://dkoboldt.github.io/varscan/using-varscan.html#v2.3_filter,可以看看

    所以也可以从vcf的第10列的信息来过滤,脚本可以简单的过滤低GQ和DP值的位点