Freebayes Call SNP

Freebayes身为众多call变异软件中的一员,一直被使用到现在,肯定有其独到之处。因此对其的简单使用方法做个笔记

  1. 下载及安装

    git clone --recursive git://github.com/ekg/freebayes.git
    make
    make install 
    
  2. 软件介绍及原理

    可参考一篇博客http://www.cnblogs.com/freemao/p/3754659.html
    官方文档https://github.com/ekg/freebayes

  3. 使用方法

    3.1 官网给的基本流程如下:

    • Align your reads to a suitable reference (e.g. with bwa or MOSAIK)
    • Ensure your alignments have read groups attached so their sample may be identified by freebayes. Aligners allow you to do this, but you can also use bamaddrg to do so post-alignment.
    • Sort the alignments (e.g. bamtools sort).
    • Mark duplicates, for instance with samtools rmdup (if PCR was used in the preparation of your sequencing library)
    • Run freebayes on all your alignment data simultaneously, generating a VCF. The default settings should work for most use cases, but if your samples are not diploid, set the –ploidy and adjust the –min-alternate-fraction suitably.
    • Filter the output e.g. using reported QUAL and/or depth (DP) or observation count (AO).
    • Interpret your results.
    • (possibly, Iterate the variant detection process in response to insight gained from your interpretation)

    一般来说前4步都是大同小异,比如bwa(add @RG) -> sort(samtools, picard) -> mark duplicates(picard),接着就是用freebayes产生vcf文件,最后过滤。。。

    3.2 使用freebayes从bam到vcf文件

    freebayes -h
    

    可看到参数众多,对几个主要参数做个解释

    –min-alternate-count (default: 2):Require at least this count of observations supporting an alternate allele within a single individual in order to evaluate the position(最少的alt碱基数,可以根据你测序深度来定)

    –min-alternate-fraction (default: 0.2):Require at least this fraction of observations supporting an alternate allele within a single individual in the
    in order to evaluate the position(跟–min-alternate-count类似,就是变成比例了,双倍体用默认值0.2即可,如果多倍体(>2)的话,建议使用>0.2的值)

    –genotype-qualities :Calculate the marginal probability of genotypes and report as GQ in each sample field in the VCF output(在输出的VCF文件中有GQ的tag)

    –min-mapping-quality(default: 1):Exclude alignments from analysis if they have a mapping quality less than Q

    –min-base-quality (default: 0):Exclude alleles from analysis if their supporting base quality is less than Q (这个和上面两个参数是对input进行过滤的,但我觉得也可以在最后vcf文件,对结果再过滤也行?)

    -X –no-mnps:Ignore multi-nuceotide polymorphisms, MNPs.

    -u –no-complex:Ignore complex events (composites of other classes)(官网是不建议这么做,其认为减少这些信息会降低检测变异的可行度,其建议在最后文件中再去过滤获得snp or indel)

    从上述可看出,基本上如果单纯的call变异的话,并且前期reads已做质控,那么在freebayes call变异这步几乎可以说使用默认参数即可(对于2倍体生物),或者后续有其他需求可再加入其他参数,所以命令如下:

    freebayes -f ~/reference/genome/human_g1k_v37/human_g1k_v37.fasta merged_marked.bam >freebayes_raw.vcf
    
  4. 提取SNP以及INDEL

    freebayes产生的VCF文件中INFO一列有专门的一个tag来注释是snp、ins(插入)、del(缺失)、mnp(连续两个snp位点,如ref为AT, alt为CG)以及complex(composite insetion and substitution events)

    所以只要用下面的shell命令即可提取SNP

    grep 'TYPE=snp' freebayes_raw.vcf > freebayes_snp_raw.vcf
    

    其他类型的变异也是类似

  5. 最后就是过滤低质量的SNP位点了

    可以用bcftools过滤,可看之前的文章http://www.bioinfo-scrounger.com/archives/248

    也可以用vcftools,粗略看过,功能比bcftools的filter功能要强大点

    当然如果是简单过滤,直接脚本就行了