GATK best practices(Pre-processing+Variant discovery)

GATK Best Practices的目的:Best Practices for Germline SNP & Indel Discovery in Whole Genome and Exome Sequence

准备按照下述教程对其做个小笔记(流程参照官网,并结合公众号),熟悉一下如何寻找变异位点的GATK流程

官网教程:https://software.broadinstitute.org/gatk/best-practices/

公众号教程:https://mp.weixin.qq.com/s/egAnRfr3etccU_RsN-zIlg

大致流程可以分为两部分:pre-processing and variant discovery,官网教程还有第3步:callset refinement(本文不准备做),选用的GATK版本为3.7

首先是初始数据的下载:

基因组数据:ftp://ftp.kobic.re.kr/pub/KPGP/2015_release_candidate/WGS/KPGP-00001里面有6个lane,解压后大概70G+

PRE-PROCESSING
  1. 将KPGP-00001样本的fastq比对到参考基因组上,这里参考基因组选用human_g1k_v37.fasta,比对软件选用bwa,代码展示以KPGP-00001_L1为例

    # 建索引
    bwa index -a bwtsw -p human_g1k_v37 human_g1k_v37.fasta 1>human_g1k_v37.bwa_index.log 2>&1
    # 比对
    bwa mem -t 5 -R "@RG\tID:KPGP-00001_L1\tSM:KPGP-00001_L1\tLB:WGS\tPL:Illumina" human_g1k_v37 KPGP-00001_L1_R1.fq KPGP-00001_L1_R2.fq >KPGP-00001_L1.sam
    
  2. 将bwa比对后的sam文件用picard套件中SortSam转化为sort后的bam文件(我的picard软件是放在/home/anlan/biosoft/picard目录下)

    java -Djava.io.tmpdir=/home/anlan/tmp -Xmx10g -jar /home/anlan/biosoft/picard/picard.jar SortSam I=KPGP-00001_L1.sam O=KPGP-00001_L1.bam SORT_ORDER=coordinate
    
  3. 接下来将bam文件进行mark duplicates,也就是将duplicate reads进行flag标记下,其并不是去除这些duplicate reads,从而使得后续GATK软件能够无视这些duplicate reads,使用picard套件中的MarkDuplicates进行处理

    java -Djava.io.tmpdir=/home/anlan/tmp -Xmx10g -jar /home/anlan/biosoft/picard/picard.jar MarkDuplicates I=KPGP-00001_L1.bam O=KPGP-00001_L1_marked.bam METRICS_FILE=KPGP-00001_L1.metrics
    
  4. 将上述6个lane的marked.bam文件merge到一个文件,使用samtools merge命令,输出的merged.bam文件也是sort后的,-b 参数输入的是文件列表(每行是一个bam文件名,如KPGP-00001_L1_marked.bam,总共有6个bam文件需要合并,合并后对merge.bam做一个index

    samtools merge -@ 5 -b bamfile.list merged.bam
    samtools index merged.bam
    
  5. 公众号教程接下来是做了RealignerTargetCreator以及IndelRealigner,但GATK Best Practices将这两部舍去了,见RealignerTargetCreatorIndelRealigner,这一步的目的主要是将比对到indel附近的reads进行局部重新比对,将比对的错误率降到最低,可参考一个博客http://blog.sina.com.cn/s/blog_12d5e3d3c0101qu6k.html
    所以说并不是这步骤不好用了,而是说GATK可能觉得这步的影响不太大,如:

    Note that indel realignment is no longer necessary for variant discovery if you plan to use a variant caller that performs a haplotype assembly step, such as HaplotypeCaller or MuTect2. However it is still required when using legacy callers such as UnifiedGenotyper or the original MuTect

    但是个人的话,也可以做以下这个步骤,公众号里有具体步骤

  6. 然后下载BQSR需要的数据库

    ## reference
    wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai
    wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/README.human_g1k_v37.fasta.txt
    java -jar ~/biosoft/picardtools/picard.jar CreateSequenceDictionary R=human_g1k_v37.fasta O=human_g1k_v37.dict
    
    ## SNP
    wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/dbsnp_138.b37.vcf.gz
    wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/dbsnp_138.b37.vcf.idx.gz
    
    ##INDEL
    wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/1000G_phase1.indels.b37.vcf.gz
    wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/1000G_phase1.indels.b37.vcf.idx.gz
    wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz
    wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz
    

    将上述都gunzip解压,具体为何选择这些数据库可见:https://software.broadinstitute.org/gatk/documentation/article?id=1247

  7. 进行GATK的BQSR步骤

    7.1 Analyze patterns of covariation in the sequence dataset

    nohup java -Djava.io.tmpdir=/home/anlan/software -Xmx10g  
    -jar /home/anlan/biosoft/gatk/GenomeAnalysisTK.jar   
    -T BaseRecalibrator -fixMisencodedQuals  
    -R /home/anlan/reference/genome/human_g1k_v37/human_g1k_v37.fasta  
    -I merged_marked.bam -o recal_data.table  
    -knownSites /home/anlan/annotation/variation/human/dbSNP/dbsnp_138.b37.vcf  
    -knownSites /home/anlan/annotation/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcf  
    -knownSites /home/anlan/annotation/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf &
    

    至于为什么要用-fixMisencodedQuals,是因为有下述报错 SAM/BAM/CRAM file htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter@3c62c81f appears to be using the wrong encoding for quality scores

    这个报错是原因则是:我的用练习数据是2013年的,那时还是phred64的,用这个参数则会将调整为phred33

    解决方法:https://software.broadinstitute.org/gatk/documentation/article.php?id=6470

    7.2 Apply the recalibration to your sequence data

    nohup java -Djava.io.tmpdir=/home/anlan/software -Xmx10g  
    -jar /home/anlan/biosoft/gatk/GenomeAnalysisTK.jar  
    -T PrintReads  
    -R /home/anlan/reference/genome/human_g1k_v37/human_g1k_v37.fasta  
    -I merged_marked.bam -BQSR recal_data.table  
    -o recal_merged_marked.bam 1>recal.log &
    
VARIANT DISCOVERY
  1. 做完recal后,接下来就是VARIANT DISCOVERY阶段,对recal的bam进行variant-calling

    nohup java -Djava.io.tmpdir=/home/anlan/software -Xmx10g  
    -jar /home/anlan/biosoft/gatk/GenomeAnalysisTK.jar  
    -T HaplotypeCaller  
    -R /home/anlan/reference/genome/human_g1k_v37/human_g1k_v37.fasta  
    -I recal_merged_marked.bam  
    --dbsnp /home/anlan/annotation/variation/human/dbSNP/dbsnp_138.b37.vcf  
    -newQual -o KPGP-00001_recal_raw_variants.vcf 1>HC.log &
    

    这里有几点注意的,–dbSNP参数在best practices里是可选项;官网的教程中的-stand_emit_conf 10参数已经在GATK 3.7中取消了,加上则会报错,改为用-newQual,见http://gatkforums.broadinstitute.org/gatk/discussion/8692/version-highlights-for-gatk-version-3-7

  2. 上述输出的KPGP-00001_recal_raw_variants.vcf则就是raw and unfiltered SNP and indel calls,接下来就是通过GATK工具过滤掉低质量的位点

  3. 先提取SNP并过滤低质量位点

    3.1 提取SNP位点

    java -Djava.io.tmpdir=/home/anlan/software -Xmx10g  
    -jar /home/anlan/biosoft/gatk/GenomeAnalysisTK.jar  
    -T SelectVariants  
    -R /home/anlan/reference/genome/human_g1k_v37/human_g1k_v37.fasta  
    -selectType SNP  
    -V KPGP-00001_recal_raw_variants.vcf  
    -o KPGP-00001_raw_snps.vcf
    

    3.2 过滤低质量位点

    java -Djava.io.tmpdir=/home/anlan/software -Xmx10g  
    -jar /home/anlan/biosoft/gatk/GenomeAnalysisTK.jar  
    -T VariantFiltration  
    -R /home/anlan/reference/genome/human_g1k_v37/human_g1k_v37.fasta  
    -V KPGP-00001_raw_snps.vcf  
    --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"  
    --filterName "my_snp_filter"  
    -o KPGP-00001_filtered_snps.vcf
    

    3.3 由于上述过滤过程只是将原始位点标注为PASSorFILTER,因此我们需要将过滤pass的位点提取出来

    java -Djava.io.tmpdir=/home/anlan/software -Xmx10g  
    -jar /home/anlan/biosoft/gatk/GenomeAnalysisTK.jar  
    -T SelectVariants  
    -R /home/anlan/reference/genome/human_g1k_v37/human_g1k_v37.fasta  
    --excludeFiltered  
    -V KPGP-00001_filtered_snps.vcf  
    -o KPGP-00001_filtered_pass.snp.vcf
    

    3.4 最后用VariantEval来评估变异位点

    java -Djava.io.tmpdir=/home/anlan/software -Xmx10g  
    -jar /home/anlan/biosoft/gatk/GenomeAnalysisTK.jar  
    -T VariantEval  
    -R /home/anlan/reference/genome/human_g1k_v37/human_g1k_v37.fasta  
    -eval KPGP-00001_filtered_pass.snp.vcf -o KPGP-00001_filtered_pass.snp.vcf.eval 
    
  4. 过滤INDEL,类似重复上述步骤

    # extracr indels
    java -Djava.io.tmpdir=/home/anlan/software -Xmx10g -jar /home/anlan/biosoft/gatk/GenomeAnalysisTK.jar -T SelectVariants -R /home/anlan/reference/genome/human_g1k_v37/human_g1k_v37.fasta -selectType INDEL -V KPGP-00001_recal_raw_variants.vcf -o KPGP-00001_raw_indels.vcf
    # filter indels 
    java -Djava.io.tmpdir=/home/anlan/software -Xmx10g -jar /home/anlan/biosoft/gatk/GenomeAnalysisTK.jar -T VariantFiltration -R /home/anlan/reference/genome/human_g1k_v37/human_g1k_v37.fasta -V KPGP-00001_raw_indels.vcf --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" --filterName "my_indel_filter" -o KPGP-00001_filtered_indels.vcf
    # extract pass indels
    java -Djava.io.tmpdir=/home/anlan/software -Xmx10g -jar /home/anlan/biosoft/gatk/GenomeAnalysisTK.jar -T SelectVariants -R /home/anlan/reference/genome/human_g1k_v37/human_g1k_v37.fasta --excludeFiltered -V KPGP-00001_filtered_indels.vcf -o KPGP-00001_filtered_pass.indels.vcf
    # variant evaluation
    java -Djava.io.tmpdir=/home/anlan/software -Xmx10g -jar /home/anlan/biosoft/gatk/GenomeAnalysisTK.jar -T VariantEval -R /home/anlan/reference/genome/human_g1k_v37/human_g1k_v37.fasta -eval KPGP-00001_filtered_pass.indels.vcf -o KPGP-00001_filtered_pass.indels.vcf.eval
    

最后得到的KPGP-00001_filtered_pass.snp.vcf和KPGP-00001_filtered_pass.indels.vcf两个vcf文件即可用于后续分析了,另外在做这个笔记的时候,GAKT 4.0版本也刚好出来了,可能GATK 3.7的流程会有变动,我也要重新尝试下GATK 4.0的啦