初试GATK4-mutect2来call somatic mutation

癌症研究通常是将肿瘤组织和正常组织一起测序(WGS/WES),然后过滤掉种系突变(germline mutation),从而获得肿瘤组织的体细胞突变(somatic mutation)

  • Somatic mutations – occur in a single body cell and cannot be inherited (only tissues derived from mutated cell are affected)
  • Germline mutations – occur in gametes and can be passed onto offspring (every cell in the entire organism will be affected)

germline mutation和somatic mutation区别可以看下图Somatic vs Germline Mutations

somatic-vs-germline_med

简单的说:一般种系突变是来源于父母的遗传(精子和卵子生殖细胞的突变,具有遗传性),因此用于族系以及遗传病研究较多;而体细胞突变则是不会遗传的,是后天获得性突变,一般会侧重于肿瘤研究方面

现在从call somatic mutation来说,可使用的软件也不少,我最早听说的要属TCGA所用的四款软件TCGA计划的4个找somatic mutation的软件使用体验

  • Varscan
  • Mutect2
  • SomaticSniper
  • muse

还有Strelka听说也算是蛮好用的感恩教师节特稿:三大肿瘤工具MuTect2/strelka/varscan2比较

寻找Somatic突变是在肿瘤研究中必不可少的一步,尽管其在使用的过程中受到测序文库、测序深度、肿瘤纯度以及软件缺陷等等因素的因素的影响,但不可否认somatic突变还是有很大的研究意义的,所以先学习下如何来寻找somatic突变


由于之前用的GATK比较多,所以这次先从Mutect2开始,因为是GATK下的一个子模块,测试数据来自于GATK官网数据tutorial_11136.tar.gz,下载方式跟下载GATK注释数据集一样,使用ftp方式下载(用户名:gsapubftp-anonymous登录ftp主机:ftp.broadinstitute.org),或者用下面命令:

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/tutorials/datasets/tutorial_11136.tar.gz

Mutect2流程示意图如下:

gatk_MuTect2

参考GATK官网教程:(How to) Call somatic mutations using GATK4 Mutect2

其主要步骤用如下几点:

  • 用CreateSomaticPanelOfNormals 来构建一个Panel of Normals (PoN)
  • 用Mutect2来Call somatic SNV and indel
  • 用FilterMutectCalls来做过滤,得confident somatic mutation

还有两个可选步骤:

  • 用GetPileupSummaries和CalculateContamination评估样本间污染(似乎在GATK4中没写可选步骤了,必做了的?)
  • 用FilterByOrientationBias过滤人工处理样本(OxoG and FFPE)

使用前可看一遍Somatic calling is NOT simply a difference between two callsets,从而更好的了解MuTect2 call somatic的一些思路以及细节,简单罗列下有:

  • The HaplotypeCaller model uses ploidy in its genotype likelihood calculations. The Mutect2 model does not
  • Mutect2 works primarily by contrasting the presence or absence of evidence for variation between two samples, the tumor and matched normal, from the same individual. The tool can run on unmatched tumors but this produces high rates of false positives. (Mutect2主要是根据对正常-肿瘤样本进行位点比较寻找突变,如果没有正常样本,那么软件能正常使用,但假阳性会很高)
  • Although we present the tool for somatic calling, it may apply to other contexts, such as mitochondrial variant calling(还可以用于mitochondrial variant calling?)
  • Because a somatic callset is based on a single individual rather than a cohort, annotations in the INFO column of a Mutect2 VCF only refer to the ALT alleles and do not include values for the REF allele.(Mutect2的VCF结果中是没有想种系突变那样REF allele,而只有ALT allele)
  • If a PoN or matched normal is provided, Mutect2 can use either to filter sites before reassembly, and it can use a germline resource to filter alleles.(如果有PoN或者注释库,Mutect2会在局部重组装根据这两个进行过滤)
  • Mutect2 allows for varying ploidy in the form of allele fractions for each variant.(Mutect2允许突变有多种ploidy,而不像HaplotypeCaller是固定的ploidy)
  • To protect patient identities, somatic calling was designed to avoid passing on any identifying germline variation from untranslated and noncoding regions.(由于在untranslated and noncoding的突变可用于区分个体,因此Mutect2是避免在这些区域寻找somatic突变以保护病人隐私)
  • What this means is that researchers prefer to remove the maximal number of false positives even at the expense of losing some true positives.(Metect2倾向于特异性最大化,由于如果是队列研究的话,可以容忍一些个体的信息丢失,也是为了减少一些重复测序所产生的假阳性突变)
  • A somatic caller should detect low fraction alleles, can make no explicit ploidy assumption and omits genotyping in the traditional sense.(1. 由于一些肿瘤切片组织含有过多的正常组织细胞,会对肿瘤细胞造成干扰;2.肿瘤突变可能是异质性的;3.一些突变常常是非整倍性的,如拷贝数变异导致的)
  • A variant allele in the case sample is not called if the site is variant in controls.(在之前的版本,也就是MuTect,如果一个突变位点在匹配的正常组织中出现,或者是在注释库如dbSNP中出现,则MuTect2是不对这位点call somatic的;这是因为根据随机突变假说,这样的位点要么不太会是somatic突变,要么对下游分析并没有什么价值。但是在Mutect2版本中略有改动,它是discounts上述在PoN中以及模糊的正常样本位点,对于注释库中出现的位点,则是按照与variant allele frequencies呈反比来区分)
  • Somatic workflows filter case sites with multiple variant alleles.(Mutect2虽然对那些multiple variant alleles位点进行call somatic mutation,但是在FilterMutectCalls步骤中会进行过滤)
  • The panel of normals helps filter systematic artifacts of sequencing.(由于一些样品制备、测序以及mapping,这些Artifacts导致的系统性误差会在寻找somatic突变中形成噪音,因此需要构建一个PoN;将多个germline normal samples(two or more)构建成一个PoN,然后在Mutect2中与肿瘤样本使用相同参数;这种PoN还有助于过滤掉常见的种系突变位点,在一些缺乏种群注释库的时候很好用)

最后则是使用下Mutect2(以下数据均为测试数据集中的)

构建一个PoN,先对要用于PoN的each normal sample运行的tumor-only mode的Mutect2,这里以HG00190为例:

~/biosoft/GATK4.0/gatk-4.0.5.1/gatk Mutect2 \
-R ~/reference/genome/gatk_hg38/Homo_sapiens_assembly38.fasta \
-I HG00190.bam \
-tumor HG00190 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 3_HG00190.vcf.gz

这里--disable-read-filter的 MateOnSameContigOrNoMappedMateReadFilter主要是想保留那些比对到不同contig上去的paired reads

然后将三个正常样本的VCF用CreateSomaticPanelOfNormals对在这三个样本出现的位点进行整理校正

~/biosoft/GATK4.0/gatk-4.0.5.1/gatk CreateSomaticPanelOfNormals \
-vcfs 3_HG00190.vcf.gz \
-vcfs 4_NA19771.vcf.gz \
-vcfs 5_HG02759.vcf.gz \
-O 6_threesamplepon.vcf.gz

然后就是用Mutect2对肿瘤样本、正常样本、PoN以及注释库进行call somatic突变,这里官网例子为了节约时间就拿了17号染色体做测试

time ~/biosoft/GATK4.0/gatk-4.0.5.1/gatk --java-options "-Xmx2g" Mutect2 \
-R ~/reference/genome/gatk_hg38/Homo_sapiens_assembly38.fasta \
-I tumor.bam \
-I normal.bam \
-tumor HCC1143_tumor \
-normal HCC1143_normal \
-pon resources/chr17_m2pon.vcf.gz \
--germline-resource resources/chr17_af-only-gnomad_grch38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 1_somatic_m2.vcf.gz \
-bamout 2_tumor_normal_m2.bam

最后就是过滤得一些可行度较高的somatic突变位点

~/biosoft/GATK4.0/gatk-4.0.5.1/gatk FilterMutectCalls \
-V somatic_m2.vcf.gz \
--contamination-table tumor_calculatecontamination.table \
-O 9_somatic_oncefiltered.vcf.gz

这里加了参数--contamination-table的tumor_calculatecontamination.table文件,是结合了污染评估的结果一起去除

那两个可选分析就下次结合实际数据进行尝试了,顺便也包括整理下一些对于结果文件理解等等,找个好一点的数据再完整的学习一边。。。

参考资料:

Differences between GATK3 MuTect2 and GATK4 Mutect2

(How to) Call somatic mutations using GATK4 Mutect2
(How to) Call somatic SNVs and indels using MuTect2
Off-label workflow to simply call differences in two samples

本文出自于http://www.bioinfo-scrounger.com转载请注明出处