TransDecoder使用的简易教程

TransDecoder按照其官网的说明,主要用于识别转录本序列中的潜在的编码区域,也就是预测CDS。转录本可以由RNA-Seq数据通过Trinity组装来的,也可以由RNA-Seq比对到参考基因组上构建的转录本。

TransDecoder识别可能的编码区域是基于以下几个标准:

  • a minimum length open reading frame (ORF) is found in a transcript sequence
  • a log-likelihood score similar to what is computed by the GeneID software is > 0
  • the above coding score is greatest when the ORF is scored in the 1st reading frame as compared to scores in the other 5 reading frames
  • if a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc)
  • optional the putative peptide has a match to a Pfam domain above the noise cutoff score

其是基于马尔可夫模型,这点跟ESTScan软件(基于隐马尔科夫模型)有较大区别。但是相比较ESTScan,TransDecoder软件使用的简便性是非常令人满意的。

从FASTA格式文件预测编码区域

  1. 提取FASTA序列中最长的ORF

    TransDecoder.LongOrfs -t target_transcripts.fasta
    

    可以通过-m 参数来设定ORF的最小长度,按照其说法,随着ORF长度的减少,其假阳性的概率也会随之上升,其默认值是100。所以也可以按照自己的需求进行调整-m 参数

  2. 可选的步骤

    通过Blast或者pfam搜索已知蛋白序列来识别ORF

  3. 预测可能的编码区域

    TransDecoder.Predict -t target_transcripts.fasta [ homology options ]
    

    这里要注意的是,运行该程序时,需跟第一步的当前路径保持一致。因此最终的文件可以在当前目录找到,也就是后缀为.pep, .cds, .gff3和.bed的文件

我觉得第二步不必按照其要求来做,一般来说,我们会使用TransDecoder对无参转录组的拼接结果序列预测其CDS,所以我们可以先将拼接序列用BLAST比对nr以及swissprot蛋白数据库,然后提取其比对上的同源序列的位置来识别CDS,最后再通过TransDecoder的第一步和第三步来预测那些未比对上的序列的CDS。

从有参考基因组的转录结果GTF文件预测编码区域

  1. 需要有参转录组比对后拼接的转录本的GTF文件以及参考基因组序列

    util/cufflinks_gtf_genome_to_cdna_fasta.pl transcripts.gtf test.genome.fasta > transcripts.fasta
    
  2. 将GTF文件转化为GFF3文件

    util/cufflinks_gtf_to_alignment_gff3.pl transcripts.gtf > transcripts.gff3
    
  3. 接着就跟上面的步骤一样了

    TransDecoder.LongOrfs -t transcripts.fasta
    (optionally, identify peptides with homology to known proteins)
    TransDecoder.Predict -t transcripts.fasta [ homology options ]
    
  4. 最后生成一个基于有参基因组的编码区域注释文件

    util/cdna_alignment_orf_to_genome_orf.pl transcripts.fasta.transdecoder.gff3 transcripts.gff3 transcripts.fasta > transcripts.fasta.transdecoder.genome.gff3 
    

输出文件说明

其实结果文件中我们最关注的就2个文件,一个是预测后的CDS序列,也就是在当前路径下的后缀为.cds的文件;另一个则为其对应的蛋白序列文件.pep。

TransDecoder对每个文件有详细的说明:

  • longest_orfs.pep: all ORFs meeting the minimum length criteria, regardless of coding potential.
  • longest_orfs.gff3: positions of all ORFs as found in the target transcripts
  • longest_orfs.cds: the nucleotide coding sequence for all detected ORFs
  • longest_orfs.cds.top_500_longest: the top 500 longest ORFs, used for training a Markov model for coding sequences.
  • hexamer.scores: log likelihood score for each k-mer (coding/random)
  • longest_orfs.cds.scores : the log likelihood sum scores for each ORF across each of the 6 reading frames
  • longest_orfs.cds.scores.selected : the accessions of the ORFs that were selected based on the scoring criteria (described at top)
  • longest_orfs.cds.best_candidates.gff3 : the positions of the selected ORFs in transcripts

Then, the final outputs are reported in your current working directory:

  • transcripts.fasta.transdecoder.pep : peptide sequences for the final candidate ORFs; all shorter candidates within longer ORFs were removed.
  • transcripts.fasta.transdecoder.cds : nucleotide sequences for coding regions of the final candidate ORFs
  • transcripts.fasta.transdecoder.gff3 : positions within the target transcripts of the final selected ORFs
  • transcripts.fasta.transdecoder.bed : bed-formatted file describing ORF positions, best for viewing using GenomeView or IGV.

参考自:http://transdecoder.github.io/