TransDecoder使用的简易教程

TransDecoder按照其官网的说明，主要用于识别转录本序列中的潜在的编码区域，也就是预测CDS。转录本可以由RNA-Seq数据通过Trinity组装来的，也可以由RNA-Seq比对到参考基因组上构建的转录本。 TransDecoder识别可能的编码区域是基于以下几个标准：

a minimum length open reading frame (ORF) is found in a transcript sequence
a log-likelihood score similar to what is computed by the GeneID software is > 0
the above coding score is greatest when the ORF is scored in the 1st reading frame as compared to scores in the other 5 reading frames
if a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc)
optional the putative peptide has a match to a Pfam domain above the noise cutoff score

其是基于马尔可夫模型，这点跟ESTScan软件（基于隐马尔科夫模型）有较大区别。但是相比较ESTScan，TransDecoder软件使用的简便性是非常令人满意的。

从FASTA格式文件预测编码区域

提取FASTA序列中最长的ORF
```
 TransDecoder.LongOrfs -t target_transcripts.fasta
```
可以通过-m 参数来设定ORF的最小长度，按照其说法，随着ORF长度的减少，其假阳性的概率也会随之上升，其默认值是100。所以也可以按照自己的需求进行调整-m 参数
可选的步骤

通过Blast或者pfam搜索已知蛋白序列来识别ORF
预测可能的编码区域
```
 TransDecoder.Predict -t target_transcripts.fasta [ homology options ]
```
这里要注意的是，运行该程序时，需跟第一步的当前路径保持一致。因此最终的文件可以在当前目录找到，也就是后缀为.pep, .cds, .gff3和.bed的文件

我觉得第二步不必按照其要求来做，一般来说，我们会使用TransDecoder对无参转录组的拼接结果序列预测其CDS，所以我们可以先将拼接序列用BLAST比对nr以及swissprot蛋白数据库，然后提取其比对上的同源序列的位置来识别CDS，最后再通过TransDecoder的第一步和第三步来预测那些未比对上的序列的CDS。

从有参考基因组的转录结果GTF文件预测编码区域

需要有参转录组比对后拼接的转录本的GTF文件以及参考基因组序列

 util/cufflinks_gtf_genome_to_cdna_fasta.pl transcripts.gtf test.genome.fasta > transcripts.fasta

将GTF文件转化为GFF3文件

 util/cufflinks_gtf_to_alignment_gff3.pl transcripts.gtf > transcripts.gff3

接着就跟上面的步骤一样了

 TransDecoder.LongOrfs -t transcripts.fasta
 (optionally, identify peptides with homology to known proteins)
 TransDecoder.Predict -t transcripts.fasta [ homology options ]

最后生成一个基于有参基因组的编码区域注释文件

 util/cdna_alignment_orf_to_genome_orf.pl transcripts.fasta.transdecoder.gff3 transcripts.gff3 transcripts.fasta > transcripts.fasta.transdecoder.genome.gff3

输出文件说明

其实结果文件中我们最关注的就2个文件，一个是预测后的CDS序列，也就是在当前路径下的后缀为.cds的文件；另一个则为其对应的蛋白序列文件.pep。

TransDecoder对每个文件有详细的说明：

longest_orfs.pep: all ORFs meeting the minimum length criteria, regardless of coding potential.
longest_orfs.gff3: positions of all ORFs as found in the target transcripts
longest_orfs.cds: the nucleotide coding sequence for all detected ORFs
longest_orfs.cds.top_500_longest: the top 500 longest ORFs, used for training a Markov model for coding sequences.
hexamer.scores: log likelihood score for each k-mer (coding/random)
longest_orfs.cds.scores : the log likelihood sum scores for each ORF across each of the 6 reading frames
longest_orfs.cds.scores.selected : the accessions of the ORFs that were selected based on the scoring criteria (described at top)
longest_orfs.cds.best_candidates.gff3 : the positions of the selected ORFs in transcripts

Then, the final outputs are reported in your current working directory:

transcripts.fasta.transdecoder.pep : peptide sequences for the final candidate ORFs; all shorter candidates within longer ORFs were removed.
transcripts.fasta.transdecoder.cds : nucleotide sequences for coding regions of the final candidate ORFs
transcripts.fasta.transdecoder.gff3 : positions within the target transcripts of the final selected ORFs
transcripts.fasta.transdecoder.bed : bed-formatted file describing ORF positions, best for viewing using GenomeView or IGV.

参考自：http://transdecoder.github.io/

本文出自于http://www.bioinfo-scrounger.com转载请注明出处