TCGA lncRNA的提取

众所周知lncRNA属于RNA中的非编码RNA，在转录调控中扮演者重要角色。并且现在听说lncRNA的研究也很火热，使用TCGA的数据对lncRNA的研究也是非常常见的需求。而我们如果想对TCGA的lncRNA进行定量则必须从TCGA RNA-Seq的表达量数据中提取出lncRNA的那部分数据。我之前也没做过lncRNA相关的研究，所以就从网上搜了下，整理了下思路

首先下载TCGA中的RNA-Seq数据，因为新版的TCGA数据库是用Ensembl id作为基因的命名的，所以这些RNA-Seq基因表达量文件中必然也包含了lncRNA的表达量数据

接着要获取TCGA RNA表达量的Ensembl id中属于lncRNA的id

最后从总的RNA表达量文件中提取lncRNA的表达量数据

第一步已经在前面的博文里做了，是以乳腺癌的TCGA数据为例

第二步获取ensembl id与lncRNA的对应关系

这问题的解决思路第一个想起来应该去GENCODE数据库中寻找，因为逛过ENCODE人类相关数据库的话，会发现有个Long non-coding RNA gene annotation选项，果断下载了其gtf文件，GENCODE对这个文件的表述：evidence-based annotation of the human genome (GRCh38), version 27 (Ensembl 90) - long non-coding RNAs，从中也能看出该文件的版本跟TCGA使用的版本也保持一致，因此只要对该gtf文件进行处理就能获得属于lncRNA的ensembl id

粗略看几行gtf格式

chr1    HAVANA  gene    29554   31109   .   +   .   gene_id "ENSG00000243485.5"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
chr1    HAVANA  transcript  29554   31097   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-202"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1    HAVANA  exon    29554   30039   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-202"; exon_number 1; exon_id "ENSE00001947070.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1    HAVANA  exon    30564   30667   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-202"; exon_number 2; exon_id "ENSE00001922571.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1    HAVANA  exon    30976   31097   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-202"; exon_number 3; exon_id "ENSE00001827679.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";

因为TCGA是基于基因水平进行定量分析的，那么我需要将第3列是gene的行中的ensembl id，gene name提取出来，ensembl id需要去掉版本号；而至于为什么提gene name，因为我想最后以gene name来表示各个gene。

 perl -F'\t' -alne 'next unless ($F[2] eq "gene");$F[8] =~ /gene_id\s+"(\w+)\.\d+";.*?gene_name\s+"(\S+)";/; print "$1\t$2"' gencode.v27.long_noncoding_RNAs.gtf >lncRNA_ensembl2name.list

lncRNA_ensembl2name.list文件总共有15778行，其中12行是重复的数据，那么去重后是15766行

如果仔细研究下，会发现lncRNA应该有其特定的几个gene_type，从上述的gtf文件就可看出，那么哪些gene_type属于lncRNA呢，首先我们从GENCODE的gtf文件中先提取下看看

perl -F'\t' -alne '$F[8] =~ /gene_type\s+"(\S+?)"/;print $1' gencode.v27.long_noncoding_RNAs.gtf |sort |uniq

几个gene_type结果如下：

3prime_overlapping_ncRNA  
antisense_RNA  
bidirectional_promoter_lncRNA  
lincRNA  
macro_lncRNA  
non_coding  
processed_transcript  
sense_intronic  
sense_overlapping  
TEC

总共有10种类型，那么ENSEMBL官网是怎么对lncRNA定义及分类的呢。http://www.ensembl.org/Help/Faq?id=468中lncRNA的biotype分类有:

3prime overlapping ncrna
ambiguous orf
antisense
antisense RNA
lincRNA
ncrna host
processed transcript
sense intronic
sense overlapping

ENSEMBL官网还有一个对lncRNA biotype的定义http://vega.archive.ensembl.org/info/about/gene_and_transcript_types.html，怎么同是ENSEMBL官网，怎么还有差别？；其实后者是Havana biotypes，ENSEMBL是这么说的：The full list of Havana biotypes can be found here，那么这个Havana对gene的分类（biotype）是怎么样的，如下：

Non coding
3prime_overlapping_ncRNA
Antisense
lincRNA (long interspersed ncRNA)
Retained_intron
Sense_intronic
Sense_overlapping
Macro_lncRNA
Bidirectional lncRNA

其实我们最终还是要看ENSEMBL的Homo_sapiens.GRCh38.90.chr.gtf文件中的gene_biotype有哪些。。然后我们再根据gene_biotype中属于lncRNA的ensembl id提取出来，先看看Homo_sapiens.GRCh38.90.chr.gtf总共有哪些：

 perl -F'\t' -alne '$F[8] =~ /gene_biotype\s+"(\S+?)";/; print $1' Homo_sapiens.GRCh38.90.chr.gtf |sort |uniq

结果比较长：

3prime_overlapping_ncRNA
antisense_RNA
bidirectional_promoter_lncRNA
IG_C_gene
IG_C_pseudogene
IG_D_gene
IG_J_gene
IG_J_pseudogene
IG_pseudogene
IG_V_gene
IG_V_pseudogene
lincRNA
macro_lncRNA
miRNA
misc_RNA
Mt_rRNA
Mt_tRNA
non_coding
polymorphic_pseudogene
processed_pseudogene
processed_transcript
protein_coding
pseudogene
ribozyme
rRNA
scaRNA
scRNA
sense_intronic
sense_overlapping
snoRNA
snRNA
sRNA
TEC
TR_C_gene
TR_D_gene
TR_J_gene
TR_J_pseudogene
TR_V_gene
TR_V_pseudogene
transcribed_processed_pseudogene
transcribed_unitary_pseudogene
transcribed_unprocessed_pseudogene
translated_processed_pseudogene
unitary_pseudogene
unprocessed_pseudogene
vaultRNA

综合一下，以下biotype可以认为属于lncRNA?（其实我也不确定这样可行否）

3prime_overlapping_ncRNA
antisense_RNA
bidirectional_promoter_lncRNA
lincRNA
macro_lncRNA
non_coding
processed_transcript
sense_intronic
sense_overlapping

从这里可以看出来了，ENSEMBL的gtf文件中属于lncRNA的有以上9种，除了没有TEC外，其他跟从GENCODE提取出来的完全一致。ENSEMBL对TEC定义是这样的：This is used for non-spliced EST clusters that have polyA features. This category has been specifically created for the ENCODE project to highlight regions that could indicate the presence of novel protein coding genes that require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies.我理解为TEC biotype是GENCODE认为可能是新的转录本，但是需要进行实验验证，这就可以解释为何从GENCODE lncRNA的gtf文件提取出来的gene type中有TEC

最后我需要将ENSEMBL的Homo_sapiens.GRCh38.90.chr.gtf文件中属于lncRNA的ensembl id给提取出来，当然也顺便提取了gene name和biotype

#!/usr/bin/perl -w
use strict;

my $biotype = shift @ARGV;
my %biotype_list;
open my $fh1, $biotype or die;
while (<$fh1>) {
    chomp;
    $biotype_list{$_} = 1;
}
close $fh1;

my $gtf = shift @ARGV;
open my $fh2, $gtf or die;
while (<$fh2>) {
    chomp;
    my @array = split /\t/, $_;
    next unless ($array[2] eq "gene");
    $array[8] =~ /gene_id\s+"(\S+?)";.*gene_name\s+"(\S+?)";.*gene_biotype\s+"(\S+?)";/;
    my $geneid = $1;
    my $genename = $2;
    my $genebiotype = $3;
    if ($biotype_list{$genebiotype}) {
        print "$geneid\t$genename\t$genebiotype\n";
    }
}
close $fh2;

输出结果有3列，第一列是ensembl id，第二列是gene name，第三列是biotype，总共有14700行，然后跟GENCODE的结果比较下，后者包含了前者，说明结果是正常的，因为后者还多一个TEC的biotype。

最后我例行跟生信人工具再比较下，因为其也有一个ENSG_ID_LNC.txt文件，里面记录了ensembl id跟lncRNA的对应关系，当然也有biotype信息，先粗略看下biotype有几类，结果如下：

3prime_overlapping_ncRNA
antisense
bidirectional_promoter_lncRNA
lincRNA
macro_lncRNA
processed_transcript
sense_intronic
sense_overlapping

跟我上面总结的结果比少了一个non_coding的biotype，那么再看一下相同的结果有多少，结果显示只有4000不到的对应关系是相同的，其实就是说至于4000不到的ensembl id对应的lncRNA的gene name是相同的，应该是lncRNA的命名的差别，只是还没查到其ensembl对应的lncRNA是什么来历；有些是lncRNA命名的版本号不同。

最后检查了好久，才晓得原来生信人用的人类ENSEMBL版本是GRCh37.p13，而我用的是GRCh38.p10，TCGA用的版本也是GRCh38，具体可参看网址GDC TCGA的历史版本，其在2016年就使用GRCh38了，比如mRNA-Seq数据则是BAM files aligned to GRCh38 using STAR 2-pass strategy，所以还是纯手工的比较靠谱，软件毕竟是别人开发的。。。

第三步，只要基于ensembl id和lncRNA id的对应关系，从TCGA的表达矩阵中属于lncRNA的那些基因的表达量提取出来即可，然后再按照需要再考虑要不要转化为ensembl的gene name

Summary

从上述几步应该可以顺利的将TCGA中lncRNA的数据从RNA-Seq中提取出来，至于拿lncRNA数据来获得什么具有生物意义的结论，我暂时还没学会。。从TCGA的数据下载到这篇的提取lncRNA可看出，TCGA数据的使用也不是一个太容易的事情。

本文出自于http://www.bioinfo-scrounger.com转载请注明出处