0%

TCGA lncRNA的提取

众所周知lncRNA属于RNA中的非编码RNA,在转录调控中扮演者重要角色。并且现在听说lncRNA的研究也很火热,使用TCGA的数据对lncRNA的研究也是非常常见的需求。而我们如果想对TCGA的lncRNA进行定量则必须从TCGA RNA-Seq的表达量数据中提取出lncRNA的那部分数据。 我之前也没做过lncRNA相关的研究,所以就从网上搜了下,整理了下思路

首先下载TCGA中的RNA-Seq数据,因为新版的TCGA数据库是用Ensembl id作为基因的命名的,所以这些RNA-Seq基因表达量文件中必然也包含了lncRNA的表达量数据

接着要获取TCGA RNA表达量的Ensembl id中属于lncRNA的id

最后从总的RNA表达量文件中提取lncRNA的表达量数据

  1. 第一步已经在前面的博文里做了,是以乳腺癌的TCGA数据为例

  2. 第二步获取ensembl id与lncRNA的对应关系

    这问题的解决思路第一个想起来应该去GENCODE数据库中寻找,因为逛过ENCODE人类相关数据库的话,会发现有个Long non-coding RNA gene annotation选项,果断下载了其gtf文件,GENCODE对这个文件的表述:evidence-based annotation of the human genome (GRCh38), version 27 (Ensembl 90) - long non-coding RNAs,从中也能看出该文件的版本跟TCGA使用的版本也保持一致,因此只要对该gtf文件进行处理就能获得属于lncRNA的ensembl id

    粗略看几行gtf格式

    chr1    HAVANA  gene    29554   31109   .   +   .   gene_id "ENSG00000243485.5"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
    chr1    HAVANA  transcript  29554   31097   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-202"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
    chr1    HAVANA  exon    29554   30039   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-202"; exon_number 1; exon_id "ENSE00001947070.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
    chr1    HAVANA  exon    30564   30667   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-202"; exon_number 2; exon_id "ENSE00001922571.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
    chr1    HAVANA  exon    30976   31097   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; transcript_type "lincRNA"; transcript_name "MIR1302-2HG-202"; exon_number 3; exon_id "ENSE00001827679.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";

    因为TCGA是基于基因水平进行定量分析的,那么我需要将第3列是gene的行中的ensembl id,gene name提取出来,ensembl id需要去掉版本号;而至于为什么提gene name,因为我想最后以gene name来表示各个gene。

     perl -F'\t' -alne 'next unless ($F[2] eq "gene");$F[8] =~ /gene_id\s+"(\w+)\.\d+";.*?gene_name\s+"(\S+)";/; print "$1\t$2"' gencode.v27.long_noncoding_RNAs.gtf >lncRNA_ensembl2name.list

    lncRNA_ensembl2name.list文件总共有15778行,其中12行是重复的数据,那么去重后是15766行

    如果仔细研究下,会发现lncRNA应该有其特定的几个gene_type,从上述的gtf文件就可看出,那么哪些gene_type属于lncRNA呢,首先我们从GENCODE的gtf文件中先提取下看看

    perl -F'\t' -alne '$F[8] =~ /gene_type\s+"(\S+?)"/;print $1' gencode.v27.long_noncoding_RNAs.gtf |sort |uniq

    几个gene_type结果如下:

    3prime_overlapping_ncRNA  
    antisense_RNA  
    bidirectional_promoter_lncRNA  
    lincRNA  
    macro_lncRNA  
    non_coding  
    processed_transcript  
    sense_intronic  
    sense_overlapping  
    TEC

    总共有10种类型,那么ENSEMBL官网是怎么对lncRNA定义及分类的呢。http://www.ensembl.org/Help/Faq?id=468中lncRNA的biotype分类有:

    3prime overlapping ncrna
    ambiguous orf
    antisense
    antisense RNA
    lincRNA
    ncrna host
    processed transcript
    sense intronic
    sense overlapping

    ENSEMBL官网还有一个对lncRNA biotype的定义http://vega.archive.ensembl.org/info/about/gene_and_transcript_types.html,怎么同是ENSEMBL官网,怎么还有差别?;其实后者是Havana biotypes,ENSEMBL是这么说的:The full list of Havana biotypes can be found here,那么这个Havana对gene的分类(biotype)是怎么样的,如下:

    Non coding
    3prime_overlapping_ncRNA
    Antisense
    lincRNA (long interspersed ncRNA)
    Retained_intron
    Sense_intronic
    Sense_overlapping
    Macro_lncRNA
    Bidirectional lncRNA

    其实我们最终还是要看ENSEMBL的Homo_sapiens.GRCh38.90.chr.gtf文件中的gene_biotype有哪些。。然后我们再根据gene_biotype中属于lncRNA的ensembl id提取出来,先看看Homo_sapiens.GRCh38.90.chr.gtf总共有哪些:

     perl -F'\t' -alne '$F[8] =~ /gene_biotype\s+"(\S+?)";/; print $1' Homo_sapiens.GRCh38.90.chr.gtf |sort |uniq

    结果比较长:

    3prime_overlapping_ncRNA
    antisense_RNA
    bidirectional_promoter_lncRNA
    IG_C_gene
    IG_C_pseudogene
    IG_D_gene
    IG_J_gene
    IG_J_pseudogene
    IG_pseudogene
    IG_V_gene
    IG_V_pseudogene
    lincRNA
    macro_lncRNA
    miRNA
    misc_RNA
    Mt_rRNA
    Mt_tRNA
    non_coding
    polymorphic_pseudogene
    processed_pseudogene
    processed_transcript
    protein_coding
    pseudogene
    ribozyme
    rRNA
    scaRNA
    scRNA
    sense_intronic
    sense_overlapping
    snoRNA
    snRNA
    sRNA
    TEC
    TR_C_gene
    TR_D_gene
    TR_J_gene
    TR_J_pseudogene
    TR_V_gene
    TR_V_pseudogene
    transcribed_processed_pseudogene
    transcribed_unitary_pseudogene
    transcribed_unprocessed_pseudogene
    translated_processed_pseudogene
    unitary_pseudogene
    unprocessed_pseudogene
    vaultRNA

    综合一下,以下biotype可以认为属于lncRNA?(其实我也不确定这样可行否)

    3prime_overlapping_ncRNA
    antisense_RNA
    bidirectional_promoter_lncRNA
    lincRNA
    macro_lncRNA
    non_coding
    processed_transcript
    sense_intronic
    sense_overlapping

    从这里可以看出来了,ENSEMBL的gtf文件中属于lncRNA的有以上9种,除了没有TEC外,其他跟从GENCODE提取出来的完全一致。ENSEMBL对TEC定义是这样的:This is used for non-spliced EST clusters that have polyA features. This category has been specifically created for the ENCODE project to highlight regions that could indicate the presence of novel protein coding genes that require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies.我理解为TEC biotype是GENCODE认为可能是新的转录本,但是需要进行实验验证,这就可以解释为何从GENCODE lncRNA的gtf文件提取出来的gene type中有TEC

    最后我需要将ENSEMBL的Homo_sapiens.GRCh38.90.chr.gtf文件中属于lncRNA的ensembl id给提取出来,当然也顺便提取了gene name和biotype

    #!/usr/bin/perl -w
    use strict;
    
    my $biotype = shift @ARGV;
    my %biotype_list;
    open my $fh1, $biotype or die;
    while (<$fh1>) {
        chomp;
        $biotype_list{$_} = 1;
    }
    close $fh1;
    
    my $gtf = shift @ARGV;
    open my $fh2, $gtf or die;
    while (<$fh2>) {
        chomp;
        my @array = split /\t/, $_;
        next unless ($array[2] eq "gene");
        $array[8] =~ /gene_id\s+"(\S+?)";.*gene_name\s+"(\S+?)";.*gene_biotype\s+"(\S+?)";/;
        my $geneid = $1;
        my $genename = $2;
        my $genebiotype = $3;
        if ($biotype_list{$genebiotype}) {
            print "$geneid\t$genename\t$genebiotype\n";
        }
    }
    close $fh2;

    输出结果有3列,第一列是ensembl id,第二列是gene name,第三列是biotype,总共有14700行,然后跟GENCODE的结果比较下,后者包含了前者,说明结果是正常的,因为后者还多一个TEC的biotype。

    最后我例行跟生信人工具再比较下,因为其也有一个ENSG_ID_LNC.txt文件,里面记录了ensembl id跟lncRNA的对应关系,当然也有biotype信息,先粗略看下biotype有几类,结果如下:

    3prime_overlapping_ncRNA
    antisense
    bidirectional_promoter_lncRNA
    lincRNA
    macro_lncRNA
    processed_transcript
    sense_intronic
    sense_overlapping

    跟我上面总结的结果比少了一个non_coding的biotype,那么再看一下相同的结果有多少,结果显示只有4000不到的对应关系是相同的,其实就是说至于4000不到的ensembl id对应的lncRNA的gene name是相同的,应该是lncRNA的命名的差别,只是还没查到其ensembl对应的lncRNA是什么来历;有些是lncRNA命名的版本号不同。

    最后检查了好久,才晓得原来生信人用的人类ENSEMBL版本是GRCh37.p13,而我用的是GRCh38.p10,TCGA用的版本也是GRCh38,具体可参看网址GDC TCGA的历史版本,其在2016年就使用GRCh38了,比如mRNA-Seq数据则是BAM files aligned to GRCh38 using STAR 2-pass strategy,所以还是纯手工的比较靠谱,软件毕竟是别人开发的。。。

  3. 第三步,只要基于ensembl id和lncRNA id的对应关系,从TCGA的表达矩阵中属于lncRNA的那些基因的表达量提取出来即可,然后再按照需要再考虑要不要转化为ensembl的gene name

Summary

从上述几步应该可以顺利的将TCGA中lncRNA的数据从RNA-Seq中提取出来,至于拿lncRNA数据来获得什么具有生物意义的结论,我暂时还没学会。。从TCGA的数据下载到这篇的提取lncRNA可看出,TCGA数据的使用也不是一个太容易的事情。

本文出自于http://www.bioinfo-scrounger.com转载请注明出处