STRINGDB包的简单使用

String是一个很好的蛋白互作网络数据库，其不仅包含了直接物理作用的互作关系，还包含了蛋白之间以间接作用的互作关系。除了有实验证据支持的数据外，还有整合其他数据库中的互作数据以及利用生物信息学预测获得的互作数据。官网网址:https://string-db.org/

从网址上可看出其提供了一个比较友好的界面可供大家使用，可根据需求输入不同的protein name or protein sequence。String网站还提供了互作信息的数据库可供大家下载DOWNLOAD，如果当你需要处理N组数据时，将数据库下载到本地进行计算是一个不错的选择。

但是如果想避免下载数据库又要获得多组数据的互作信息，那么使用StringDB这个R包是一个不错的选择。The STRINGdb package provides a R interface to the STRING protein-protein interactions database (http://www.string-db.org)，从这个介绍可看出，StringDB包就是连接String网站的数据库进行蛋白互作网络分析的，这个R包半年更新一次，基本上也跟String数据库的更新速度差不多，所以数据的实时性还是有保证的，下面简单介绍下StringDB包的使用

安装R包

source("https://bioconductor.org/biocLite.R")
biocLite("STRINGdb")

简单使用

StringDB包将其所有函数都放在了string_db对象里面，所以可用$符号来调用函数，如用`STRINGdb$methods()`查看包中的所有函数：

> STRINGdb$methods()
 [1] ".objectPackage"                      ".objectParent"                      
 [3] "add_diff_exp_color"                  "add_proteins_description"           
 [5] "benchmark_ppi"                       "benchmark_ppi_pathway_view"         
 [7] "callSuper"                           "copy"                               
 [9] "enrichment_heatmap"                  "export"                             
[11] "field"                               "get_aliases"                        
[13] "get_annotations"                     "get_annotations_desc"               
[15] "get_bioc_graph"                      "get_clusters"                       
[17] "get_enrichment"                      "get_graph"                          
[19] "get_homologs"                        "get_homologs_besthits"              
[21] "get_homology_graph"                  "get_interactions"                   
[23] "get_link"                            "get_neighbors"                      
[25] "get_pathways_benchmarking_blackList" "get_png"                            
[27] "get_ppi_enrichment"                  "get_ppi_enrichment_full"            
[29] "get_proteins"                        "get_pubmed"                         
[31] "get_pubmed_interaction"              "get_subnetwork"                     
[33] "get_summary"                         "get_term_proteins"                  
[35] "getClass"                            "getRefClass"                        
[37] "import"                              "initFields"                         
[39] "initialize"                          "load"                               
[41] "load_all"                            "map"                                
[43] "mp"                                  "plot_network"                       
[45] "plot_ppi_enrichment"                 "post_payload"                       
[47] "remove_homologous_interactions"      "set_background"                     
[49] "show"                                "show#envRefClass"                   
[51] "trace"                               "untrace"                            
[53] "usingMethods"

由于没有帮助文档，所以不能用？help来查看函数用法，但可用STRINGdb$help("map")来查看map函数的介绍及说明：

> STRINGdb$help("map")
Call:
$map(my_data_frame, my_data_frame_id_col_names, takeFirst = , removeUnmappedRows = , quiet = )

Description:
  Maps the gene identifiers of the input dataframe to STRING identifiers.
  It returns the input dataframe with the "STRING_id" additional column.

Input parameters:
  "my_data_frame"                 data frame provided as input. 
  "my_data_frame_id_col_names"    vector contatining the names of the columns of "my_data_frame" that have to be used for the mapping.
  "takeFirst"                     boolean indicating what to do in case of multiple STRING proteins that map to the same name. 
                                      If TRUE, only the first of those is taken. Otherwise all of them are used. (default TRUE)
  "removeUnmappedRows"            remove the rows that cannot be mapped to STRING 
                                      (by default those lines are left and their STRING_id is set to NA)
  "quiet"                         Setting this variable to TRUE we can avoid printing the warning relative to the unmapped values.

Author(s):
   Andrea Franceschini

最后就是使用函数进行蛋白互作网络分析了，首先我们肯定需要一个用于分析的gene list，然后使用map函数将gene list 的id转化为string id，stringDB包支持多种ID转化，比如HUGO names，Entrez GeneID， ENSEMBL proteins， RefSeq transcripts等，虽然stringdb手册中写着是上述几种，其他的似乎也行，比如我这次以ENSEMBL gene id为例，总共43个ensembl id

> head(data, 5)
              geneid
1 ENSRNOG00000050827
2 ENSRNOG00000011815
3 ENSRNOG00000061527
4 ENSRNOG00000049944
5 ENSRNOG00000016348

接着我们需要先构建一个string_db对象

string_db <- STRINGdb$new(version = "10", species = 10116, score_threshold = 700, input_directory = "")

这里我为什么选择version 10.0而不用最新的10.5呢，因为stringDB包现在只支持10.0的数据库（2016年的）。。。原来用了才知道也不是最新跟官网同步的。。后面的score_threshold则是阈值卡分，string官网上也有这一步，700表示高可信度

然后使用map函数来做ID转化，stringDB包会下载临时的数据库文件，所以需要联网的，但是如果将R关闭，这些文件则会被删除

data_mapped <- string_db$map(data, "geneid", removeUnmappedRows = TRUE)
Warning:  we couldn't map to STRING 11% of your identifiers

我用removeUnmappedRows参数设定了如果没mapping上的id在最后结果中不显示，最后43个ID有38个mapping上了，如下：

hit <- data_mapped$STRING_id
> length(hit)
[1] 38

再着我们将转化后的string id进行作图，这是stringDB包还是需要下载string数据库中的互作连接数据，这个比较大，需要等待下的，然后简单的使用plot_network函数作图

string_db$plot_network(hit)

结果略尴尬，一个有互作关系的蛋白都没，可能是我设置的可信度太高了，也有可能是因为版本不是最新的。所以我也string官网也尝试了下，45个ID有40个mapping上了，比stringDB包多2个。然后出图，结果中有4条蛋白有互作关系，但是这时的默认可信度是400（属于中等可信度），但是当我将可信度设置为700后，这2条互作信息就没了。。。看样子之前stringdb包出的结果还是跟官网一样的，只是我可信度设置过高了

用stringdb作图后，stringdb还提供了一个小功能STRING payload mechanism，就是能根据基因表达的上下调信息对互作网络图上的点进行标注颜色，比如上调标注为红色，下调标注为绿色。个人觉得这个不如用cytoscape软件来实现比较好，但我们需要先从stringdb中将互作网络的信息导出，所以我先将可信度重新设置为400，然后重置下string_db对象，结果显示的互作关系比string官网的还多1条，这时只能用版本不同来解释了，毕竟数据库都是在更新的

info <- string_db$get_interactions(hit)

最后将info数据库输出到txt文件中，然后再导入cytoscape即可作图了

Summary

Stringdb包还提供了聚类功能，可将复杂的互作网络图分割成几个小块（官网也有这个功能的），还有GO/KEGG富集功能（个人感觉其他的包做的更好，毕竟stringDB包只是基于string数据库的，信息不够全）。最后如果想用string数据库最新数据的话，还是建议去官网使用，或者把官网的数据库下载下来后自行用脚本进行处理，但是如果不太建议数据的更新程度，stringDB包则是个不错的选择。

参考：
http://bioconductor.org/packages/release/bioc/vignettes/STRINGdb/inst/doc/STRINGdb.pdf
http://bioconductor.org/packages/release/bioc/html/STRINGdb.html

本文出自于http://www.bioinfo-scrounger.com转载请注明出处