0%

Derived variables and string manipulation (SAS & R)

This is reference to the 2.1 to 2.2 section of Data management in <SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)>.

Format

在R中,低于format数据,根据数据格式不同有不同的操作方式,比较灵活

在SAS中,有proc format专门处理这个需求

proc format ;
value sex
  1 = "Male"
  2 = "Female"
  . = "Unknown";
value bp
  140-high = "high"
  135-140 = "mid"
  other = "low";
value $gp
  "A1" = "C";
run ;


data report;
  format gender sex.
    sbp bp.
    group $gp.;
  input id gender sbp group $;
  datalines;
100001 1 160 A
100002 2 133 A1
100003 . 120 B
;
run;

从上例子可看出proc format是主要影响到数据的输出格式,此外还有个informat是影响SAS输入数据的格式,如:

data fs;
  informat x $2.;
  input x$ y;
  x1 = x + 1;
  y1 = y + 1;
  datalines;
1100 1200
;
run;

SAS中带format的数据集,如果缺失format文件或版本不对,会打不开;需要format每次加载了才能使用,可通过fmtsearch()或数据集加载,如:option fmtsearch=(libname)

proc format library=work cntlout=work.fmt;
  value  rang  1-<20="正常"  Low-<1="异常" 20< -High="异常无临床意义" .="MISS";
run;

data temp1;
input ORRES1;
LBTESTCD1=put(ORRES1,rang30.);
cards;
100
209
.
23
22
18
-1
;
run;

对于缺失format文件的情况,大致可以有以下处理方式:

  • 直接舍弃format,option nofmterr;
  • 保存类似于上述的format文件,library参数指定
  • 保存format的数据集,cntlout参数指定

Access variables from a dataset

在R中,想输出一个dataset(数据框/列表/向量)中的某个变量/数值,直接显式的引用即可,方式众多,操作比较灵活,如:

print(ds$col1)
head(ds, 5)
ds

在SAS中,则需要先data step或者proc step显式或者隐式申明一个dataset,然后再操作

proc print data=sashelp.class (obs=5);
  var Name Age Sex;
run;

Label variables

在R中,有comment()函数可以定义变量的labeling,但是日常适用中好似不怎么常用

在SAS中,由于临床试验中的要求,一般都会给变量设置labeling,如:

proc print data=sashelp.class(obs=5) label;
  label Age="This is a modified label"
run;

Rename variables in a dataset

在R中,更改数据集的列名也有多种形式,如:

names(df)[1] <- c("col1")
colnames(df)[1:2] <- c("col1", "col2")
dplyr::rename()

在SAS中,可对数据集用rename处理

data class;
  set sashelp.class (rename=(Sex=Gender));
run;

string and numeric variable conversion

在R中,转成字符串用as.character(),转成数值型则是as.numeric()

在SAS中,大部分转化是用out以及input实现的,以sashelp的class数据集为例,可以看到Age,Height,Weight是数值型(可以根据变量值是否右对齐来判断),然后用put将其从数值型转化为字符串型

# convert from numeric to string by put
data class;
  set sashelp.class (obs=5);
  C_age=put(Age,$8.);
  C_height=put(Height,$8.);
  C_weight=put(Weight,$8.);
  drop Age Height Weight;
run;

接着用input实现字符串型转化为数值型

data class2;
  set work.class;
  age=put(C_age,8.);
  keep Name Sex age;
run;

此外还有一种更简单的方式实现上述需求,如:

data class2;
  set work.class;
  age=C_age+0;
  keep Name Sex age;
run;

Create a categorical variable using logic

在R中,在数据框中生成一个逻辑变量有多种方法,我常用dplyr::mutate(),配合if_else()等语句即可

在SAS中,也类似,在data步中用if语句

data class;
  set sashelp.class;
  if Age>15 and Sex eq "M" then gp="GroupA";
  else if Age<=15 and Sex eq "F" then gp="GroupB";
  else gp=.;
run;

Extract characters from string variables

这是一个字符串操作的问题,从SAS和R的函数来看,几乎覆盖了所有字符串操作的需求,如:

提取字符串第2-4位的字符

# R code
stringr::str_sub("abcdef", start = 2, end = 4)

# SAS code
data _null_;
  str=substr("abcdef", 2, 3);
  put str=;
run;

对于SAS略微不舒服地方,每次输出个数据必须指定data步或者proc,真的有点麻烦,一点也不programming!

判断指定的字符串是否能match到指定字符

# R code
stringr::str_detect("Hello world!", "world")
stringr::str_match("Hello world!", "world")

# SAS code
data _null_;
   /* Use PRXMATCH to find the position of the pattern match. */
   position=prxmatch("/world/", "Hello world!");
   put position=;
   if position then put "word, Match!";
run;

从字符串中提取出匹配到的字符

# R code
stringr::str_match("AE 2021-01-01", "\\w+\\s+(.*)")[1,2]

# SAS code
data _null_;
  re=prxparse("/\w+\s+(.*)/");
  if prxmatch(re, "AE 2021-01-01") then 
  do;
    date=prxposn(re,1,"AE 2021-01-01");
  end;
  put date=;
run;

Replace strings within string variables

替换字符串中的某些字符,在R中还是可以借助stringr包,而在SAS中可以用tranwrd

# R code
stringr::str_replace_all("a_b_c", "_", "+")

# SAS code
data _null_;
    my_string = "a_b_c";
    my_new_string = tranwrd(my_string,"_", "+");
    put "My String: " my_string;
    put "My New String: " my_new_string;
run;

Length of string variables

获取字符串的长度,在R中可以用nchar()函数,length()可以计算向量的长度;而SAS则是length计算字符串的长度

# R code
nchar("12345")

# SAS code
data _null_;
  len=length("12345");
  put len=;
run;

Concatenate string variables

拼接字符串,在R中用paste()函数,在SAS中则是||

# R code
paste("Hello", "World!")
stringr::str_c("Hello ", "World!")

# SAS code
data _null_;
  newcharvar="Hello " || "World!";
  put newcharvar=;
run;

Split strings into multiple strings

分割字符串,在R中常见的是strsplit(),或者stringr::str_split();在SAS中基础用scancountw函数则可以计算分割后字符的数目

# R code
strsplit("Smith John", " ")

# SAS code
data have;
  name="Smith John";
  lastname=scan(name,1," ");
  firstname=scan(name,2," ");
run;

Set operations

用于判断某个字符串或者变量是否在某个向量或者数组内,在R中常用%in%,而SAS中则是in;两者的返回结果形式不一样,前者返回TRUE/FALSE,后者返回1/0

# R code
"a" %in% c("a", "b")

# SAS code
data _null_;
  res=("a" in ("a","b"));
  put res=;
run;

参考资料:

SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)

本文出自于http://www.bioinfo-scrounger.com转载请注明出处