0%

Dataset manipulation (SAS & R)

This is reference to the 2.3 section of Data management and 2.4 Date and time variables in <SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)>.

在R中常说的数据集操作是指处理数据框类型的数据,当然有时也会是其他的数据类型。在SAS中就是数据集,SAS相比其他编程方法来说数据类型还是太少了。。

处理数据,常见的不外乎combination, collation, and subsetting

Subsetting datasets by rows

在R中比较简单,根据行索引即可提取任意行的数据集;在SAS则可以通过firstobsobs来定义

提取起始于第i行的数据集

data class;
  set sashelp.class (firstobs=5);
run;

提取前i行的数据集

data class;
  set sashelp.class (obs=5);
run;

提取第i行到第j行的数据集

data class;
  set sashelp.class (firstobs=5 obs=10);
run;

Subsetting datasets by observations

这是指根据变量过滤数据集的行,在R中常用的函数是dplyr::filter();而是SAS中则是搭配where语句使用

# R code
filter(iris, Species == "setosa")

# SAS code
data class;
  set sashelp.class(where=(Sex="M"));
run;

Splitting a Dataset

在R中,可以相当于过滤行,即可用dplyr::filter()函数来实现;在SAS中则可以通过多个IF语句分割数据集

data female male;
  set sashelp.class;
  if Sex="F" then output female;
  if Sex="M" then output male;
run;

Drop or keep variables in a dataset

这是指根据变量过滤数据集的行,在R中常用的函数是dplyr::select();在SAS中则可以通过keep来实现,如:

# R code
dplyr::select(iris, c("Sepal.Length", "Sepal.Width"))
    
# SAS code
data class;
  set sashelp.class(keep= Name Sex Age);
run;

Random sample of a dataset

这个在simulation中蛮好用的,从逻辑上将实现方式有很多,类似于subsetting datasets;但是也有现成的函数可以直接用,在R中可以考虑用dplyr::sample_n()函数,在SAS则是surveyselect

# R code
# replace = FALSE
dplyr::sample_n(iris, 10)

# SAS code
/* sample without replacement 简单随机抽样*/
proc surveyselect data=sashelp.class out=outds n=10 method=srs; run;

Observation number

输出观测编号,也就是行号,或者说行索引;方法肯定很多,比如SAS是用_n_变量,R则可以row.names()

# R code
row.names(iris)

# SAS code
data _null_;
  set sashelp.class;
  put _n_;
run;

Keep unique values

这种可以分成两种情况:

  • 整个数据集去重
  • 基于指定的变量(列)来去重

在R中可以用dplyr::distinct()函数,在SAS中则是用sort步和nodupkey参数

# R code
unique(iris) # All duplicate rows removed  
dplyr::distinct(iris, Species, .keep_all = TRUE) # Species rows removed

# SAS code
/*Romove duplicated value by all variables*/
proc sort data=sashelp.retail out=retail_without_duplicated_values nodupkey;
  by _all_;
run;
/*Remove duplicatd values by Year variable*/
proc sort data=sashelp.retail out=retail_with_unique_value nodupkey;
  by Year;
run;

Identify duplicated values

这个与上述的 keep unique values 刚好相反,返回有重复的行;在R中对于向量可以用duplicated函数,对于数据框则需要几个函数配合以下;在SAS中还是用sort步和nouniquekey参数

# R code
iris %>% add_count(Species) %>% filter(n>1)

/*Return duplicatd values by Year variable*/
proc sort data=sashelp.retail out=retail_with_duplicated_value nouniquekey;
  by Year Month;
run;

Convert from wide to long format or long to wide format

这个在数据处理中是一个非常常见的需求,俗称长数据和宽数据的转化,如宽转长或者长转宽;在R中,现在比较好用的函数是tidyr::pivot_longer() or tidyr::pivot_wider();在SAS则是用transpose

# R code
# wide to long transform
relig_income %>%
  pivot_longer(!religion, names_to = "income", values_to = "count")
  
# long to wide transform
fish_encounters %>%
  pivot_wider(names_from = station, values_from = seen, values_fill = 0)


# SAS code
/*proc transpose, the same as pivot for long to wide or wide to long*/
data wide;
  input SID $ Programming State English;
  datalines;
  S01 98 100 80
  S02 84 98 94
  S03 89 92 88
  ;
run;

/*wide to long transform*/
proc transpose data=wide out=long(rename=(_name_=Coursename col1=Score));
  var Programming State English;
  by SID;
run;

/*long to wide transform*/
proc transpose data=long out=Rewide(drop=_name_);
  var Score;
  by SID;
  id Coursename;
run;

Concatenate and stack datasets

将一个数据集拆分成不同变量的数据集,R会使用列索引的方式,SAS则可以用keepoutput搭配,如:

/* one to one, having sample observation*/
data class1(keep=name sex) class2(keep=age height weight);
  set sashelp.class;
  output class1;
  output class2;
run;

常见的合并数据集,可以分成列合并和行合并,对应在R中常用的就是rbind()cbind();在SAS中方法就比较多了,如setmerge

/*Set*/
data class;
  set class1;
  set class2;
run;

/*Sometime is the same as merge*/
data class;
  merge class1 class2;
run;

以上是简单的例子,其实对数据的拼接也有比较复杂的用法,比如除了set外,还有append方法,或则SQL的union等

Merge datasets

有时我们是想根据指定的列(或变量),连接两个数据集,在SQL中就是left join, right join 或者 inner join,也就是merge datasets;在R中,可以使用left_join()right_join()inner_join()等,跟SQL很类似,非常好记;在SAS中,要么先sort再merge,要么用sql语句

# R code
band_members %>% inner_join(band_instruments, by = "name")

# SAS code
/*For correct way is first sort and then merge*/
data class1(keep=name sex) class2(keep=name age height weight);
  set sashelp.class;
  output class1;
  output class2;
run;
proc sort data=class1;
  by name;
run;
proc sort data=class2;
  by name;
run;
data class;
  merge class1 class2;
  by name;
run;

/*proc sql to implementary left join, right join, inner join*/
data class1(keep=name sex) class2(keep=name age height weight);
  set sashelp.class;
  output class1;
  if _n_ in (1,5,10,15) then output class2;
run;
data class2
  set class2;
  if name="Janet" then name="Janey";
run;

proc sql;
  create table class_left as 
  select a.*,b.*
  from class1 as a left join class2 as b
  on a.name=b.name;
quit;

Select, copy, change name, delete datasets

这里是单指SAS中datasets的一些操作了

/*select, copy, change name, delete*/
proc datasets;
  copy in=sashelp out=work;
  select class;
quit;
proc datasets lib=work;
  change class=student;
quit;
proc datasets lib=work;
  delete student;
quit;

/*delete all datasets*/
proc datasets lib=work kill memtype=data;
run;
/*only save class data*/
proc datasets lib=work;
  save class;
run;

Create date variable

创建日期型变量,在R中常用as.Date()函数,在SAS中则是input转化或者mdy用法

值得注意的是,前者是返回日期型变量,后者SAS则是返回距离1960.1.1号的天数,是数值型。。。

# R code
as.Date("2014-04-29")
as.Date(Sys.time())

data dt;
  dayvar = input("04/29/2014", mmddyy10.);
  dayvar2 = mdy(4, 29, 2014);
  todays_date = today();
run;

参考资料:

SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)

本文出自于http://www.bioinfo-scrounger.com转载请注明出处