Handling the duplicates and missing values in data manipulation is a very common process. This post is taking a few examples to list how to accomplish it from a datasets in SAS.
In R, I prefer to use
dplyr::distinct toolkit to remove duplicates, and
na.omit() functions or external packages like
mice to handle missing values.
We can use the
proc sort to remove rows that have duplicate values across all columns of the dataset.
proc sort data = sashelp.cars(keep = make type origin) out = without_dups nodupkey; by _all_; run;
In some special condition, we would like to select only unique/distinct rows from a dataset as per a specific column and keep the first row of values for that column.
proc sort data = sashelp.cars out = make_without_dups nodupkey; by Make; run;
In clinical trial data, missing data or missing values is a common occurrence when no data is stored for the variable in the observation. It can be occurred in numeric or character variables as a single period (
For all we know, according to the missing values, the reasons can be summarized as below:
- Missing completely at random (MCAR)
- Missing at random (MAR), not completely random
- Not missing at random (NMAR)
So how to handle the missing values?
Suppose we did a reaction time study with six subjects, and the subjects reaction time was measured by three times. That data is as shown below.
data times; input id trial1 trial2 trial3; cards; 1 1.5 1.4 1.6 2 1.5 . 1.9 3 . 2.0 1.6 4 . . 2.2 5 2.1 2.3 2.2 6 1.8 2.0 1.9 ; run;
As you see below, we can use some useful functions to count the number of missing observations, like
nmiss for numeric and
cmiss for character. Or
missing to indicate whether the argument contains a missing value. And then filter any rows that have more than one missing value.
data raw_0; set times (where = (nmiss(trial1,trial2,trial3) = 0)); run;
Or just indicate the specific variable, like
data raw_1; set times; missing_flag = missing(trial1); run;
First off, let's try to replace all missing values with zero in every column in a simple way, which is creating an implicit Array
NumVar to hold all numeric variables in the dataset and then loop over it. If you just want to replace one column, so then add that variable name instead of
data raw_3; set times; array NumVar _numeric_; do over NumVar; if NumVar=. then NumVar=0; end; run;
If your question is more complicated, such as not replaced by zero but by mean, then how would we address it? I suppose that
proc stdize is a good solution.
/*proc stdize data = times out = stdize_vars reponly missing = 0; run;*/ proc stdize data = times out = stdize_vars reponly method = mean; var trial1 trial2; /* or _numeric_, or empty*/ run;
Imputation missing values is a complicated data manipulation process that can work well if you select the correct method for specific variables. But I would not learn more about how to do it with SAS by now, since I prefer to use R for imputation.
Here just list a few of useful sas procedures so that I can read and recall them later if needed.
Hope above notes will be helpful for you.
Please indicate the source: http://www.bioinfo-scrounger.com