# KeepNotes blog

Stay hungry, Stay Foolish.

0%

We have discussed the multiple imputation in non-monotone pattern of missingness in the article of Understanding Multiple Imputation in SAS, and sort out how to implement it in SAS. While here, I would like to learn how to use linear regression in multiple imputation to deal with monotone pattern data in R.

In R, we can use the `mice` package (Multiple Imputation with Chained Equations) to perform multiple imputation where there is an option for the linear regression method.

The chained equations is a variation of a Gibbs Sampler (an MCMC approach) that iterates between drawing estimates of missing values and estimates of parameters for distribution of the variable (both conditional on the other variables).

In regression imputation, the existing variables are used to predict, and then the predicted value is substituted as if an actually obtained value. This approach has several advantages because the imputation retains a great deal of data over the listwise or pairwise deletion and avoids significantly altering the standard deviation or the shape of the distribution. However, as in a mean substitution, while a regression imputation substitutes a value predicted from other variables, no novel information is added, while the sample size has been increased and the standard error is reduced.

Then let's jump into implementation with the `mice` package. Here I will use the example data set (`low1.sas7bdat`) from Mallinckrodt et al. (https://journals.sagepub.com/doi/pdf/10.1177/2168479013501310) which is available via https://www.lshtm.ac.uk/research/centres-projects-groups/missing-data#dia-missing-data.

``````low1 <- haven::read_sas("./low1.sas7bdat")

## # A tibble: 6 × 6
##   PATIENT POOLINV trt   basval  week change
##     <dbl> <chr>   <chr>  <dbl> <dbl>  <dbl>
## 1    1005 101     2         16     1     -3
## 2    1005 101     2         16     2     -5
## 3    1005 101     2         16     4    -10
## 4    1005 101     2         16     6    -11
## 5    1005 101     2         16     8    -13
## 6    1006 101     2         17     1     -1``````

As shown above, this is a long format data set, including few variables, and the meaning of them is straightforward to understand literally. In order to meet the `mice` functions, I will first convert it to wide format with separte variables for each time points (week1 - week8).

``````low_wide <- low1 %>%
pivot_wider(names_from = week,
names_prefix = "week",
values_from = change) %>%
select(-POOLINV)

## # A tibble: 6 × 8
##   PATIENT trt   basval week1 week2 week4 week6 week8
##     <dbl> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1    1005 2         16    -3    -5   -10   -11   -13
## 2    1006 2         17    -1    -2    -6   -10   -12
## 3    1008 1         32    -6   -12   -17   -20   -22
## 4    1011 1         18    -1    -5    -8    NA    NA
## 5    1012 1         22    -6    -9   -13   -16   -17
## 6    1015 2         29    -6   -14   -14   -20   -25``````

Then let's have a look at the missing pattern of this example.

``````low_wide %>%
select(basval, week1, week2, week4, week8) %>%
mutate(across(1:5, function(x) {
if_else(is.na(x), ".", "X")
})) %>%
group_by_all() %>%
count(name = "Freq")

## # A tibble: 4 × 6
## # Groups:   basval, week1, week2, week4, week8 [4]
##   basval week1 week2 week4 week8  Freq
##   <chr>  <chr> <chr> <chr> <chr> <int>
## 1 X      X     .     .     .         4
## 2 X      X     X     .     .         3
## 3 X      X     X     X     .         9
## 4 X      X     X     X     X       184``````

Besides you can also use the `mice::md.pattern(low_wide)` function to display the missing patterns that is similar as the above outputs.

As shown above, the 'X' marks indicate that the data point in a certain visit is completed and '.' marks indicate the data is fully missing. So we can suppose that the example data is monotonic type and missingness is MAR for later analysis purposes.

Now we will generate 5 imputed datasets via `mice` function by setting `m=5` and `method = 'norm.predict'` (called Linear regression through prediction).

``````low_imp <- mice(low_wide, method = "norm.predict")
summary(low_imp)``````

If you would like to check if the missingness are all completed, via `low_imp\$imp`. And you can also specify which imputed datasets to use via setting `action` argument. The `action = 0` will return the orginal dataset with missing values, and `action = 1` corresponds to the first imputed datasets.

``low_imp_1 <- complete(low_imp, action = 1) ``

And maybe we'd like to have all imputed datasets in a long format that will be easy to handle and analyze at some point.

``low_imp_res <- complete(low_imp, action = "long", include = TRUE)``

If you already have an imputed dataset with long format from another imputation method, `mice::as.mids()` would be a helpful function that can convert it to an object with `mids` class for the further analysis in `mice`. The `mids` class should contains the orginal data as well as imputed datasets with `.imp` and `.id` columns inside.

``mids <- as.mids(low_imp_res)``

The second step is to fit the ANCOVA model for `week8` time point with treatment (`trt`) as independent variable, change from baseline in week 8 (`week8`) as response variable, and baseline (`basval`) as covariates.

``````mods <- with(
low_imp,
lm(week8 ~ basval + trt)
)``````

The `mods` is an object with `mira` class that contains the call and fitted model object for each one of the imputations.

And the final step is to integrate the results from ANCOVA models using Rubin's Rules (https://bookdown.org/mwheymans/bookmi/rubins-rules.html), which is also the method used by `proc mianalyze` in SAS

``````pool_res <- pool(mods)
summary(pool_res)

##          term   estimate  std.error  statistic       df      p.value
## 1 (Intercept)  0.5759012 1.73148889  0.3326046 194.3680 7.397912e-01
## 2      basval -0.5329080 0.08089651 -6.5875270 194.3860 4.086519e-10
## 3        trt2 -1.7308118 0.66660413 -2.5964612 194.7855 1.013719e-02``````

We can see the pooled estimations like coefficient and standard error in the above output, which is obtained by `coef()` and `vcov()` functions from pooled models such as `lm` here. The estimated treatment coefficient is around `-1.73` with a standard error around `0.67`.

Actually the coefficient and standard error are not the final results we would like to display in the clinical report. We should get the LS-means estimation for each group, and do contrast between two groups and hypothesis test. That will be discussed in the next topic.