由于我是采用镜像的方式迁移，因此流程非常简单；在开始更换服务器之前，只需要做好以下准备工作：

- 对当前服务器设置一个自定义的镜像
- 将原有的Hexo站点文件备份，以防数据丢失
- 选配新的ECS服务器，其中地域选择与旧ECS相同的、镜像选择你设置的自定义好的，这样后续才能顺利迁移

当你完成新ECS服务器购买后，可以开始进行服务器更换了，按照以下步骤进行：

- 当你完成ECS服务器购买后，新ECS已有了与旧ECS相同的配置，所以几乎不需要再重新配置hexo了，除了将
`_config.yml`

文件中旧IP替换成新IP - 检查下各个端口是否打开，防火墙是否配置
- 重新deploy下博客文章
- 最后重新解析下域名，将旧IP更换成新的IP；不然网站只能用IP访问而不能用www域名访问了

通过以上步骤，即完成了ECS服务器更换的Hexo迁移

]]>As we know, the common primary outcome in randomized trials is often the difference in average (LS mean) at a given timepoint (visit). One way to analyze these data is to ignore the measurements at intermediate timepoints and focus on estimating the outcome at the specific timepoint by ANCOVA, but the data should be complete. If not, sometimes the multiple imputation method is suggested. However in the MMRM model, it's generally thought that utilizing the information from all timepoints implicitly handles missing data. In SAS, it's more efficient to use `proc mixed`

than `proc glm`

to handle missing values, which allows the inclusion of subjects with missing data. And in R, I feel like the 'mmrm' package is more powerful and runs more smoothly than others.

Here, I take the example data from `mmrm`

package and implement the MMRM using SAS and R, respectively. In this randomized trial, subjects are treated with a treatment drug or placebo, and the FEV1 (forced expired volume in one second) is a measure of how quickly the lungs can be emptied. This measure is repeated from Visit 1 to Visit 4. Low levels of FEV1 may indicate chronic obstructive pulmonary disease (COPD). To evaluate the effect of treatment on FEV1, the MMRM will be used to analyze the outcome with an unstructured covariance matrix reflecting the correlation between visits within the subjects, treatment (treatment drug or placebo), visit and treatment-by-visit as the fixed effects, subject as a randon effect, visit as a repeated measure, and baseline as the covariates.

Here, I take the example data from `mmrm`

package and implement the MMRM using SAS and R, respectively. In this randomized trial, subjects are treated with a treatment drug or placebo, and the FEV1 (forced expired volume in one second) is a measure of how quickly the lungs can be emptied. This measure is repeated from Visit 1 to Visit 4. Low levels of FEV1 may indicate chronic obstructive pulmonary disease (COPD).

`library(mmrm)data("fev_data")write.csv(fev_data, file = "./fev_data.csv", na = "", row.names = F)`

To evaluate the effect of treatment on FEV1, this endpoint measurements can be analyzed using MMRM with an unstructured covariance matrix reflecting the correlation between visits within the subjects, treatment (treatment drug or placebo), visit and treatment-by-visit as the fixed effects, subject as a randon effect, visit as a repeated measure, and race as the covariate.

So the SAS code as shown below.

`proc import datafile="./fev_data.csv" out=fev_data dbms=csv replace; getnames=yes;run;proc mixed data=fev_data method=reml; class ARMCD(ref='PBO') AVISIT RACE USUBJID; model FEV1 = RACE ARMCD AVISIT ARMCD*AVISIT / ddfm=KR; repeated AVISIT / subject=USUBJID type=UN r rcorr; lsmeans ARMCD*AVISIT / cl alpha=0.05 diff slice=AVISIT; lsmeans ARMCD / cl alpha=0.05 diff; ods output lsmeans=lsm diffs=diff;run;`

From above SAS code, we can see that the `method`

option specifies the estimation method as `REML`

. The `repeated`

statement is used to specify the repeated measures factor and control the covariance structure. In the repeated measures models, the `subject`

optional is used to define which observations belong to the same subject, and which belong to the different subjects who are assumed to be independent. The `type`

optional statement specifies the model for the covariance structure of the error within subjects. We also add `ddfm=KR`

in `model`

statement to specify a method for the denominator degrees of freedom (such as Kenward-Rogers here). At least, the LS mean calculated from the `lsmeans`

statement with `ci`

and `diff`

options is also very commonly used. These two options can help us obtain the confidence interval and difference of the LS mean, and the p value if the hypothesis margin is `0`

.

As for `ARMCD*AVISIT`

in the `lsmeans`

statement that means you would like to get the test of LS means in all combinations of visits. If you try the `lsmeans ARMCD`

, which is identical to the mean of pair-wise visits from the LS means of `lsmeans ARMCD*AVISIT`

.

And the same arguments in R, as shown below.

`library(mmrm)library(emmeans)data("fev_data")fit <- mmrm( formula = FEV1 ~ RACE + ARMCD + AVISIT + ARMCD * AVISIT + us(AVISIT | USUBJID), data = fev_data)# summary(fit)`

If you would like to obtain the LS mean of each visit for each group, like the `lsm`

dataset in SAS, you can use the `emmeans`

function from the `emmeans`

package as the mmrm object can be analyzed by the external package.

`# emmeans(fit, "ARMCD", by = "AVISIT")emmeans(fit, ~ ARMCD | AVISIT)## AVISIT = VIS1:## ARMCD emmean SE df lower.CL upper.CL## PBO 33.3 0.757 149 31.8 34.8## TRT 37.1 0.764 144 35.6 38.6## ## AVISIT = VIS2:## ARMCD emmean SE df lower.CL upper.CL## PBO 38.2 0.608 150 37.0 39.4## TRT 41.9 0.598 146 40.7 43.1## ## AVISIT = VIS3:## ARMCD emmean SE df lower.CL upper.CL## PBO 43.7 0.462 131 42.8 44.6## TRT 46.8 0.507 130 45.8 47.8## ## AVISIT = VIS4:## ARMCD emmean SE df lower.CL upper.CL## PBO 48.4 1.189 134 46.0 50.7## TRT 52.8 1.188 133 50.4 55.1## ## Results are averaged over the levels of: RACE ## Confidence level used: 0.95`

As for the `diff`

dataset from SAS, you can use the `pairs`

function to get identical outputs.

`pairs(emmeans(fit, ~ ARMCD | AVISIT), reverse = TRUE, adjust="tukey")## AVISIT = VIS1:## contrast estimate SE df t.ratio p.value## TRT - PBO 3.78 1.076 146 3.508 0.0006## ## AVISIT = VIS2:## contrast estimate SE df t.ratio p.value## TRT - PBO 3.76 0.853 148 4.405 <.0001## ## AVISIT = VIS3:## contrast estimate SE df t.ratio p.value## TRT - PBO 3.11 0.689 132 4.509 <.0001## ## AVISIT = VIS4:## contrast estimate SE df t.ratio p.value## TRT - PBO 4.41 1.681 133 2.622 0.0098## ## Results are averaged over the levels of: RACE`

I feel like if we use the ANCOVA model and focus on the specific timepoint before the end of the trial, in that case, we can say the treatment effect is the main difference between the treatment and control groups. But in MMRM, we include all timepoints's information. Despite the collection of these intermediate outcomes, the primary outcome is often still the difference at that specific or final timepoint. Thus, it will have a couple of advantages, like improving the power and avoiding the bias of dropout because although the subjects withdraw from the study before the final timepoint, they may still contribute information in the interim. Once all the timepoints are included, the treatment-by-visit also should be added to the model as a consideration when the effect is different in the slopes of outcomes over time.

Initially, the unstructured (`type=UN`

) covariance structure allows SAS to estimate the covariance matrix, as the unstructured approach makes no assumption at all about the relationship in the correlations among study visits. As for how to select an appropriate covariance structure, it depends on your understanding of the study and the data you have. Here are also a couple of documents for your reference if you would like to know which structure can be used and how to try and select a more suitable structure. For instance, the lower AIC values suggest a better fit.

Here are two documents for your reference: - Selecting an Appropriate Covariance Structure - Guidelines for Selecting the Covariance Structure in Mixed Model Analysis

- MMRM Package Introduction
- MIXED MODEL REPEATED MEASURES (MMRM)
- Proc mixed
- Understanding Interaction Effects in Statistics
- Mixed Model Repeated Measures (MMRM)
- Repeated Measures Modeling With PROC MIXED
- Mixed Models for Repeated Measures Should Include Time-by-Covariate Interactions to Assure Power Gains and Robustness Against Dropout Bias Relative to Complete-Case ANCOVA

`mcradds`

(version 1.0.1) helps with designing, analyzing and visualization in In Vitro Diagnostic trials.You can install it from CRAN with:

`install.packages("mcradds")`

or you can install the development version directly from GitHub with:

`if (!require("devtools")) { install.packages("devtools")}devtools::install_github("kaigu1990/mcradds")`

This blog post will introduce you to package and desirability functions. Let's start loading this package.

`library(mcradds)`

The `mcradds`

R package is a complement to `mcr`

package and it offers common and solid functions for designing, analyzing, and visualizing in In Vitro Diagnostic (IVD) trials. In my work experience as a statistician for diagnostic trials at Roche Diagnostic, `mcr`

package is an internally built tool for analyzing regression and other relevant methodologies that are also widely used in the IVD industry community.

However, the `mcr`

package focuses on method comparison trials and does not include additional common diagnostic methods but that have been provided in the `mcradds`

. It is intuitive and easy to use. So you can perform statistical analysis and graphics in different IVD trials utilizing the analytical functions.

- Estimate the sample size for trials, following NMPA guidelines.
- Evaluate diagnostic accuracy with/without reference, following CLSI EP12-A2.
- Perform regression method analysis and plots, following CLSI EP09-A3.
- Perform bland-Altman analysis and plots, following CLSI EP09-A3.
- Detect outliers with 4E method from CLSI EP09-A2 and ESD from CLSI EP09-A3.
- Estimate bias in medical decision level, following CLSI EP09-A3.
- Perform Pearson and Spearman correlation analysis, adding hypothesis test and confidence interval.
- Evaluate Reference Range/Interval, following CLSI EP28-A3 and NMPA guidelines.
- Add paired ROC/AUC test for superiority and non-inferiority trials, following CLSI EP05-A3/EP15-A3.
- Perform reproducibility analysis (reader precision) for immunohistochemical assays, following CLSI I/LA28-A2 and NMPA guidelines.
- Evaluate precision of quantitative measurements, following CLSI EP05-A3.

Please be noted that these functions and methods have not been validated and QC'ed, so I cannot guarantee that all of them are entirely proper and error-free. But I always strive to compare the results to those of other resources in order to obtain a consistent result for them. And because some of them were utilized in my past usual work process, I believe the quality of this package is temporarily sufficient to use.

Let's demonstrate that by looking at a few of examples. More detailed usages can be found in Get started page

Suppose that we have a new diagnostic assay with the expected sensitivity criteria of `0.9`

, and the clinical acceptable criteria is `0.85`

. If we conduct a two-sided normal Z-test at a significance level of `α = 0.05`

and achieve a power of `80%`

, what should the total sample size be?

The result from sample size function is:

`size_one_prop(p1 = 0.9, p0 = 0.85, alpha = 0.05, power = 0.8)#> #> Sample size determination for one Proportion #> #> Call: size_one_prop(p1 = 0.9, p0 = 0.85, alpha = 0.05, power = 0.8)#> #> optimal sample size: n = 363 #> #> p1:0.9 p0:0.85 alpha:0.05 power:0.8 alternative:two.sided`

Suppose that you have a wide structure of data like `qualData`

that contains the qualitative measurements of the candidate (your own product) and comparative (reference product) assays. In this scenario, if you’re interested in how to create a 2x2 contingency table, the `diagTab()`

function is a good solution.

`data("qualData")tb <- qualData %>% diagTab( formula = ~ CandidateN + ComparativeN, levels = c(1, 0) )tb#> Contingency Table: #> #> levels: 1 0#> ComparativeN#> CandidateN 1 0#> 1 122 8#> 0 16 54`

However, there are different formula settings when the data structure is long.

`dummy <- data.frame( id = c("1001", "1001", "1002", "1002", "1003", "1003"), value = c(1, 0, 0, 0, 1, 1), type = c("Test", "Ref", "Test", "Ref", "Test", "Ref")) %>% diagTab( formula = type ~ value, bysort = "id", dimname = c("Test", "Ref"), levels = c(1, 0) )dummy#> Contingency Table: #> #> levels: 1 0#> Ref#> Test 1 0#> 1 1 1#> 0 0 1`

And then you can use the `getAccuracy()`

method to compute the diagnostic performance based on the table above.

`# Default method is Wilson score, and digit is 4.tb %>% getAccuracy(ref = "r")#> EST LowerCI UpperCI#> sens 0.8841 0.8200 0.9274#> spec 0.8710 0.7655 0.9331#> ppv 0.9385 0.8833 0.9685#> npv 0.7714 0.6605 0.8541#> plr 6.8514 3.5785 13.1181#> nlr 0.1331 0.0832 0.2131`

If you want to estimate the reader precision between different readers, reads, or sites, use the `APA`

, `ANA`

and `OPA`

as the primary endpoint in the PDL1 assay trials. Let’s see an example of precision between readers.

`data("PDL1RP")reader <- PDL1RP$btw_readertb1 <- reader %>% diagTab( formula = Reader ~ Value, bysort = "Sample", levels = c("Positive", "Negative"), rep = TRUE, across = "Site" )getAccuracy(tb1, ref = "bnr", rng.seed = 12306)#> EST LowerCI UpperCI#> apa 0.9479 0.9260 0.9686#> ana 0.9540 0.9342 0.9730#> opa 0.9511 0.9311 0.9711`

Suppose that in another scenario, you have a wide structure of quantitative data like `platelet`

and would like to do the Bland-Altman analysis to obtain a series of descriptive statistics including, `mean`

, `median`

, `Q1`

, `Q3`

, `min`

, `max`

and other estimations like `CI`

(confidence interval of mean) and `LoA`

(Limit of Agreement).

`data("platelet")# Default difference typeblandAltman( x = platelet$Comparative, y = platelet$Candidate, type1 = 3, type2 = 5)#> Call: blandAltman(x = platelet$Comparative, y = platelet$Candidate, #> type1 = 3, type2 = 5)#> #> Absolute difference type: Y-X#> Relative difference type: (Y-X)/(0.5*(X+Y))#> #> Absolute.difference Relative.difference#> N 120 120#> Mean (SD) 7.330 (15.990) 0.064 ( 0.145)#> Median 6.350 0.055#> Q1, Q3 ( 0.150, 15.750) ( 0.001, 0.118)#> Min, Max (-47.800, 42.100) (-0.412, 0.667)#> Limit of Agreement (-24.011, 38.671) (-0.220, 0.347)#> Confidence Interval of Mean ( 4.469, 10.191) ( 0.038, 0.089)`

And the visualization of Bland-Altman can be easily conducted by the `autoplot`

method.

`object <- blandAltman(x = platelet$Comparative, y = platelet$Candidate)# Absolute difference plotautoplot(object, type = "absolute")`

Here is a plot of the data.

Based on the output from Bland-Altman, you can also detect the potential outliers using the `getOutlier()`

method.

`# ESD approachba <- blandAltman(x = platelet$Comparative, y = platelet$Candidate)out <- getOutlier(ba, method = "ESD", difference = "rel")out$stat#> i Mean SD x Obs ESDi Lambda Outlier#> 1 1 0.06356753 0.1447540 0.6666667 1 4.166372 3.445148 TRUE#> 2 2 0.05849947 0.1342496 0.5783972 4 3.872621 3.442394 TRUE#> 3 3 0.05409356 0.1258857 0.5321101 2 3.797226 3.439611 TRUE#> 4 4 0.05000794 0.1183096 -0.4117647 10 3.903086 3.436800 TRUE#> 5 5 0.05398874 0.1106738 -0.3132530 14 3.318236 3.433961 FALSE#> 6 6 0.05718215 0.1056542 -0.2566372 23 2.970250 3.431092 FALSEout$outmat#> sid x y#> 1 1 1.5 3.0#> 2 2 4.0 6.9#> 3 4 10.2 18.5#> 4 10 16.4 10.8`

Suppose that you would like to evaluate the regression agreement between two assays with 'Deming' method, you can use the `mcreg`

, this main function is wrapped from `mcr`

package.

`# Deming regressionfit <- mcreg( x = platelet$Comparative, y = platelet$Candidate, error.ratio = 1, method.reg = "Deming", method.ci = "jackknife")`

Like the Bland-Altman plot, as well as in regression plot, the `autoplot`

function can provide the scatter plot with a fitted line as shown below.

Based on this regression analysis, you can also estimate the bias at one or more medical decision levels.

`# absolute bias.calcBias(fit, x.levels = c(30))#> Level Bias SE LCI UCI#> X1 30 4.724429 1.378232 1.995155 7.453704# proportional bias.calcBias(fit, x.levels = c(30), type = "proportional")#> Level Prop.bias(%) SE LCI UCI#> X1 30 15.7481 4.594106 6.650517 24.84568`

Suppose that you have a target population data, and would like to compute the 95% reference interval (RI) with non-paramtric method.

`data("calcium")refInterval(x = calcium$Value, RI_method = "nonparametric", CI_method = "nonparametric")#> #> Reference Interval Method: nonparametric, Confidence Interval Method: nonparametric #> #> Call: refInterval(x = calcium$Value, RI_method = "nonparametric", CI_method = "nonparametric")#> #> N = 240#> Outliers: NULL#> Reference Interval: 9.10, 10.30#> RefLower Confidence Interval: 8.9000, 9.2000#> Refupper Confidence Interval: 10.3000, 10.4000`

Suppose that you want to see if the OxLDL assay is superior to the LDL assay through comparing two AUC of paired two-sample diagnostic assays using the standardized difference method when the margin is equal to `0.1`

. In this case, the null hypothesis is that the difference is less than `0.1`

.

`data("ldlroc")# H0 : Superiority margin <= 0.1:aucTest( x = ldlroc$LDL, y = ldlroc$OxLDL, response = ldlroc$Diagnosis, method = "superiority", h0 = 0.1)#> Setting levels: control = 0, case = 1#> Setting direction: controls < cases#> #> The hypothesis for testing superiority based on Paired ROC curve#> #> Test assay:#> Area under the curve: 0.7995#> Standard Error(SE): 0.0620#> 95% Confidence Interval(CI): 0.6781-0.9210 (DeLong)#> #> Reference/standard assay:#> Area under the curve: 0.5617#> Standard Error(SE): 0.0836#> 95% Confidence Interval(CI): 0.3979-0.7255 (DeLong)#> #> Comparison of Paired AUC:#> Alternative hypothesis: the difference in AUC is superiority to 0.1#> Difference of AUC: 0.2378#> Standard Error(SE): 0.0790#> 95% Confidence Interval(CI): 0.0829-0.3927 (standardized differenec method)#> Z: 1.7436#> Pvalue: 0.04061`

Suppose that you feel like to do the hypothesis test of `H0=0.7`

not `H0=0`

with pearson and spearman correlation analysis, the `pearsonTest()`

and `spearmanTest()`

would be helpful.

`# Pearson hypothesis testx <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)y <- c(2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8)pearsonTest(x, y, h0 = 0.5, alternative = "greater")#> $stat#> cor lowerci upperci Z pval #> 0.5711816 -0.1497426 0.8955795 0.2448722 0.4032777 #> #> $method#> [1] "Pearson's correlation"#> #> $conf.level#> [1] 0.95# Spearman hypothesis testx <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)y <- c(2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8)spearmanTest(x, y, h0 = 0.5, alternative = "greater")#> $stat#> cor lowerci upperci Z pval #> 0.6000000 -0.1478261 0.9656153 0.3243526 0.3728355 #> #> $method#> [1] "Spearman's correlation"#> #> $conf.level#> [1] 0.95`

That's it! That's the `mcradds`

package. More details can be found in the Introduction to mcradds vignette.

The main reference is the Chapter 22 Releasing to CRAN in https://r-pkgs.org/release.html. Follow these steps below.

Use `usethis::use_release_issue()`

to generate a listing on the github issue page to advise on a series of recommendations you should finish.

If you don't have a README document already, you should create and render `devtools::build_readme()`

it before releasing. Don't forget to add the install instructions in the README. Keep updating the NEW document as well.

A vignette is necessary that is a long-term guide to your package. Use `usethis::use_vignette("my-vignette")`

to create a default template first, and then you can just follow other mature packages's vignettes, through following the similar structure from them is okay (that's what I'm doing).

In addition, a website like pkgdown is also help for users to know more about your package. These functions from `usethis`

package can help your build it. The `usethis::use_pkgdown()`

function to initial setup, and `pkgdown::build_site()`

to render your site, then `usethis::use_pkgdown_github_pages()`

to deployment your site to github and githun action.

Check the DESCRIPTION clearly

- Proofread the title, follow the naming rule, like it should be plain text (no markup), capitalized like a title, and NOT end in a period.
- Provide a good description, which is very important.
- Check version number, updating manually or using
`usethis::use_version()`

. - Don't forget to add (copyright holder) role to
`Authors@R`

. If you are the only developer, you should add three roles and put "aut", "cre" and "cph" together. - Make sure the license is reasonable and correct.
- Add the correct urls following to the CRAN's URL checks, and check with
`urlchecker::url_check()`

.

Check and list all spell words in `inst/WORDLIST`

automatically with `usethis::use_spell_check()`

. That's a fantastic way—just a one-line command.

At last, run `devtools::check()`

once again to ensure everything is ready.

As usual, I use `devtools::check()`

to double-check all is still well before I want to merge or commit update. But before releasing, you'd better add `remote = TRUE`

and `manual = TRUE`

to run the `R CMD check`

again, like `devtools::check(remote = TRUE, manual = TRUE)`

, which will build and check the manual, and perform a number of CRAN incoming checks.

Maybe you will encounter the same problem I had, like a confused warning `pdflatex not found! Not building PDF manual`

. I didn't understand the meaning of this warning at first. I checked all options in R and Rstudio, but that didn’t work. Finally, I found that it occurred because I didn’t have the `pdflatex`

executive program on this computer!

It's easy to solve the problem if you find it. I chose to install the `pdflatex`

using the solution provided by Yihui Xie, referring to the article https://yihui.org/tinytex/.

`install.packages('tinytex')tinytex::install_tinytex()`

Another option is to add some more packages for building PDF vignettes of many CRAN packages.

`tinytex:::install_yihui_pkgs()`

At last, if it still doesn't work, ensure the path of `pdflatex`

has been added to your PATH environment on the computer.

After `R CMD check`

, you'd better use `devtools::check_win_devel()`

as this checking with r-devel is required by CRAN policy. And make sure your package can be passed through CRAN's win-builder service, which is only for Windows. Another good option is to use `rhub::check_for_cran()`

that is also a service supported by the R Consortium, to check your package.

If this package is the new submission to CRAN, there are currently no downstream dependencies for it. If not, you should do the reverse dependency checks.

`usethis::use_revdep()revdepcheck::revdep_check(num_workers = 4)`

or

`revdepcheck::cloud_check()`

After all the above, record comments about the submission to `cran-comments.md`

, and that will be created by the `usethis::use_cran_comments()`

you use at first. There is no need to manually add it.

Once you’re satisfied that all issues have been addressed and it’s time to submit your package to CRAN, run `usethis::use_version()`

to reach the final version you would like for the first release to CRAN, and then submit using `devtools::submit_cran()`

without any hesitation.

Afterwards, you will receive an email telling you that the package is pending a manual inspection of this new CRAN submission. You will get a response within the next 10 working days, but sometime the feedback is very fast.

If there are some comments from CRAN, respond to any CRAN remarks and double-check everything. Fix what needs to be fixed. If not, write and provide a good reason as you can. Don't forget to add a "Resubmission" section at the top of `cran-comments.md`

to clearly identify that the package is a resubmission, and list the changes that you have made. If you want to explain or clarify something, also can be added inside.

At last, if you receive an email telling your package will be published within 24 hours in the correponding CRAN directory, that means your package have been accepted and released on CRAN. And then you should push it to Github with the new version number. And next, use `usethis::use_github_release()`

to create a new release with tag version on your github, and then update the NEW document as well to illustrate that this is the CRAN release.

Now you can continue increasing the version number to the development version using `usethis::use_dev_version()`

. It makes sense to immediately push to GitHub so that any update will be based on the development version.

Other checking lists for CRAN are also available for reference.

]]>`officer`

）可以用于生成editable图片在PPT中。这里的editable是指图片中每个元素包括散点、X/Y轴、标签都能修改，常用于图片的再修饰参考于：Chapter 5 officer for PowerPoint

其实`officer`

是一个`Officeverse`

套件中的一个包，还包括其他大家熟悉的，如：

`officedown`

，在rmarkdown中生成word`officedown`

，生成非常好用的表格`rvg`

，生成矢量图形`mschart`

，生成macrosoft office的图形

进入正题，假如你有一个R生成的图片，可以是R基础绘图生成的，也可以是ggplot2绘图生成，或者是其他绘图R包生成（但是必须要有`ggplot`

对象），均可通过以下方式转化成在PPT中的editable图片

首先生成图片并用`rvg::dml`

函数封装成矢量图以便后续在PPT中插入到各页slides中

`library(rvg) p1 <- dml(plot(1:10))library(ggplot2)g2 <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() + theme_classic()p2 <- dml(ggobj = g2)library(survival)library(survminer)g3 <- survfit(Surv(time, status) ~ sex, data = lung) %>% ggsurvplot(data = lung)p3 <- dml(ggobj = g3$plot)`

矢量图的对象生成后，接着根据下图的步骤添加到PPT中

先用`read_pptx`

根据默认模板生成一个空的PPT文件；然后用`add_slide`

生成一页空的slide；最后用`ph_with`

将矢量图对象插入其中。其中所涉及到的一些参数，需要先了解office PowerPoint的一些基本组件，可阅读：2.2 PowerPoint presentation properties

`library(officer)doc <- read_pptx()doc <- add_slide(doc, layout = "Title and Content", master = "Office Theme")doc <- ph_with(doc, p1, location = ph_location_fullsize() )doc <- add_slide(doc, layout = "Title and Content", master = "Office Theme")doc <- ph_with(doc, p2, location = ph_location_fullsize() )doc <- add_slide(doc, layout = "Title and Content", master = "Office Theme")doc <- ph_with(doc, p3, location = ph_location_fullsize() )print(doc, target = "test.pptx")`

最后即可打开`test.pptx`

文件修饰图片啦

In terms of the ANCOVA model, if you would like to add the margin of non-inferiority and superiority, you can just use the `lsmestimate`

statement with `testvalue=2`

when the margin is 2. Whereas for multiple imputation you can't just add this statement in the analysis step, you should define this margin in the pool step.

In order to echo the last article, here I will use the identical example data, first and second steps of the MI process, and just illustrate the difference in the third step. Assume that the endpoint is the change from baseline at week 6, and given that this drug is used to reduce the primary indicator, the null hypothesis might be that the CHG in the treatment group minus the placebo group is more than `-2`

, demonstrating that the drug efficacy is not superior to placebo.

`ods output ParameterEstimates=super; proc mianalyze data=diff theta0=-2; modeleffects estimate; stderr stderr;run;`

The combined imputation with a margin of `-2`

as following.

Now we can find the `Theta0`

value is `-2`

rather than the usual and default `0`

. And the two-sided p-value is `0.4745`

. If we would like to obtain the one-sided p-value, an additional calculation can be done. Or just a half of a two-sided p-value is also fine, which is the same.

`data super; set super; pval = (1 - probt(abs(tvalue),df));run;`

Otherwise the t-statistic and p-value can also be computed by the t distribution formula, as shown below in R.

`est <- -2.803439theta0 <- -2se <- 1.123403df <- 4800.7t <- (est - theta0) / se> t[1] -0.7151832pval <- pt(t, df)> pval[1] 0.2372653`

The superiority test is used as an example above, however non-inferiority test can follow the same procedure by simply altering the margin.

]]>There are plenty of methods that could be applied to the missing data, depending on the goal of the clinical trial. The most common and recommended is multiple imputation (MI), and other methods such as last observation carried forward (LOCF), observed case (OC) and mixed model for repeated measurement (MMRM) are also available for sensitivity analysis.

Multiple imputation is a model based method, but it's not just the model to impute data, it's also a framework with the implementation of various analytic models like ANCOVA. In general, there are 3 steps to implement MI, where R and SAS are all the same.

- Imputation, generating M datasets with imputed. But before starting this step, you'd better examine the missing pattern, Monotone missing data pattern, or Arbitrary pattern of missing data.
- Analysis, generating M sets of estimates from M imputed datasets using the statistical model.
- Pooling, the M sets of estimates will be combined into one MI estimate. This is very different from other imputation methods as it not only imputes missing values but also outputs the estimated value from multiple imputed datasets. The pooling method is Rubin's Rules (RR), which can pool parameter estimates such as mean differences, regression coefficients and standard errors, and then derive confidence intervals and p-values. The pool logic will be briefly introduced below.

After the routine introduction of MI, let's talk about how to implement the MI model to deal with actual missing data in SAS. I'm also planning to compare the SAS procedure with the `rbmi`

R package in the next article. To be honest, I tend to use R instead of SAS in my actual work, so I would like to introduce more R use in clinical trials.

Here is an example dataset from an antidepressant clinical trial of an active drug versus placebo. The relevant endpoint is the Hamilton 17-item depression rating scale (HAMD17) which was assessed at baseline and at weeks 1, 2, 4, and 6. This example comes from the `rbmi`

package so that I can use the same dataset in R programming. But I do the pre-processing and transposing to meet the data type of the MI procedure in SAS.

`library(rbmi)library(tidyverse)data("antidepressant_data")dat <- antidepressant_data# Use expand_locf to add rows corresponding to visits with missing outcomes to the datasetdat <- expand_locf( dat, PATIENT = levels(dat$PATIENT), # expand by PATIENT and VISIT VISIT = levels(dat$VISIT), vars = c("BASVAL", "THERAPY"), # fill with LOCF BASVAL and THERAPY group = c("PATIENT"), order = c("PATIENT", "VISIT"))dat2 <- pivot_wider( dat, id_cols = c(PATIENT, THERAPY, BASVAL), names_from = VISIT, names_prefix = "VISIT", values_from = HAMDTL17)write.csv(dat2, file = "./antidepressant.csv", na = "", row.names = F)`

And then import the csv file into SAS, as shown below.

Next, we should examine the missing pattern in the dataset by using zero imputation (`nimpute=0`

).

`proc mi data=antidepressant nimpute=0; var BASVAL VISIT4 VISIT5 VISIT6 VISIT7;run;`

The above graph indicates that there is a patient who doesn’t fit the monotone missing data pattern, so the missing pattern is non-monotone. With regard to which MI method should be performed, the MISSING DATA PATTERNS section and Table 4 from this article(MI FOR MI, OR HOW TO HANDLE MISSING INFORMATION WITH MULTIPLE IMPUTATION) can be used as references. I will select the MCMC (Markov Chain Monte Carlo) method for the multiple imputation afterwards.

Then starting the first step, here I choose the MCMC full-data imputation with `impute=full`

and specify the BY statement to obtain the separate imputed datasets in the treatment group. And I also specify the seed number as well, but keep in mind that it is only defined in the first group; the second group is not the seed number you define but another fixed and random seed number following the seed number you have.

`proc sort data=antidepressant; by THERAPY; run;proc mi data=antidepressant seed=12306 nimpute=100 out=imputed_data; mcmc chain=single impute=full initial=em (maxiter=1000) niter=1000 nbiter=1000; em maxiter=1000; by THERAPY; var BASVAL VISIT4 VISIT5 VISIT6 VISIT7;run;`

The second step is to implement the analysis model for each imputation datasets that were created from the first step. Assume that I want to estimate the endpoint of change from baseline at week 6 with the LSmean in each treatment by ANCOVA model, and the difference between them. It will include the fixed effect of treatment and fixed covariate of baseline.

`data imputed_data2; set imputed_data; CHG = VISIT7 - BASVAL;run;proc sort; by _imputation_; run;ods output lsmeans=lsm diffs=diff; proc mixed data=imputed_data2; by _imputation_; class THERAPY; model CHG=BASVAL THERAPY /ddfm=kr; lsmeans THERAPY / cl pdiff diff;run;`

We can see the LS mean for each imputation (`_Imputation_`

) in the `lsm`

dataset where each imputation has two rows including drug and placebo, and the difference between two groups in the `diff`

dataset as shwon below.

The third step is to pool all estimates from the second step, including the LS mean estimates and difference.

`proc sort data=lsm; by THERAPY; run;ods output ParameterEstimates=combined_lsm; proc mianalyze data=lsm; by THERAPY; modeleffects estimate; stderr stderr;run;ods output ParameterEstimates=combined_diff; proc mianalyze data=diff; by THERAPY _THERAPY; modeleffects estimate; stderr stderr;run;`

For now the imputations have been combined as shown below.

Above results indicate that each imputation has been combined and the final estimate is calculated by the Rubin's Rules (RR). The t-statistic, confidence interval and p-value are based on t distribution, so the most important is how to calculate the estimate and standard error. The pooled estimate is the mean value of all imputation's estimates. And the pooled SE is the square root of `Vtotal`

that can be calculated through formulas of 9.2-9.4. The formulas is cited from https://bookdown.org/mwheymans/bookmi/rubins-rules.html.

I'm trying to illustate the computing process of RR in R with the `diff`

dataset as example. Using R is as it's easy for me to do the matrix operations.

`diff <- haven::read_sas("./diff.sas7bdat")est <- mean(diff$Estimate)> est[1] -2.803439n <- nrow(diff)Vw <- mean(diff$StdErr^2)Vb <- sum((diff$Estimate - est)^2) / (n - 1)Vtotal <- Vw + Vb + Vb / nse <- sqrt(Vtotal)> se[1] 1.123403`

This `est`

and `se`

value is equal to the pooled Estimate of `-2.803439`

and StdErr of `1.123403`

in SAS.

With the interest of other parameters, you may ask how are the DF and t-statistics calculated? I recommend reading the entire article as mentioned above to comprehend the complete process that is not introduced here. Once the DF and t-statistics are determined, the confidence interval and p-value can be also computed easily by T distribution.

Multiple imputation is a recommended and useful tool in trial use, which provides robust parameter estimates depending on which missing pattern your data has.

In the next article, I will try to illustrate how to use MI in non-inferiority and superiority trials.

Multiple Imputation using SAS and R Programming

MI FOR MI, OR HOW TO HANDLE MISSING INFORMATION WITH MULTIPLE IMPUTATION

Chapter9 Rubin's Rules

We all know that there are two common methods to compute the confidence limit for a Hazard Ratio in the SAS `PHREG`

procedure. - Wald's Confidence Limits - Profile-Likelihood Confidence Limits

However, in R, we commonly use the `confint()`

or `summary()`

function to compute the CI from the `coxph`

model, which assumes normality. So it is identical to Wald's CI.

You can also compute it manually from the `EST ± SE * Z`

, as shown below.

`m <- coxph(Surv(time, status) ~ ph.ecog , data=na.omit(lung))ss <- summary(m)coef <- coef(m)se <- ss$coefficients[,"se(coef)"]c(exp(coef - qnorm(0.975) * se), exp(coef + qnorm(0.975) * se))`

But what's the weakness of Wald's CI? Refer to Why and When to Use Profile Likelihood Based Confidence Intervals

This blog says that since the standard errors of the general linear model are based on asymptotic variance, they may not be a good estimator of standard error for small samples. In particular, Wald Confidence Intervals may not perform very well. One should only use the Wald Confidence Interval if the likelihood function is symmetric about the MLE.

So what's the superiority of the Profile Likelihood CI?

In cases where the likelihood function is not symmetric about the MLE, the Profile Likelihood Based Confidence Interval serves better. This is because the Profile Likelihood Based Confidence Interval is based on the asymptotic chi-square distribution of the log likelihood ratio test statistic.

If you use the SAS `PHREG`

procedure, you can just simply define the `lr`

argument as `pl`

to get the Profile Likelihood CI for the hazard ratio.

Unfortunately you cannot get it in `coxph()`

function from the `survival`

package. I have tried the `coxphf()`

function from `coxphf`

package, but the CI is not identical to SAS with a little difference in a few decimal places.

`m2 <- coxphf::coxphf(formula=Surv(time, status) ~ ph.ecog, pl=FALSE, data=na.omit(lung))summary(m2)`

Anyway, this is an alternative way to compute the Profile Likelihood CI.

]]>其主要介绍了4种推断方法

- One-Proportion Inference
- One-Mean Inference
- Two-proportion inference
- Two-mean inference

当你想测试某个人群比例不等于某个固定比例，那么你可以使用**One-Proportion Inference**。

比如你推断在一些教堂中女性的比例超过55%。这里的55%就是我们需要推断的proportion，hypotheses则是`H0: pi=0.55`

和`Ha: pi>0.55`

。然后我们发现有一个教堂中100名人员里面有62个女性，那么`phat=62/100`

。此时我们是否有足够的证据推翻原假设？毕竟我们只是在一个教堂的抽样数据，所以我们得借助simulation，并假设原假设为真来计算假设检验的P值。因此我们借助`do`

和`rflip`

函数构建`pi=0.55`

的1000个模拟trials

`library(mosaic)pi <- 0.55 # probability of success for each tossn <- 100 # Number of times we toss the penny (sample size)trials <- 1000 # Number of trials (number of samples)observed <- 62 # Observed number of heads phat = observed / n # p-hat - the observed proportion of headsdata.sim <- do(trials) * rflip(n, prob = pi)`

然后计算在这1000次模拟中，出现proportion大于`phat`

的个数，除以总trials数就是我们想得到的"P值"。比较易于理解，由于是模拟所以得出来的，所以跟原本中的数值肯定是不会一样，但当模拟次数更大后，每次模拟的结果将会趋于一个较为稳定的值

`pvalue <- sum(data.sim$prop >= phat) / trials`

当你推测的值不是proportion而是一个mean值时，则可以考虑用**One-Mean Inference**。

比如你推测一辆汽车每加仑汽油的平均行驶里程不等于22英里。这里的`22`

就是我们需要推断的mean，hypotheses则是`H0: μ=22`

和`Ha: μ≠22`

。然后我们观测到在数据集`mtcars`

中每加仑汽油的平均行驶里程（`mpg`

）的mean为20.09。这时我们可以对`mtcars`

数据集进行重抽样来推断上述假设。

`mu <- 22observed <- mean(~mpg, data=mtcars, na.rm=T)paste("Observed value for sample mean: ", observed)trials <- 1000samples <- do(trials) * mosaic::mean(~mpg, data=resample(mtcars))`

基于`samples`

模拟数据，我们可以粗略计算下95%置信区间。

`# Let's compute a 95% Confidence Interval (ci <- quantile(samples$mean,c(0.025,0.975)))# 2.5% 97.5% # 18.18406 22.23453`

从置信区间可看出，其包含了我们H0假设的`μ=22`

，所提可以初步判断原假设成立，即拒绝了汽车每加仑汽油的平均行驶里程不等于`22`

这个推测。

接着根据重抽样的数据再次计算`mpg`

的均值大于`22`

的比例，由于是双侧假设，所以P值最终需要乘以2。

`pvalue <- sum(samples$mean >= mu) / trialspaste("Two-sided p-value is", 2 * pvalue)# [1] "Two-sided p-value is 0.088"`

当你推测的是两个proportion之间是否有显著不同时，则可以考虑用**Two-proportion inference**。

比如你推测支持某项政策的女性比例与支持该政策的男性比例有不同。这里两个比例就是我们所需要比较的，hypotheses则是`H0: π1=π2`

和`Ha: π1≠π2`

。这时我们观测到某个样本里`p1=62/100`

女性支持该政策，而男性则是`p2=51/100`

，数据如下：

`df <- rbind( do(38) * data.frame(Group = "Men", Support = "no"), do(62) * data.frame(Group = "Men", Support = "yes"), do(49) * data.frame(Group = "Women", Support ="no"), do(51) * data.frame(Group = "Women", Support = "yes") )(df.summary <- tally(Support ~ Group, data=df))`

接着计算两组之间proportion的差值，简单点就是`0.62-0.51=0.11`

，或者

`observed <- diffprop(Support ~ Group, data = df)paste("Observed difference in proportions: " , round(observed,3))# [1] "Observed difference in proportions: 0.11"`

然后使用打乱分组信息的方式做模拟，看看打乱后两组的差异的null distribution

`trials <- 1000null.dist <- do(trials) * diffprop(Support ~ shuffle(Group), data=df)histogram( ~ diffprop, data= null.dist, xlab = "Differences in proportions", main = "Null distribution for differences in proportions", v= observed)`

最后从null distribution中计算P值来确认是否能推翻原假设，也就是说假如H0假设是成立，那么差值大于`0.11`

（或更大更极端的值）的概率是多少，是否很小（如小于0.05）。个人理解若P值很大，则可以推翻原假设。

`p.value <- prop(~diffprop >= observed, data= null.dist)paste(" One-sided p-value: ", round(p.value,3))# [1] " One-sided p-value: 0.088"`

当你推测的是两个mean之间是否有显著不同时，则可以考虑用**Two-mean inference**

比如推测在长鳍金枪鱼（albacore）和黄鳍金枪鱼（yellowfin）中发现的汞的平均含量有不同。这里两个均值就是我们所需要比较的，hypotheses则是`H0: μ1=μ2`

和`Ha: μ1≠μ2`

。然后我们观测到在数据集`tuna.txt`

中albacore mean为`0.35763`

，而yellowfin mean为`0.35444`

，两者的差值为`-0.003`

。这时我们可以对`tuna`

数据集进行打乱分组信息来模拟并推断上述假设。

`df <- read.delim("http://citadel.sjfc.edu/faculty/ageraci/data/tuna.txt")str(df)favstats(~Mercury | Tuna, data=df)observed <- diffmean(~Mercury | Tuna, data=df, na.rm = T)paste("Observed difference in the means: ", round(observed, 3 ))# [1] "Observed difference in the means: -0.003"`

接着就类似于`Two-proportion inference`

，用`shuffle`

函数打乱分组，模拟1000次，然后计算P值来判断当原假设成立前提下该值是否极端，最后看是否能推翻原假设。

`trials <- 1000null.dist <- do(trials) * diffmean(Mercury ~ shuffle(Tuna), data=df, na.rm = T)pvalue <- prop(~ diffmean <= observed, data=null.dist)paste("The one-sided p-value is ", round(pvalue,3))# [1] "The one-sided p-value is 0.428"`

以上是对于参考资料的一个简单记录，simulation是一个非常有意思的方法，在临床试验中也较为常见，值得后续继续学习，本文所介绍的模拟是一个非常简单也易于理解的范例。

Chapter 7 Simulation-based Inference

Simulation-based inference with mosaic

MOSAIC R packages

Here we don't talk about how to determine which type of missingness your data have, you can refer to the articles Multiple Imputation.

Or a summary (Missing data assumptions and corresponding imputation methods) in Multiple imputation as a valid way of dealing with missing data

Let's keep it more practical and focus on how to impute missing data. For example, LOCF (Last Observation Carry Forward) is the standard method for imputing missing data in clinical trial studies. It is used to fill in missing values at a later point in the study, but that can lead to biased results. Other methods such as BOCF(Baseline Observation Carry Forward), WOCF(Worsts Observation Carry forward), and Multiple Imputation are also used, but rarely seen in oncology studies. The last common method like MMRM(Mixed-Effect Model Repeated Measure) is used for continuous missing data.

Given SAS is still the dominant delivery program, here I will record how to use SAS to handle this missing data. However I also perfectly suggest using R as the alternative program or QC program, as I believe R will be accepted by regulatory authorities, at least as an optional delivery program. Therefore I'm gonna record how to use the `rbmi`

package to deal with missing data like LOCF and multiple imputation, and compute the LS means with ANCOVA model in another article.

Here we create a dummy dataset with 3 columns: `usubjid`

, `avisitn`

and `aval`

.

`data data; input usubjid $8. avisitn aval; datalines;1001-101 0 851001-101 1 841001-101 2 861001-101 3 .1001-101 4 .1001-101 5 851002-101 0 901002-101 1 .1002-101 2 911002-101 3 921002-101 4 .1002-101 5 .;run;proc sort; by usubjid avisitn; run;`

Actually there are several methods to implement the LOCF, referring to LOCF-Different Approaches, Same Results Using LAG Function, RETAIN Statement, and ARRAY Facility. I usually use the `RETAIN`

statement as it's easy to understand and also very elegant. So this brings us to the final code, once the `usubjid`

changes, the `rn`

variable will be initialized to null(.) or first `aval`

grouped by `usubjid`

. And then through the `if`

statement to check if the next aval is not missing, and carry the `rn`

forward to the next aval.

`data locf; length dtype $10.; retain rn; set data; by usubjid avisitn; if first.usubjid then rn=.; if aval ne . then do; rn=aval; aval_locf=aval; end; else do; aval_locf=rn; dtype="LOCF"; end;run;`

We can see the final dataset below with LOCF'ed variable, `aval_LOCF`

.

And the BOCF and WOCF methods are also conservative like LOCF, and their programming logic is roughly the same. The former one can be used when subjects drop out due to Adverse Event, while the latter one can be used for lack of efficacy(LOE) indeed.

For Multiple Imputation(MI), it's more robust than LOCF, as it has multiple imputations.

The procedures for Multiple Imputation are generally the same in both SAS and R, such as:

- Impution, the missing data is imputed
`m`

times and generates`m`

complete datasets with a specified model or distribution. - Analysis, each of these datasets is analyzed using a certain statistical model or function, and generating
`m`

sets of estimates. - Pooling, the
`m`

sets of estimates are combined to one MI result with an appropriate method, like Rubin´s Rules (RR) that is specifically designed to pool parameter estimates and is also wrapped into SAS and R packages.

These procedures are easy to understand, so how to implement them?

- In SAS, you can use
`proc mi`

procedure for imputation, select one statistical model, such as`proc mixed`

for analysis, and lastly use`proc mianalyze`

procedure for pooling. - In R, although there are several R packages available for use, I personally prefer using
`mice`

and`rmbi`

packages, which will be introduced in other articles.

Compared to the above two methods, the MMRM (Mixed-effect Model for Repeated Measures) method does not do the imputation for individual missing data, while treating each individual as a random effect, as it has already considered the missing data in the model(that the missing data is implicitly imputed).

So it can be seen that MMRM does well in controlling type I error but LOCF may lead to the inflation of type I error. Although the MI method can also control Type I error, it is more conservative than MMRM because it will underestimate the treatment effect.

Actually there is really impressive article that talks about the comparison of MMRM versus MI, as well as the regulatory authorities' considerations on this topic. Referring to it would be quite helpful. Handling of Missing Data: Comparison of MMRM (mixed model repeated measures) versus MI (multiple imputation).

- In SAS, you can simply use
`proc mixed`

procedure using mixed model with maximum likelihood-based method. - In R, the
`nlme`

package is commonly used, but the new`mmrm`

package offers advanced functionality (just heard before...).

Multiple Imputation

SAS LOCF For Multiple Variables

SAS LOCF

LOCF-Different Approaches, Same Results Using LAG Function, RETAIN Statement, and ARRAY Facility

LOCF Method and Application in Clinical Data Analysis

临床试验中缺失数据的预防与处理

Handling of Missing Data: Comparison of MMRM (mixed model repeated measures) versus MI (multiple imputation)

`ggplot2`

.I encounter this question when I want to construct two different color ranges to `col`

aesthetic, such as `geom_line`

and `geom_text`

. Sometimes I may choose another way to visualize the data to avoid this situation, but I really want to know how to solve it if I have to use this color strategy.

From my Google search. I have found the best solution and it must be thanks to Elio Campitelli’s contribution, the author of the `ggnewscale`

package. He demonstrates how to implement two color scales and explain what the principle is. You can refer to this article, Multiple color (and fill) scales with ggplot2.

Here I just show how it works. Firstly we prepare the dummy data.

`library(tidyverse)library(ggnewscale)set.seed(123)data <- tibble( id = rep(1:5, each = 4), day = sample(5:20, 20, replace = TRUE), linecol = str_c("col", id), day2 = day + 2, label = rep(c("Group1", "Group2"), each = 10))`

And then I’d like to draw a line plot with labels around it. The line colors are determined by the `linecol`

variable, while label colors are by `label`

group. Let's look at the error example that doesn’t work with no surprise.

`data %>% ggplot(aes(x = day, y = id)) + geom_line(aes(col = linecol)) + scale_color_manual(values = c("red", "orange", "yellow", "green", "blue")) + geom_text(aes(label = label, col = label)) + scale_color_manual(values = c("blue", "orange"), guide = NULL) # ErrorScale for colour is already present.Adding another scale for colour, which will replace the existing scale.Error in `palette()`:! Insufficient values in manual scale. 7 needed but only 2 provided.`

In order to solve it, you just need to add one line code as mentioned in that reference article. `structure(ggplot2::standardise_aes_names("colour"), class = "new_aes")`

or the function `new_scale_color()`

wrapped in `ggnewscale`

package. If you want to use `scale_fill_*`

, replacing "colour" to "fill" is fine.

So the final code without any error is shown below.

`data %>% ggplot(aes(x = day, y = id)) + geom_line(aes(col = linecol)) + scale_color_manual(values = c("red", "orange", "yellow", "green", "blue")) + structure(ggplot2::standardise_aes_names("colour"), class = "new_aes") + # new_scale_color() + geom_text(aes(label = label, col = label)) + scale_color_manual(values = c("blue", "orange"), guide = NULL)`

There is no doubt that this solution is not very common and formal. And I really hope it can be merged into `ggplot2`

big family so that I only need to import one package. Ah!

本方法参考Call ChatGPT (or really any other API) from R，其于3月2号就发布了教程！

示例如下：

`library(httr)api_key <- "sk-Zzpcse7C0Mabe461NvEbToA3g765nYnmFwGgZ5b"response <- POST( # curl https://api.openai.com/v1/chat/completions url = "https://api.openai.com/v1/chat/completions", # -H "Authorization: Bearer $OPENAI_API_KEY" add_headers(Authorization = paste("Bearer", api_key)), # -H "Content-Type: application/json" content_type_json(), # -d '{ # "model": "gpt-3.5-turbo", # "messages": [{"role": "user", "content": "What is a banana?"}] # }' encode = "json", body = list( model = "gpt-3.5-turbo", messages = list(list(role = "user", content = "How to use ChatGPT API in R?")) ))chatGPT_answer <- content(response)$choices[[1]]$message$contentchatGPT_answer <- stringr::str_trim(chatGPT_answer)cat(chatGPT_answer)`

首先在OpenAI API page创建一个API key，需要openai账号登陆

然后`Create new secret key`

，会产生一个类似于上方示例代码中的`sk-Zzpcse7C0Mabe461NvEbToA3g765nYnmFwGgZ5b`

等一连串字符（注意：示例代码的key是假的，无法使用，因为我基于我的修改了一些字符）

最后修改示例代码中的`content`

后面的文字，输入你想要的即可实现与ChatGPT的交互。

此外你也可以打包成一个函数，方便调用，如：

`# Calls the ChatGPT API with the given prompt and returns the answerask_chatgpt <- function(prompt) { response <- POST( url = "https://api.openai.com/v1/chat/completions", add_headers(Authorization = paste("Bearer", api_key)), content_type_json(), encode = "json", body = list( model = "gpt-3.5-turbo", messages = list(list( role = "user", content = prompt )) ) ) str_trim(content(response)$choices[[1]]$message$content)}answer <- ask_chatgpt("How to use ChatGPT API in R?")cat(answer)`

整体上还是蛮方便，对吧~

`chatgpt`

若问是否有ChatGPT 相关R包可供使用？可参考ChatGPT coding assistant for RStudio

其不仅提供了与ChatGPT交互的函数`ask_chatgpt()`

，还有其他有意思功能，具体查看其文档说明即可！

使用也与上述API方法类似，也是先调用API key，然后选择对应的函数，但是会简洁一些，如：

`Sys.setenv(OPENAI_API_KEY = "sk-Zzpcse7C0Mabe461NvEbToA3g765nYnmFwGgZ5b") # fake key, don't use it.library(chatgpt)cat(ask_chatgpt("How to use ChatGPT API in R?"))*** ChatGPT input:How to use ChatGPT API in R?You can use the `httr` package in R to interact with ChatGPT's API. Here's a sample code to get started.1. Install `httr` package via `install.packages("httr")``1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

# Load httr library

library(httr)

# Set the API parameters

url <- "https://api.chatgpt.com/chat"

body <- list(

query = "Hi, how are you?",

token = "<your-api-token>"

)

# Send a POST request to the endpoint

response <- POST(url, body = body)

# Extract the response content as a string

content(response, as = "text")

In this example, the `url` variable stores the endpoint URL of ChatGPT API. `body` has two parameters:- `query`: The text you want to send to ChatGPT API as input.- `token`: Your API token provided by ChatGPT. After setting the variables, the `POST` function from `httr` package sends a POST request to the API endpoint. Finally, the response content is extracted as a string using `content` function.Make sure to replace `<your-api-token>` with your ChatGPT API token before running the

ChatGPT 真是太神奇了~假如ChatGPT web版不崩的话，我还是会优先使用web版，因为现阶段我只为search而不是为了develop.

]]>`mutate`

function from the `dplyr`

package in R.All the ways are referred to in this discussion in Stackoverflow. I keep a record of this due to the convenience for next reference.

First of all, I show one wrong way that I’ve done before. Given you have a dummy data below, and would like to split and get the first half of the string with `_`

delimiter.

`library(tidyverse)data <- tibble( label = c("a_1", "b_2", "c_3", "d_4", "e_5"))`

As per my past experience, I got used to splitting the label column by `str_split(label, "_")[[1]][1]`

. But that is unable to give the correct output where the values are all “a”. You can see below or try it by yourself.

`data %>% mutate(sublabel = str_split(label, "_")[[1]][1])# A tibble: 5 × 2 label sublabel <chr> <chr> 1 a_1 a 2 b_2 a 3 c_3 a 4 d_4 a 5 e_5 a `

Obviously you can see that’s definitely wrong. The correct way you can use has been listed below and I summarize them from that article in Stackoverflow.

Add the

`simplify = T`

argument that can return the data frame instead of a list, so that I can use`[,1]`

to extract the first half one.`data %>% mutate(sublabel = str_split(label, "_", simplify = T)[,1])`

Use

`separate()`

function instead of`str_split()`

through a very clever way to avoid the error.`data %>% separate(label, c("sublabel1", "sublabel2"))`

Similar to the first one, but use a more straight and explicit way to extract the first half one with the

`map_chr()`

function that can apply a function to each element of a list. So if I want to select the first one in one list, just using`map_chr(.,1)`

.`data %>% mutate(sublabel = str_split(label, "_") %>% map_chr(., 1))`

This is a brief post, and I hope it will be a reminder for me when I forget something.

]]>I was also confused about the distinction between these two estimations when I was new to the drug trials and requested to calculate the follow-up time.

For the median survival time, I suppose many people know how to deal with that through the Kaplan-Meier curve. But how about the median follow-up time? As we know, if we want to calculate the follow-up time, we can not guarantee that all subjects are ongoing. Thus we should think of a reasonable way to deal with the completed subjects, otherwise if we directly calculate the median value, the result will result in an underestimate.

Refer to Schemper and Smith, a very clear way to use the reverse Kaplan-Meier curve to calculate the median follow-up time. Given a tumor trial in which the event of interest is actually the loss-of-followup, it's easy to understand that we can not know how long they would have been followed if that event didn't happen. To make this calculation more analytical, Schemper and Smith suggest using the Kaplan-Meier curve with the reversed status indicator (if you use R, it can be seen that `1`

to indicate the subject who is censored, and `0`

to indicate the event.), where the median survival time can actually be interpreted as the median follow-up time.

Suppose we have a set of OS survival data where 0 indicates the death status. In R that can be used as below.

`library(survival)fit <- survfit(Surv(time, status==0) ~ 1, data = os_data)surv_median(fit)`

- M Schemper and TL Smith. A note on quantifying follow-up in studies of failure time. Controlled clinical trials (1996) vol. 17 (4) pp. 343-346
- Determining the median followup time

In the trials, we would actually draw up a plan to define the rules for how to impute the partial date. But here, I simplify the imputation rule as shown below to illustrate its implementation in R and SAS:

- If the day of analysis start date is missing then impute the first day of the month. If both the day and month are missing then impute to 01-Jan.
- If the day of analysis end date is missing then impute the last day of the month. If both the day and month are missing then impute to 31-Dec.
- If the imputed analysis end date is after the last alive date then set it to the last alive date.

Firstly, let’s create dummy data in SAS that includes four variables.

`data dummy; length USUBJID $20. LSTALVDT $20. AESTDTC $20. AEENDTC $20.; input USUBJID $ LSTALVDT $ AESTDTC $ AEENDTC $; datalines; SITE01-001 2023-01-10 2019-06-18 2019-06-29 SITE01-001 2023-01-10 2020-01-02 2020-02 SITE01-001 2023-01-10 2022-03 2022-03 SITE01-001 2023-01-10 2022-06 2022-06 SITE01-001 2023-01-10 2023 2023;run;`

`USUBJID`

, unique subject identifier.`LSTALVDT`

, last known alive date.`AESTDTC`

, start date of adverse event.`AEENDTC`

, end date of adverse event.

And we can see from the above rules that concatenating "01" with the date that misses the day is very easy. However if we want to calculate the `AENDT`

, we need to consider which day is matched with each month, for example, the 28th or 29th, 30th or 31th. So we need to apply the `intnx`

function to get the last day correctly.

`data dummy_2; set dummy; if length(AESTDTC)=7 then do; ASTDTF="D"; ASTDT=catx('-', AESTDTC, "01"); end; else if length(AESTDTC)=4 then do; ASTDTF="M"; ASTDT=catx('-', AESTDTC, "01-01"); end; else if length(AESTDTC)=10 then ASTDT=AESTDTC; if length(AEENDTC)=7 then do; AENDTF="D"; AEENDTC_=catx('-', AEENDTC, "01"); AENDT=put(intnx('month', input(AEENDTC_,yymmdd10.), 0, 'E'), yymmdd10.); end; else if length(AEENDTC)=4 then do; AENDTF="M"; AENDT=catx('-', AEENDTC, "12-31"); end; else if length(AEENDTC)=10 then AENDT=AEENDTC; if input(AENDT,yymmdd10.)>input(LSTALVDT,yymmdd10.) then AENDT=LSTALVDT; drop AEENDTC_;run;`

From the output we can see that when the day of date is missing, we set the imputation flag to 'D' as the flag variable, like `ASTDTF`

. If the month of the date is missing, set it to "M". It also considers leap years and sets the date to the last alive date if the imputed date is later than the last alive date. So I suppose all the dates have been imputed correctly.

Then let’s create the same dummy to see how to conduct the rules in R.

`library(tidyverse)library(lubridate)dummy <- tibble( USUBJID = "SITE01-001", LSTALVDT = "2023-01-10", AESTDTC = c("2019-06-18", "2020-01-02", "2022-03", "2022-06", "2023"), AEENDTC = c("2019-06-29", "2020-02", "2022-03", "2022-06", "2023"))`

The dummy data can be shown below.

`# A tibble: 5 × 4 USUBJID LSTALVDT AESTDTC AEENDTC <chr> <chr> <chr> <chr> 1 SITE01-001 2023-01-10 2019-06-18 2019-06-292 SITE01-001 2023-01-10 2020-01-02 2020-02 3 SITE01-001 2023-01-10 2022-03 2022-03 4 SITE01-001 2023-01-10 2022-06 2022-06 5 SITE01-001 2023-01-10 2023 2023 `

And then we follow the rules as the SAS used to impute the partial date in R. To get the last day of each month's imputation, we'd better use the `rollback()`

and `ceiling_date()`

functions in the `lubridate`

package to get the correct day considering the leap years. In addition, others are the common functions in the `tidyverse`

package to manipulate the data, like `case_when()`

and `select()`

.

`dummy_2 <- dummy %>% mutate( ASTDTF = case_when( str_length(AESTDTC) == 4 ~ "M", str_length(AESTDTC) == 7 ~ "D" ), ASTDT_ = case_when( str_length(AESTDTC) == 4 ~ str_c(AESTDTC, "01-01", sep = "-"), str_length(AESTDTC) == 7 ~ str_c(AESTDTC, "01", sep = "-"), is.na(ASTDTF) ~ AESTDTC ), ASTDT = ymd(ASTDT_), AENDTF = case_when( str_length(AEENDTC) == 4 ~ "M", str_length(AEENDTC) == 7 ~ "D" ), AENDT_ = case_when( str_length(AEENDTC) == 4 ~ str_c(AEENDTC, "12-31", sep = "-"), str_length(AEENDTC) == 7 ~ str_c(AEENDTC, "-15"), is.na(AENDTF) ~ AEENDTC ), AENDT = case_when( str_length(AEENDTC) == 7 ~ rollback(ceiling_date(ymd(AENDT_), "month")), TRUE ~ ymd(AENDT_) ), AENDT = if_else(AENDT > ymd(LSTALVDT), ymd(LSTALVDT), AENDT) ) %>% select(-ASTDT_, -AENDT_)`

Here we can see that the output is consistent with the SAS. It's very easy in R, right? You can also use many useful functions to transfer the different date types, for example from `date9.`

to `yymmdd10.`

like `dmy("01Jan2023")`

. Honestly the `lubridate`

package can provide a series of functions to deal with date manipulation, such as using `interval()`

to calculate the duration of AEs.

`# A tibble: 5 × 8 USUBJID LSTALVDT AESTDTC AEENDTC ASTDTF ASTDT AENDTF AENDT <chr> <chr> <chr> <chr> <chr> <date> <chr> <date> 1 SITE01-001 2023-01-10 2019-06-18 2019-06-29 NA 2019-06-18 NA 2019-06-292 SITE01-001 2023-01-10 2020-01-02 2020-02 NA 2020-01-02 D 2020-02-293 SITE01-001 2023-01-10 2022-03 2022-03 D 2022-03-01 D 2022-03-314 SITE01-001 2023-01-10 2022-06 2022-06 D 2022-06-01 D 2022-06-305 SITE01-001 2023-01-10 2023 2023 M 2023-01-01 M 2023-01-10`

`admiral`

PackageMaybe you would say if there is a package that can deal with date imputation for ADaM. A manipulation structure that is wrapped in a series of functions to sort out the common imputation situations in ADaM. There's no doubt that you can believe the `admiral`

package. Let me show some examples here to demonstrate how to use it for imputing partial dates.

`library(admiral)dummy %>% derive_vars_dt( dtc = AESTDTC, new_vars_prefix = "AST", highest_imputation = "M", date_imputation = "first" ) %>% mutate(LSTALVDT = ymd(LSTALVDT)) %>% derive_vars_dt( dtc = AEENDTC, new_vars_prefix = "AEND", highest_imputation = "M", date_imputation = "last", max_dates = vars(LSTALVDT) )`

Isn't the code quite straightforward? If your date vector is date time (DTM), you can use `derive_vars_dtm()`

instead.

`# A tibble: 5 × 8 USUBJID LSTALVDT AESTDTC AEENDTC ASTDT ASTDTF AENDDT AENDDTF <chr> <date> <chr> <chr> <date> <chr> <date> <chr> 1 SITE01-001 2023-01-10 2019-06-18 2019-06-29 2019-06-18 NA 2019-06-29 NA 2 SITE01-001 2023-01-10 2020-01-02 2020-02 2020-01-02 NA 2020-02-29 D 3 SITE01-001 2023-01-10 2022-03 2022-03 2022-03-01 D 2022-03-31 D 4 SITE01-001 2023-01-10 2022-06 2022-06 2022-06-01 D 2022-06-30 D 5 SITE01-001 2023-01-10 2023 2023 2023-01-01 M 2023-01-10 M`

I'm planning to learn how to use the `admiral`

package, for example, by building ADaM ADRS. I suppose this package improves the ecology of R greatly in drug trials.

Common Dating in R: With an example of partial date imputation

Tips to Manipulate the Partial Dates

Date and Time Imputation

To reach this purpose, we just need to take two steps:

- Split the data frame by group.
- Add a blank row.

The idea is extremly clear and similar to the SAS process. Here, let's see how to complete these two steps.

Firstly, I create test data like:

`library(tidyverse)data <- iris %>% group_by(Species) %>% slice_head(n = 3) %>% select(Species, everything())> data# A tibble: 9 × 5# Groups: Species [3] Species Sepal.Length Sepal.Width Petal.Length Petal.Width <fct> <dbl> <dbl> <dbl> <dbl>1 setosa 5.1 3.5 1.4 0.22 setosa 4.9 3 1.4 0.23 setosa 4.7 3.2 1.3 0.24 versicolor 7 3.2 4.7 1.45 versicolor 6.4 3.2 4.5 1.56 versicolor 6.9 3.1 4.9 1.57 virginica 6.3 3.3 6 2.58 virginica 5.8 2.7 5.1 1.99 virginica 7.1 3 5.9 2.1`

Now I'd like to insert rows between each `Species`

, which would mean inserting a row between 3-4 rows and 6-7 rows. So we need to use the `group_split`

function to split data by the `Species`

variable.

`data %>% group_split(Species)`

And then we can find that the output class is a list, so the next step we should do is convert this list class to a dataframe with blank rows. We can now use the functional programming tool `purrr`

, which has a map function `map_dfr`

to deal with this. It applies a function(here is the `add_row`

) to each element of the list.

`data %>% group_split(Species) %>% map_dfr(~add_row(.x, .after = Inf))# A tibble: 12 × 5 Species Sepal.Length Sepal.Width Petal.Length Petal.Width <fct> <dbl> <dbl> <dbl> <dbl> 1 setosa 5.1 3.5 1.4 0.2 2 setosa 4.9 3 1.4 0.2 3 setosa 4.7 3.2 1.3 0.2 4 NA NA NA NA NA 5 versicolor 7 3.2 4.7 1.4 6 versicolor 6.4 3.2 4.5 1.5 7 versicolor 6.9 3.1 4.9 1.5 8 NA NA NA NA NA 9 virginica 6.3 3.3 6 2.510 virginica 5.8 2.7 5.1 1.911 virginica 7.1 3 5.9 2.112 NA NA NA NA NA `

The above output is what I expected. And I feel the R programming is more brief and clear than SAS, do you think so?

In R the most simple function to replace NA is `replace()`

or `is.na()`

functions.

`library(tidyverse)data <- tibble( a = c(1, 2, NA, 3, 4), b = c(5, NA, 6, 7, 8), c = c(9, 10, 11, NA, 12))`

For instance, if we want to replace NAs in all columns, the simple functions can be used like:

`data[is.na(data)] <- 0replace(data, is.na(data), 0)`

In the more factual scenario, we will have both numeric and character columns at the same time, not only the numeric in the above example. It seems the prior method is not convenient as we must select the numeric or character columns first and then replace NA with any appropriate value. Through searching on Google, I suppose the more simple way is to use `dplyr::mutate_if()`

to check and select the specific type of columns, and `replace_na()`

to replace the NAs.

`data <- tibble( num1 = c(NA, 1, NA), num2 = c(2, NA, 3), chr1 = c("a", NA, "b"), chr2 = c("c", "d", NA))data %>% mutate_if(is.numeric, ~replace_na(., 0)) %>% mutate_if(is.character, ~replace_na(., "xx"))`

To be honest, I prefer the combo functions as I got used to applying the pipe `%>%`

code in R, so the relevant functions like `mutate_if()`

, `mutate_all()`

, `mutate_at()`

functions in `tidyverse`

R package are very convenient for me.

For instance, if you want to replace NAs with 0 on selected column names or indexes, as shown below.

`data %>% mutate_at(c(1,2), ~replace_na(., 0))`

Besides the `dplyr::coalesce()`

function can also be used to replace the NAs in a very tricky way, although it’s used to find the first non-missing element in common.

`data %>% mutate(num1 = coalesce(num1, 0))`

R – Replace NA with Empty String in a DataFrame

R – Replace NA with 0 in Multiple Columns

Let's see a demo.

`library(ggplot2)library(tidyverse)# Datadata(iris)ggplot(iris, aes(x = Species, y = Sepal.Length, colour = Species)) + geom_boxplot()`

Adding jittered points to the box plot in `ggplot`

is useful to see the underlying distribution of the data. You can use the `geom_jitter`

function with few params. For example, `width`

param to adjust the width of the jittered points.

`ggplot(iris, aes(x = Species, y = Sepal.Length, colour = Species, shape = Species)) + geom_boxplot() + geom_jitter(width = 0.25)`

Sometimes, we might try to add jittered data points to the grouped boxplot, but we can not use the `geom_jitter()`

function directly as it's a handy shortcut for `geom_point(position="jitter")`

. Let's see what chart will be generated as shown below. It makes the grouped boxplot with overlapping jittered data points.

`ggplot(iris2, aes(x = Species, y = Sepal.Length, colour = group, shape = group)) + geom_boxplot() + geom_jitter(width = 0.25)`

Natively, how to make a better and correct jittered data points to the grouped boxplot. We can use the `position_jitterdodge()`

as the position param, inside the `geom_point`

function.

`ggplot(iris2, aes(x = Species, y = Sepal.Length, colour = group, shape = group)) + geom_boxplot() + geom_point(position = position_jitterdodge(jitter.width = 0.25))`

Right now, we get a nice looking grouped boxplot with clearly separated boxes and jittered data points within each box.

https://r-charts.com/distribution/box-plot-jitter-ggplot2/

https://datavizpyr.com/how-to-make-grouped-boxplot-with-jittered-data-points-in-ggplot2/

Given that I want to plot a scatter plot with regression line for `sashelp.iris`

dataset by the GTL(Graph Template Language) process. So I define a GTL template firstly.

`proc template; define statgraph ScatterRegPlot; begingraph/ backgroundcolor=white border=false datacontrastcolors=(orange purple blue) datasymbols=(circlefilled trianglefilled DiamondFilled); layout overlay; scatterplot x=SepalLength y=SepalWidth /group=Species name='points'; regressionplot x=SepalLength y=SepalWidth / group=Species degree=3 name='reg'; discretelegend 'points'; endlayout; endgraph; end;run;`

Now let's see how to create RTF or PDF with this graph.

For PDF as below:

`ods escapechar="^";ods listing close;options nonumber nodate;ods pdf file="C:/Users/Desktop/example.pdf";proc sgrender data = sashelp.iris template = ScatterRegPlot; run;ods pdf close;ods listing;`

For RTF just change `ods pdf`

above to `ods rtf`

.

If we just want to save as PNG, as follows:

`ods listing gpath='C:/Users/TJ0695/Desktop' image_dpi = 300 style=Journal;ods graphics / imagename="example" imagefmt=png width = 20cm height = 15cm;proc sgrender data = sashelp.iris template = ScatterRegPlot; run;ods graphics off;`

If we increase DPI to 600, it will cause an error, like `ERROR: Java virtual machine exception. java.lang.OutOfMemoryError: Java heap space.`

. So we should modify the configuration file of SAS to fix this error.

- Run
`proc options option=config; run;`

to find the certain configuration file. - Open that file and find the specific text started with
`-Xms`

or`-Xmx`

, and change both of them to`1024m`

from`128m`

. - Reboot SAS and rerun the code.

After that the error doesn't appear again but the warning is still there.

I find it difficult to understand what LS actually means in its literal sense.

The definition from `lsmeans`

package is shown blow, that have been transitioned to `emmeans`

package.

Least-squares means (LS means for short) for a linear model are simply predictions—or averages thereof—over a regular grid of predictor settings which I call the

reference grid.

In fact, even when I read this sentence, I was still very confused. What's the reference grid, and how to predict?

So let's see how the LS means is calculated, and the corresponding confidence interval as well.

Firstly import CDSIC pliot dataset, the same as the previous blog article - Conduct an ANCOVA model in R for Drug Trial. And then handle with the `adsl`

and `adlb`

to create an analysis dataset `ana_dat`

so that we can use ANCOVA by `lm`

function. Supposed that we want to see the `CHG`

(change from baseline) is affected by independent variable `TRTP`

(treatment) under the control of covariate variables `BASE`

(baseline) and `AGE`

(age).

Filter the dataset by `BASE`

variable as one missing value can be found in dataset.

`library(tidyverse)library(emmeans)ana_dat2 <- filter(ana_dat, !is.na(BASE))`

Then fit the ANCOVA model by `lm`

function.

`fit <- lm(CHG ~ BASE + AGE + TRTP, data = ana_dat2)anova(fit)# Analysis of Variance Table## Response: CHG# Df Sum Sq Mean Sq F value Pr(>F)# BASE 1 1.699 1.6989 0.9524 0.3322# AGE 1 0.001 0.0010 0.0006 0.9811# TRTP 2 8.343 4.1715 2.3385 0.1034# Residuals 76 135.570 1.7838 `

We know that the LS means can be calculated according to reference grid that contains the mean of covariables and total factors for independent variables.

`rg <- ref_grid(fit)# 'emmGrid' object with variables:# BASE = 5.4427# AGE = 75.309# TRTP = Placebo, Xanomeline Low Dose, Xanomeline High Dose`

The mean of `BASE`

and `AGE`

are, as we can see from the table above, `5.4427`

and `75.309`

, respectively. Or we can calculate manually like:

`summary(ana_dat2[,c("BASE", "AGE")])# BASE AGE # Min. : 3.497 Min. :51.00 # 1st Qu.: 4.774 1st Qu.:71.00 # Median : 5.273 Median :77.00 # Mean : 5.443 Mean :75.31 # 3rd Qu.: 5.718 3rd Qu.:81.00 # Max. :10.880 Max. :88.00`

Then we can use `summary()`

or `predict()`

function to get the predicted value based on reference grid `rg`

.

`rg_pred <- summary(rg)rg_pred# BASE AGE TRTP prediction SE df# 5.44 75.3 Placebo 0.0578 0.506 76# 5.44 75.3 Xanomeline Low Dose -0.1833 0.211 76# 5.44 75.3 Xanomeline High Dose 0.5031 0.235 76`

The prediction column is the same as from `predict(rg)`

. The prediction table looks like the predicted values of the different factor levels at the constant mean value.

In fact, we can aslo calculate the predicted value as we have the coefficients estimation of the regression equation from `fit$coefficients`

`> fit$coefficients (Intercept) BASE AGE -1.11361290 0.11228582 0.00743963 TRTPXanomeline Low Dose TRTPXanomeline High Dose -0.24108746 0.44531274`

As the `TRTP`

includes multiple factors so it has been converted into dummy variables:

`contrasts(ana_dat2$TRTP)# Xanomeline Low Dose Xanomeline High Dose# Placebo 0 0# Xanomeline Low Dose 1 0# Xanomeline High Dose 0 1`

Now if we want to calculate the predicted value for the `Xanomeline Low Dose`

factor, it can be as follows:

`> 0.11229*5.44+0.00744*75.3-0.24109*1-1.11361[1] -0.1836104`

Back to LS means, from its definition, it seems to be the average of the predicted values.

`rg_pred %>% group_by(TRTP) %>% summarise(LSmean = mean(prediction))# # A tibble: 3 × 2# TRTP LSmean# <fct> <dbl># 1 Placebo 0.0578# 2 Xanomeline Low Dose -0.183 # 3 Xanomeline High Dose 0.503 `

It's exactly the same results as `lsmeans(rg, "TRTP")`

by `emmeans`

package. Or just using `emmeans(fit, "TRTP")`

can also get the same results

`lsmeans(rg, "TRTP")# TRTP lsmean SE df lower.CL upper.CL# Placebo 0.0578 0.506 76 -0.949 1.065# Xanomeline Low Dose -0.1833 0.211 76 -0.603 0.236# Xanomeline High Dose 0.5031 0.235 76 0.036 0.970`

The degree of freedom is `76`

as the DF for `TRTP`

is `2`

, and `1`

and `1`

for each covariables. So the total DF is `81-2-1-1=76`

I think.

Using `test`

we can get the P value when we compare the lsmean to zero.

`test(lsmeans(fit, "TRTP"))# TRTP lsmean SE df t.ratio p.value# Placebo 0.0578 0.506 76 0.114 0.9093# Xanomeline Low Dose -0.1833 0.211 76 -0.870 0.3869# Xanomeline High Dose 0.5031 0.235 76 2.145 0.0351`

In fact, the `t.ratio`

is the t statistics, so we can calculate P value manually, like

`2 * pt(abs(0.114), 76, lower.tail = F)2 * pt(abs(-0.870), 76, lower.tail = F)2 * pt(abs(2.145), 76, lower.tail = F)`

Likewise the confidence interval of lsmean can also be calculated manually based on `SE`

and `DF`

, such as for Placebo factor.

`> 0.0578 + c(-1, 1) * qt(0.975, 76) * 0.506[1] -0.9499863 1.0655863`

I think these steps will go a long way in understanding the meaning of least-squares means, and the logic behind it. Hope to be helpful.

“emmeans” package

最小二乘均值的估计模型

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

Confidence intervals and tests in emmeans

Least-squares Means: The R Package lsmeans