When we mention how to find the cutoff, the first response in our brain may be the ROC curve. Absolutely, ROC curve is a very common approach in the biomarker field to look for a fit cutoff to a reagent. However in the ROC curve, the dependent variable must be two categorical variables, which is not universal to different types of data, such as quantitative variables, survival(censored) variables.
In this situation, as an option we can consider using maximally selected rank statistics to find the cutoff.
The statistics depends on your data types.
What is the maximally selected rank statistics? Briefly speaking it assumes that an unknown cutoff in X (independent variable) determines two groups of observations regarding the response Y, and the two groups have the biggest statistics between each other. This statistics is an appropriate standardized two-sample linear rank statistic of the responses that represents the difference between two groups.
The hypothesis test is to find out the maximum of the standardized statistics of all possible cutoffs, which can provide the best separation of the response into two groups.
So Maximally selected rank statistics can be used for estimation as well as evaluation of a simple cutpoint model.
surv_cutpoint() function in the
survminer package wrapped
maxstat in it to determine the optimal cutpoint for each variable.
A simple example below is about how to use the
maxstat package to find a statistically significant cutoff.
Load the survival data from the maxstat package.
library(survival) library(maxstat) data(DLBCL) mod <- maxstat.test(Surv(time, cens) ~ MGE, data = DLBCL, smethod = "LogRank", pmethod = "condMC", B = 9999) > mod Maximally selected LogRank statistics using condMC data: Surv(time, cens) by MGE M = 3.1772, p-value = 0.009701 sample estimates: estimated cutpoint 0.1860526
smthod provides several kinds of statistics to be computed and the
pmethod is used to specify the kind of p-value approximation. The argument
B specifies the number of Monte-Carlo replications to be performed (and defaults to 10000).
For the overall survival time, the estimated cutpoint is 0.186 mean gene expression, the maximum of the log-rank statistics is M = 3.1772. The probability that, under the null hypothesis, the maximally selected log-rank statistic is greater M = 3.171 is less then than 0.0097.
If you have more than one dependent variable to be evaluated, you need to evaluate these predictors simultaneously and find out which one is better than the others.
mod2 <- maxstat.test(Surv(time, cens) ~ MGE + IPI, data = DLBCL, smethod = "LogRank", pmethod = "exactGauss", abseps=0.01) > mod2 Optimally Selected Prognostic Factors Call: maxstat.test.data.frame(formula = Surv(time, cens) ~ MGE + IPI, data = DLBCL, smethod = "LogRank", pmethod = "exactGauss", abseps = 0.01) Selected: Maximally selected LogRank statistics using exactGauss data: Surv(time, cens) by IPI M = 2.9603, p-value = 0.01104 sample estimates: estimated cutpoint 1 Adjusted p.value: 0.03430325 , error: 0.001754899 > mod2$maxstats [] Maximally selected LogRank statistics using exactGauss data: Surv(time, cens) by MGE M = 3.0602, p-value = 0.02721 sample estimates: estimated cutpoint 0.1860526 [] Maximally selected LogRank statistics using exactGauss data: Surv(time, cens) by IPI M = 2.9603, p-value = 0.01104 sample estimates: estimated cutpoint 1
The p-value of the global test for the null hypothesis “survival is independent from both IPI and MGE” is 0.034 and IPI provides a better distinction into two groups than MGE does.
The visualization of the result can be shown by
Please indicate the source: http://www.bioinfo-scrounger.com