This guide provides an introduction to applying the
multicate
package in health data analyses. The explanations
below are based on the original paper that is based on for development
of this package: Comparison of methods that combine multiple
randomized trials to estimate heterogeneous treatment effects.
Statistics in medicine (Brantner et al., 2024).1 The detailed github
code and readme are in this link(https://github.com/dobengjhu/multicate).
To get strated, install and load the multicate
package
using either install.packages(multicate)
or
pak::pak("dobengjhu/multicate")
. To explore the
functionality of the package, you can examine the sample dataset
dummy_tbl
. This dataset includes three previously conducted
trials comparing the same two treatments. It contains:
An outcome variable called response
A treatment indicator tx
Five signal covariates (var
to var5
)
that may act as effect modifiers
You can view a summary of the dataset with
summary(dummy_tbl)
.
In this documnet, except for multicate
, we used the
packages as follow.
# 1) Install multicate package
#install.packages("multicate")
#pak::pak("dobengjhu/multicate")
library(multicate)
#2) Additional packages for data cleansing
library(tibble)
#install.packages("tableone")
library(tableone)
library(tidyverse)
#install.packages("fastDummies") # if not already installed
library(fastDummies)
#install.packages("kableExtra")
library(kableExtra)
#install.packages("performance")
library(performance)
studyid | tx | var1 | var2 | var3 | var4 | var5 | response |
---|---|---|---|---|---|---|---|
study1 | 0 | 0.9229540 | -0.6017255 | 0.5678660 | 0.2579868 | -0.3901526 | 0.5331128 |
study1 | 1 | 0.6724726 | 0.3215742 | 0.3526782 | 1.0991352 | -0.9077090 | 3.3049825 |
study1 | 0 | -1.1199247 | -1.1144356 | 1.7288654 | 0.3144015 | -0.7294912 | 0.8041214 |
study1 | 1 | -0.0527381 | 0.9808838 | -0.9881958 | -1.0123336 | 1.0122642 | -0.1292580 |
study1 | 1 | -0.8410323 | -0.8977820 | -1.1944022 | -0.2001853 | -0.3946877 | -1.6867035 |
study1 | 0 | -0.7123436 | 1.2483083 | -1.0701537 | -1.6792970 | 0.7385637 | -1.3736278 |
study1 | 1 | 1.6375565 | -0.2251780 | -1.1567990 | -0.3433757 | -1.9489651 | 0.7792991 |
study1 | 0 | 1.5271901 | 0.8185257 | 1.4695467 | 2.3093622 | 0.6189032 | 4.9178592 |
study1 | 0 | -0.4038597 | 0.5745078 | 1.2317660 | 0.4126709 | 0.0939900 | 2.3317713 |
study1 | 1 | -0.7455479 | -0.5814232 | -1.0777537 | 0.3291436 | 2.3484475 | -0.5198707 |
In this example, we use data from a study of Right Heart Catheterization (RHC), a diagnostic procedure used to measure cardiac function in critically ill patients. While RHC can guide urgent and ongoing treatment decisions, it also carries a risk of serious complications. The observational dataset originates from Murphy and Cluff (1990), was later reanalyzed by Connors et al. (1996), and has since become a benchmark dataset in causal inference research.
The dataset (rhc.xls
) includes information on 5,735
hospitalized adult patients across five U.S. medical centers. The
treatment
variable (column 2) indicates whether RHC was
administered within 24 hours of admission (TRUE
for
treatment; FALSE
for control). The outcome variable of this
study is originally dth30
(column 54), records 30-day
post-admission mortality (TRUE
for death;
FALSE
for survival), but in this example, we set the
outcome Glasgow Coma Score (scoma1
variable) as our
continuous coutcome.
Covariate information (column 3 to 53): age, sex, race (black, white, other), years of education,income, type of medical insurance (private, Medicare, Medicaid, private and Medicare, Medicare and Medicaid, or none), primary disease category, secondary disease category, 10 categories of admission diagnosis, activities of daily living (ADL) and Duke Activity Status Index (DASI) 2 weeks before admission, do-not-resuscitate status on day 1 of admission, cancer (none, localized, metastatic), an estimate of the probability of surviving 2 months, acute physiology component of the APACHE III score, Glasgow Coma Score, weight, temperature, mean blood pressure, respiratory rate, heart rate, PaO2=FIO2 ratio, PaCO2, pH, WBC count, hematocrit, sodium, potassium, creatinine, bilirubin, albumin, urine output, 12 categories of comorbidity illness, and whether the patient transferred from another hospital. Since the number of covariates are too big for this demonstration purpose, we will choose a smaller set of covariates in analysis.
To demonstrate the functionality of the multicate
package in the context of multiple studies, we simulate study membership
by arbitrarily dividing the dataset into five groups, treating them as
if they came from five distinct medical centers. Although the original
dataset does not indicate which patient belongs to which hospital, this
simulated structure allows us to explore treatment effect heterogeneity
across centers and individual-level covariates.
#Load data
rhc <- read.csv(file="rhc.xls", header=TRUE)
rhc <- rhc %>% mutate(across(where(is.logical), as.numeric))
rhc <- rhc %>% subset(select = c(-ID))
rhc$center <- rep(1:5, each=nrow(rhc)/5)
rhc$center <- as.character(rhc$center)
#Choose covariates to be used for this study
rhc <- rhc %>% select(treatment, center, scoma1,
race, ninsclas,
hrt1, resp1, bili1, pafi1, aps1, temp1, das2d3pc, sod1, liverhx)
center | treatment | scoma1 | race | ninsclas | hrt1 | resp1 | bili1 | pafi1 | aps1 | temp1 | das2d3pc | sod1 | liverhx |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | white | Medicare | 124 | 10 | 1.0097656 | 68.0000 | 46 | 38.69531 | 23.50000 | 145 | 0 |
1 | 1 | 0 | white | Private & Medicare | 137 | 38 | 0.6999512 | 218.3125 | 50 | 38.89844 | 14.75195 | 137 | 0 |
1 | 1 | 0 | white | Private | 130 | 40 | 1.0097656 | 275.5000 | 82 | 36.39844 | 18.13672 | 146 | 0 |
1 | 0 | 0 | white | Private & Medicare | 58 | 26 | 0.3999634 | 156.6562 | 48 | 35.79688 | 22.92969 | 117 | 0 |
1 | 1 | 41 | white | Medicare | 125 | 27 | 1.0097656 | 478.0000 | 72 | 34.79688 | 21.05078 | 126 | 0 |
1 | 0 | 0 | white | Medicare | 134 | 36 | 1.0097656 | 184.1875 | 38 | 39.19531 | 17.50000 | 138 | 0 |
1 | 0 | 26 | white | Private | 135 | 10 | 1.0097656 | 148.7500 | 29 | 39.39844 | 15.40625 | 136 | 0 |
1 | 0 | 100 | white | Private | 102 | 34 | 0.8999023 | 240.0000 | 25 | 39.19531 | 29.14453 | 136 | 0 |
1 | 0 | 0 | white | Private | 118 | 30 | 1.0998535 | 333.3125 | 47 | 38.39844 | 26.28906 | 136 | 0 |
1 | 1 | 0 | white | Medicaid | 141 | 40 | 0.5000000 | 68.0000 | 48 | 38.50000 | 23.25781 | 146 | 0 |
As described above, the multicate
package builds on
other modeling packages, which require the input data to be fully
numeric and free of missing (NA) values. To meet these requirements, we
pre-processed the dataset accordingly.
After removing missing values (NA) and simplifying categorical
variable levels, we convert all factor and character covariates
variables into numeric or dummy variables using fastDummies
package. The janitor
package’s function,
janitor::clean_names()
is helpful for standardizing column
names by replacing spaces and special characters (e.g., converting
condition_no symptoms
to
condition_no_symptoms
).
center | treatment | scoma1 | race | ninsclas | hrt1 | resp1 | bili1 | pafi1 | aps1 | temp1 | das2d3pc | sod1 | liverhx | race_other | race_white | ninsclas_medicare | ninsclas_medicare_medicaid | ninsclas_no_insurance | ninsclas_private | ninsclas_private_medicare |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | white | Medicare | 124 | 10 | 1.0097656 | 68.0000 | 46 | 38.69531 | 23.50000 | 145 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | white | Private_Medicare | 137 | 38 | 0.6999512 | 218.3125 | 50 | 38.89844 | 14.75195 | 137 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
1 | 1 | 0 | white | Private | 130 | 40 | 1.0097656 | 275.5000 | 82 | 36.39844 | 18.13672 | 146 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 0 | white | Private_Medicare | 58 | 26 | 0.3999634 | 156.6562 | 48 | 35.79688 | 22.92969 | 117 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
1 | 1 | 41 | white | Medicare | 125 | 27 | 1.0097656 | 478.0000 | 72 | 34.79688 | 21.05078 | 126 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | white | Medicare | 134 | 36 | 1.0097656 | 184.1875 | 38 | 39.19531 | 17.50000 | 138 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
1 | 0 | 26 | white | Private | 135 | 10 | 1.0097656 | 148.7500 | 29 | 39.39844 | 15.40625 | 136 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 100 | white | Private | 102 | 34 | 0.8999023 | 240.0000 | 25 | 39.19531 | 29.14453 | 136 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 0 | white | Private | 118 | 30 | 1.0998535 | 333.3125 | 47 | 38.39844 | 26.28906 | 136 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | 1 | 0 | white | Medicaid | 141 | 40 | 0.5000000 | 68.0000 | 48 | 38.50000 | 23.25781 | 146 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
The example below is the sample with causal forest for the estimation method, and study specific for the aggregation method.
set.seed(100)
cate_mod <- estimate_cate(rhc_clean,
estimation_method = "causalforest",
aggregation_method = "studyspecific",
study_col = "center",
treatment_col = "treatment",
outcome_col = "scoma1",
covariate_col = NULL,
drop_col = c("race","ninsclas"))
summary_cate_mod <- summary(cate_mod)
#result
#cate_mod2
#result summary
summary_cate_mod$ate
#> Estimate Std. Error
#> -3.309329 NA
set.seed(100)
cate_mod2 <- estimate_cate(rhc_clean,
estimation_method = "causalforest",
aggregation_method = "studyindicator",
study_col = "center",
treatment_col = "treatment",
outcome_col = "scoma1",
covariate_col = NULL,
drop_col = c("race","ninsclas"))
summary_cate_mod2 <- summary(cate_mod2)
#result
#cate_mod2
#result summary
summary_cate_mod2$ate
#> Estimate Std. Error
#> -3.375521 0.752609
The plot()
funciton with estimate_cate()
objects provides five visualizations to help interpret the estiamted
CATEs: 1. Histogram of the CATEs, 2. Boxplot of CATEs stratified by
study membership, 3. Confidence interval plot shoing 95% confidence
intervals for all CATEs, sorted by their estiamted CATEs, 4. Best linear
projection (available only when
estimation_method = "causalforest"
), 5. Interpretation tree
for visualizing how covariates drive treatment effect heterogeneity.
plot(cate_mod)
Additionally, the plot_vteffect()
provides a more
focused visualization if you are interested in exploring treatment
effect heterogeneity with respect to a specific
continuous covariate. It plots the estimated CATE
(tau_hat
) for each observation against the selected
covariate, allowing you to see how treatment effects vary across its
range and across different studies.
For example, if you’re interested in how treatment effects differ by
aps1
across studies, you can specify this variable using
the covariate_name = "aps1"
parameter.
plot_vteffect(cate_mod, covariate_name = "aps1")
For prediction, the aggregation method must be set to
‘studyspecific
’. While only the studyspecific
aggregation is supported for prediction, the estimation method can be
either “causal forest” or “s-learner”. To illustrate, we can randomly
select 100 observations from the original dataset as a simplified target
population. In practice, the target data can include any set of
covariate profiles for which the researcher wants to generate prediction
intervals. However, an important assumption is that the covariate
profiles in the target data fall within the distribution of at least one
of the studies used to estimate the CATE—this is known as the positivity
assumption.
The predict()
function with estimate_cate()
objects in multicate
implements a two-stage meta-analysis
approach to generate prediction intervals:
The resulting prediction interval for the CATE for covariate profile in the target setting represents a range of potential values that the CATE may be in the new setting. For more detailed assumptions and mathematical formulations, refer to the link.4
new_dat <- sample_n(rhc_clean, 100)
This prediction results include three additional columns beyond the original dataet. - tau_predicted: the estimated Conditional Average Treatment Effect (CATE) - pi_lower: the lower bound of the prediction interval - pi_upper: the upper bound of the prediction interval
tau_predicted | pi_lower | pi_upper | treatment | center.x | scoma1 | race | ninsclas | hrt1 | resp1 | bili1 | pafi1 | aps1 | temp1 | das2d3pc | sod1 | liverhx | race_other | race_white | ninsclas_medicare | ninsclas_medicare_medicaid | ninsclas_no_insurance | ninsclas_private | ninsclas_private_medicare | center.y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-1.9615021 | -6.991720 | 3.0687158 | 1 | 3 | 44 | white | Medicare | 35 | 36 | 1.0097656 | 166.6562 | 102 | 32.19531 | 15.40625 | 136 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 3 |
-2.7086885 | -8.486070 | 3.0686932 | 0 | 3 | 9 | white | Private_Medicare | 55 | 26 | 0.5999756 | 282.0000 | 46 | 36.50000 | 11.00000 | 144 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 3 |
-0.2958267 | -6.900161 | 6.3085077 | 0 | 1 | 9 | black | Medicare | 120 | 26 | 0.1999817 | 134.2812 | 61 | 35.89844 | 14.75195 | 132 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
-0.7522934 | -4.898785 | 3.3941985 | 0 | 3 | 0 | white | Medicare | 121 | 28 | 1.0097656 | 199.9688 | 74 | 35.00000 | 21.94922 | 133 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 3 |
-4.8684897 | -9.389990 | -0.3469899 | 0 | 2 | 100 | white | Private_Medicare | 70 | 12 | 1.6999512 | 103.3281 | 30 | 36.50000 | 22.60547 | 138 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 2 |
-1.8268046 | -6.305585 | 2.6519755 | 0 | 1 | 44 | black | Medicaid | 150 | 48 | 0.3999634 | 148.3125 | 40 | 39.09375 | 14.75195 | 142 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
-3.2989329 | -8.368655 | 1.7707893 | 0 | 3 | 0 | white | Medicare_Medicaid | 43 | 36 | 0.5000000 | 357.1250 | 43 | 35.29688 | 19.00391 | 139 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 3 |
-1.4803195 | -6.780706 | 3.8200675 | 0 | 2 | 0 | white | Private | 122 | 37 | 2.8999023 | 233.3125 | 54 | 38.50000 | 20.11719 | 151 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 2 |
-2.8276971 | -6.971732 | 1.3163381 | 1 | 3 | 0 | black | Private | 122 | 42 | 1.0097656 | 276.0000 | 63 | 38.50000 | 29.14844 | 135 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3 |
-3.6820256 | -9.128999 | 1.7649482 | 1 | 4 | 0 | white | Private | 144 | 40 | 1.0097656 | 182.0000 | 89 | 39.89844 | 19.33203 | 121 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 4 |
While the primary purpose of this package is for RCTs, but the base dataset can be from randomized controlled trials, observational studies, or a combination of both. Our functions are primarily designed for IPD (Individual Patient Data). The dataset must contain at least 4 key components, a study indicator variable, a treatment indicator variable, An outcome variable, and covariate variables.
You should be careful to use which data formats your data has. For this package, the treatment variable should be numeric (e.g. 0 or 1), the study ID should be character (factor is allowed.), and the outcome must be numeric. Note that this package is primarily designed for continuous outcome, and covariates should be all numeric.
Data column | Data format |
---|---|
treatment | numeric(0,1) |
studyid | character |
outcome | numeric |
covariates | numeric |
No. Packages like grf
(used for causal_forest) do not
support missing values. You must either remove rows with NA
or impute missing values before analysis.
If you need to include factor variables, we recommend the following:
The brief summary of solutions suggested by the grf vignett is as follows:
fastdummies
to convert
categorical variables into dummy variables easily.Means encoding: Use the sufrep
package for
target/means encoding.
Oridinal encoding: If your categorical variable has a natural order (e.g. months), encode it as a numeric variable accordingly.
See the grf
vignette on categorical inputs for mode
deail.5
Ensure that the column names or formulas passed to estimate_cate() do
not contain spaces or logical connectors (like ‘and’). The
janitor
package is helpful for cleaning column names by
replacing spaces with underscores (e.g., ‘variable name’ becomes
‘variable_name’). This step is especially important when using the
ensemble forest method.
These errors may be caused by lack of sufficient variation or sparsity in one or more covariates. When a variable has very few observations in a particular category (e.g., a binary variable with only a handful of “Yes” values), it can lead to unstable splits in tree-based models like causal forest, or to overfitting, especially in small subgroups.
Before applying the model to clinical data, it is recommended to examine the distribution of each covariate to identify rare levels or highly imbalanced variables that may affect model performance.
dbarts
and
grf
packages.
grf
’s
causal_forest
.
importance
is set to impurity
, and
keep.inbag
is set to TRUE
.
To estiamte CATE, you can use the function
estiamte_cate()
and this function requires parameter like
below:
c('var1','var2','var3')
.c('var1','var2','var3')
. Default is
NULL.causal forest
for the
estimation method you may adjust extra arguments to compelte splitting
tress faster by setting a smaller number of trees. (e.g.extra_args =
list(num.trees = 50))
The result objects of estimate_cate()
and
summary()
provide different information.
estimate_cate()
result object:
summary()
of the result object of estimate_cate():
ate
)studycate
) with minimum,
median, maximum CATE.
Plot4 BLP(Best Linear projection figure) is not available for the ensemble forest method. You can refer to the table below to see which plots are supported by each combination of aggregation and estimation methods.
No | Plots from plot() |
Aggregation Method | Estimation Method |
---|---|---|---|
1 | Histogram of estimated CATEs | Available to all | Available to all |
2 | Boxplot of CATEs by study ID | Available to all | Available to all |
3 | 95% CI for all CATEs | Study specific, Study indicator | Available to all |
4 | Best Linear Projection | Study specific, Study indicator | Causal forest |
5 | Interpretation tree | Available to all | Available to all |
Brantner, C. L., Nguyen, T. Q., Tang, T., Zhao, C., Hong, H., & Stuart, E. A. (2024). Comparison of methods that combine multiple randomized trials to estimate heterogeneous treatment effects. Statistics in medicine, 43(7), 1291-1314. https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.9955↩︎
https://github.com/dobengjhu/multicate/blob/main/R/predict_cate.R↩︎
https://grf-labs.github.io/grf/articles/categorical_inputs.html↩︎
Hill, J. L. (2011). Bayesian Nonparametric Modeling for Causal Inference. Journal of Computational and Graphical Statistics, 20(1), 217–240. https://doi.org/10.1198/jcgs.2010.08162↩︎
Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10), 4156-4165. https://www.pnas.org/doi/abs/10.1073/pnas.1804597116↩︎
Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests. https://arxiv.org/abs/1610.01271↩︎