This guide provides an introduction to applying the multicate package in health data analyses. The explanations below are based on the original paper that is based on for development of this package: Comparison of methods that combine multiple randomized trials to estimate heterogeneous treatment effects. Statistics in medicine (Brantner et al., 2024).¹ The detailed github code and readme are in this link(https://github.com/dobengjhu/multicate).

Application

1) Load packages

To get strated, install and load the multicate package using either install.packages(multicate) or pak::pak("dobengjhu/multicate"). To explore the functionality of the package, you can examine the sample dataset dummy_tbl. This dataset includes three previously conducted trials comparing the same two treatments. It contains:

An outcome variable called response
A treatment indicator tx
Five signal covariates (var to var5) that may act as effect modifiers

You can view a summary of the dataset with summary(dummy_tbl).

In this documnet, except for multicate, we used the packages as follow.

# 1) Install multicate package
#install.packages("multicate")
#pak::pak("dobengjhu/multicate")
library(multicate)

#2) Additional packages for data cleansing
library(tibble)
#install.packages("tableone")
library(tableone)
library(tidyverse)
#install.packages("fastDummies")  # if not already installed
library(fastDummies)
#install.packages("kableExtra")
library(kableExtra)
#install.packages("performance")
library(performance)

studyid	tx	var1	var2	var3	var4	var5	response
study1	0	0.9229540	-0.6017255	0.5678660	0.2579868	-0.3901526	0.5331128
study1	1	0.6724726	0.3215742	0.3526782	1.0991352	-0.9077090	3.3049825
study1	0	-1.1199247	-1.1144356	1.7288654	0.3144015	-0.7294912	0.8041214
study1	1	-0.0527381	0.9808838	-0.9881958	-1.0123336	1.0122642	-0.1292580
study1	1	-0.8410323	-0.8977820	-1.1944022	-0.2001853	-0.3946877	-1.6867035
study1	0	-0.7123436	1.2483083	-1.0701537	-1.6792970	0.7385637	-1.3736278
study1	1	1.6375565	-0.2251780	-1.1567990	-0.3433757	-1.9489651	0.7792991
study1	0	1.5271901	0.8185257	1.4695467	2.3093622	0.6189032	4.9178592
study1	0	-0.4038597	0.5745078	1.2317660	0.4126709	0.0939900	2.3317713
study1	1	-0.7455479	-0.5814232	-1.0777537	0.3291436	2.3484475	-0.5198707

2) RHC dataset

In this example, we use data from a study of Right Heart Catheterization (RHC), a diagnostic procedure used to measure cardiac function in critically ill patients. While RHC can guide urgent and ongoing treatment decisions, it also carries a risk of serious complications. The observational dataset originates from Murphy and Cluff (1990), was later reanalyzed by Connors et al. (1996), and has since become a benchmark dataset in causal inference research.

Data²
Dictionary³

The dataset (rhc.xls) includes information on 5,735 hospitalized adult patients across five U.S. medical centers. The treatment variable (column 2) indicates whether RHC was administered within 24 hours of admission (TRUE for treatment; FALSE for control). The outcome variable of this study is originally dth30 (column 54), records 30-day post-admission mortality (TRUE for death; FALSE for survival), but in this example, we set the outcome Glasgow Coma Score (scoma1 variable) as our continuous coutcome.

Covariate information (column 3 to 53): age, sex, race (black, white, other), years of education,income, type of medical insurance (private, Medicare, Medicaid, private and Medicare, Medicare and Medicaid, or none), primary disease category, secondary disease category, 10 categories of admission diagnosis, activities of daily living (ADL) and Duke Activity Status Index (DASI) 2 weeks before admission, do-not-resuscitate status on day 1 of admission, cancer (none, localized, metastatic), an estimate of the probability of surviving 2 months, acute physiology component of the APACHE III score, Glasgow Coma Score, weight, temperature, mean blood pressure, respiratory rate, heart rate, PaO2=FIO2 ratio, PaCO2, pH, WBC count, hematocrit, sodium, potassium, creatinine, bilirubin, albumin, urine output, 12 categories of comorbidity illness, and whether the patient transferred from another hospital. Since the number of covariates are too big for this demonstration purpose, we will choose a smaller set of covariates in analysis.

To demonstrate the functionality of the multicate package in the context of multiple studies, we simulate study membership by arbitrarily dividing the dataset into five groups, treating them as if they came from five distinct medical centers. Although the original dataset does not indicate which patient belongs to which hospital, this simulated structure allows us to explore treatment effect heterogeneity across centers and individual-level covariates.

#Load data
rhc <- read.csv(file="rhc.xls", header=TRUE)
rhc <- rhc %>% mutate(across(where(is.logical), as.numeric))
rhc <- rhc %>% subset(select = c(-ID))
rhc$center <- rep(1:5, each=nrow(rhc)/5)
rhc$center <- as.character(rhc$center)

#Choose covariates to be used for this study
rhc <- rhc %>% select(treatment, center, scoma1,  
                      race, ninsclas,
                      hrt1, resp1, bili1, pafi1, aps1, temp1, das2d3pc, sod1, liverhx)

center	treatment	scoma1	race	ninsclas	hrt1	resp1	bili1	pafi1	aps1	temp1	das2d3pc	sod1
1	0	0	white	Medicare	124	10	1.0097656	68.0000	46	38.69531	23.50000	145
1	1	0	white	Private & Medicare	137	38	0.6999512	218.3125	50	38.89844	14.75195	137
1	1	0	white	Private	130	40	1.0097656	275.5000	82	36.39844	18.13672	146
1	0	0	white	Private & Medicare	58	26	0.3999634	156.6562	48	35.79688	22.92969	117
1	1	41	white	Medicare	125	27	1.0097656	478.0000	72	34.79688	21.05078	126
1	0	0	white	Medicare	134	36	1.0097656	184.1875	38	39.19531	17.50000	138
1	0	26	white	Private	135	10	1.0097656	148.7500	29	39.39844	15.40625	136
1	0	100	white	Private	102	34	0.8999023	240.0000	25	39.19531	29.14453	136
1	0	0	white	Private	118	30	1.0998535	333.3125	47	38.39844	26.28906	136
1	1	0	white	Medicaid	141	40	0.5000000	68.0000	48	38.50000	23.25781	146

3) Data preparation

As described above, the multicate package builds on other modeling packages, which require the input data to be fully numeric and free of missing (NA) values. To meet these requirements, we pre-processed the dataset accordingly.

After removing missing values (NA) and simplifying categorical variable levels, we convert all factor and character covariates variables into numeric or dummy variables using fastDummies package. The janitor package’s function, janitor::clean_names() is helpful for standardizing column names by replacing spaces and special characters (e.g., converting condition_no symptoms to condition_no_symptoms).

center	treatment	scoma1	race	ninsclas	hrt1	resp1	bili1	pafi1	aps1	temp1	das2d3pc	sod1	race_white	ninsclas_medicare	ninsclas_private	ninsclas_private_medicare
1	0	0	white	Medicare	124	10	1.0097656	68.0000	46	38.69531	23.50000	145	1	1	0	0
1	1	0	white	Private_Medicare	137	38	0.6999512	218.3125	50	38.89844	14.75195	137	1	0	0	1
1	1	0	white	Private	130	40	1.0097656	275.5000	82	36.39844	18.13672	146	1	0	1	0
1	0	0	white	Private_Medicare	58	26	0.3999634	156.6562	48	35.79688	22.92969	117	1	0	0	1
1	1	41	white	Medicare	125	27	1.0097656	478.0000	72	34.79688	21.05078	126	1	1	0	0
1	0	0	white	Medicare	134	36	1.0097656	184.1875	38	39.19531	17.50000	138	1	1	0	0
1	0	26	white	Private	135	10	1.0097656	148.7500	29	39.39844	15.40625	136	1	0	1	0
1	0	100	white	Private	102	34	0.8999023	240.0000	25	39.19531	29.14453	136	1	0	1	0
1	0	0	white	Private	118	30	1.0998535	333.3125	47	38.39844	26.28906	136	1	0	1	0
1	1	0	white	Medicaid	141	40	0.5000000	68.0000	48	38.50000	23.25781	146	1	0	0	0

4) Choose aggregation and estimation method

The example below is the sample with causal forest for the estimation method, and study specific for the aggregation method.

set.seed(100)
cate_mod <- estimate_cate(rhc_clean,
                          estimation_method = "causalforest",
                          aggregation_method = "studyspecific",
                          study_col = "center",
                          treatment_col = "treatment",
                          outcome_col = "scoma1",
                          covariate_col = NULL,
                          drop_col = c("race","ninsclas"))
summary_cate_mod <- summary(cate_mod)

#result
#cate_mod2

#result summary
summary_cate_mod$ate
#>   Estimate Std. Error 
#>  -3.309329         NA

set.seed(100)
cate_mod2 <- estimate_cate(rhc_clean,
                          estimation_method = "causalforest",
                          aggregation_method = "studyindicator",
                          study_col = "center",
                          treatment_col = "treatment",
                          outcome_col = "scoma1",
                          covariate_col = NULL,
                          drop_col = c("race","ninsclas"))
summary_cate_mod2 <- summary(cate_mod2)

#result
#cate_mod2

#result summary
summary_cate_mod2$ate
#>   Estimate Std. Error 
#>  -3.375521   0.752609

5) Visualization of estimation

The plot() funciton with estimate_cate() objects provides five visualizations to help interpret the estiamted CATEs: 1. Histogram of the CATEs, 2. Boxplot of CATEs stratified by study membership, 3. Confidence interval plot shoing 95% confidence intervals for all CATEs, sorted by their estiamted CATEs, 4. Best linear projection (available only when estimation_method = "causalforest"), 5. Interpretation tree for visualizing how covariates drive treatment effect heterogeneity.

plot(cate_mod)

Additionally, the plot_vteffect() provides a more focused visualization if you are interested in exploring treatment effect heterogeneity with respect to a specific continuous covariate. It plots the estimated CATE (tau_hat) for each observation against the selected covariate, allowing you to see how treatment effects vary across its range and across different studies.

For example, if you’re interested in how treatment effects differ by aps1 across studies, you can specify this variable using the covariate_name = "aps1" parameter.

plot_vteffect(cate_mod, covariate_name = "aps1")

5) Predict CATEs for target population

For prediction, the aggregation method must be set to ‘studyspecific’. While only the studyspecific aggregation is supported for prediction, the estimation method can be either “causal forest” or “s-learner”. To illustrate, we can randomly select 100 observations from the original dataset as a simplified target population. In practice, the target data can include any set of covariate profiles for which the researcher wants to generate prediction intervals. However, an important assumption is that the covariate profiles in the target data fall within the distribution of at least one of the studies used to estimate the CATE—this is known as the positivity assumption.

The predict() function with estimate_cate() objects in multicate implements a two-stage meta-analysis approach to generate prediction intervals:

Stage1: A model is fit within each study to estimate CATE given covariate profile in the target data.
Stage2: A random effects meta-analysis is then applied to combine the study-specific CATE estiamtes and quantify uncertainty.

The resulting prediction interval for the CATE for covariate profile in the target setting represents a range of potential values that the CATE may be in the new setting. For more detailed assumptions and mathematical formulations, refer to the link.⁴

new_dat <- sample_n(rhc_clean, 100)

This prediction results include three additional columns beyond the original dataet. - tau_predicted: the estimated Conditional Average Treatment Effect (CATE) - pi_lower: the lower bound of the prediction interval - pi_upper: the upper bound of the prediction interval

tau_predicted	pi_lower	pi_upper	treatment	center.x	scoma1	race	ninsclas	hrt1	resp1	bili1	pafi1	aps1	temp1	das2d3pc	sod1	race_white	ninsclas_medicare	ninsclas_medicare_medicaid	ninsclas_private	ninsclas_private_medicare	center.y
-1.9615021	-6.991720	3.0687158	1	3	44	white	Medicare	35	36	1.0097656	166.6562	102	32.19531	15.40625	136	1	1	0	0	0	3
-2.7086885	-8.486070	3.0686932	0	3	9	white	Private_Medicare	55	26	0.5999756	282.0000	46	36.50000	11.00000	144	1	0	0	0	1	3
-0.2958267	-6.900161	6.3085077	0	1	9	black	Medicare	120	26	0.1999817	134.2812	61	35.89844	14.75195	132	0	1	0	0	0	1
-0.7522934	-4.898785	3.3941985	0	3	0	white	Medicare	121	28	1.0097656	199.9688	74	35.00000	21.94922	133	1	1	0	0	0	3
-4.8684897	-9.389990	-0.3469899	0	2	100	white	Private_Medicare	70	12	1.6999512	103.3281	30	36.50000	22.60547	138	1	0	0	0	1	2
-1.8268046	-6.305585	2.6519755	0	1	44	black	Medicaid	150	48	0.3999634	148.3125	40	39.09375	14.75195	142	0	0	0	0	0	1
-3.2989329	-8.368655	1.7707893	0	3	0	white	Medicare_Medicaid	43	36	0.5000000	357.1250	43	35.29688	19.00391	139	1	0	1	0	0	3
-1.4803195	-6.780706	3.8200675	0	2	0	white	Private	122	37	2.8999023	233.3125	54	38.50000	20.11719	151	1	0	0	1	0	2
-2.8276971	-6.971732	1.3163381	1	3	0	black	Private	122	42	1.0097656	276.0000	63	38.50000	29.14844	135	0	0	0	1	0	3
-3.6820256	-9.128999	1.7649482	1	4	0	white	Private	144	40	1.0097656	182.0000	89	39.89844	19.33203	121	1	0	0	1	0	4

Which dataset is suitable?

Q. What kinds of data should I have?

While the primary purpose of this package is for RCTs, but the base dataset can be from randomized controlled trials, observational studies, or a combination of both. Our functions are primarily designed for IPD (Individual Patient Data). The dataset must contain at least 4 key components, a study indicator variable, a treatment indicator variable, An outcome variable, and covariate variables.

Q. What data format is required?

You should be careful to use which data formats your data has. For this package, the treatment variable should be numeric (e.g. 0 or 1), the study ID should be character (factor is allowed.), and the outcome must be numeric. Note that this package is primarily designed for continuous outcome, and covariates should be all numeric.

Data column	Data format
treatment	numeric(0,1)
studyid	character
outcome	numeric
covariates	numeric

Q. Can I have missing values?

No. Packages like grf (used for causal_forest) do not support missing values. You must either remove rows with NA or impute missing values before analysis.

Q. What if I have categorical or factor variables in my covariates?

If you need to include factor variables, we recommend the following:

The brief summary of solutions suggested by the grf vignett is as follows:

One-hot encoding: Use fastdummies to convert categorical variables into dummy variables easily.

Ensure that values do not contain spaces or special characters. For example, convert ‘study 1’ to ‘study_1’.

Means encoding: Use the sufrep package for target/means encoding.
Oridinal encoding: If your categorical variable has a natural order (e.g. months), encode it as a numeric variable accordingly.

See the grf vignette on categorical inputs for mode deail.⁵

Q. I applied one-hot encoding but it is still not working-Why?

Ensure that the column names or formulas passed to estimate_cate() do not contain spaces or logical connectors (like ‘and’). The janitor package is helpful for cleaning column names by replacing spaces with underscores (e.g., ‘variable name’ becomes ‘variable_name’). This step is especially important when using the ensemble forest method.

Q. Why did I encounter errors when using the causal forest with the study-specific method?

These errors may be caused by lack of sufficient variation or sparsity in one or more covariates. When a variable has very few observations in a particular category (e.g., a binary variable with only a handful of “Yes” values), it can lead to unstable splits in tree-based models like causal forest, or to overfitting, especially in small subgroups.

Before applying the model to clinical data, it is recommended to examine the distribution of each covariate to identify rare levels or highly imbalanced variables that may affect model performance.

Q. What other packages are used internally?

slearner: Based on the dbarts and grfpackages.
- The detailed arguments can be found in the paper by Hill (2011)⁶ and Künzel (2019)⁷. The default arguments are as follows: keeptrees is set to TRUE, and verbose is set to FALSE when aggregation_method is set to ensembleforest.
causalforest: Based on the package grf’s causal_forest.
- You can refer to the detailed arguments in the paper by Athey et al. (2019)⁸. This is estimated by averaging the treatment effect across patients, accounting for counterfactual assignments. The default arguments are as follows: importance is set to impurity, and keep.inbag is set to TRUE.

Q. How do I input my data into the function parameters?

To estiamte CATE, you can use the function estiamte_cate() and this function requires parameter like below:

estimation_method (string): Choose ‘slearner’ or ‘causalforest’
aggregation_method (String): Choose one of options below.
- 1. studyindicator: Pool all data together but keep study ID as an indicator, including them as covariates in the single-study method.
- 1. ensembleforest: Fit the estimation model within each study, apply each model to all indivudals across all studies, then fit ensemble random forest to augmented data.
- 1. studyspecific: Fit the estimation model separately for each study, and report out study-specific estimates.
study_col (String): Name of the column of study indicator.Note that study_col can be NULL only when aggregation method is studyspecific. In other cases, study_col parameter should be filled with the specific column name, like ‘studyid’.
treatment_col (String): Name of the column of treatment.
outcome_col (String): Name of the column of outcome.
covariate_col (Vector): Names of the columns of covariates. The default is NULL. The length of string should be the same with the number of covariates. You can type the list like c('var1','var2','var3').
drop_col (Vector): Name of the columns to be deleted. You can type the list like c('var1','var2','var3'). Default is NULL.
extra_args (List): Add additional arguments for each estimation method. For example, when you choose causal forest for the estimation method you may adjust extra arguments to compelte splitting tress faster by setting a smaller number of trees. (e.g.extra_args = list(num.trees = 50))

Q. What kinds of objects are returned?

The result objects of estimate_cate() and summary() provide different information.

estimate_cate() result object:
- ‘GRF forest object of type causal_forest’ (e.g. Number of trees, Number of training samples)
- ‘variance importance tibble’
- ‘original data with additional two columns, tau hat and variance of estimates’.
summary() of the result object of estimate_cate():
- overall ATE (object ate)
- its standard error
- study-specific ATE (object studycate) with minimum, median, maximum CATE.

Q. Why can’t I get Plot 4 (BLP) with ensemble forest?

Plot4 BLP(Best Linear projection figure) is not available for the ensemble forest method. You can refer to the table below to see which plots are supported by each combination of aggregation and estimation methods.

No	Plots from `plot()`	Aggregation Method	Estimation Method
1	Histogram of estimated CATEs	Available to all	Available to all
2	Boxplot of CATEs by study ID	Available to all	Available to all
3	95% CI for all CATEs	Study specific, Study indicator	Available to all
4	Best Linear Projection	Study specific, Study indicator	Causal forest
5	Interpretation tree	Available to all	Available to all

Brantner, C. L., Nguyen, T. Q., Tang, T., Zhao, C., Hong, H., & Stuart, E. A. (2024). Comparison of methods that combine multiple randomized trials to estimate heterogeneous treatment effects. Statistics in medicine, 43(7), 1291-1314. https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.9955 ↩︎
https://hbiostat.org/data ↩︎
https://hbiostat.org/data/repo/rhc ↩︎
https://github.com/dobengjhu/multicate/blob/main/R/predict_cate.R ↩︎
https://grf-labs.github.io/grf/articles/categorical_inputs.html ↩︎
Hill, J. L. (2011). Bayesian Nonparametric Modeling for Causal Inference. Journal of Computational and Graphical Statistics, 20(1), 217–240. https://doi.org/10.1198/jcgs.2010.08162 ↩︎
Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10), 4156-4165. https://www.pnas.org/doi/abs/10.1073/pnas.1804597116 ↩︎
Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests. https://arxiv.org/abs/1610.01271 ↩︎

Applying multicate

Kyungeun Jeon, Carly Brantner, Daniel Obeng, Elizabeth Stuart

2025-07-11