This guide provides an introduction to applying the multicate package in health data analyses. The explanations below are based on the original paper that is based on for development of this package: Comparison of methods that combine multiple randomized trials to estimate heterogeneous treatment effects. Statistics in medicine (Brantner et al., 2024).1 The detailed github code and readme are in this link(https://github.com/dobengjhu/multicate).

Application

1) Load packages

To get strated, install and load the multicate package using either install.packages(multicate) or pak::pak("dobengjhu/multicate"). To explore the functionality of the package, you can examine the sample dataset dummy_tbl. This dataset includes three previously conducted trials comparing the same two treatments. It contains:

  • An outcome variable called response

  • A treatment indicator tx

  • Five signal covariates (var to var5) that may act as effect modifiers

You can view a summary of the dataset with summary(dummy_tbl).

In this documnet, except for multicate, we used the packages as follow.

# 1) Install multicate package
#install.packages("multicate")
#pak::pak("dobengjhu/multicate")
library(multicate)

#2) Additional packages for data cleansing
library(tibble)
#install.packages("tableone")
library(tableone)
library(tidyverse)
#install.packages("fastDummies")  # if not already installed
library(fastDummies)
#install.packages("kableExtra")
library(kableExtra)
#install.packages("performance")
library(performance)
studyid tx var1 var2 var3 var4 var5 response
study1 0 0.9229540 -0.6017255 0.5678660 0.2579868 -0.3901526 0.5331128
study1 1 0.6724726 0.3215742 0.3526782 1.0991352 -0.9077090 3.3049825
study1 0 -1.1199247 -1.1144356 1.7288654 0.3144015 -0.7294912 0.8041214
study1 1 -0.0527381 0.9808838 -0.9881958 -1.0123336 1.0122642 -0.1292580
study1 1 -0.8410323 -0.8977820 -1.1944022 -0.2001853 -0.3946877 -1.6867035
study1 0 -0.7123436 1.2483083 -1.0701537 -1.6792970 0.7385637 -1.3736278
study1 1 1.6375565 -0.2251780 -1.1567990 -0.3433757 -1.9489651 0.7792991
study1 0 1.5271901 0.8185257 1.4695467 2.3093622 0.6189032 4.9178592
study1 0 -0.4038597 0.5745078 1.2317660 0.4126709 0.0939900 2.3317713
study1 1 -0.7455479 -0.5814232 -1.0777537 0.3291436 2.3484475 -0.5198707

 

2) RHC dataset

In this example, we use data from a study of Right Heart Catheterization (RHC), a diagnostic procedure used to measure cardiac function in critically ill patients. While RHC can guide urgent and ongoing treatment decisions, it also carries a risk of serious complications. The observational dataset originates from Murphy and Cluff (1990), was later reanalyzed by Connors et al. (1996), and has since become a benchmark dataset in causal inference research.

  • Data2

  • Dictionary3

The dataset (rhc.xls) includes information on 5,735 hospitalized adult patients across five U.S. medical centers. The treatment variable (column 2) indicates whether RHC was administered within 24 hours of admission (TRUE for treatment; FALSE for control). The outcome variable of this study is originally dth30 (column 54), records 30-day post-admission mortality (TRUE for death; FALSE for survival), but in this example, we set the outcome Glasgow Coma Score (scoma1 variable) as our continuous coutcome.

Covariate information (column 3 to 53): age, sex, race (black, white, other), years of education,income, type of medical insurance (private, Medicare, Medicaid, private and Medicare, Medicare and Medicaid, or none), primary disease category, secondary disease category, 10 categories of admission diagnosis, activities of daily living (ADL) and Duke Activity Status Index (DASI) 2 weeks before admission, do-not-resuscitate status on day 1 of admission, cancer (none, localized, metastatic), an estimate of the probability of surviving 2 months, acute physiology component of the APACHE III score, Glasgow Coma Score, weight, temperature, mean blood pressure, respiratory rate, heart rate, PaO2=FIO2 ratio, PaCO2, pH, WBC count, hematocrit, sodium, potassium, creatinine, bilirubin, albumin, urine output, 12 categories of comorbidity illness, and whether the patient transferred from another hospital. Since the number of covariates are too big for this demonstration purpose, we will choose a smaller set of covariates in analysis.

To demonstrate the functionality of the multicate package in the context of multiple studies, we simulate study membership by arbitrarily dividing the dataset into five groups, treating them as if they came from five distinct medical centers. Although the original dataset does not indicate which patient belongs to which hospital, this simulated structure allows us to explore treatment effect heterogeneity across centers and individual-level covariates.

#Load data
rhc <- read.csv(file="rhc.xls", header=TRUE)
rhc <- rhc %>% mutate(across(where(is.logical), as.numeric))
rhc <- rhc %>% subset(select = c(-ID))
rhc$center <- rep(1:5, each=nrow(rhc)/5)
rhc$center <- as.character(rhc$center)

#Choose covariates to be used for this study
rhc <- rhc %>% select(treatment, center, scoma1,  
                      race, ninsclas,
                      hrt1, resp1, bili1, pafi1, aps1, temp1, das2d3pc, sod1, liverhx)
center treatment scoma1 race ninsclas hrt1 resp1 bili1 pafi1 aps1 temp1 das2d3pc sod1 liverhx
1 0 0 white Medicare 124 10 1.0097656 68.0000 46 38.69531 23.50000 145 0
1 1 0 white Private & Medicare 137 38 0.6999512 218.3125 50 38.89844 14.75195 137 0
1 1 0 white Private 130 40 1.0097656 275.5000 82 36.39844 18.13672 146 0
1 0 0 white Private & Medicare 58 26 0.3999634 156.6562 48 35.79688 22.92969 117 0
1 1 41 white Medicare 125 27 1.0097656 478.0000 72 34.79688 21.05078 126 0
1 0 0 white Medicare 134 36 1.0097656 184.1875 38 39.19531 17.50000 138 0
1 0 26 white Private 135 10 1.0097656 148.7500 29 39.39844 15.40625 136 0
1 0 100 white Private 102 34 0.8999023 240.0000 25 39.19531 29.14453 136 0
1 0 0 white Private 118 30 1.0998535 333.3125 47 38.39844 26.28906 136 0
1 1 0 white Medicaid 141 40 0.5000000 68.0000 48 38.50000 23.25781 146 0

 

3) Data preparation

As described above, the multicate package builds on other modeling packages, which require the input data to be fully numeric and free of missing (NA) values. To meet these requirements, we pre-processed the dataset accordingly.

After removing missing values (NA) and simplifying categorical variable levels, we convert all factor and character covariates variables into numeric or dummy variables using fastDummies package. The janitor package’s function, janitor::clean_names() is helpful for standardizing column names by replacing spaces and special characters (e.g., converting condition_no symptoms to condition_no_symptoms).

center treatment scoma1 race ninsclas hrt1 resp1 bili1 pafi1 aps1 temp1 das2d3pc sod1 liverhx race_other race_white ninsclas_medicare ninsclas_medicare_medicaid ninsclas_no_insurance ninsclas_private ninsclas_private_medicare
1 0 0 white Medicare 124 10 1.0097656 68.0000 46 38.69531 23.50000 145 0 0 1 1 0 0 0 0
1 1 0 white Private_Medicare 137 38 0.6999512 218.3125 50 38.89844 14.75195 137 0 0 1 0 0 0 0 1
1 1 0 white Private 130 40 1.0097656 275.5000 82 36.39844 18.13672 146 0 0 1 0 0 0 1 0
1 0 0 white Private_Medicare 58 26 0.3999634 156.6562 48 35.79688 22.92969 117 0 0 1 0 0 0 0 1
1 1 41 white Medicare 125 27 1.0097656 478.0000 72 34.79688 21.05078 126 0 0 1 1 0 0 0 0
1 0 0 white Medicare 134 36 1.0097656 184.1875 38 39.19531 17.50000 138 0 0 1 1 0 0 0 0
1 0 26 white Private 135 10 1.0097656 148.7500 29 39.39844 15.40625 136 0 0 1 0 0 0 1 0
1 0 100 white Private 102 34 0.8999023 240.0000 25 39.19531 29.14453 136 0 0 1 0 0 0 1 0
1 0 0 white Private 118 30 1.0998535 333.3125 47 38.39844 26.28906 136 0 0 1 0 0 0 1 0
1 1 0 white Medicaid 141 40 0.5000000 68.0000 48 38.50000 23.25781 146 0 0 1 0 0 0 0 0

 

4) Choose aggregation and estimation method

 

The example below is the sample with causal forest for the estimation method, and study specific for the aggregation method.

set.seed(100)
cate_mod <- estimate_cate(rhc_clean,
                          estimation_method = "causalforest",
                          aggregation_method = "studyspecific",
                          study_col = "center",
                          treatment_col = "treatment",
                          outcome_col = "scoma1",
                          covariate_col = NULL,
                          drop_col = c("race","ninsclas"))
summary_cate_mod <- summary(cate_mod)

#result
#cate_mod2
#result summary
summary_cate_mod$ate
#>   Estimate Std. Error 
#>  -3.309329         NA
set.seed(100)
cate_mod2 <- estimate_cate(rhc_clean,
                          estimation_method = "causalforest",
                          aggregation_method = "studyindicator",
                          study_col = "center",
                          treatment_col = "treatment",
                          outcome_col = "scoma1",
                          covariate_col = NULL,
                          drop_col = c("race","ninsclas"))
summary_cate_mod2 <- summary(cate_mod2)

#result
#cate_mod2
#result summary
summary_cate_mod2$ate
#>   Estimate Std. Error 
#>  -3.375521   0.752609

 

5) Visualization of estimation

The plot() funciton with estimate_cate() objects provides five visualizations to help interpret the estiamted CATEs: 1. Histogram of the CATEs, 2. Boxplot of CATEs stratified by study membership, 3. Confidence interval plot shoing 95% confidence intervals for all CATEs, sorted by their estiamted CATEs, 4. Best linear projection (available only when estimation_method = "causalforest"), 5. Interpretation tree for visualizing how covariates drive treatment effect heterogeneity.

plot(cate_mod)

Additionally, the plot_vteffect() provides a more focused visualization if you are interested in exploring treatment effect heterogeneity with respect to a specific continuous covariate. It plots the estimated CATE (tau_hat) for each observation against the selected covariate, allowing you to see how treatment effects vary across its range and across different studies.

For example, if you’re interested in how treatment effects differ by aps1 across studies, you can specify this variable using the covariate_name = "aps1" parameter.

plot_vteffect(cate_mod, covariate_name = "aps1")

 

5) Predict CATEs for target population

For prediction, the aggregation method must be set to ‘studyspecific’. While only the studyspecific aggregation is supported for prediction, the estimation method can be either “causal forest” or “s-learner”. To illustrate, we can randomly select 100 observations from the original dataset as a simplified target population. In practice, the target data can include any set of covariate profiles for which the researcher wants to generate prediction intervals. However, an important assumption is that the covariate profiles in the target data fall within the distribution of at least one of the studies used to estimate the CATE—this is known as the positivity assumption.

The predict() function with estimate_cate() objects in multicate implements a two-stage meta-analysis approach to generate prediction intervals:

  1. Stage1: A model is fit within each study to estimate CATE given covariate profile in the target data.
  2. Stage2: A random effects meta-analysis is then applied to combine the study-specific CATE estiamtes and quantify uncertainty.

The resulting prediction interval for the CATE for covariate profile in the target setting represents a range of potential values that the CATE may be in the new setting. For more detailed assumptions and mathematical formulations, refer to the link.4

new_dat <- sample_n(rhc_clean, 100)

This prediction results include three additional columns beyond the original dataet. - tau_predicted: the estimated Conditional Average Treatment Effect (CATE) - pi_lower: the lower bound of the prediction interval - pi_upper: the upper bound of the prediction interval

tau_predicted pi_lower pi_upper treatment center.x scoma1 race ninsclas hrt1 resp1 bili1 pafi1 aps1 temp1 das2d3pc sod1 liverhx race_other race_white ninsclas_medicare ninsclas_medicare_medicaid ninsclas_no_insurance ninsclas_private ninsclas_private_medicare center.y
-1.9615021 -6.991720 3.0687158 1 3 44 white Medicare 35 36 1.0097656 166.6562 102 32.19531 15.40625 136 0 0 1 1 0 0 0 0 3
-2.7086885 -8.486070 3.0686932 0 3 9 white Private_Medicare 55 26 0.5999756 282.0000 46 36.50000 11.00000 144 0 0 1 0 0 0 0 1 3
-0.2958267 -6.900161 6.3085077 0 1 9 black Medicare 120 26 0.1999817 134.2812 61 35.89844 14.75195 132 0 0 0 1 0 0 0 0 1
-0.7522934 -4.898785 3.3941985 0 3 0 white Medicare 121 28 1.0097656 199.9688 74 35.00000 21.94922 133 0 0 1 1 0 0 0 0 3
-4.8684897 -9.389990 -0.3469899 0 2 100 white Private_Medicare 70 12 1.6999512 103.3281 30 36.50000 22.60547 138 0 0 1 0 0 0 0 1 2
-1.8268046 -6.305585 2.6519755 0 1 44 black Medicaid 150 48 0.3999634 148.3125 40 39.09375 14.75195 142 0 0 0 0 0 0 0 0 1
-3.2989329 -8.368655 1.7707893 0 3 0 white Medicare_Medicaid 43 36 0.5000000 357.1250 43 35.29688 19.00391 139 0 0 1 0 1 0 0 0 3
-1.4803195 -6.780706 3.8200675 0 2 0 white Private 122 37 2.8999023 233.3125 54 38.50000 20.11719 151 0 0 1 0 0 0 1 0 2
-2.8276971 -6.971732 1.3163381 1 3 0 black Private 122 42 1.0097656 276.0000 63 38.50000 29.14844 135 0 0 0 0 0 0 1 0 3
-3.6820256 -9.128999 1.7649482 1 4 0 white Private 144 40 1.0097656 182.0000 89 39.89844 19.33203 121 0 0 1 0 0 0 1 0 4

 

Which dataset is suitable?

 

Q. What kinds of data should I have?

While the primary purpose of this package is for RCTs, but the base dataset can be from randomized controlled trials, observational studies, or a combination of both. Our functions are primarily designed for IPD (Individual Patient Data). The dataset must contain at least 4 key components, a study indicator variable, a treatment indicator variable, An outcome variable, and covariate variables.

 

Q. What data format is required?

You should be careful to use which data formats your data has. For this package, the treatment variable should be numeric (e.g. 0 or 1), the study ID should be character (factor is allowed.), and the outcome must be numeric. Note that this package is primarily designed for continuous outcome, and covariates should be all numeric.

Data column Data format
treatment numeric(0,1)
studyid character
outcome numeric
covariates numeric

 

Q. Can I have missing values?

No. Packages like grf (used for causal_forest) do not support missing values. You must either remove rows with NA or impute missing values before analysis.

 

Q. What if I have categorical or factor variables in my covariates?

If you need to include factor variables, we recommend the following:

The brief summary of solutions suggested by the grf vignett is as follows:

  1. One-hot encoding: Use fastdummies to convert categorical variables into dummy variables easily.
  • Ensure that values do not contain spaces or special characters. For example, convert ‘study 1’ to ‘study_1’.
  1. Means encoding: Use the sufrep package for target/means encoding.

  2. Oridinal encoding: If your categorical variable has a natural order (e.g. months), encode it as a numeric variable accordingly.

See the grf vignette on categorical inputs for mode deail.5

 

Q. I applied one-hot encoding but it is still not working-Why?

Ensure that the column names or formulas passed to estimate_cate() do not contain spaces or logical connectors (like ‘and’). The janitor package is helpful for cleaning column names by replacing spaces with underscores (e.g., ‘variable name’ becomes ‘variable_name’). This step is especially important when using the ensemble forest method.

 

Q. Why did I encounter errors when using the causal forest with the study-specific method?

These errors may be caused by lack of sufficient variation or sparsity in one or more covariates. When a variable has very few observations in a particular category (e.g., a binary variable with only a handful of “Yes” values), it can lead to unstable splits in tree-based models like causal forest, or to overfitting, especially in small subgroups.

Before applying the model to clinical data, it is recommended to examine the distribution of each covariate to identify rare levels or highly imbalanced variables that may affect model performance.

 

Q. What other packages are used internally?

  • slearner: Based on the dbarts and grfpackages.
    • The detailed arguments can be found in the paper by Hill (2011)6 and Künzel (2019)7. The default arguments are as follows: keeptrees is set to TRUE, and verbose is set to FALSE when aggregation_method is set to ensembleforest.
  • causalforest: Based on the package grf’s causal_forest.
    • You can refer to the detailed arguments in the paper by Athey et al. (2019)8. This is estimated by averaging the treatment effect across patients, accounting for counterfactual assignments. The default arguments are as follows: importance is set to impurity, and keep.inbag is set to TRUE.

 

Q. How do I input my data into the function parameters?

To estiamte CATE, you can use the function estiamte_cate() and this function requires parameter like below:

  • estimation_method (string): Choose ‘slearner’ or ‘causalforest’
  • aggregation_method (String): Choose one of options below.
      1. studyindicator: Pool all data together but keep study ID as an indicator, including them as covariates in the single-study method.
      1. ensembleforest: Fit the estimation model within each study, apply each model to all indivudals across all studies, then fit ensemble random forest to augmented data.
      1. studyspecific: Fit the estimation model separately for each study, and report out study-specific estimates.
  • study_col (String): Name of the column of study indicator.Note that study_col can be NULL only when aggregation method is studyspecific. In other cases, study_col parameter should be filled with the specific column name, like ‘studyid’.
  • treatment_col (String): Name of the column of treatment.
  • outcome_col (String): Name of the column of outcome.
  • covariate_col (Vector): Names of the columns of covariates. The default is NULL. The length of string should be the same with the number of covariates. You can type the list like c('var1','var2','var3').
  • drop_col (Vector): Name of the columns to be deleted. You can type the list like c('var1','var2','var3'). Default is NULL.
  • extra_args (List): Add additional arguments for each estimation method. For example, when you choose causal forest for the estimation method you may adjust extra arguments to compelte splitting tress faster by setting a smaller number of trees. (e.g.extra_args = list(num.trees = 50))

 

Q. What kinds of objects are returned?

The result objects of estimate_cate() and summary() provide different information.

  • estimate_cate() result object:
    • ‘GRF forest object of type causal_forest’ (e.g. Number of trees, Number of training samples)
    • ‘variance importance tibble’
    • ‘original data with additional two columns, tau hat and variance of estimates’.
  • summary() of the result object of estimate_cate():
    • overall ATE (object ate)
    • its standard error
    • study-specific ATE (object studycate) with minimum, median, maximum CATE.

 

Q. Why can’t I get Plot 4 (BLP) with ensemble forest?

Plot4 BLP(Best Linear projection figure) is not available for the ensemble forest method. You can refer to the table below to see which plots are supported by each combination of aggregation and estimation methods.

No Plots from plot() Aggregation Method Estimation Method
1 Histogram of estimated CATEs Available to all Available to all
2 Boxplot of CATEs by study ID Available to all Available to all
3 95% CI for all CATEs Study specific, Study indicator Available to all
4 Best Linear Projection Study specific, Study indicator Causal forest
5 Interpretation tree Available to all Available to all

 


  1. Brantner, C. L., Nguyen, T. Q., Tang, T., Zhao, C., Hong, H., & Stuart, E. A. (2024). Comparison of methods that combine multiple randomized trials to estimate heterogeneous treatment effects. Statistics in medicine, 43(7), 1291-1314. https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.9955↩︎

  2. https://hbiostat.org/data↩︎

  3. https://hbiostat.org/data/repo/rhc↩︎

  4. https://github.com/dobengjhu/multicate/blob/main/R/predict_cate.R↩︎

  5. https://grf-labs.github.io/grf/articles/categorical_inputs.html↩︎

  6. Hill, J. L. (2011). Bayesian Nonparametric Modeling for Causal Inference. Journal of Computational and Graphical Statistics, 20(1), 217–240. https://doi.org/10.1198/jcgs.2010.08162↩︎

  7. Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10), 4156-4165. https://www.pnas.org/doi/abs/10.1073/pnas.1804597116↩︎

  8. Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests. https://arxiv.org/abs/1610.01271↩︎