The main wrapper function to conduct ipd using various methods and models, and returns a list of fitted model components.
Usage
ipd(
formula,
method,
model,
data,
label = NULL,
unlabeled_data = NULL,
seed = NULL,
intercept = TRUE,
alpha = 0.05,
alternative = "two-sided",
n_t = Inf,
na_action = "na.fail",
...
)
Arguments
- formula
An object of class
formula
: a symbolic description of the model to be fitted. Must be of the formY - f ~ X
, whereY
is the name of the column corresponding to the observed outcome in the labeled data,f
is the name of the column corresponding to the predicted outcome in both labeled and unlabeled data, andX
corresponds to the features of interest (i.e.,X = X1 + ... + Xp
).- method
The method to be used for fitting the model. Must be one of
"postpi_analytic"
,"postpi_boot"
,"ppi"
,"pspa"
, or"ppi_plusplus"
.- model
The type of model to be fitted. Must be one of
"mean"
,"quantile"
,"ols"
, or"logistic"
.- data
A
data.frame
containing the variables in the model, either a stacked data frame with a specific column identifying the labeled versus unlabeled observations (label
), or only the labeled data set. Must contain columns for the observed outcomes (Y
), the predicted outcomes (f
), and the features (X
) needed to specify theformula
.- label
A
string
,int
, orlogical
specifying the column in the data that distinguishes between the labeled and unlabeled observations. See theDetails
section for more information. If NULL,unlabeled_data
must be specified.- unlabeled_data
(optional) A
data.frame
of unlabeled data. If NULL,label
must be specified. Specifying both thelabel
andunlabeled_data
arguments will result in an error message. If specified, must contain columns for the predicted outcomes (f
), and the features (X
) needed to specify theformula
.- seed
(optional) An
integer
seed for random number generation.- intercept
Logical
. Should an intercept be included in the model? Default isTRUE
.- alpha
The significance level for confidence intervals. Default is
0.05
.- alternative
A string specifying the alternative hypothesis. Must be one of
"two-sided"
,"less"
, or"greater"
.- n_t
(integer, optional) Size of the dataset used to train the prediction function (necessary for the
"postpi"
methods ifn_t
<nrow(X_l)
. Defaults toInf
.- na_action
(string, optional) How missing covariate data should be handled. Currently
"na.fail"
and"na.omit"
are accommodated. Defaults to"na.fail"
.- ...
Additional arguments to be passed to the fitting function. See the
Details
section for more information.
Value
a summary of model output.
A list containing the fitted model components:
- coefficients
Estimated coefficients of the model
- se
Standard errors of the estimated coefficients
- ci
Confidence intervals for the estimated coefficients
- formula
The formula used to fit the ipd model.
- data
The data frame used for model fitting.
- method
The method used for model fitting.
- model
The type of model fitted.
- intercept
Logical. Indicates if an intercept was included in the model.
- fit
Fitted model object containing estimated coefficients, standard errors, confidence intervals, and additional method-specific output.
- ...
Additional output specific to the method used.
Details
1. Formula:
The ipd
function uses one formula argument that specifies both the
calibrating model (e.g., PostPI "relationship model", PPI "rectifier" model)
and the inferential model. These separate models will be created internally
based on the specific method
called.
2. Data:
The data can be specified in two ways:
Single data argument (
data
) containing a stackeddata.frame
and a label identifier (label
).Two data arguments, one for the labeled data (
data
) and one for the unlabeled data (unlabeled_data
).
For option (1), provide one data argument (data
) which contains a
stacked data.frame
with both the unlabeled and labeled data and a
label
argument that specify the column that identifies the labeled
versus the unlabeled observations in the stacked data.frame
NOTE: Labeled data identifiers can be:
- String
"l", "lab", "label", "labeled", "labelled", "tst", "test", "true"
- Logical
TRUE
- Factor
Non-reference category (i.e., binary 1)
Unlabeled data identifiers can be:
- String
"u", "unlab", "unlabeled", "unlabelled", "val", "validation", "false"
- Logical
FALSE
- Factor
Non-reference category (i.e., binary 0)
For option (2), provide separate data arguments for the labeled data set
(data
) and the unlabeled data set (unlabeled_data
). If the
second argument is provided, the function ignores the label identifier and
assumes the data provided is stacked.
3. Method:
Use the method
argument to specify the fitting method:
- "postpi"
Wang et al. (2020) Post-Prediction Inference (PostPI)
- "ppi"
Angelopoulos et al. (2023) Prediction-Powered Inference (PPI)
- "ppi_plusplus"
Angelopoulos et al. (2023) PPI++
- "pspa"
Miao et al. (2023) Assumption-Lean and Data-Adaptive Post-Prediction Inference (PSPA)
4. Model:
Use the model
argument to specify the type of model:
- "mean"
Mean value of the outcome
- "quantile"
q
th quantile of the outcome- "ols"
Linear regression
- "logistic"
Logistic regression
- "poisson"
Poisson regression
The ipd
wrapper function will concatenate the method
and
model
arguments to identify the required helper function, following
the naming convention "method_model".
5. Auxiliary Arguments:
The wrapper function will take method-specific auxiliary arguments (e.g.,
q
for the quantile estimation models) and pass them to the helper
function through the "..." with specified defaults for simplicity.
6. Other Arguments:
All other arguments that relate to all methods (e.g., alpha, ci.type), or other method-specific arguments, will have defaults.
Examples
#-- Generate Example Data
set.seed(12345)
dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1)
head(dat)
#> X1 X2 X3 X4 Y f set
#> 1 0.5855288 -0.78486098 1.1872102 1.05076285 1.4008570 NA training
#> 2 0.7094660 -2.56005244 -0.3567140 -0.07179733 4.1079201 NA training
#> 3 -0.1093033 0.07280078 1.2122385 0.11673662 1.4501726 NA training
#> 4 -0.4534972 0.75024358 -0.6939527 0.97786651 1.2987926 NA training
#> 5 0.6058875 -0.12824888 1.3560616 -1.03154201 2.5256490 NA training
#> 6 -1.8179560 -0.48786673 0.9057313 2.19912933 0.2889297 NA training
formula <- Y - f ~ X1
#-- PostPI Analytic Correction (Wang et al., 2020)
ipd(formula, method = "postpi_analytic", model = "ols",
data = dat, label = "set")
#>
#> Call:
#> Y - f ~ X1
#>
#> Coefficients:
#> (Intercept) X1
#> 0.7654975 0.8975856
#-- PostPI Bootstrap Correction (Wang et al., 2020)
nboot <- 200
ipd(formula, method = "postpi_boot", model = "ols",
data = dat, label = "set", nboot = nboot)
#>
#> Call:
#> Y - f ~ X1
#>
#> Coefficients:
#> (Intercept) X1
#> 0.7725950 0.9026754
#-- PPI (Angelopoulos et al., 2023)
ipd(formula, method = "ppi", model = "ols",
data = dat, label = "set")
#>
#> Call:
#> Y - f ~ X1
#>
#> Coefficients:
#> [,1]
#> (Intercept) 0.7899574
#> X1 0.8012371
#> attr(,"names")
#> [1] "(Intercept)" "X1"
#-- PPI++ (Angelopoulos et al., 2023)
ipd(formula, method = "ppi_plusplus", model = "ols",
data = dat, label = "set")
#>
#> Call:
#> Y - f ~ X1
#>
#> Coefficients:
#> (Intercept) X1
#> 0.7341888 0.7718728
#-- PSPA (Miao et al., 2023)
ipd(formula, method = "pspa", model = "ols",
data = dat, label = "set")
#>
#> Call:
#> Y - f ~ X1
#>
#> Coefficients:
#> (Intercept) X1
#> 0.7306059 0.7744542