Blog · Junha Park

Nlp

29 Oct 2025

layout: post title: “Nature’s Language Processing” subtitle: “” excerpt_separator: “” date: 2025-10-29

More …

Coalescing causal inference and foundation models

21 Sep 2025 • Annotated BI

Randomized experiments and average treatment effect(ATE)

Consider a dataset $\mathcal{D}$, containing $n$ tuples $(X,Y,A)$. $X$ be the covariates, $Y$ be the outcome, and $A \in {0,1}$ be the treatment variable. We assume that

The data is sampled i.i.d from joint distribution $\mathbb{P}$ over $(X,Y,A)$
$Y(a) ⫫ A$, for $a=0,1$
$\mathbb{P}(A=a)=\pi_a>0$, for $a=0,1$, and propensity score $\pi_a$ is known.
SUTVA(stable unit treatment value assumption) holds: $Y_i=Y_i(A_i)$

Under these assumptions, we aim to estimate average treatment effect(ATE):

\[\theta := \mathbb{E}[Y(1)-Y(0)] = \mathbb{E}[Y\mid A=1]-\mathbb{E}[Y\mid A=0]\]

Standard approach is to estimate $\theta$ using difference in means (DM) estimator:

\[\hat{\theta} _ {DM} = \frac{1}{n_1} \sum_{i;A_i=1} {Y_i} - \frac{1}{n_0} \sum_{i;A_i=0} {Y_i}\]

where $n_i$ denotes the number of data samples with $A_i=a$.

Estimators of ATE

Difference in means estimator $\hat{\theta}_{DM}$

For difference in means(DM) estimator $\theta_{DM}$, it is known that the estimator is consistent and asymptotically normal:

\[\sqrt{n}(\hat{\theta} _ {DM}-\theta) \rightarrow^d \mathcal{N}(0, V_{DM})\]

with asymptotic variance $V_{DM}$.

We also have a consistent estimator of the asymptotic variance, $\hat{V} _ {DM}=V_{DM}+o(1)$(see Wager, Theorem 1.2). We therefore can construct asymptotically valid CI:

\[\mathcal{C}^\alpha_{DM} = (\hat{\theta} _ {DM} \pm z_{1-\frac{\alpha}{2}}\sqrt{\frac{\hat{V}_{DM}}{n}})\]

Augmented Inverse Probability Weighting estimator $\hat{\theta}_{AIPW}$

Let’s start with the unbiasedness of the AIPW estimator.

AIPW estimator is derived from the IPW estimator, which weights each sample from untreated and treated groups by the inverse of estimated propensity score.

\[\hat{\theta}_{IPW}=\frac{1}{n}\sum_i\{\frac{A_iY_i}{\hat{\pi(X_i)}}-\frac{(1-A_i)Y_i}{1-\hat{\pi(X_i)}}\}\]

If we know the propensity score, we replace

\[\hat{\pi}(X_i) \leftarrow \pi_1, 1-\hat{\pi}(X_i) \leftarrow \pi_0\]

and yields an unbiased estimator of $\theta$. We can show the unbiasedness of this estimator since

\[\mathbb{E}[A_iY_i]=\mathbb{E}[Y_i\mid A_i=1]\cdot\pi_1, \mathbb{E}[(1-A_i)Y_i]=\mathbb{E}[Y_i\mid A_i=0]\cdot\pi_0\]

Note that if we estimate the propensity score consistently, IPW estimator is consistent for $\theta$.

The AIPW estimator can be expressed as a sum of IPW estimator and the adjustment term.

\[\hat{\theta} _ {AIPW}=\hat{\theta} _ {IPW}-\frac{1}{n} \sum_i \frac{(A_i-\hat{\pi}(X_i))}{\hat{\pi}(X_i)(1-\hat{\pi}(X_i))} \cdot [(1-\hat{\pi}(X_i))\cdot h(X_i,1) + \hat{\pi}(X_i)\cdot h(X_i,0)]\]

For the case which propensity score is known, we can express AIPW estimator as follows:

\[\hat{\theta}_{AIPW}=\frac{1}{n} \sum_i \psi_i(h), \text{ where } \psi_i(h) := (\frac{A_i}{\pi_1}(Y_i-h(X_i,1))+h(X_i,1)) - (\frac{1-A_i}{\pi_0} (Y_i-h(X_i,0))+h(X_i,0))\]

Here, $h(\cdot,\cdot)$ is a square-integrable function that performs outcome regression by given covariate $X_i$ and binary treatment indicator $A_i$.

AIPW estsimator is unbiased when propensity score and outcome model $h(\cdot, \cdot)$ is known. This is because

\[\mathbb{E}[\frac{A_ih(X_i,1)}{\pi_1}]=\mathbb{E}[h(X_i,1)], \mathbb{E}[\frac{(1-A_i)h(X_i,0)}{\pi_0}]=\mathbb{E}[h(X_i,0)]\] \[\therefore \mathbb{E}[\psi_i(h)]=\mathbb{E}[Y_i\mid A_i=1]-\mathbb{E}[Y_i\mid A_i=0]\]

Among the classes of AIPW estimator depending on the outcome model $h$, the most efficient estimator should minimize the asymptotic variance. Specifically, lower bound is attained when $h$ is the conditional mean of the outcome:

\[h^*(X,A)=\mathbb{E}[Y\mid X,A]\]

We combine this result with the fact that all estimators of $\theta$ that are consistent and asymptotically normal are asymptoticdally eqivalent to the APIW estimator (when propensity score is known, see Robins). We therefore conclude that $\hat{\theta}_{AIPW}(h^*)$ has the smallest possible CI among all consistent and asymptotically normal estimators of $\theta$ (see De Bartolomeis).

One practical issue is that we only have estimator $\hat{h}$ of $h^*$, so the efficiency lower bound is achieved only if

\[\left\lVert \hat{h}-h^{*} \right\rVert_{L_2}=o(1)\]

A notable property of AIPW estimator is that its asymptotic normality is achieved as long as the $\hat{h}$ has an asymptotic limit $h^\dagger$ (see De Bartolomeis Proposition 1.1). This means that we can construct a valid confidence interval, independent to the estimator of outcome model $\hat{h}$. Especially, when we choose $\hat{h}$ as the minimizer of empirical risk among linear function class $\mathcal{H}$, it is referred to as standard AIPW estimator.

\[\hat{h}(X,a) \in \arg\min_{h\in \mathcal{H}} \frac{1}{n_a} \sum_{i;A_i=a} \mathcal{L}(Y_i, h(X_i))\]

Hybrid AIPW from De Bartolomeis

Hybrid AIPW comes from the idea that we can improve the outcome model $\hat{h}$ by using multiple AIPW estimators which outcome model is replaced by foundation models. Plugging in a different outcome estimator with foundation models $f_1, …, f_k$, we can obtain multiple AIPW estimators

\[\hat{\theta} _ {AIPW} (\hat{h}), \hat{\theta} _ {AIPW} (f_1), ..., \hat{\theta} _ {AIPW}(f_k)\]

One important point is that $\hat{h}$ is estimated from experimental data, while $f_1, .., f_k$ are foundation models trained on independent external data. We can simply consider a linear combination of these estimators parameterized by weight $\lambda$, to select an optimal estimator.

\[\hat{\theta}_\lambda := \lambda_1 \hat{\theta} _{AIPW}(\hat{h}) + \sum _{j=1}^k \lambda _{j+1} \hat{\theta} _{AIPW}(f_j)\]

Here, $\lambda$ satisfies $\sum_{j=1}^{k+1}\lambda_j=1$.

If we know the asymptotic covariance $\Sigma := \text{Cov}[(\psi(h^\dagger),…,\psi(f_k)^T]$ for asymptotic limit $h^\dagger$, we can choose $\lambda$ that minimize the variance of $\hat{\theta}_\lambda$ via

\[\lambda^* = \arg\min_{\lambda} \text{Var}[\hat{\theta}_\lambda] = \arg\min_{\lambda} \lambda^T\Sigma\lambda = \Sigma^{-1}\mathbb{1}/(\mathbb{1}^T\Sigma^{-1}\mathbb{1})\]

In practice, we only have an estimate $\hat{\Sigma}$ of the covariance, thus we use $\hat{\lambda} := \arg\min_\lambda \lambda^T\hat{\Sigma}\lambda$.

One of the most intriguing property of H-AIPW estimator is that the asymptotic variance of combined estimator is no greater than that of any individual estimator (see De Bartolomeis Theorem 2, Appendix A.1.2).

This ensures that the combined estimator is guaranteed to be as precise as the best estimator in the ensemble. Even when the foundation model is biased, H-AIPW estimator will fall back to the standard AIPW.

Connection with PPI(prediction-powered-inference)

Interesting connection of AIPW estimator with PPI is discussed in Angelopoulos, Xu, De Bartolomeis.

Considering a scenario of estimating the counterfactual mean $\mathbb{E}[Y(1)]$, we can derive a direct equivalence of PPI++() and AIPW as follows:

\[\hat{\theta} _{PPI++} = \frac{1}{n}\sum _{i=1}^n Y_i + \lambda(\frac{1}{n_0} \sum _{i;A_i=0}f(X_i) - \frac{1}{n_1} \sum _{i;A_i=1}f(X_i)\]

Here, the standard $\hat{\theta} _{DM}$ is the mean outcomes of the treated group. The PPI++ estimator, $\hat{\theta} _{PPI++}$ improves DM estimator by leveraging predictions from black-box ML model $f$. We can tune $\lambda$ to minimize the variance, but espeically, when we choose $\lambda = \frac{n_0}{n_0+n_1} = \frac{n_0}{n}$, PPI++ estimator reduces to AIPW estimator.

Let’s start from the AIPW estimator:

\[\hat{\theta} _ {AIPW} = \frac{1}{n} \sum_ {i=1}^n [{\frac{A_i(Y_i-f(X_i))}{\pi_1}+f(X_i)}]\]

Since propensity score $\pi_1 = \frac{n_1}{n_1+n_0}$, AIPW estimator reduces to

\[= \frac{1}{n_1} \sum_ {i=1}^n [A_i(Y_i-f(X_i))] + \frac{1}{n} \sum_ {i=1}^n f(X_i)\] \[= \frac{1}{n_1} \sum_ {i;A_i=1} [Y_i-f(X_i)] + \frac{1}{n} \sum_ {i;A_i=0}^n f(X_i) + \frac{1}{n} \sum_ {i;A_i=1}^n f(X_i)\] \[= \frac{1}{n_1} \sum_{i;A_i=1} Y_i -\frac{n_0}{nn_1} \sum_ {i;A_i=1} f(X_i) + \frac{1}{n} \sum_{A_i=0} f(X_i)\] \[= \frac{1}{n_1} \sum_{i;A_i=1} Y_i + \frac{n_0}{n} (\frac{1}{n_0} \sum_ {i;A_i=0} f(X_i) - \frac{1}{n_1} \sum_ {i;A_i=1} f(X_i))\]

By substituting $\frac{n_0}{n} = \frac{n_0}{n_0+n_1}$ with $\lambda$, we yield PPI++ estimator $\hat{\theta}_{PPI}$.