Coalescing causal inference and foundation models

Randomized experiments and average treatment effect(ATE)

Consider a dataset $\mathcal{D}$, containing $n$ tuples $(X,Y,A)$. $X$ be the covariates, $Y$ be the outcome, and $A \in {0,1}$ be the treatment variable. We assume that

  1. The data is sampled i.i.d from joint distribution $\mathbb{P}$ over $(X,Y,A)$
  2. $Y(a) ⫫ A$, for $a=0,1$
  3. $\mathbb{P}(A=a)=\pi_a>0$, for $a=0,1$, and propensity score $\pi_a$ is known.
  4. SUTVA(stable unit treatment value assumption) holds: $Y_i=Y_i(A_i)$

Under these assumptions, we aim to estimate average treatment effect(ATE):

\[\theta := \mathbb{E}[Y(1)-Y(0)] = \mathbb{E}[Y\mid A=1]-\mathbb{E}[Y\mid A=0]\]

Standard approach is to estimate $\theta$ using difference in means (DM) estimator:

\[\hat{\theta} _ {DM} = \frac{1}{n_1} \sum_{i;A_i=1} {Y_i} - \frac{1}{n_0} \sum_{i;A_i=0} {Y_i}\]

where $n_i$ denotes the number of data samples with $A_i=a$.

Estimators of ATE

Difference in means estimator $\hat{\theta}_{DM}$

For difference in means(DM) estimator $\theta_{DM}$, it is known that the estimator is consistent and asymptotically normal:

\[\sqrt{n}(\hat{\theta} _ {DM}-\theta) \rightarrow^d \mathcal{N}(0, V_{DM})\]

with asymptotic variance $V_{DM}$.

We also have a consistent estimator of the asymptotic variance, $\hat{V} _ {DM}=V_{DM}+o(1)$(see Wager, Theorem 1.2). We therefore can construct asymptotically valid CI:

\[\mathcal{C}^\alpha_{DM} = (\hat{\theta} _ {DM} \pm z_{1-\frac{\alpha}{2}}\sqrt{\frac{\hat{V}_{DM}}{n}})\]

Augmented Inverse Probability Weighting estimator $\hat{\theta}_{AIPW}$

Let’s start with the unbiasedness of the AIPW estimator.

AIPW estimator is derived from the IPW estimator, which weights each sample from untreated and treated groups by the inverse of estimated propensity score.

\[\hat{\theta}_{IPW}=\frac{1}{n}\sum_i\{\frac{A_iY_i}{\hat{\pi(X_i)}}-\frac{(1-A_i)Y_i}{1-\hat{\pi(X_i)}}\}\]

If we know the propensity score, we replace

\[\hat{\pi}(X_i) \leftarrow \pi_1, 1-\hat{\pi}(X_i) \leftarrow \pi_0\]

and yields an unbiased estimator of $\theta$. We can show the unbiasedness of this estimator since

\[\mathbb{E}[A_iY_i]=\mathbb{E}[Y_i\mid A_i=1]\cdot\pi_1, \mathbb{E}[(1-A_i)Y_i]=\mathbb{E}[Y_i\mid A_i=0]\cdot\pi_0\]

Note that if we estimate the propensity score consistently, IPW estimator is consistent for $\theta$.

The AIPW estimator can be expressed as a sum of IPW estimator and the adjustment term.

\[\hat{\theta} _ {AIPW}=\hat{\theta} _ {IPW}-\frac{1}{n} \sum_i \frac{(A_i-\hat{\pi}(X_i))}{\hat{\pi}(X_i)(1-\hat{\pi}(X_i))} \cdot [(1-\hat{\pi}(X_i))\cdot h(X_i,1) + \hat{\pi}(X_i)\cdot h(X_i,0)]\]

For the case which propensity score is known, we can express AIPW estimator as follows:

\[\hat{\theta}_{AIPW}=\frac{1}{n} \sum_i \psi_i(h), \text{ where } \psi_i(h) := (\frac{A_i}{\pi_1}(Y_i-h(X_i,1))+h(X_i,1)) - (\frac{1-A_i}{\pi_0} (Y_i-h(X_i,0))+h(X_i,0))\]

Here, $h(\cdot,\cdot)$ is a square-integrable function that performs outcome regression by given covariate $X_i$ and binary treatment indicator $A_i$.

AIPW estsimator is unbiased when propensity score and outcome model $h(\cdot, \cdot)$ is known. This is because

\[\mathbb{E}[\frac{A_ih(X_i,1)}{\pi_1}]=\mathbb{E}[h(X_i,1)], \mathbb{E}[\frac{(1-A_i)h(X_i,0)}{\pi_0}]=\mathbb{E}[h(X_i,0)]\] \[\therefore \mathbb{E}[\psi_i(h)]=\mathbb{E}[Y_i\mid A_i=1]-\mathbb{E}[Y_i\mid A_i=0]\]

Among the classes of AIPW estimator depending on the outcome model $h$, the most efficient estimator should minimize the asymptotic variance. Specifically, lower bound is attained when $h$ is the conditional mean of the outcome:

\[h^*(X,A)=\mathbb{E}[Y\mid X,A]\]

We combine this result with the fact that all estimators of $\theta$ that are consistent and asymptotically normal are asymptoticdally eqivalent to the APIW estimator (when propensity score is known, see Robins). We therefore conclude that $\hat{\theta}_{AIPW}(h^*)$ has the smallest possible CI among all consistent and asymptotically normal estimators of $\theta$ (see De Bartolomeis).

One practical issue is that we only have estimator $\hat{h}$ of $h^*$, so the efficiency lower bound is achieved only if

\[\left\lVert \hat{h}-h^{*} \right\rVert_{L_2}=o(1)\]

A notable property of AIPW estimator is that its asymptotic normality is achieved as long as the $\hat{h}$ has an asymptotic limit $h^\dagger$ (see De Bartolomeis Proposition 1.1). This means that we can construct a valid confidence interval, independent to the estimator of outcome model $\hat{h}$. Especially, when we choose $\hat{h}$ as the minimizer of empirical risk among linear function class $\mathcal{H}$, it is referred to as standard AIPW estimator.

\[\hat{h}(X,a) \in \arg\min_{h\in \mathcal{H}} \frac{1}{n_a} \sum_{i;A_i=a} \mathcal{L}(Y_i, h(X_i))\]

Hybrid AIPW from De Bartolomeis

Hybrid AIPW comes from the idea that we can improve the outcome model $\hat{h}$ by using multiple AIPW estimators which outcome model is replaced by foundation models. Plugging in a different outcome estimator with foundation models $f_1, …, f_k$, we can obtain multiple AIPW estimators

\[\hat{\theta} _ {AIPW} (\hat{h}), \hat{\theta} _ {AIPW} (f_1), ..., \hat{\theta} _ {AIPW}(f_k)\]

One important point is that $\hat{h}$ is estimated from experimental data, while $f_1, .., f_k$ are foundation models trained on independent external data. We can simply consider a linear combination of these estimators parameterized by weight $\lambda$, to select an optimal estimator.

\[\hat{\theta}_\lambda := \lambda_1 \hat{\theta} _{AIPW}(\hat{h}) + \sum _{j=1}^k \lambda _{j+1} \hat{\theta} _{AIPW}(f_j)\]

Here, $\lambda$ satisfies $\sum_{j=1}^{k+1}\lambda_j=1$.

If we know the asymptotic covariance $\Sigma := \text{Cov}[(\psi(h^\dagger),…,\psi(f_k)^T]$ for asymptotic limit $h^\dagger$, we can choose $\lambda$ that minimize the variance of $\hat{\theta}_\lambda$ via

\[\lambda^* = \arg\min_{\lambda} \text{Var}[\hat{\theta}_\lambda] = \arg\min_{\lambda} \lambda^T\Sigma\lambda = \Sigma^{-1}\mathbb{1}/(\mathbb{1}^T\Sigma^{-1}\mathbb{1})\]

In practice, we only have an estimate $\hat{\Sigma}$ of the covariance, thus we use $\hat{\lambda} := \arg\min_\lambda \lambda^T\hat{\Sigma}\lambda$.

One of the most intriguing property of H-AIPW estimator is that the asymptotic variance of combined estimator is no greater than that of any individual estimator (see De Bartolomeis Theorem 2, Appendix A.1.2).

This ensures that the combined estimator is guaranteed to be as precise as the best estimator in the ensemble. Even when the foundation model is biased, H-AIPW estimator will fall back to the standard AIPW.


Connection with PPI(prediction-powered-inference)

Interesting connection of AIPW estimator with PPI is discussed in Angelopoulos, Xu, De Bartolomeis.

Considering a scenario of estimating the counterfactual mean $\mathbb{E}[Y(1)]$, we can derive a direct equivalence of PPI++() and AIPW as follows:

\[\hat{\theta} _{PPI++} = \frac{1}{n}\sum _{i=1}^n Y_i + \lambda(\frac{1}{n_0} \sum _{i;A_i=0}f(X_i) - \frac{1}{n_1} \sum _{i;A_i=1}f(X_i)\]

Here, the standard $\hat{\theta} _{DM}$ is the mean outcomes of the treated group. The PPI++ estimator, $\hat{\theta} _{PPI++}$ improves DM estimator by leveraging predictions from black-box ML model $f$. We can tune $\lambda$ to minimize the variance, but espeically, when we choose $\lambda = \frac{n_0}{n_0+n_1} = \frac{n_0}{n}$, PPI++ estimator reduces to AIPW estimator.

Let’s start from the AIPW estimator:

\[\hat{\theta} _ {AIPW} = \frac{1}{n} \sum_ {i=1}^n [{\frac{A_i(Y_i-f(X_i))}{\pi_1}+f(X_i)}]\]

Since propensity score $\pi_1 = \frac{n_1}{n_1+n_0}$, AIPW estimator reduces to

\[= \frac{1}{n_1} \sum_ {i=1}^n [A_i(Y_i-f(X_i))] + \frac{1}{n} \sum_ {i=1}^n f(X_i)\] \[= \frac{1}{n_1} \sum_ {i;A_i=1} [Y_i-f(X_i)] + \frac{1}{n} \sum_ {i;A_i=0}^n f(X_i) + \frac{1}{n} \sum_ {i;A_i=1}^n f(X_i)\] \[= \frac{1}{n_1} \sum_{i;A_i=1} Y_i -\frac{n_0}{nn_1} \sum_ {i;A_i=1} f(X_i) + \frac{1}{n} \sum_{A_i=0} f(X_i)\] \[= \frac{1}{n_1} \sum_{i;A_i=1} Y_i + \frac{n_0}{n} (\frac{1}{n_0} \sum_ {i;A_i=0} f(X_i) - \frac{1}{n_1} \sum_ {i;A_i=1} f(X_i))\]

By substituting $\frac{n_0}{n} = \frac{n_0}{n_0+n_1}$ with $\lambda$, we yield PPI++ estimator $\hat{\theta}_{PPI}$.

Foundation Models, Surrogate Biology

An analogy: harvesting knowledge from language models


The origin of this idea traces back to a conversation with a CS colleague during the summer of 2022. At the time, I was fascinated by questions in efficient learning theory and the role of model architecture in determining learning capacity. Our topic was, “Why haven’t vision models achieved results comparable to language models?” This was shortly after the emergence of RLHF, when large language models were attracting global attention.

I proposed a hypothesis: “Perhaps language itself is inherently structured for learning—serving as a highly compressed and organized form of representation.

He then added an intriguing point: “If large language models capture this structure, we can think of them not just as tools, but as datasets themselves. For small-scale researchers, one good strategy is to harvest knowledge embedded in these models.

This conversation planted a question in my mind: Could similar principles apply in biology? Could large-scale models, trained on vast biological observations, sufficiently serve as surrogates for experimental validation or as proxies for biological knowledge?

Validation of high-throughput measurements


Three years later, these questions have become increasingly relevant in functional genomics, where high-throughput measurements produce enormous yet noisy datasets. One emerging approach involves asking whether conclusions drawn from large-scale studies(both observation and perturbation)—often summarized as global statements about systemic architecture—can be recapitulated in silico using models trained from self-supervision of rich data.

For example, models trained on genomic sequences have been used to impute data in a composite $^\dagger$ fashion[1], identify cis-regulatory syntaxes[2], and probe simpler statistical models such as linear predictors[1,2]. These examples demonstrate how deep models can complement or even substitute experimental measurements by serving as structured surrogates for biological information.

$\dagger$ It reminds me of terminology called ‘Socratic Model’, coined from this seminal paper[3].

In silico recapitulation of global biology


It has become increasingly clear that equating scale with foundational capability is problematic. Models trained on massive datasets in a self-supervised manner do not inherently qualify as foundation models; rather, true foundation models might demonstrate robust generalization beyond pretraining objectives and the ability to reproduce established biological principles without task-specific tuning.

Biology offers a proper testbed for this argument. Biologists possess a wealth of experimentally validated propositions about biological system architecture, and also have natural system/synthetic tools to validate novel hypotheses experimentally. Consider an example from enhancer biology: Gasperini et al.[4] produced large-scale CRISPRi screens, and another study[5] re-analyzed the data to argue that enhancer action is predominantly multiplicative and that evidence for enhancer interactions was undetectable. Here, the Enformer model was used as a computational surrogate to validate these claims in silico. Reversal of this logic is also instructive: if a proposed model aspires to be considered a foundation model for biology, it might reproduce such propositions in a zero-shot manner without explicit fine-tuning.

Some might argue that this claim is quite aggressive, but propositions about biological systems with an established consensus now serve as good benchmarks. Evidence must be systematic and comprehensive—not cherry-picked or anecdotal. Notably, increasing efforts are being made to verify whether so-called “foundation models” enable zero-shot inference of verifiable biological properties derived from large-scale data[6,7].

Learning Constraints from Biological Feedback


An even more ambitious direction is to learn models directly from biological systems by leveraging endogenous mechanisms. Biological systems impose strong selective constraints, determining which states persist and which are eliminated. Negative examples—e.g., states that fail selection—are particularly challenging to obtain under natural conditions, leading to biased datasets where unfavorable configurations are underrepresented.

Among biological systems, the immune repertoire offers a unique opportunity in this regard. For instance, B-cell development is shaped by stringent selection. Productive B cell receptor (BCR) sequences are preferentially retained, while nonproductive or autoreactive sequences are eliminated. This process provides a natural source of implicit preference data, analogous to reward signals in reinforcement learning.

Building on recent work[8], which exploits allelic inclusion to classify suboptimal BCR sequences, one could extend this idea to a generative framework. Specifically, incorporating preference-based reinforcement learning—where productive versus nonproductive repertoires provide implicit ranking signals—could enable models to internalize selective constraints shaping BCR diversity. Such models would not only generate biologically plausible antibody sequences but also simulate affinity maturation pathways, offering new tools for immunological research and therapeutic design.

Reference

[1] Zhou, Yichao, et al. “scPrediXcan integrates advances in deep learning and single-cell data into a powerful cell-type–specific transcriptome-wide association study framework.” bioRxiv (2025): 2024-11.
[2] Seitz, Evan E., et al. “Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models.” Nature machine intelligence 6.6 (2024): 701-713.
[3] Zeng, Andy, et al. “Socratic models: Composing zero-shot multimodal reasoning with language.” arXiv preprint arXiv:2204.00598 (2022).
[4] Gasperini, Molly, et al. “A genome-wide framework for mapping gene regulation via cellular genetic screens.” Cell 176.1 (2019): 377-390.
[5] Zhou, Jessica L., et al. “Analysis of single-cell CRISPR perturbations indicates that enhancers predominantly act multiplicatively.” Cell Genomics 4.11 (2024).
[6] Wang, Yihui, et al. “Genomic Touchstone: Benchmarking Genomic Language Models in the Context of the Central Dogma.” bioRxiv (2025): 2025-06.
[7] Tang, Ziqi, et al. “Evaluating the representational power of pre-trained DNA language models for regulatory genomics.” Genome Biology 26.1 (2025): 203.
[8] Jagota, Milind, et al. “Learning antibody sequence constraints from allelic inclusion.” bioRxiv (2024).