22 Nov 2025
•
Annotated BI
Testbed for benchmarking large models trained on biological sequences
My recent blog post examined the stringent criteria required to meaningfully evaluate large models trained on biological sequences.
For convenience, I’d like to refer to “large models trained on biological sequences” as NLP models, which abbreviates “Nature’s Language Processing” to emphasize the analogy between natural language modeling and the modeling of genomic, transcriptomic, and proteomic sequences. \(\dagger\)
\(\dagger\): Future blog posts will discuss some critical gaps in an apple-to-apple comparison between natural languages and biological sequences.
Despite rapid progress, an important challenge persists: we lack biologically principled, community-wide standards for benchmarking NLP models, and a semantic gap remains between biologists and computer scientists regarding what “good performance” should look like.
To guide the discussion, we articulate two core aspirations for NLP models:
- robust generalization beyond pretraining objectives
- reproduction of established biological principles without task-specific tuning
To clarify this landscape, we first examine biologically grounded tasks within the framework of the central dogma.
Central Dogma
The central dogma: DNA - RNA - protein provides a natural organizing principle for evaluating NLP models across distinct sequence modalities.
Wang et al. recently proposed a comprehensive benchmark designed with extensive tasks spanning the central dogma.
Here, a few examples of sequence(DNA, RNA, Protein) level tasks are presented below among the full list of 36 tasks proposed at the original work.
| Modality |
Tasks |
Description |
Category |
| DNA |
Promoter annotation |
Distinguishing promoter regions 1) TATA box, Initiator, CCAAT box, GC box 2) Minimal sequence for transcription initiation around TSS |
Classification |
| DNA |
Enhancer types classification |
Distinguishing 1) Tissue-specific enhancers 2) Tissue-invariant enhancers 3) Non-enhancer regions |
Classification |
| RNA |
mRNA stability prediction |
Predicting mRNA stability profile |
Regression |
| RNA |
SARS-CoV-2 Vaccine degradation prediction |
Predict hydrolysis rates of mRNA |
Regression |
| Protein |
Post-translational modification prediction |
Predict PTM categories |
Classification |
| Protein |
Fluorescence prediction |
Predict log-flourescence for avGFP sequences |
Regression |
Here, we observe that the scope of the task varies significantly. Some tasks test general principles and regulatory syntaxes of gene expression, while others assess specific biological properties (e.g., degradation of mRNA, fluorescence of protein).
This naturally poses a constraint on zero-shot predictability. For some tasks, we expect NLP models to succeed easily with simple linear probing. On the other hand, some tasks are inherently designed with narrow scopes and thus require adjustment ranging from fine-tuning of penultimate layers to extensive fine-tuning of all layers.
Focusing on zero-shot inference
In practice, zero-shot inference in biological modeling roughly refers to predicting outcomes for a task or distribution that was never explicitly used for training.
New tasks or distribution can involve new functional assays, cellular contexts, or sequence families. This is the reason why we are focusing on evaluating the representational power of NLP models under a zero-shot inference setting.
Note that Wang et al. is a comprehensive benchmark but primarily focuses on fine-tuning diverse NLP models with specific datasets.
On the other hand, Tang et al. provide a good example of designing DNA-level tasks and benchmarking across NLP models using simple probing models. Here, we summarize key tasks that have been challenged by zero-shot inference using NLP models.
Tasks challenged with zero-shot inference (or with simple probing)
| Modality |
Tasks |
Description |
Probing model |
Reference |
| DNA |
LentiMPRA cell-type-specifc regulatory activity prediction |
Predicting experimentally measured enhancer activity via lentiMPRA, mainly cell-type-specific CRE activity. Note that conventional genomic language models are not cell-type aware. |
Ridge Regression, MLP, CNN (CNN outperformance indicates nonlinearity) |
Tang et al. |
| DNA |
ChIP-seq TF binding prediction |
Binary classification of TF binding event identified from ChiP-seq |
CNN |
Tang et al. |
| RNA |
Predicting alternative splicing |
Given splice acceptor and donor sequence, predict percentage-spliced-in across 56 tissues (multi-task regression) |
CNN |
Tang et al. |
| RNA |
RBP binding prediction(eCLIP-seq) |
Binary classification whether the sequence corresponds to an eCLIP-seq peak or not |
CNN |
Tang et al. |
| RNA |
Predicting RNA Pol II elongation potential |
Regression of RNA Pol II elongation potential from INSERT-seq |
CNN |
Tang et al. |
| RNA |
Secondary structure prediction |
1) Transformer attention maps inform base pairing probability 2) Supplied to construct RNA secondary structure |
Logistic regression |
Gong et al. |
| Protein |
Guidance of antibody evolution (efficient evolution theory) |
Restricting mutational space using PLM(ESM-1b, ESM-1v) likelihood ratio compared to WT amino acid sequence |
N/A(Log of Conditional Likelihood) |
Hie et al. |
| Protein |
Deep mutational scan success prediction |
Predict mutation effect capability(DMS correlation) |
N/A(Pseudo log likelihood) |
Gordon et al. |
Here, we do not describe these tasks as “succeeded” by zero-shot inference.
Tang et al. demonstrates that if we focus on tasks that are highly influenced by the cell-type-specific nature of TF binding(LentiMPRA, ChIP-seq), one-hot encoding of nucleotides surpasses the approach of nonlinearly probing embeddings.
Another important lesson from benchmarking on these tasks is that supervised models such as Enformer and Sei outperform other language models trained in a self-supervised manner.
The success of Enformer in recapitulating the multiplicative mode of enhancer actions in silico also supports these performance gaps between supervised and self-supervised models(Zhou et al.).
Bridging the two worlds
A promising direction lies between supervised and self-supervised paradigms. One proof-of-concept approach that connects the two worlds of supervised and self-supervised NLP models is aligning language models with experimental feedback data(RLXF, Blalock et al.).
It intuitively resembles the concept of RLHF in human-language models. This approach employs biological assays to construct reward signals, providing a framework that leverages both large-scale pretraining and targeted feedback.
Further methodological advancements will be necessary to scale this paradigm, but it still represents a vital conceptual bridge.
21 Sep 2025
•
Annotated BI
Randomized experiments and average treatment effect(ATE)
Consider a dataset $\mathcal{D}$, containing $n$ tuples $(X,Y,A)$.
$X$ be the covariates, $Y$ be the outcome, and $A \in {0,1}$ be the treatment variable.
We assume that
- The data is sampled i.i.d from joint distribution $\mathbb{P}$ over $(X,Y,A)$
- $Y(a) ⫫ A$, for $a=0,1$
- $\mathbb{P}(A=a)=\pi_a>0$, for $a=0,1$, and propensity score $\pi_a$ is known.
- SUTVA(stable unit treatment value assumption) holds: $Y_i=Y_i(A_i)$
Under these assumptions, we aim to estimate average treatment effect(ATE):
\[\theta := \mathbb{E}[Y(1)-Y(0)] = \mathbb{E}[Y\mid A=1]-\mathbb{E}[Y\mid A=0]\]
Standard approach is to estimate $\theta$ using difference in means (DM) estimator:
\[\hat{\theta} _ {DM} = \frac{1}{n_1} \sum_{i;A_i=1} {Y_i} - \frac{1}{n_0} \sum_{i;A_i=0} {Y_i}\]
where $n_i$ denotes the number of data samples with $A_i=a$.
Estimators of ATE
Difference in means estimator $\hat{\theta}_{DM}$
For difference in means(DM) estimator $\theta_{DM}$, it is known that the estimator is consistent and asymptotically normal:
\[\sqrt{n}(\hat{\theta} _ {DM}-\theta) \rightarrow^d \mathcal{N}(0, V_{DM})\]
with asymptotic variance $V_{DM}$.
We also have a consistent estimator of the asymptotic variance, $\hat{V} _ {DM}=V_{DM}+o(1)$(see Wager, Theorem 1.2).
We therefore can construct asymptotically valid CI:
\[\mathcal{C}^\alpha_{DM} = (\hat{\theta} _ {DM} \pm z_{1-\frac{\alpha}{2}}\sqrt{\frac{\hat{V}_{DM}}{n}})\]
Augmented Inverse Probability Weighting estimator $\hat{\theta}_{AIPW}$
Let’s start with the unbiasedness of the AIPW estimator.
AIPW estimator is derived from the IPW estimator, which weights each sample from untreated and treated groups by the inverse of estimated propensity score.
\[\hat{\theta}_{IPW}=\frac{1}{n}\sum_i\{\frac{A_iY_i}{\hat{\pi(X_i)}}-\frac{(1-A_i)Y_i}{1-\hat{\pi(X_i)}}\}\]
If we know the propensity score, we replace
\[\hat{\pi}(X_i) \leftarrow \pi_1, 1-\hat{\pi}(X_i) \leftarrow \pi_0\]
and yields an unbiased estimator of $\theta$.
We can show the unbiasedness of this estimator since
\[\mathbb{E}[A_iY_i]=\mathbb{E}[Y_i\mid A_i=1]\cdot\pi_1, \mathbb{E}[(1-A_i)Y_i]=\mathbb{E}[Y_i\mid A_i=0]\cdot\pi_0\]
Note that if we estimate the propensity score consistently, IPW estimator is consistent for $\theta$.
The AIPW estimator can be expressed as a sum of IPW estimator and the adjustment term.
\[\hat{\theta} _ {AIPW}=\hat{\theta} _ {IPW}-\frac{1}{n} \sum_i \frac{(A_i-\hat{\pi}(X_i))}{\hat{\pi}(X_i)(1-\hat{\pi}(X_i))} \cdot [(1-\hat{\pi}(X_i))\cdot h(X_i,1) + \hat{\pi}(X_i)\cdot h(X_i,0)]\]
For the case which propensity score is known, we can express AIPW estimator as follows:
\[\hat{\theta}_{AIPW}=\frac{1}{n} \sum_i \psi_i(h), \text{ where } \psi_i(h) := (\frac{A_i}{\pi_1}(Y_i-h(X_i,1))+h(X_i,1)) - (\frac{1-A_i}{\pi_0} (Y_i-h(X_i,0))+h(X_i,0))\]
Here, $h(\cdot,\cdot)$ is a square-integrable function that performs outcome regression by given covariate $X_i$ and binary treatment indicator $A_i$.
AIPW estsimator is unbiased when propensity score and outcome model $h(\cdot, \cdot)$ is known.
This is because
\[\mathbb{E}[\frac{A_ih(X_i,1)}{\pi_1}]=\mathbb{E}[h(X_i,1)], \mathbb{E}[\frac{(1-A_i)h(X_i,0)}{\pi_0}]=\mathbb{E}[h(X_i,0)]\]
\[\therefore \mathbb{E}[\psi_i(h)]=\mathbb{E}[Y_i\mid A_i=1]-\mathbb{E}[Y_i\mid A_i=0]\]
Among the classes of AIPW estimator depending on the outcome model $h$, the most efficient estimator should minimize the asymptotic variance.
Specifically, lower bound is attained when $h$ is the conditional mean of the outcome:
\[h^*(X,A)=\mathbb{E}[Y\mid X,A]\]
We combine this result with the fact that all estimators of $\theta$ that are consistent and asymptotically normal are asymptoticdally eqivalent to the APIW estimator (when propensity score is known, see Robins).
We therefore conclude that $\hat{\theta}_{AIPW}(h^*)$ has the smallest possible CI among all consistent and asymptotically normal estimators of $\theta$ (see De Bartolomeis).
One practical issue is that we only have estimator $\hat{h}$ of $h^*$, so the efficiency lower bound is achieved only if
\[\left\lVert \hat{h}-h^{*} \right\rVert_{L_2}=o(1)\]
A notable property of AIPW estimator is that its asymptotic normality is achieved as long as the $\hat{h}$ has an asymptotic limit $h^\dagger$ (see De Bartolomeis Proposition 1.1).
This means that we can construct a valid confidence interval, independent to the estimator of outcome model $\hat{h}$.
Especially, when we choose $\hat{h}$ as the minimizer of empirical risk among linear function class $\mathcal{H}$, it is referred to as standard AIPW estimator.
\[\hat{h}(X,a) \in \arg\min_{h\in \mathcal{H}} \frac{1}{n_a} \sum_{i;A_i=a} \mathcal{L}(Y_i, h(X_i))\]
Hybrid AIPW from De Bartolomeis
Hybrid AIPW comes from the idea that we can improve the outcome model $\hat{h}$ by using multiple AIPW estimators which outcome model is replaced by foundation models. Plugging in a different outcome estimator with foundation models $f_1, …, f_k$, we can obtain multiple AIPW estimators
\[\hat{\theta} _ {AIPW} (\hat{h}), \hat{\theta} _ {AIPW} (f_1), ..., \hat{\theta} _ {AIPW}(f_k)\]
One important point is that $\hat{h}$ is estimated from experimental data, while $f_1, .., f_k$ are foundation models trained on independent external data. We can simply consider a linear combination of these estimators parameterized by weight $\lambda$, to select an optimal estimator.
\[\hat{\theta}_\lambda := \lambda_1 \hat{\theta} _{AIPW}(\hat{h}) + \sum _{j=1}^k \lambda _{j+1} \hat{\theta} _{AIPW}(f_j)\]
Here, $\lambda$ satisfies $\sum_{j=1}^{k+1}\lambda_j=1$.
If we know the asymptotic covariance $\Sigma := \text{Cov}[(\psi(h^\dagger),…,\psi(f_k)^T]$ for asymptotic limit $h^\dagger$, we can choose $\lambda$ that minimize the variance of $\hat{\theta}_\lambda$ via
\[\lambda^* = \arg\min_{\lambda} \text{Var}[\hat{\theta}_\lambda] = \arg\min_{\lambda} \lambda^T\Sigma\lambda = \Sigma^{-1}\mathbb{1}/(\mathbb{1}^T\Sigma^{-1}\mathbb{1})\]
In practice, we only have an estimate $\hat{\Sigma}$ of the covariance, thus we use $\hat{\lambda} := \arg\min_\lambda \lambda^T\hat{\Sigma}\lambda$.
One of the most intriguing property of H-AIPW estimator is that the asymptotic variance of combined estimator is no greater than that of any individual estimator (see De Bartolomeis Theorem 2, Appendix A.1.2).
This ensures that the combined estimator is guaranteed to be as precise as the best estimator in the ensemble. Even when the foundation model is biased, H-AIPW estimator will fall back to the standard AIPW.
Connection with PPI(prediction-powered-inference)
Interesting connection of AIPW estimator with PPI is discussed in Angelopoulos, Xu, De Bartolomeis.
Considering a scenario of estimating the counterfactual mean $\mathbb{E}[Y(1)]$, we can derive a direct equivalence of PPI++() and AIPW as follows:
\[\hat{\theta} _{PPI++} = \frac{1}{n}\sum _{i=1}^n Y_i + \lambda(\frac{1}{n_0} \sum _{i;A_i=0}f(X_i) - \frac{1}{n_1} \sum _{i;A_i=1}f(X_i)\]
Here, the standard $\hat{\theta} _{DM}$ is the mean outcomes of the treated group.
The PPI++ estimator, $\hat{\theta} _{PPI++}$ improves DM estimator by leveraging predictions from black-box ML model $f$.
We can tune $\lambda$ to minimize the variance, but espeically, when we choose $\lambda = \frac{n_0}{n_0+n_1} = \frac{n_0}{n}$, PPI++ estimator reduces to AIPW estimator.
Let’s start from the AIPW estimator:
\[\hat{\theta} _ {AIPW} = \frac{1}{n} \sum_ {i=1}^n [{\frac{A_i(Y_i-f(X_i))}{\pi_1}+f(X_i)}]\]
Since propensity score $\pi_1 = \frac{n_1}{n_1+n_0}$, AIPW estimator reduces to
\[= \frac{1}{n_1} \sum_ {i=1}^n [A_i(Y_i-f(X_i))] + \frac{1}{n} \sum_ {i=1}^n f(X_i)\]
\[= \frac{1}{n_1} \sum_ {i;A_i=1} [Y_i-f(X_i)] + \frac{1}{n} \sum_ {i;A_i=0}^n f(X_i) + \frac{1}{n} \sum_ {i;A_i=1}^n f(X_i)\]
\[= \frac{1}{n_1} \sum_{i;A_i=1} Y_i -\frac{n_0}{nn_1} \sum_ {i;A_i=1} f(X_i) + \frac{1}{n} \sum_{A_i=0} f(X_i)\]
\[= \frac{1}{n_1} \sum_{i;A_i=1} Y_i + \frac{n_0}{n} (\frac{1}{n_0} \sum_ {i;A_i=0} f(X_i) - \frac{1}{n_1} \sum_ {i;A_i=1} f(X_i))\]
By substituting $\frac{n_0}{n} = \frac{n_0}{n_0+n_1}$ with $\lambda$, we yield PPI++ estimator $\hat{\theta}_{PPI}$.