Nature's Language Processing

Testbed for benchmarking large models trained on biological sequences

My recent blog post examined the stringent criteria required to meaningfully evaluate large models trained on biological sequences.

For convenience, I’d like to refer to “large models trained on biological sequences” as NLP models, which abbreviates “Nature’s Language Processing” to emphasize the analogy between natural language modeling and the modeling of genomic, transcriptomic, and proteomic sequences. \(\dagger\)

\(\dagger\): Future blog posts will discuss some critical gaps in an apple-to-apple comparison between natural languages and biological sequences.

Despite rapid progress, an important challenge persists: we lack biologically principled, community-wide standards for benchmarking NLP models, and a semantic gap remains between biologists and computer scientists regarding what “good performance” should look like.

To guide the discussion, we articulate two core aspirations for NLP models:

robust generalization beyond pretraining objectives
reproduction of established biological principles without task-specific tuning

To clarify this landscape, we first examine biologically grounded tasks within the framework of the central dogma.

Central Dogma

The central dogma: DNA - RNA - protein provides a natural organizing principle for evaluating NLP models across distinct sequence modalities.

Wang et al. recently proposed a comprehensive benchmark designed with extensive tasks spanning the central dogma.

Here, a few examples of sequence(DNA, RNA, Protein) level tasks are presented below among the full list of 36 tasks proposed at the original work.

Modality	Tasks	Description	Category
DNA	Promoter annotation	Distinguishing promoter regions 1) TATA box, Initiator, CCAAT box, GC box 2) Minimal sequence for transcription initiation around TSS	Classification
DNA	Enhancer types classification	Distinguishing 1) Tissue-specific enhancers 2) Tissue-invariant enhancers 3) Non-enhancer regions	Classification
RNA	mRNA stability prediction	Predicting mRNA stability profile	Regression
RNA	SARS-CoV-2 Vaccine degradation prediction	Predict hydrolysis rates of mRNA	Regression
Protein	Post-translational modification prediction	Predict PTM categories	Classification
Protein	Fluorescence prediction	Predict log-flourescence for avGFP sequences	Regression

Here, we observe that the scope of the task varies significantly. Some tasks test general principles and regulatory syntaxes of gene expression, while others assess specific biological properties (e.g., degradation of mRNA, fluorescence of protein).

This naturally poses a constraint on zero-shot predictability. For some tasks, we expect NLP models to succeed easily with simple linear probing. On the other hand, some tasks are inherently designed with narrow scopes and thus require adjustment ranging from fine-tuning of penultimate layers to extensive fine-tuning of all layers.

Focusing on zero-shot inference

In practice, zero-shot inference in biological modeling roughly refers to predicting outcomes for a task or distribution that was never explicitly used for training.

New tasks or distribution can involve new functional assays, cellular contexts, or sequence families. This is the reason why we are focusing on evaluating the representational power of NLP models under a zero-shot inference setting.

Note that Wang et al. is a comprehensive benchmark but primarily focuses on fine-tuning diverse NLP models with specific datasets.

On the other hand, Tang et al. provide a good example of designing DNA-level tasks and benchmarking across NLP models using simple probing models. Here, we summarize key tasks that have been challenged by zero-shot inference using NLP models.

Tasks challenged with zero-shot inference (or with simple probing)

Modality	Tasks	Description	Probing model	Reference
DNA	LentiMPRA cell-type-specifc regulatory activity prediction	Predicting experimentally measured enhancer activity via lentiMPRA, mainly cell-type-specific CRE activity. Note that conventional genomic language models are not cell-type aware.	Ridge Regression, MLP, CNN (CNN outperformance indicates nonlinearity)	Tang et al.
DNA	ChIP-seq TF binding prediction	Binary classification of TF binding event identified from ChiP-seq	CNN	Tang et al.
RNA	Predicting alternative splicing	Given splice acceptor and donor sequence, predict percentage-spliced-in across 56 tissues (multi-task regression)	CNN	Tang et al.
RNA	RBP binding prediction(eCLIP-seq)	Binary classification whether the sequence corresponds to an eCLIP-seq peak or not	CNN	Tang et al.
RNA	Predicting RNA Pol II elongation potential	Regression of RNA Pol II elongation potential from INSERT-seq	CNN	Tang et al.
RNA	Secondary structure prediction	1) Transformer attention maps inform base pairing probability 2) Supplied to construct RNA secondary structure	Logistic regression	Gong et al.
Protein	Guidance of antibody evolution (efficient evolution theory)	Restricting mutational space using PLM(ESM-1b, ESM-1v) likelihood ratio compared to WT amino acid sequence	N/A(Log of Conditional Likelihood)	Hie et al.
Protein	Deep mutational scan success prediction	Predict mutation effect capability(DMS correlation)	N/A(Pseudo log likelihood)	Gordon et al.

Here, we do not describe these tasks as “succeeded” by zero-shot inference. Tang et al. demonstrates that if we focus on tasks that are highly influenced by the cell-type-specific nature of TF binding(LentiMPRA, ChIP-seq), one-hot encoding of nucleotides surpasses the approach of nonlinearly probing embeddings.

Another important lesson from benchmarking on these tasks is that supervised models such as Enformer and Sei outperform other language models trained in a self-supervised manner. The success of Enformer in recapitulating the multiplicative mode of enhancer actions in silico also supports these performance gaps between supervised and self-supervised models(Zhou et al.).

Bridging the two worlds

A promising direction lies between supervised and self-supervised paradigms. One proof-of-concept approach that connects the two worlds of supervised and self-supervised NLP models is aligning language models with experimental feedback data(RLXF, Blalock et al.).

It intuitively resembles the concept of RLHF in human-language models. This approach employs biological assays to construct reward signals, providing a framework that leverages both large-scale pretraining and targeted feedback.

Further methodological advancements will be necessary to scale this paradigm, but it still represents a vital conceptual bridge.

Testbed for benchmarking large models trained on biological sequences

Central Dogma

Focusing on zero-shot inference

Tasks challenged with zero-shot inference (or with simple probing)

Bridging the two worlds

Comments

Related Posts

Anatomy of Biological Sequence Modeling 03 Jan 2026

Backpropagation through the NMF block 14 Dec 2025

Coalescing causal inference and foundation models 21 Sep 2025