Testbed for benchmarking large models trained on biological sequences
My recent blog post examined the stringent criteria required to meaningfully evaluate large models trained on biological sequences.
For convenience, I’d like to refer to “large models trained on biological sequences” as NLP models, which abbreviates “Nature’s Language Processing” to emphasize the analogy between natural language modeling and the modeling of genomic, transcriptomic, and proteomic sequences. \(\dagger\)
\(\dagger\): Future blog posts will discuss some critical gaps in an apple-to-apple comparison between natural languages and biological sequences.
Despite rapid progress, an important challenge persists: we lack biologically principled, community-wide standards for benchmarking NLP models, and a semantic gap remains between biologists and computer scientists regarding what “good performance” should look like.
To guide the discussion, we articulate two core aspirations for NLP models:
- robust generalization beyond pretraining objectives
- reproduction of established biological principles without task-specific tuning
To clarify this landscape, we first examine biologically grounded tasks within the framework of the central dogma.
Central Dogma
The central dogma: DNA - RNA - protein provides a natural organizing principle for evaluating NLP models across distinct sequence modalities.
Wang et al. recently proposed a comprehensive benchmark designed with extensive tasks spanning the central dogma.
Here, a few examples of sequence(DNA, RNA, Protein) level tasks are presented below among the full list of 36 tasks proposed at the original work.
| Modality | Tasks | Description | Category |
|---|---|---|---|
| DNA | Promoter annotation | Distinguishing promoter regions 1) TATA box, Initiator, CCAAT box, GC box 2) Minimal sequence for transcription initiation around TSS | Classification |
| DNA | Enhancer types classification | Distinguishing 1) Tissue-specific enhancers 2) Tissue-invariant enhancers 3) Non-enhancer regions | Classification |
| RNA | mRNA stability prediction | Predicting mRNA stability profile | Regression |
| RNA | SARS-CoV-2 Vaccine degradation prediction | Predict hydrolysis rates of mRNA | Regression |
| Protein | Post-translational modification prediction | Predict PTM categories | Classification |
| Protein | Fluorescence prediction | Predict log-flourescence for avGFP sequences | Regression |
Here, we observe that the scope of the task varies significantly. Some tasks test general principles and regulatory syntaxes of gene expression, while others assess specific biological properties (e.g., degradation of mRNA, fluorescence of protein).
This naturally poses a constraint on zero-shot predictability. For some tasks, we expect NLP models to succeed easily with simple linear probing. On the other hand, some tasks are inherently designed with narrow scopes and thus require adjustment ranging from fine-tuning of penultimate layers to extensive fine-tuning of all layers.
Focusing on zero-shot inference
In practice, zero-shot inference in biological modeling roughly refers to predicting outcomes for a task or distribution that was never explicitly used for training.
New tasks or distribution can involve new functional assays, cellular contexts, or sequence families. This is the reason why we are focusing on evaluating the representational power of NLP models under a zero-shot inference setting.
Note that Wang et al. is a comprehensive benchmark but primarily focuses on fine-tuning diverse NLP models with specific datasets.
On the other hand, Tang et al. provide a good example of designing DNA-level tasks and benchmarking across NLP models using simple probing models. Here, we summarize key tasks that have been challenged by zero-shot inference using NLP models.
Tasks challenged with zero-shot inference (or with simple probing)
| Modality | Tasks | Description | Probing model | Reference |
|---|---|---|---|---|
| DNA | LentiMPRA cell-type-specifc regulatory activity prediction | Predicting experimentally measured enhancer activity via lentiMPRA, mainly cell-type-specific CRE activity. Note that conventional genomic language models are not cell-type aware. | Ridge Regression, MLP, CNN (CNN outperformance indicates nonlinearity) | Tang et al. |
| DNA | ChIP-seq TF binding prediction | Binary classification of TF binding event identified from ChiP-seq | CNN | Tang et al. |
| RNA | Predicting alternative splicing | Given splice acceptor and donor sequence, predict percentage-spliced-in across 56 tissues (multi-task regression) | CNN | Tang et al. |
| RNA | RBP binding prediction(eCLIP-seq) | Binary classification whether the sequence corresponds to an eCLIP-seq peak or not | CNN | Tang et al. |
| RNA | Predicting RNA Pol II elongation potential | Regression of RNA Pol II elongation potential from INSERT-seq | CNN | Tang et al. |
| RNA | Secondary structure prediction | 1) Transformer attention maps inform base pairing probability 2) Supplied to construct RNA secondary structure | Logistic regression | Gong et al. |
| Protein | Guidance of antibody evolution (efficient evolution theory) | Restricting mutational space using PLM(ESM-1b, ESM-1v) likelihood ratio compared to WT amino acid sequence | N/A(Log of Conditional Likelihood) | Hie et al. |
| Protein | Deep mutational scan success prediction | Predict mutation effect capability(DMS correlation) | N/A(Pseudo log likelihood) | Gordon et al. |
Here, we do not describe these tasks as “succeeded” by zero-shot inference. Tang et al. demonstrates that if we focus on tasks that are highly influenced by the cell-type-specific nature of TF binding(LentiMPRA, ChIP-seq), one-hot encoding of nucleotides surpasses the approach of nonlinearly probing embeddings.
Another important lesson from benchmarking on these tasks is that supervised models such as Enformer and Sei outperform other language models trained in a self-supervised manner. The success of Enformer in recapitulating the multiplicative mode of enhancer actions in silico also supports these performance gaps between supervised and self-supervised models(Zhou et al.).
Bridging the two worlds
A promising direction lies between supervised and self-supervised paradigms. One proof-of-concept approach that connects the two worlds of supervised and self-supervised NLP models is aligning language models with experimental feedback data(RLXF, Blalock et al.).
It intuitively resembles the concept of RLHF in human-language models. This approach employs biological assays to construct reward signals, providing a framework that leverages both large-scale pretraining and targeted feedback.
Further methodological advancements will be necessary to scale this paradigm, but it still represents a vital conceptual bridge.
Comments