Nature's Language Processing

Testbed for benchmarking large models trained on biological sequences


My recent blog post examined the stringent criteria required to meaningfully evaluate large models trained on biological sequences.

For convenience, I’d like to refer to “large models trained on biological sequences” as NLP models, which abbreviates “Nature’s Language Processing” to emphasize the analogy between natural language modeling and the modeling of genomic, transcriptomic, and proteomic sequences. \(\dagger\)

\(\dagger\): Future blog posts will discuss some critical gaps in an apple-to-apple comparison between natural languages and biological sequences.

Despite rapid progress, an important challenge persists: we lack biologically principled, community-wide standards for benchmarking NLP models, and a semantic gap remains between biologists and computer scientists regarding what “good performance” should look like.

To guide the discussion, we articulate two core aspirations for NLP models:

  1. robust generalization beyond pretraining objectives
  2. reproduction of established biological principles without task-specific tuning

To clarify this landscape, we first examine biologically grounded tasks within the framework of the central dogma.

Central Dogma


The central dogma: DNA - RNA - protein provides a natural organizing principle for evaluating NLP models across distinct sequence modalities.

Wang et al. recently proposed a comprehensive benchmark designed with extensive tasks spanning the central dogma.

Here, a few examples of sequence(DNA, RNA, Protein) level tasks are presented below among the full list of 36 tasks proposed at the original work.

Modality Tasks Description Category
DNA Promoter annotation Distinguishing promoter regions 1) TATA box, Initiator, CCAAT box, GC box 2) Minimal sequence for transcription initiation around TSS Classification
DNA Enhancer types classification Distinguishing 1) Tissue-specific enhancers 2) Tissue-invariant enhancers 3) Non-enhancer regions Classification
RNA mRNA stability prediction Predicting mRNA stability profile Regression
RNA SARS-CoV-2 Vaccine degradation prediction Predict hydrolysis rates of mRNA Regression
Protein Post-translational modification prediction Predict PTM categories Classification
Protein Fluorescence prediction Predict log-flourescence for avGFP sequences Regression

Here, we observe that the scope of the task varies significantly. Some tasks test general principles and regulatory syntaxes of gene expression, while others assess specific biological properties (e.g., degradation of mRNA, fluorescence of protein).

This naturally poses a constraint on zero-shot predictability. For some tasks, we expect NLP models to succeed easily with simple linear probing. On the other hand, some tasks are inherently designed with narrow scopes and thus require adjustment ranging from fine-tuning of penultimate layers to extensive fine-tuning of all layers.

Focusing on zero-shot inference


In practice, zero-shot inference in biological modeling roughly refers to predicting outcomes for a task or distribution that was never explicitly used for training.

New tasks or distribution can involve new functional assays, cellular contexts, or sequence families. This is the reason why we are focusing on evaluating the representational power of NLP models under a zero-shot inference setting.

Note that Wang et al. is a comprehensive benchmark but primarily focuses on fine-tuning diverse NLP models with specific datasets.

On the other hand, Tang et al. provide a good example of designing DNA-level tasks and benchmarking across NLP models using simple probing models. Here, we summarize key tasks that have been challenged by zero-shot inference using NLP models.

Tasks challenged with zero-shot inference (or with simple probing)


Modality Tasks Description Probing model Reference
DNA LentiMPRA cell-type-specifc regulatory activity prediction Predicting experimentally measured enhancer activity via lentiMPRA, mainly cell-type-specific CRE activity. Note that conventional genomic language models are not cell-type aware. Ridge Regression, MLP, CNN (CNN outperformance indicates nonlinearity) Tang et al.
DNA ChIP-seq TF binding prediction Binary classification of TF binding event identified from ChiP-seq CNN Tang et al.
RNA Predicting alternative splicing Given splice acceptor and donor sequence, predict percentage-spliced-in across 56 tissues (multi-task regression) CNN Tang et al.
RNA RBP binding prediction(eCLIP-seq) Binary classification whether the sequence corresponds to an eCLIP-seq peak or not CNN Tang et al.
RNA Predicting RNA Pol II elongation potential Regression of RNA Pol II elongation potential from INSERT-seq CNN Tang et al.
RNA Secondary structure prediction 1) Transformer attention maps inform base pairing probability 2) Supplied to construct RNA secondary structure Logistic regression Gong et al.
Protein Guidance of antibody evolution (efficient evolution theory) Restricting mutational space using PLM(ESM-1b, ESM-1v) likelihood ratio compared to WT amino acid sequence N/A(Log of Conditional Likelihood) Hie et al.
Protein Deep mutational scan success prediction Predict mutation effect capability(DMS correlation) N/A(Pseudo log likelihood) Gordon et al.

Here, we do not describe these tasks as “succeeded” by zero-shot inference. Tang et al. demonstrates that if we focus on tasks that are highly influenced by the cell-type-specific nature of TF binding(LentiMPRA, ChIP-seq), one-hot encoding of nucleotides surpasses the approach of nonlinearly probing embeddings.

Another important lesson from benchmarking on these tasks is that supervised models such as Enformer and Sei outperform other language models trained in a self-supervised manner. The success of Enformer in recapitulating the multiplicative mode of enhancer actions in silico also supports these performance gaps between supervised and self-supervised models(Zhou et al.).


Bridging the two worlds

A promising direction lies between supervised and self-supervised paradigms. One proof-of-concept approach that connects the two worlds of supervised and self-supervised NLP models is aligning language models with experimental feedback data(RLXF, Blalock et al.).

It intuitively resembles the concept of RLHF in human-language models. This approach employs biological assays to construct reward signals, providing a framework that leverages both large-scale pretraining and targeted feedback.

Further methodological advancements will be necessary to scale this paradigm, but it still represents a vital conceptual bridge.

Comments