14 Dec 2025
•
Annotated BI
Motivation
We often decompose high-dimensional profiles into low-rank, sparse matrices using factorization algorithms.
These algorithms are closely related to dimensionality reduction—the factors they produce naturally encode low-dimensional representations.
Before discussing factorization algorithms further, let’s first understand dimensionality reduction as an instance of representation learning.
This can be formalized through a universal framework called I-Con. Representation learning aligns the transition probability (i.e., conditional distribution) from a supervisory signal to those of representations specified by a learnable function.
The core I-Con loss function is:
\[\mathcal{L}(\theta, \phi) =\int_{i\in \mathcal{X}} D_{KL}(p_\theta(\cdot|i)||q_\phi(\cdot|i))=\int_{i\in\mathcal{X}}\int_{j\in\mathcal{X}}p_\theta(j|i)\log\frac{p_\theta(j|i)}{q_\phi(j|i)}\]
Here, $p_\theta$ is a supervisory distribution, while $q_\phi$ is learned by capturing the structure of the desired representation.
We can optimize both $\theta,\phi$ (e.g., X-Sample), though most methods optimize only $\phi$.
Many representation learning approaches—including t-SNE, SimCLR, and K-Means clustering—can be understood as instances of the I-Con framework.
Now let’s circle back to factorization algorithms.
These include PCA, K-Means clustering, and NMF, but here we focus on NMF as a representative factorization algorithm that learns latent factors as linear combinations of high-dimensional features and reconstructs the original features via linear projection.
Note that NMF (nonnegative matrix factorization) is equivalent to K-means clustering on a bipartite graph (where nodes correspond to both samples and features), with relaxation that allows soft assignments to the K clusters(Ding et al.).
Since NMF provides a framework to interpret high-dimensional features as a coordinated set of factors, it is widely applied in statistical approaches to find modular structure in high-dimensional profiles and in mechanistic interpretability studies of neural activations.
Interestingly, an NMF block can be seamlessly integrated into deep neural networks to enhance interpretability while still benefiting from the effectiveness of automatic backpropagation in neural networks. In this blog post, I take a closer look at the technical details underlying gradient calculation through the NMF block when incorporated into a neural network.
Nonnegative matrix factorization
Previous blog post have explained the multiplicative update rule for NMF, which particularly solves the following NP-hard problem:
\[(U,W)=\arg\min_{U\geq0,W\geq0}||A-UW^T||_F^2\]
This problem is not convex w.r.t. the input pair $(U,W)$, but fixing the value of one of the two factors and optimizing the other—makes the NMF problem into a pair of convex NNLS problems. We call it an alternating NNLS problems, and its convexity ensures that alternating minimization eventually leads to a local minimum. Here, we first discuss the technical details in solving NMF optimization problem with alternating direction method of multipliers(ADMM).
The standard practice of ADMM in integrating nonnegativity constraints to optimization objective is introducing an auxiliary variable $\tilde{U}, \tilde{W}$ as follows:
\[\min_{U,\tilde{U},W,\tilde{W}}\frac{1}{2}||A-\tilde{U}\tilde{W}^T||^2_F+\delta(U)+\delta(W), \\ s.t. \tilde{U}=U,\tilde{W}=W, \delta{(H)}=0 \text{ if } H\geq 0, +\infty \text{ o.w. }\]
Introducing $\tilde{U}, \tilde{W}$ may seem redundant, but it separates the
(1) unconstrained optimization from the
(2) constraints ($\delta(\cdot)$) applied to $U,W$.
Note that $\tilde{U},U$ and $\tilde{W},W$ differ during optimization but converge to equality at the limit.
During optimization, dual variables $\bar{U},\bar{W}$ balance the objectives of (1) and (2).
Following standard ADMM practice, we create an augmented Lagrangian incorporating these constraints:
\[\mathcal{L}(A,U,W,\tilde{U},\tilde{W},\bar{U},\bar{W})=\]
\[\frac{1}{2}||A-\tilde{U}\tilde{W}^T||_F^2 + \delta(U)+ \delta(W)\]
\[+\bar{U}^T(\tilde{U}-U)+\bar{W}^T(\tilde{W}-W)\]
\[+\frac{\rho}{2}(||\tilde{U}-U||_2^2+||\tilde{W}-W||_2^2)\]
We solve this Lagrangian by decomposing it into a sequence of convex problems.
ADMM iterates over the $(U, \tilde{U}, \bar{U}), (W,\tilde{W},\bar{W})$ triplets as follows:
\[U_{t+1} = \arg\min_{U=\tilde{U}} \frac{1}{2}||A-\tilde{U}W^T_t||^2_F+\delta(U)+\frac{\rho}{2}||\tilde{U}-U||_2^2\]
\[W_{t+1} = \arg\min_{W=\tilde{W}} \frac{1}{2}||A-U_t\tilde{W}^T||^2_F+\delta(W)+\frac{\rho}{2}||\tilde{W}-W||_2^2\]
Each problem decomposes into three sub-problems solved by ADMM. Their simplicity and efficiency are detailed in Fel et al., Appendix C.2.
Implicit differentiation of NMF block with Jaxopt
Let’s motivate the use of “implicit differentiation” for backpropagating through the NMF block.
Our goal is to compute $\frac{\partial{U}}{\partial{X}}, \frac{\partial{W}}{\partial{X}}$.
The chain rule shows we need to compute the Jacobians $\frac{\partial{U}}{\partial{A}}, \frac{\partial{W}}{\partial{A}}$ and feed them into the automatic differentiation computational graph implemented by PyTorch or TensorFlow.
The Jaxopt library provides an efficient, modular way to perform implicit differentiation. It calculates these Jacobians without explicitly forming the entire Jacobian matrix. Instead, it uses VJP and JVP (vector-Jacobian product and Jacobian-vector product) to reduce the problem to solving a linear system.
General principles are omitted here, but Blondel et al. shows that various families of optimality conditions (including stationary conditions, KKT, etc.) reduce to the general principle by choosing an appropriate optimality function $F$.
Specifically, backpropagation through the NMF block stacks the KKT conditions on the NNLS problems to obtain optimality function $F$.
For the NMF block, we can perform two-stage backpropagation following these steps:
(1) Construct optimality function $F((U,W,\bar{U},\bar{W}),A)=((UW^T-A)W-\bar{U}, (WU^T-A^T)U-\bar{W},\bar{U}\odot U, \bar{W}\odot W)$.
(2) Jaxopt computes $\frac{\partial(U,W,\bar{U},\bar{W})}{\partial A}= -(\partial_1{F})^{-1}\partial_2F$.
(2’) Here, $(\partial_1F)^{-1}$ is not explicitly computed. Instead, $(\partial_1F) \frac{\partial(U,W,\bar{U},\bar{W})}{\partial A}= -\partial_2F$ is solved by conjugate gradient using JVP $v \mapsto(\partial_1F)v$.
(3) Use the chain rule to compute $\frac{\partial U}{\partial X}=\frac{\partial A}{\partial X}\frac{\partial U}{\partial A}$.
22 Nov 2025
•
Annotated BI
Testbed for benchmarking large models trained on biological sequences
My recent blog post examined the stringent criteria required to meaningfully evaluate large models trained on biological sequences.
For convenience, I’d like to refer to “large models trained on biological sequences” as NLP models, which abbreviates “Nature’s Language Processing” to emphasize the analogy between natural language modeling and the modeling of genomic, transcriptomic, and proteomic sequences. \(\dagger\)
\(\dagger\): Future blog posts will discuss some critical gaps in an apple-to-apple comparison between natural languages and biological sequences.
Despite rapid progress, an important challenge persists: we lack biologically principled, community-wide standards for benchmarking NLP models, and a semantic gap remains between biologists and computer scientists regarding what “good performance” should look like.
To guide the discussion, we articulate two core aspirations for NLP models:
- robust generalization beyond pretraining objectives
- reproduction of established biological principles without task-specific tuning
To clarify this landscape, we first examine biologically grounded tasks within the framework of the central dogma.
Central Dogma
The central dogma: DNA - RNA - protein provides a natural organizing principle for evaluating NLP models across distinct sequence modalities.
Wang et al. recently proposed a comprehensive benchmark designed with extensive tasks spanning the central dogma.
Here, a few examples of sequence(DNA, RNA, Protein) level tasks are presented below among the full list of 36 tasks proposed at the original work.
| Modality |
Tasks |
Description |
Category |
| DNA |
Promoter annotation |
Distinguishing promoter regions 1) TATA box, Initiator, CCAAT box, GC box 2) Minimal sequence for transcription initiation around TSS |
Classification |
| DNA |
Enhancer types classification |
Distinguishing 1) Tissue-specific enhancers 2) Tissue-invariant enhancers 3) Non-enhancer regions |
Classification |
| RNA |
mRNA stability prediction |
Predicting mRNA stability profile |
Regression |
| RNA |
SARS-CoV-2 Vaccine degradation prediction |
Predict hydrolysis rates of mRNA |
Regression |
| Protein |
Post-translational modification prediction |
Predict PTM categories |
Classification |
| Protein |
Fluorescence prediction |
Predict log-flourescence for avGFP sequences |
Regression |
Here, we observe that the scope of the task varies significantly. Some tasks test general principles and regulatory syntaxes of gene expression, while others assess specific biological properties (e.g., degradation of mRNA, fluorescence of protein).
This naturally poses a constraint on zero-shot predictability. For some tasks, we expect NLP models to succeed easily with simple linear probing. On the other hand, some tasks are inherently designed with narrow scopes and thus require adjustment ranging from fine-tuning of penultimate layers to extensive fine-tuning of all layers.
Focusing on zero-shot inference
In practice, zero-shot inference in biological modeling roughly refers to predicting outcomes for a task or distribution that was never explicitly used for training.
New tasks or distribution can involve new functional assays, cellular contexts, or sequence families. This is the reason why we are focusing on evaluating the representational power of NLP models under a zero-shot inference setting.
Note that Wang et al. is a comprehensive benchmark but primarily focuses on fine-tuning diverse NLP models with specific datasets.
On the other hand, Tang et al. provide a good example of designing DNA-level tasks and benchmarking across NLP models using simple probing models. Here, we summarize key tasks that have been challenged by zero-shot inference using NLP models.
Tasks challenged with zero-shot inference (or with simple probing)
| Modality |
Tasks |
Description |
Probing model |
Reference |
| DNA |
LentiMPRA cell-type-specifc regulatory activity prediction |
Predicting experimentally measured enhancer activity via lentiMPRA, mainly cell-type-specific CRE activity. Note that conventional genomic language models are not cell-type aware. |
Ridge Regression, MLP, CNN (CNN outperformance indicates nonlinearity) |
Tang et al. |
| DNA |
ChIP-seq TF binding prediction |
Binary classification of TF binding event identified from ChiP-seq |
CNN |
Tang et al. |
| RNA |
Predicting alternative splicing |
Given splice acceptor and donor sequence, predict percentage-spliced-in across 56 tissues (multi-task regression) |
CNN |
Tang et al. |
| RNA |
RBP binding prediction(eCLIP-seq) |
Binary classification whether the sequence corresponds to an eCLIP-seq peak or not |
CNN |
Tang et al. |
| RNA |
Predicting RNA Pol II elongation potential |
Regression of RNA Pol II elongation potential from INSERT-seq |
CNN |
Tang et al. |
| RNA |
Secondary structure prediction |
1) Transformer attention maps inform base pairing probability 2) Supplied to construct RNA secondary structure |
Logistic regression |
Gong et al. |
| Protein |
Guidance of antibody evolution (efficient evolution theory) |
Restricting mutational space using PLM(ESM-1b, ESM-1v) likelihood ratio compared to WT amino acid sequence |
N/A(Log of Conditional Likelihood) |
Hie et al. |
| Protein |
Deep mutational scan success prediction |
Predict mutation effect capability(DMS correlation) |
N/A(Pseudo log likelihood) |
Gordon et al. |
Here, we do not describe these tasks as “succeeded” by zero-shot inference.
Tang et al. demonstrates that if we focus on tasks that are highly influenced by the cell-type-specific nature of TF binding(LentiMPRA, ChIP-seq), one-hot encoding of nucleotides surpasses the approach of nonlinearly probing embeddings.
Another important lesson from benchmarking on these tasks is that supervised models such as Enformer and Sei outperform other language models trained in a self-supervised manner.
The success of Enformer in recapitulating the multiplicative mode of enhancer actions in silico also supports these performance gaps between supervised and self-supervised models(Zhou et al.).
Bridging the two worlds
A promising direction lies between supervised and self-supervised paradigms. One proof-of-concept approach that connects the two worlds of supervised and self-supervised NLP models is aligning language models with experimental feedback data(RLXF, Blalock et al.).
It intuitively resembles the concept of RLHF in human-language models. This approach employs biological assays to construct reward signals, providing a framework that leverages both large-scale pretraining and targeted feedback.
Further methodological advancements will be necessary to scale this paradigm, but it still represents a vital conceptual bridge.