Foundation Models, Surrogate Biology

An analogy: harvesting knowledge from language models


The origin of this idea traces back to a conversation with a CS colleague during the summer of 2022. At the time, I was fascinated by questions in efficient learning theory and the role of model architecture in determining learning capacity. Our topic was, “Why haven’t vision models achieved results comparable to language models?” This was shortly after the emergence of RLHF, when large language models were attracting global attention.

I proposed a hypothesis: “Perhaps language itself is inherently structured for learning—serving as a highly compressed and organized form of representation.

He then added an intriguing point: “If large language models capture this structure, we can think of them not just as tools, but as datasets themselves. For small-scale researchers, one good strategy is to harvest knowledge embedded in these models.

This conversation planted a question in my mind: Could similar principles apply in biology? Could large-scale models, trained on vast biological observations, sufficiently serve as surrogates for experimental validation or as proxies for biological knowledge?

Validation of high-throughput measurements


Three years later, these questions have become increasingly relevant in functional genomics, where high-throughput measurements produce enormous yet noisy datasets. One emerging approach involves asking whether conclusions drawn from large-scale studies(both observation and perturbation)—often summarized as global statements about systemic architecture—can be recapitulated in silico using models trained from self-supervision of rich data.

For example, models trained on genomic sequences have been used to impute data in a composite $^\dagger$ fashion[1], identify cis-regulatory syntaxes[2], and probe simpler statistical models such as linear predictors[1,2]. These examples demonstrate how deep models can complement or even substitute experimental measurements by serving as structured surrogates for biological information.

$\dagger$ It reminds me of terminology called ‘Socratic Model’, coined from this seminal paper[3].

In silico recapitulation of global biology


It has become increasingly clear that equating scale with foundational capability is problematic. Models trained on massive datasets in a self-supervised manner do not inherently qualify as foundation models; rather, true foundation models might demonstrate robust generalization beyond pretraining objectives and the ability to reproduce established biological principles without task-specific tuning.

Biology offers a proper testbed for this argument. Biologists possess a wealth of experimentally validated propositions about biological system architecture, and also have natural system/synthetic tools to validate novel hypotheses experimentally. Consider an example from enhancer biology: Gasperini et al.[4] produced large-scale CRISPRi screens, and another study[5] re-analyzed the data to argue that enhancer action is predominantly multiplicative and that evidence for enhancer interactions was undetectable. Here, the Enformer model was used as a computational surrogate to validate these claims in silico. Reversal of this logic is also instructive: if a proposed model aspires to be considered a foundation model for biology, it might reproduce such propositions in a zero-shot manner without explicit fine-tuning.

Some might argue that this claim is quite aggressive, but propositions about biological systems with an established consensus now serve as good benchmarks. Evidence must be systematic and comprehensive—not cherry-picked or anecdotal. Notably, increasing efforts are being made to verify whether so-called “foundation models” enable zero-shot inference of verifiable biological properties derived from large-scale data[6,7].

Learning Constraints from Biological Feedback


An even more ambitious direction is to learn models directly from biological systems by leveraging endogenous mechanisms. Biological systems impose strong selective constraints, determining which states persist and which are eliminated. Negative examples—e.g., states that fail selection—are particularly challenging to obtain under natural conditions, leading to biased datasets where unfavorable configurations are underrepresented.

Among biological systems, the immune repertoire offers a unique opportunity in this regard. For instance, B-cell development is shaped by stringent selection. Productive B cell receptor (BCR) sequences are preferentially retained, while nonproductive or autoreactive sequences are eliminated. This process provides a natural source of implicit preference data, analogous to reward signals in reinforcement learning.

Building on recent work[8], which exploits allelic inclusion to classify suboptimal BCR sequences, one could extend this idea to a generative framework. Specifically, incorporating preference-based reinforcement learning—where productive versus nonproductive repertoires provide implicit ranking signals—could enable models to internalize selective constraints shaping BCR diversity. Such models would not only generate biologically plausible antibody sequences but also simulate affinity maturation pathways, offering new tools for immunological research and therapeutic design.

Reference

[1] Zhou, Yichao, et al. “scPrediXcan integrates advances in deep learning and single-cell data into a powerful cell-type–specific transcriptome-wide association study framework.” bioRxiv (2025): 2024-11.
[2] Seitz, Evan E., et al. “Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models.” Nature machine intelligence 6.6 (2024): 701-713.
[3] Zeng, Andy, et al. “Socratic models: Composing zero-shot multimodal reasoning with language.” arXiv preprint arXiv:2204.00598 (2022).
[4] Gasperini, Molly, et al. “A genome-wide framework for mapping gene regulation via cellular genetic screens.” Cell 176.1 (2019): 377-390.
[5] Zhou, Jessica L., et al. “Analysis of single-cell CRISPR perturbations indicates that enhancers predominantly act multiplicatively.” Cell Genomics 4.11 (2024).
[6] Wang, Yihui, et al. “Genomic Touchstone: Benchmarking Genomic Language Models in the Context of the Central Dogma.” bioRxiv (2025): 2025-06.
[7] Tang, Ziqi, et al. “Evaluating the representational power of pre-trained DNA language models for regulatory genomics.” Genome Biology 26.1 (2025): 203.
[8] Jagota, Milind, et al. “Learning antibody sequence constraints from allelic inclusion.” bioRxiv (2024).

Collection of genes: A sc-linker case study


Introduction


With the advent of single cell genomics, we now have access to high-resolution, genome-wide transcriptomic measurements at the resolution of single cells. This unprecedented level of detail allows us to explore how genes are expressed across diverse cell types, states, and molecular contexts.

One key objective in single-cell RNA sequencing (scRNA-seq) analysis is to discover a collection of genes that are jointly expressed across diverse cell types and molecular contexts. We often call it as gene modules, sets of genes that exhibit coordianted expression patterns.

To assess the biological relevance of these gene modules, researchers commonly perform ontology$\dagger$ enrichment analysis, to support the functional coherence of collected genes.

$\dagger$ Notable advancement in gene ontology; from GO terms, molecular signatures to large-scale model-based embeddings - I’ll discuss these topics later.


Gene Modules from Pairwise Gene Expression


We naturally expect functional relevance across gene collections inferred from gene expression patterns, and one key principle shared by diverse module detection algorithms is the measurement of pairwise relationships. Common approaches in gene module detection, principal component analysis(PCA) and nonnegative matrix factorization(NMF), both leverage this pricniple, interpreting gene module as a linear combination of individual gene instances.

  • In PCA, For normalized gene-feature matrix $X$, we compute $X^TX$; which represents gene-gene graph where edge weight corresponds to the inner product of two feature vectors.
  • NMF(nonnegative matrix factorization) has equivalence with K-means clustering on the bipartite graph(which nodes correspond to both cells and genes), applying relaxation by allowing soft assignments to the K clusters[1].
  • Other non-linear algorithm, Hotspot[2], explicitly computes local autocorrelation on the defined neighborhood graph for all selected feature pairs.

Despite differences in methodology, these approaches share a conceptual foundation: modeling gene-gene relationships as a graph structure. This observation naturally leads to a question: how useful are the detected gene modules in a biological context? An elegant framework addressing this question is sc-linker[3]. This method identifies gene modules contrastively—focusing on cell-type and disease-specific patterns—and then links these modules to complex trait associations using GWAS summary statistics. By doing so, sc-linker not only discovers expression-driven modules but also prioritizes those that are enriched for trait heritability, effectively nominating them as functional categories relevant to disease biology.


A case of sc-linker


sc-linker define gene module $M$ as a linear combination of gene $x_i$: \(P=\sum_{i=1}^N w_ix_i\) where $x_i$ is the expression of gene $i$ and $w_i$ is the corresponding weight. Definition of gene module is roughly categorized into three types:

  1. $M_{cell}$(cell-type specific): \(M_{cell}=\sum w_ix_i \text{ where } w_i=\sigma(\chi_i) \text{ for } \chi_i=-2log(P_i)~\chi_2^2\). Here, $P_i$ is a p-value after DE test comparing specific cell type $C$ to all others, and $\sigma(\cdot)$ is a min-max scaling function.

  2. $M_{dis}$(disease dependent): Defined similarly to $M_{cell}$, but the p-values are derived from disease vs. healthy comparisons.

  3. $M_{proc}$(cellular process): Obatined using contrastive NMF, which learns shared and condition-specific components from healthy and diseased scRNA-seq data.


Contrastive(case-control) NMF for Module identification:

Given two scRNA-seq feature matrices:

  • $H_{P\times N_1}$ for healthy samples
  • $D_{P\times N_2}$ for diseased samples

Conventional NMF decompose each matrix into:

\[H_{P\times N_1}\approx [L^{CH}L^{UH}]F^H, D_{P\times N_2}\approx [L^{CD}L^{UD}]F^D\]

,where $L^H=[L^{CH} L^{UH}], L^D=[L^{CD}L^{UD}]$ contain shared($L^{CH}, L^{CD}$) component and unique components (\(L^{UH}, L^{UD}\)).

In the contrastive setting, we enforce similarity between shared components by introducing a penalty term $\Vert L^{CH}-L^{CD}\Vert $. This term is added to conventional NMF obejctive thus it leads to the minimzation of objective $Q$:

\[Q=\frac{1}{2} \Vert{H-L^HF^H}\Vert _F^2+\frac{1}{2}\Vert{D-L^DF^D}\Vert _F^2+\frac{\mu}{2}(\Vert{L^H} \Vert _F^2+\Vert{L^D}\Vert _F^2)+\frac{\gamma}{2}(\Vert{L^{CH}-L^{CD}}\Vert _F^2)\]

Compuptation of gradient $\nabla Q(L^H), \nabla Q(L^D), \nabla Q(F^H), \nabla Q(F^D)$ yields a multiplicative update rule, derived from splitting the gradient into positive and negative components[4]:

\[\nabla Q(L)=Q_+ - Q_-, L \leftarrow L \circ \frac{Q_-}{Q_+}\]

This encourages shared components to capture common structure while allowing disease-specific modules to emerge.


Linking Gene Modules to GWAS via s-LDSC

Once modules are defined, sc-linker treats each module as a functional category and uses stratified LDSC (s-LDSC)[5] to quantify its contribution to trait heritability.

It is a long journey to introduce s-LDSC from scratch, but some key concepts include:

  • regressing GWAS summary statistics(chi-square statistic) with LD score(sum of LD $r^2$ values between that SNP and all others in the region) yields a estimate of hearitability and confounding factors such as population structure.
  • s-LDSC extends LDSC accounting for the functional category of SNPs
  • in sc-linker study, genes comprising the module are mapped to SNPs via enhancer-gene mapping(Roadmap-ABC)[6-10] strategy thus translates gene functional categories to a collection of mapped SNPs.

These consecutive steps in identifying gene programs and linking with GWAS summary statistics, provide a foundational framework in identifying and applying a collection of genes to interpret complex, context-dependent hierarchy across variant-enhancer-gene-trait.

Reference

[1] Ding, C., Li, T., & Peng, W. (2008). On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics & Data Analysis, 52(8), 3913-3927.

[2] DeTomaso, D., & Yosef, N. (2021). Hotspot identifies informative gene modules across modalities of single-cell genomics. Cell systems, 12(5), 446-456.

[3] Jagadeesh, K. A., Dey, K. K., Montoro, D. T., Mohan, R., Gazal, S., Engreitz, J. M., … & Regev, A. (2022). Identifying disease-critical cell types and cellular processes by integrating single-cell RNA-sequencing and human genetics. Nature genetics, 54(10), 1479-1492.

[4] Lee, Daniel, and H. Sebastian Seung. “Algorithms for non-negative matrix factorization.” Advances in neural information processing systems 13 (2000).

[5] Finucane, Hilary K., et al. “Partitioning heritability by functional annotation using genome-wide association summary statistics.” Nature genetics 47.11 (2015): 1228-1235.

[6] Ernst, J., Kheradpour, P., Mikkelsen, T. S., Shoresh, N., Ward, L. D., Epstein, C. B., … & Bernstein, B. E. (2011). Mapping and analysis of chromatin state dynamics in nine human cell types. Nature, 473(7345), 43-49.

[7] Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A., Kheradpour, P., … & Roadmap Epigenomics Consortium. (2015). Integrative analysis of 111 reference human epigenomes. Nature, 518(7539), 317.

[8] Liu, Y., Sarkar, A., Kheradpour, P., Ernst, J., & Kellis, M. (2017). Evidence of reduced recombination rate in human regulatory domains. Genome biology, 18(1), 193.

[9] Fulco, C. P., Nasser, J., Jones, T. R., Munson, G., Bergman, D. T., Subramanian, V., … & Engreitz, J. M. (2019). Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nature genetics, 51(12), 1664-1669.

[10] Nasser, Joseph, et al. “Genome-wide enhancer maps link risk variants to disease genes.” Nature 593.7858 (2021): 238-243.