Collection of genes: A sc-linker case study


Introduction


With the advent of single cell genomics, we now have access to high-resolution, genome-wide transcriptomic measurements at the resolution of single cells. This unprecedented level of detail allows us to explore how genes are expressed across diverse cell types, states, and molecular contexts.

One key objective in single-cell RNA sequencing (scRNA-seq) analysis is to discover a collection of genes that are jointly expressed across diverse cell types and molecular contexts. We often call it as gene modules, sets of genes that exhibit coordianted expression patterns.

To assess the biological relevance of these gene modules, researchers commonly perform ontology$\dagger$ enrichment analysis, to support the functional coherence of collected genes.

$\dagger$ Notable advancement in gene ontology; from GO terms, molecular signatures to large-scale model-based embeddings - I’ll discuss these topics later.


Gene Modules from Pairwise Gene Expression


We naturally expect functional relevance across gene collections inferred from gene expression patterns, and one key principle shared by diverse module detection algorithms is the measurement of pairwise relationships. Common approaches in gene module detection, principal component analysis(PCA) and nonnegative matrix factorization(NMF), both leverage this pricniple, interpreting gene module as a linear combination of individual gene instances.

  • In PCA, For normalized gene-feature matrix $X$, we compute $X^TX$; which represents gene-gene graph where edge weight corresponds to the inner product of two feature vectors.
  • NMF(nonnegative matrix factorization) has equivalence with K-means clustering on the bipartite graph(which nodes correspond to both cells and genes), applying relaxation by allowing soft assignments to the K clusters[1].
  • Other non-linear algorithm, Hotspot[2], explicitly computes local autocorrelation on the defined neighborhood graph for all selected feature pairs.

Despite differences in methodology, these approaches share a conceptual foundation: modeling gene-gene relationships as a graph structure. This observation naturally leads to a question: how useful are the detected gene modules in a biological context? An elegant framework addressing this question is sc-linker[3]. This method identifies gene modules contrastively—focusing on cell-type and disease-specific patterns—and then links these modules to complex trait associations using GWAS summary statistics. By doing so, sc-linker not only discovers expression-driven modules but also prioritizes those that are enriched for trait heritability, effectively nominating them as functional categories relevant to disease biology.


A case of sc-linker


sc-linker define gene module $M$ as a linear combination of gene $x_i$: \(P=\sum_{i=1}^N w_ix_i\) where $x_i$ is the expression of gene $i$ and $w_i$ is the corresponding weight. Definition of gene module is roughly categorized into three types:

  1. $M_{cell}$(cell-type specific): \(M_{cell}=\sum w_ix_i \text{ where } w_i=\sigma(\chi_i) \text{ for } \chi_i=-2log(P_i)~\chi_2^2\). Here, $P_i$ is a p-value after DE test comparing specific cell type $C$ to all others, and $\sigma(\cdot)$ is a min-max scaling function.

  2. $M_{dis}$(disease dependent): Defined similarly to $M_{cell}$, but the p-values are derived from disease vs. healthy comparisons.

  3. $M_{proc}$(cellular process): Obatined using contrastive NMF, which learns shared and condition-specific components from healthy and diseased scRNA-seq data.


Contrastive(case-control) NMF for Module identification:

Given two scRNA-seq feature matrices:

  • $H_{P\times N_1}$ for healthy samples
  • $D_{P\times N_2}$ for diseased samples

Conventional NMF decompose each matrix into:

\[H_{P\times N_1}\approx [L^{CH}L^{UH}]F^H, D_{P\times N_2}\approx [L^{CD}L^{UD}]F^D\]

,where $L^H=[L^{CH} L^{UH}], L^D=[L^{CD}L^{UD}]$ contain shared($L^{CH}, L^{CD}$) component and unique components (\(L^{UH}, L^{UD}\)).

In the contrastive setting, we enforce similarity between shared components by introducing a penalty term $\Vert L^{CH}-L^{CD}\Vert $. This term is added to conventional NMF obejctive thus it leads to the minimzation of objective $Q$:

\[Q=\frac{1}{2} \Vert{H-L^HF^H}\Vert _F^2+\frac{1}{2}\Vert{D-L^DF^D}\Vert _F^2+\frac{\mu}{2}(\Vert{L^H} \Vert _F^2+\Vert{L^D}\Vert _F^2)+\frac{\gamma}{2}(\Vert{L^{CH}-L^{CD}}\Vert _F^2)\]

Compuptation of gradient $\nabla Q(L^H), \nabla Q(L^D), \nabla Q(F^H), \nabla Q(F^D)$ yields a multiplicative update rule, derived from splitting the gradient into positive and negative components[4]:

\[\nabla Q(L)=Q_+ - Q_-, L \leftarrow L \circ \frac{Q_-}{Q_+}\]

This encourages shared components to capture common structure while allowing disease-specific modules to emerge.


Linking Gene Modules to GWAS via s-LDSC

Once modules are defined, sc-linker treats each module as a functional category and uses stratified LDSC (s-LDSC)[5] to quantify its contribution to trait heritability.

It is a long journey to introduce s-LDSC from scratch, but some key concepts include:

  • regressing GWAS summary statistics(chi-square statistic) with LD score(sum of LD $r^2$ values between that SNP and all others in the region) yields a estimate of hearitability and confounding factors such as population structure.
  • s-LDSC extends LDSC accounting for the functional category of SNPs
  • in sc-linker study, genes comprising the module are mapped to SNPs via enhancer-gene mapping(Roadmap-ABC)[6-10] strategy thus translates gene functional categories to a collection of mapped SNPs.

These consecutive steps in identifying gene programs and linking with GWAS summary statistics, provide a foundational framework in identifying and applying a collection of genes to interpret complex, context-dependent hierarchy across variant-enhancer-gene-trait.

Reference

[1] Ding, C., Li, T., & Peng, W. (2008). On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics & Data Analysis, 52(8), 3913-3927.

[2] DeTomaso, D., & Yosef, N. (2021). Hotspot identifies informative gene modules across modalities of single-cell genomics. Cell systems, 12(5), 446-456.

[3] Jagadeesh, K. A., Dey, K. K., Montoro, D. T., Mohan, R., Gazal, S., Engreitz, J. M., … & Regev, A. (2022). Identifying disease-critical cell types and cellular processes by integrating single-cell RNA-sequencing and human genetics. Nature genetics, 54(10), 1479-1492.

[4] Lee, Daniel, and H. Sebastian Seung. “Algorithms for non-negative matrix factorization.” Advances in neural information processing systems 13 (2000).

[5] Finucane, Hilary K., et al. “Partitioning heritability by functional annotation using genome-wide association summary statistics.” Nature genetics 47.11 (2015): 1228-1235.

[6] Ernst, J., Kheradpour, P., Mikkelsen, T. S., Shoresh, N., Ward, L. D., Epstein, C. B., … & Bernstein, B. E. (2011). Mapping and analysis of chromatin state dynamics in nine human cell types. Nature, 473(7345), 43-49.

[7] Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A., Kheradpour, P., … & Roadmap Epigenomics Consortium. (2015). Integrative analysis of 111 reference human epigenomes. Nature, 518(7539), 317.

[8] Liu, Y., Sarkar, A., Kheradpour, P., Ernst, J., & Kellis, M. (2017). Evidence of reduced recombination rate in human regulatory domains. Genome biology, 18(1), 193.

[9] Fulco, C. P., Nasser, J., Jones, T. R., Munson, G., Bergman, D. T., Subramanian, V., … & Engreitz, J. M. (2019). Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nature genetics, 51(12), 1664-1669.

[10] Nasser, Joseph, et al. “Genome-wide enhancer maps link risk variants to disease genes.” Nature 593.7858 (2021): 238-243.

Unified perspective on GRN inference with external knowledge

Motivation


Understanding how genes regulate each other is central to unraveling the complex circuitry of cellular life. Gene regulatory networks (GRNs) offer a blueprint of these intricate molecular conversations. With the advent of multiplexed, high-throughout, and multi-modal measurements, new computational models now seek to infer GRNs with unprecedented granularity.

Unfortunately, the field of GRN inference is fragmented. Each method— based on regression, probabilistic graphical models, or graph neural networks—proposes its own formulation. As researchers, we often face the dilemma of choosing between methods, without a clear framework for understanding their shared foundations. As end users, we also need a holistic benchmark stratified by the types of external information giving aid to GRN inference; since some external modality could be critically powerful in GRN inference.

Inspired by the excellent review on GRN inference methods[1], I’d like to share one perspective on understanding diverse GRN inference methods with external information under a unified theoretical framework.

Unified perspective


Gene regulatory network(GRN), is defined as a directed, weighted graph with each node represents genes that could be a regulator or target. Here, let’s describe a GRN with regulatory influence matrix $\Theta \in \mathbb{R}^{p \times p}$, where $p$ is the number of nodes(genes) composing the GRN. Some works limit transcription factor(TF) as active regulator among genome-wide candidates, so $\Theta \in \mathbb{R}^{q \times p}$ for this case where $q$ is the number of selected TFs to compose GRN.

Coarsely we can consider two categories of GRN inference algorithms;
1) directly inferring $\Theta$ from given scRNA-seq data $X$ with regularization term to incorporate external knowledge $\Omega$
2) understanding $\Theta$ as a part of generative process that supports the realization of scRNA-seq data $X$ and apply soft or hard constraints with external data $Y$.

For instance of the first category, we consider CellOracle[2] and NetREX-CF[3]. Their loss function can be represented in a unified framework; \(\hat{\Theta}=argmin_\Theta[{\mathcal{L}_{data}(X,\Theta)+\mathcal{L}_{reg}(\Theta,\Omega)}]\)

The first term stands for the data fit loss, and the remainder stands for the regularization term with respect to the prior information $\Omega$.

Proposition 1. CellOracle regression model is an instance of unified framework.
proof sketch. CellOracle regression model is based on LASSO regression, and candidate regulators are limited to the pre-defined set informed by the prior knowledge.

For gene $j$, let $S_j$ be the set of candidate regulators. For gene $i$ and $j$, introduce a prior-informed support mask $\Omega=[\Omega_{ij}]$ where $\Omega_{ij}=1(i \in S_j) \text{ or } 0(i \notin S_j)$.

Consider a regression model \(X_j=\sum_{i\neq j}\Theta_{ij}X_i+\epsilon_j, \text{ where } \epsilon_j \sim N(0,\sigma_j^2)\) Then optimization objective for LASSO regression is \(\mathcal{L_{data}} = \sum_{j=1}^p[{1\over 2\sigma_j^2}||X_j-\sum_{i=1}^p \Omega_{ij}\Theta_{ij}X_i||_2^2+\kappa \sum_{i=1}^p\Omega_{ij}|\Theta_{ij}|]\) , and with prior regularization, Additionally, hard constraint on unavailable regulatory pairs on $\Theta$ yields \(\mathcal{L}_{reg}=0\text{ if } \Theta_{ij}=0 \text{ for all } i,j \text{ with } m_{ij}=0, \text{ otherwise } \infty\) so that $\Theta_{ij}=0$ is enforced outside $S_j$. Thus, we get optimal $\Theta$ by solving \(\hat{\Theta}=argmin_\Theta[{\mathcal{L}_{data}(X,\Theta)+\mathcal{L}_{reg}(\Theta,\Omega)}]\)

Proposition 2. NetREX-CF regression model is an instance of unified framework.
proof sketch. Let $P^{(k)} \in \mathbb{R}^{q\times p}|_{k=1}^K$ be the prior, quantified knowledge of regulatory pair existence from $K$ sources(it can be diverse modalities). The NetREX-CF model is based on NCA(network component analysis), which loss function can be described as: \(\mathcal{L}_{data}=[||X-\Theta A||_2^2+\kappa||A||_2^2+\lambda||\Theta||_2^2+\eta||\Theta||]\) For prior regularization perspective, the model jointly achieves prior regularization with the optimization of collaborative filtering(CF) objective by explicitly introducing a penalty matrix of CF $C=1+a•\sum_k P^{(k)}$ and prior mask $B=\mathbb{1}(\sum_kP^{(k)}>0)$. With respect to $\Theta$, these matrices are refined to $\Omega=\alpha_1C|\Theta|\odot B+\alpha_2|\Theta|\odot(1-B)+C$ and $M=|\Theta|+(1-|\Theta|)\odot B$. Then prior regularization term is designed to optimize regulator latent matrix $U$ and gene latent matrix $V$, where loss function is: \(\mathcal{L}_{reg}=\sum_{i,j}\Omega_{ij}(M_{ij}-u_i^Tv_j)^2\)


For the second category, here we consider probabilistic graphical model Symphony[4] and PMF-GRN[5] as an instance. Under a bayesian perspective, given observation $X$ and prior $P$, Variational approach is used to approximate the distribution of latent GRN $\Theta$, thus enables estimation of statistics on the posterior distribution of $\Theta$.

We can consider two types of external knowledge to provide soft and hard constraints; data itself($Y$, e.g. ATAC-seq) can be jointly incorporated to the generative process(soft), or existing knowledge(e.g. available TF-gene pairs) can be used as a parameter of prior distribution to guide the inference.

Despite differences in the generative process of scRNA-seq observation $X$, unified perspective of incorporating external data $Y$ or established knowledge as a prior $P$ can be described as: \(P(X,Y|\Theta, P, \Omega_{x}, \Omega_{y})\) Here, we assume a hierarchical model with parameters $\Omega_{x}, \Omega_{y}$ to describe generative process of data $X, Y$, respectively. For simplicity, we consider a dependence relationship that satisfies $P(X,Y|\Theta, P, \Omega_{x}, \Omega_{y})=P(X|\Theta, P, \Omega_{x})•P(Y|\Theta,\Omega_y)$ and $\Theta$ is also dependent to $P, \Omega_{x}$.

Variational perspective yields ELBO as an objective function: \(\mathcal{L}_{elbo}=\mathbb{E}_{q(\Theta,\Omega_x,P)}[\log p(X|\Theta,\Omega_x,P)]+\mathbb{E}_{q(\Theta,\Omega_y,P)}[\log p(Y|\Theta,\Omega_y,P)]+D_{KL}(q(\Theta,\Omega_x,\Omega_y,P)||p(\Theta,\Omega_x,\Omega_y,P))\) Here, we can apply known regulatory relationship to design prior $p(P)$, for instance, directly incorporating knowledge to mean parameter of the prior distribution.

Two different algorithms, symphony and PMF-GRN differs in the data generative process and the choice of strategy for external knowledge injection but they both employ variational inference to optimize the objective function.

Looking ahead


Recently, several sporadic reports provide a clue about the shape of underlying topology of GRN; we can make theoretical hypotheses about the relationship between the amount of information about GRN which can be extracted from diverse modalities of observation.

Which provides a more granular view of GRN? Gene expression patterns, mainly co-expression, from scRNA-seq data? Or the counterparts obtained from causal studies such as perturb-seq? Is there any underlying inequality that holds for the information from two types of observation?

What is currently understood; when gene expression was simulated via a stochastic process from GRNs, the amount of information from direct causal studies seemed to provide a higher resolution view of the underlying network structure than co-expression[6]. Additionally, another report show scRNA-seq atlas scale learning is sufficient to effectively extrapolate on genetic perturbation effects; especially using scGPT and GEARS embeddings[7]. These findings therefore suggest the feasibility of pushing the limit approach of analytical methods to infer GRN from scRNA-seq co-expression information.

Naturally, this raises the question of when we perform GRN inference based on scRNA-seq co-expression, whether perturb-seq can effectively aid inference(very limited at this point, since genome-wide CRISPRi single cell screens are known for one cell type) compared to another type of priors(e.g. ATAC-seq or knowledge based databases) and whether it can be used as a universal prior despite limitations in cell type differences.
To answer this question, we can consider a naive benchmark experiment; regarding diverse types of external information under unified objective function, and systematically assess the information gain obtained from diverse types of prior knowledge.

However, we still lack a rich knowledge about how we could set up with a ground truth data in an experimental or simulational study. Preparing a ground truth network by utilizing TF-gene interaction is one approach under current understanding, and CRISPR screen paired with rich phenotypic measurements could emerge as a novel, information-rich modality. Emergence of these data naturally leads to the question about the choice of dataset usage; they can be both incorporated as a prior knowledge or as a ground truth for evaluation. Sharpening this fuzzy relationship is thus important in systematic benchmark studies, and this is one of the reason why I especially appreciate geneRNIB[8] as a solid foundation for improving GRN inference benchmark studies.

References


[1] Stock, M., Losert, C., Zambon, M., Popp, N., Lubatti, G., Hörmanseder, E., … & Scialdone, A. (2025). Leveraging prior knowledge to infer gene regulatory networks from single-cell RNA-sequencing data. Molecular Systems Biology, 1-17.
[2] Kamimoto, K., Stringa, B., Hoffmann, C. M., Jindal, K., Solnica-Krezel, L., & Morris, S. A. (2023). Dissecting cell identity via network inference and in silico gene perturbation. Nature, 614(7949), 742-751.
[3] Wang, Y., Lee, H., Fear, J. M., Berger, I., Oliver, B., & Przytycka, T. M. (2022). NetREX-CF integrates incomplete transcription factor data with gene expression to reconstruct gene regulatory networks. Communications Biology, 5(1), 1282.
[4] Burdziak, C., Azizi, E., Prabhakaran, S., & Pe’er, D. (2019). A nonparametric multi-view model for estimating cell type-specific gene regulatory networks. arXiv preprint arXiv:1902.08138.
[5] Skok Gibbs, C., Mahmood, O., Bonneau, R., & Cho, K. (2024). PMF-GRN: a variational inference approach to single-cell gene regulatory network inference using probabilistic matrix factorization. Genome biology, 25(1), 88.
[6] Aguirre, M., Spence, J. P., Sella, G., & Pritchard, J. K. (2024). Gene regulatory network structure informs the distribution of perturbation effects. bioRxiv.
[7] Ahlmann-Eltze, C., Huber, W., & Anders, S. (2024). Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear methods. biorxiv.
[8] Nourisa, J., Passemiers, A., Stock, M., Zeller-Plumhoff, B., Cannoodt, R., Arnold, C., … & Luecken, M. D. (2025). geneRNIB: a living benchmark for gene regulatory network inference. bioRxiv, 2025-02.