16 May 2026
•
Research
This is an article introducing my research on a problem I’ve been pondered for a long time.
When I was studying anatomy in college, I found the “땡시[ddaeng-si]” type of exam particularly difficult.
These were asking the name of a human anatomical structures and landmarks after looking at it for a very short amount of time.
Since all surrounding structures that could serve as clues were covered, it was often difficult to even identify the approximate location based on just a tiny portion of observable structure.
At the time, I thought it would have been easier to deduce the answer if I could have known the approximate location of the presented structure.
This intuition became the starting point for my research.
For high-resolution pathology images or 3D medical images, artificial neural networks are often trained using small patches—smaller than the original image—to ensure efficient learning.
Similar to the previous episode, patch-based learning poses a constraint of observing a narrower field of view compared to the entire image.
In this environment, it is important to consider the location from which the patch was sampled.
Generally, the morphology and signal intensity of medical images serve as critical information to identify organ boundaries.
The additional consideration of “location” represents a unique context that emerges in patch-based learning.
For example, we know that the heart is highly unlikely to appear in a patch sampled near the leg.
Therefore, my focus shifted to “where to sample the next patch” and “how to understand the anatomical context of the current patch’s location.”
In short, this boils down to “Where” and “What”, and this research is the result of our efforts to answer these questions.
Additionally, the method proposed in this process naturally possesses explainable properties.
In the era of large models and data, principles that are reemerging as a reflection on the efficiency of the learning process are catching my attention.
Among these, using the human reasoning process as an analytical framework and applying it to machine learning is a classic yet still interseting approach.
25 Feb 2026
•
Annotated BI
Brief overview
What is the key bottleneck of typical supervised learning?
Most common one is the curation of manual annotations(usually called labeling), and it is especially challenging for tasks involving dense predictions(e.g., segmentation). Cell segmentation for spatial transcriptomics poses a significant challenge for downstream analysis, as it can be error-prone and inherently hard to mitigate bias.
CSDE decomposes this challenge into two intertwined components: (1) difficulty in curating manual annotations and (2) potential errors in curated annotations.
To address these challenges, CSDE leverages the full usage of error-prone labels automatically generated from model predictions.
For the first challenge, they implement an interface to accept, modify labels(not supporting modification of segmentation contours), or reject an invalid pair of (segmentation, label) to efficiently curate a small, but high-fidelity dataset.
Consequently, prediction-powered inference (PPI) provides a statistical framework for valid estimation of parameters by leveraging a small, curated dataset and imputations from a large, unlabeled dataset, addressing the second challenge.
Regarding the increasing attention to the interactive frameworks utilizing human feedback, CSDE seems to be an important case of how quantitative analysis of biological measurements could be efficiently refined by human intervention. Technical details are described as follows.
Technical details
Key technical details of CSDE include: (1) validity of naive PPI, (2) validity of importance sampling, and (3) choice of lambda_g specified by theorems from PPI++.
I’ve discussed some details related to PPI in my previous post.
Problem setup
- Log fold change parameters for gene $g$ = $b^g$
- Cell type label for cell $i = Y^i$
- Expression of gene $g$ for cell $i=X_{ig}$ (quantified from segmentation contours)
Validity of Naive PPI in CSDE
CSDE is built on Prediction-Powered Inference (PPI), which combines:
- A large automated dataset (high power, potentially biased),
- A small manually curated dataset (low bias, higher variance).
The core statistical task is estimating gene-specific log-fold changes (LFCs) using a GLM.
The CSDE estimator maximizes a prediction-powered objective rather than the naive likelihood.
CSDE maximizes a prediction-powered objective: $\hat{\beta} _g = arg max _{b^g} \mathcal{J}_g(b^g)$.
This objective is specified as follows:
\[\mathcal{J}_g(b^g) =\lambda_g \mathcal{L}^g_{\hat{D}_N} (b^g) + (\mathcal{L}^g_{D_n}(b^g)-\lambda_g \mathcal{L}^g_{\hat{D}_n}(b^g))\]
This is an unbiased estimator of the expected log-likelihood. Additionally, CSDE problem setup(differential expression) can be understood as hypothesis testing in GLM, thus Theorem 1 from PPI++ guarantees consistency and asymptotic normality of the PPI estimator.
Validity of Importance Sampling in CSDE
CSDE does not necessarily sample curated cells uniformly. Instead, it may use importance sampling to prioritize cells that are more informative.
The adjusted objective function is as follows:
\[\mathcal{J}_g(b^g) =\lambda_g \mathcal{L}^g_{\hat{D}_N} (b^g) + (\sum_{i=1}^n\eta_i \log p(X_{ig}|Y_i;\beta^g) -\lambda_g (\log p(\hat{X}_{ig}|\hat{Y_i};\beta^g)))\]
CSDE defines weights to prioritize cell type of interest(actually, they are “predicted” to be the cell type of interest via an automated pipeline), which are more likely to be sampled, specifically aiming at approximately one-third of the cells are ensured to be target cell types(e.g., T cells).
This heuristic could be simple and effective, but from my personal perspective, I wonder whether this process could be done with a more principled approach.
By the way, this approach is still valid since the automated dataset term converges to the population expectation via the law of large numbers, and the reweighted curated term provides an unbiased estimate of the population discrepancy. Each term converges in probability to the expectation under the unweighted distribution(by LLN and SNIS(self-normalized importance sampling) consistency, respectively).
Choice of $\lambda_g$, theories from PPI++ paper
The parameter $\lambda_g$ balances the imputed likelihood from the large automated dataset and the correction term from curated data.
Proposition 2. of PPI++ shows that there exists an optimal value minimizing the total asymptotic variance of the entire parameter vector.
However, in CSDE, the parameter of interest is a single coefficient $\beta_k^g$(for cell type $k$), thus it is natural to choose $\lambda$ to minimize the asymptotic variance of $\hat{\beta_k^g}$.
By taking the $k$th term of the asymptotic variance of the full parameter specified from the PPI++ paper, we can yield a closed-form solution(see Supplementary Methods C.2 for details).