# From Transposable Element Modeling to Physics-informed Biological AI

**Zhongning “Nina” Deng**  
*Research direction memo — working draft*

## Current foundation

My current work focuses on computational genomics for transposable element (TE) annotation. Two practical problems anchor this work. The first is model design: how can complementary genomic signals be combined in a multimodal architecture to improve TE annotation and classification? The second is data design: how can records and taxonomies derived from RepBase and Dfam be reconciled into a traceable, biologically meaningful, ML-ready dataset?

These problems are closely coupled. A sequence model inherits the assumptions, ambiguity, and missing structure of its training data. TE biology makes this especially visible because repetitive sequences, nested insertions, fragmentation, divergent nomenclatures, and uneven taxonomic resolution complicate both labels and evaluation.

## Research question

How can biological machine-learning systems move from pattern recognition toward representations that capture biological structure, state, and change?

I am interested in three connected layers:

1. **Representation:** learning useful genomic features across sequence and complementary modalities.
2. **Data and taxonomy:** defining labels, provenance, hierarchy, and evaluation so that learned representations correspond to defensible biological concepts.
3. **Dynamics and constraints:** building models that represent biological transitions while respecting known structure or physical principles.

## Why TE work is a useful starting point

TE annotation is a concrete test bed for reasoning across scales. Local motifs, long-range sequence organization, family relationships, evolutionary divergence, and genomic context can all matter. Robust progress therefore requires more than a larger model: it requires careful decisions about information, hierarchy, uncertainty, and generalization.

The transferable asset is not only a TE classifier. It is a disciplined approach to building biological learning systems: curate the data lineage, make taxonomy explicit, combine signals deliberately, test what the representation learns, and identify where biological constraints should enter the model.

## Forward direction

My longer-term goal is to contribute to biological foundation models and virtual cell systems that connect rich observations to latent biological dynamics. I am particularly interested in physics-informed biological AI: models whose architectures, objectives, or inference procedures incorporate meaningful constraints rather than learning unconstrained correlations alone.

This is a research direction rather than a completed project. The next steps are to strengthen the current TE/genomics work, study modern genomic foundation models, formulate small bridge problems involving structured latent dynamics or constrained learning, and collaborate with researchers in genomics, mathematical biology, and machine learning.

## Near-term agenda

- Establish a reproducible, taxonomically coherent TE dataset with explicit provenance.
- Benchmark unimodal and multimodal representations for TE annotation.
- Analyze failure modes across TE families, sequence divergence, and dataset shift.
- Identify biological priors that are testable in small, well-defined modeling problems.
- Develop the mathematical and computational tools needed for models of biological state and dynamics.