[arXiv]score: 0.17

120M Genomic Model with Adaptive Tokenization Beats Models 20x Larger

June 4, 2026

LDARNet applies H-Net-style dynamic chunking to masked language modeling for DNA, using BiMamba-2 layers with local attention to learn biologically meaningful token boundaries without supervision. At 120M parameters it wins 11/18 benchmark tasks among sub-300M models and achieves state-of-the-art on 5 histone modification tasks, outperforming models up to 2.4B parameters.

cs.CLq-bio.GN

HOW THIS AFFECTS YOU

●

researcherAdaptive tokenization via learned routing isolates a meaningful efficiency gain over fixed k-mer or BPE schemes in genomic modeling, validated by a FLOPs-matched controlled experiment.

●

healthA compact genomic foundation model competitive with much larger models on histone modification tasks could reduce compute costs for epigenomics research pipelines.

SOURCE

https://arxiv.org/abs/2606.04552

← back to feed