[arXiv]score: 0.17
120M Genomic Model with Adaptive Tokenization Beats Models 20x Larger
June 4, 2026
LDARNet applies H-Net-style dynamic chunking to masked language modeling for DNA, using BiMamba-2 layers with local attention to learn biologically meaningful token boundaries without supervision. At 120M parameters it wins 11/18 benchmark tasks among sub-300M models and achieves state-of-the-art on 5 histone modification tasks, outperforming models up to 2.4B parameters.
cs.CLq-bio.GN
HOW THIS AFFECTS YOU
●
researcherAdaptive tokenization via learned routing isolates a meaningful efficiency gain over fixed k-mer or BPE schemes in genomic modeling, validated by a FLOPs-matched controlled experiment.
●
healthA compact genomic foundation model competitive with much larger models on histone modification tasks could reduce compute costs for epigenomics research pipelines.