[arXiv]score: 0.35
Unified Pix Token And Word Token Generative Language Model
May 15, 2026
Unified multimodal model combining pixel tokens and word tokens to address visual detail limitations (small text, numbers) in CLIP/SigLIP-based vision encoders used in current state-of-the-art open-source models.
cs.CV