[arXiv]score: 0.35

Unified Pix Token And Word Token Generative Language Model

May 15, 2026

Unified multimodal model combining pixel tokens and word tokens to address visual detail limitations (small text, numbers) in CLIP/SigLIP-based vision encoders used in current state-of-the-art open-source models.

cs.CV

SOURCE

https://arxiv.org/abs/2605.14028

← back to feed