[arXiv]score: 0.20

JSTIP Interleaved Pretraining Boosts ASR Entity Accuracy

July 3, 2026

Joint Speech-Text Interleaved Pretraining (JSTIP) improves ASR by constructing word-level and segment-level interleaved sequences for speech-LLM architectures. Using 38k hours of ASR data, the method achieves entity recognition performance on par with synthetic pairs while simplifying domain adaptation.

HOW THIS AFFECTS YOU

●

builderYou can use JSTIP to improve the entity recognition capabilities of your speech-to-text pipelines.

●

researcherThis method suggests that interleaving speech and text sequences better leverages LLM priors than simple joint training.

read original ↗arxiv.org

DAILY DIGEST

catch up on AI in 2 minutes, every morning. free. unsubscribe anytime. privacy

← back to feed