JSTIP Interleaved Pretraining Boosts ASR Entity Accuracy
July 3, 2026
Joint Speech-Text Interleaved Pretraining (JSTIP) improves ASR by constructing word-level and segment-level interleaved sequences for speech-LLM architectures. Using 38k hours of ASR data, the method achieves entity recognition performance on par with synthetic pairs while simplifying domain adaptation.
HOW THIS AFFECTS YOU
●
builderYou can use JSTIP to improve the entity recognition capabilities of your speech-to-text pipelines.
●
researcherThis method suggests that interleaving speech and text sequences better leverages LLM priors than simple joint training.