[HUGGINGFACE]score: 0.55

Sparse Autoencoders on CosyVoice3 TTS Reveal Steerable Phoneme and Gender Features

June 8, 2026

BatchTopK sparse autoencoders trained on CosyVoice3's LM backbone recover interpretable features spanning phonemes, laughter, accent, and speaker gender, with causal steering interventions raising laughter probability from 0.02 to 0.79 and flipping perceived gender. A modality-aware auto-interp pipeline labels features by whether they fire on text, speech, or both.

HOW THIS AFFECTS YOU

●

builderYou can use SAE latent steering as a lightweight control mechanism for TTS attributes like laughter and speech rate without retraining the base model.

●

researcherThis extends SAE interpretability methods to multimodal token streams, with a modality-aware labeling pipeline that could generalize to other text-audio LMs.

●

designerTargeted latent interventions give you fine-grained expressive control over TTS output — laughter, gender, rate — without prompt engineering or model swaps.

read original ↗huggingface.co

← back to feed