BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis
Abstract
Autoregressive (AR) frameworks have recently achieved remarkable progress in zero-shot text-to-speech (TTS) by leveraging discrete speech tokens and large language model techniques. Despite their success, existing AR-based zero-shot TTS systems face two critical limitations: (i) an inherent speed–quality trade-off, as sequential token generation either reduces frame rates at the cost of expressiveness or enriches tokens at the cost of efficiency, and (ii) a text-oriented supervision mismatch, as cross-entropy loss penalizes token errors uniformly without considering the fine-grained acoustic similarity among adjacent tokens. To address these challenges, we propose \textbf{BridgeTTS}, a novel AR-TTS framework built upon the dual speech representation paradigm \textbf{BridgeCode}. BridgeTTS reduces AR iterations by predicting sparse tokens while reconstructing rich continuous features for high-quality synthesis. Joint optimization of token-level and feature-level objectives further enhances naturalness and intelligibility. Experiments demonstrate that BridgeTTS achieves competitive quality and speaker similarity while significantly accelerating synthesis. Speech demos are available at https://test1562.github.io/demo/.
Contents
Model Architecture
Comparison Experiments
- VALL-E: https://github.com/lifeiteng/VALL-E
- CosyVoice: https://github.com/FunAudioLLM/CosyVoice
- UniAudio: https://github.com/yangdongchao/UniAudio
- GPT-Talker: https://github.com/walker-hyf/GPT-Talker
Sample for Seen Speaker - Development Set
Sample 1
Text: If we had one real critic in London-but what can one expect?,
Groundtruth | VALL-E | CosyVoice | UniAudio | GPT-Talker | BridgeTTS |
---|---|---|---|---|---|
Sample 2
Text: As soon as they beheld the twilight of sense and heresy, they started, measured back their steps, and were again involved in the gloom of impenetrable orthodoxy.,
Groundtruth | VALL-E | CosyVoice | UniAudio | GPT-Talker | BridgeTTS |
---|---|---|---|---|---|
Sample 3
Text: The Author wishes it to be understood that Erewhon is pronounced as a word of three syllables, all short-thus, E re whon.,
Groundtruth | VALL-E | CosyVoice | UniAudio | GPT-Talker | BridgeTTS |
---|---|---|---|---|---|
Sample 4
Text: The paternal parent has a right to his infants, no doubt." That was Bozzle's law.,
Groundtruth | VALL-E | CosyVoice | UniAudio | GPT-Talker | BridgeTTS |
---|---|---|---|---|---|
Sample 5
Text: And Mr Ossipon brings every week a pile of these f p tracts to sell at a halfpenny each.,
Groundtruth | VALL-E | CosyVoice | UniAudio | GPT-Talker | BridgeTTS |
---|---|---|---|---|---|
Sample for Unseen Speaker - Test Set
Sample 1
Text: So choose for yourself-to make a rush or tarry here.",
Groundtruth | VALL-E | CosyVoice | UniAudio | GPT-Talker | BridgeTTS |
---|---|---|---|---|---|
Sample 2
Text: They met a good many acquaintances; Mainhall, indeed, knew almost every one, and he babbled on incontinently, screwing his small head about over his high collar.,
Groundtruth | VALL-E | CosyVoice | UniAudio | GPT-Talker | BridgeTTS |
---|---|---|---|---|---|
Sample 3
Text: The first was-sort of in play, wasn't it?",
Groundtruth | VALL-E | CosyVoice | UniAudio | GPT-Talker | BridgeTTS |
---|---|---|---|---|---|
Sample 4
Text: "Nothing whatever," replied the courtier, as pale as death; "but your majesty has not thought of Fruits.",
Groundtruth | VALL-E | CosyVoice | UniAudio | GPT-Talker | BridgeTTS |
---|---|---|---|---|---|
Ablation Study
Sample for Seen Speaker - Development Set
Sample 1
Text: He had stolen out during the half hour allowed at the works for tea, to buy them an orange or two, which now puffed out his jacket pocket.,
Groundtruth | Proposed | -w/o DenseBridge | -w/o Loss Features |
---|---|---|---|
Sample 2
Text: I was walking backward, in a crouching position, when I heard Antonia scream.,
Groundtruth | Proposed | -w/o DenseBridge | -w/o Loss Features |
---|---|---|---|
Sample 3
Text: It was the afternoon of a holiday, and she had closed early.,
Groundtruth | Proposed | -w/o DenseBridge | -w/o Loss Features |
---|---|---|---|
Sample for Unseen Speaker - Test Set
Sample 1
Text: But, sir, how shall I find a teacher?,
Groundtruth | Proposed | -w/o DenseBridge | -w/o Loss Features |
---|---|---|---|
Sample 2
Text: A young lady quietly joined the party at the supper table.,
Groundtruth | Proposed | -w/o DenseBridge | -w/o Loss Features |
---|---|---|---|
Sample 3
Text: Yet were they little worse than what were insisted on before the battle of Naseby.,
Groundtruth | Proposed | -w/o DenseBridge | -w/o Loss Features |
---|---|---|---|
Sample 4
Text: A ring of amethyst I could not wear here, plainer to my sight, Than that first kiss.,
Groundtruth | Proposed | -w/o DenseBridge | -w/o Loss Features |
---|---|---|---|