BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis

Abstract

Autoregressive (AR) frameworks have recently achieved remarkable progress in zero-shot text-to-speech (TTS) by leveraging discrete speech tokens and large language model techniques. Despite their success, existing AR-based zero-shot TTS systems face two critical limitations: (i) an inherent speed–quality trade-off, as sequential token generation either reduces frame rates at the cost of expressiveness or enriches tokens at the cost of efficiency, and (ii) a text-oriented supervision mismatch, as cross-entropy loss penalizes token errors uniformly without considering the fine-grained acoustic similarity among adjacent tokens. To address these challenges, we propose \textbf{BridgeTTS}, a novel AR-TTS framework built upon the dual speech representation paradigm \textbf{BridgeCode}. BridgeTTS reduces AR iterations by predicting sparse tokens while reconstructing rich continuous features for high-quality synthesis. Joint optimization of token-level and feature-level objectives further enhances naturalness and intelligibility. Experiments demonstrate that BridgeTTS achieves competitive quality and speaker similarity while significantly accelerating synthesis. Speech demos are available at https://test1562.github.io/demo/.

Model Architecture
Comparison Experiments
Ablation Study

Model Architecture

Figure 1: Comparison Between Previous and Proposed BridgeTTS Frameworks Using Autoregressive Generators.

Figure 2: Overview of the Training Process for BridgeCode.

Figure 3: Overview of the proposed BridgeTTS. (A) Training Process Diagram. (B) Inference Process Diagram.

Comparison Experiments

The audio sample below is a sample synthesized using the model proposed in this paper.

The LibriTTS dataset is used, you can download it via https://www.openslr.org/60/.

LibriTTS is a multi-speaker English corpus. It amounts to 585 hours and over 2300 speakers. Train-clean-100, train-clean-360, and train-other-500 are merged as the training set. Dev-clean and dev-other are merged as a development set. Test-clean and test-other are merged as the test set.

Each comparison experiment model sources are listed below:

VALL-E: https://github.com/lifeiteng/VALL-E
CosyVoice: https://github.com/FunAudioLLM/CosyVoice
UniAudio: https://github.com/yangdongchao/UniAudio
GPT-Talker: https://github.com/walker-hyf/GPT-Talker

Sample for Seen Speaker - Development Set

Sample 1

Text: If we had one real critic in London-but what can one expect?,

Groundtruth	VALL-E	CosyVoice	UniAudio	GPT-Talker	BridgeTTS

Sample 2

Text: As soon as they beheld the twilight of sense and heresy, they started, measured back their steps, and were again involved in the gloom of impenetrable orthodoxy.,

Groundtruth	VALL-E	CosyVoice	UniAudio	GPT-Talker	BridgeTTS

Sample 3

Text: The Author wishes it to be understood that Erewhon is pronounced as a word of three syllables, all short-thus, E re whon.,

Groundtruth	VALL-E	CosyVoice	UniAudio	GPT-Talker	BridgeTTS

Sample 4

Text: The paternal parent has a right to his infants, no doubt." That was Bozzle's law.,

Groundtruth	VALL-E	CosyVoice	UniAudio	GPT-Talker	BridgeTTS

Sample 5

Text: And Mr Ossipon brings every week a pile of these f p tracts to sell at a halfpenny each.,

Groundtruth	VALL-E	CosyVoice	UniAudio	GPT-Talker	BridgeTTS