BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis

Abstract

Autoregressive (AR) frameworks have recently achieved remarkable progress in zero-shot text-to-speech (TTS) by leveraging discrete speech tokens and large language model techniques. Despite their success, existing AR-based zero-shot TTS systems face two critical limitations: (i) an inherent speed–quality trade-off, as sequential token generation either reduces frame rates at the cost of expressiveness or enriches tokens at the cost of efficiency, and (ii) a text-oriented supervision mismatch, as cross-entropy loss penalizes token errors uniformly without considering the fine-grained acoustic similarity among adjacent tokens. To address these challenges, we propose \textbf{BridgeTTS}, a novel AR-TTS framework built upon the dual speech representation paradigm \textbf{BridgeCode}. BridgeTTS reduces AR iterations by predicting sparse tokens while reconstructing rich continuous features for high-quality synthesis. Joint optimization of token-level and feature-level objectives further enhances naturalness and intelligibility. Experiments demonstrate that BridgeTTS achieves competitive quality and speaker similarity while significantly accelerating synthesis. Speech demos are available at https://test1562.github.io/demo/.

Contents

Model Architecture


Figure 1: Comparison Between Previous and Proposed BridgeTTS Frameworks Using Autoregressive Generators.



Figure 2: Overview of the Training Process for BridgeCode.



Figure 3: Overview of the proposed BridgeTTS. (A) Training Process Diagram. (B) Inference Process Diagram.


Comparison Experiments

  • The audio sample below is a sample synthesized using the model proposed in this paper.
  • The LibriTTS dataset is used, you can download it via https://www.openslr.org/60/.
  • LibriTTS is a multi-speaker English corpus. It amounts to 585 hours and over 2300 speakers. Train-clean-100, train-clean-360, and train-other-500 are merged as the training set. Dev-clean and dev-other are merged as a development set. Test-clean and test-other are merged as the test set.
  • Each comparison experiment model sources are listed below:
  • Sample for Seen Speaker - Development Set

    Sample 1

    Text: If we had one real critic in London-but what can one expect?,

    Groundtruth VALL-E CosyVoice UniAudio GPT-Talker BridgeTTS

    Sample 2

    Text: As soon as they beheld the twilight of sense and heresy, they started, measured back their steps, and were again involved in the gloom of impenetrable orthodoxy.,

    Groundtruth VALL-E CosyVoice UniAudio GPT-Talker BridgeTTS

    Sample 3

    Text: The Author wishes it to be understood that Erewhon is pronounced as a word of three syllables, all short-thus, E re whon.,

    Groundtruth VALL-E CosyVoice UniAudio GPT-Talker BridgeTTS

    Sample 4

    Text: The paternal parent has a right to his infants, no doubt." That was Bozzle's law.,

    Groundtruth VALL-E CosyVoice UniAudio GPT-Talker BridgeTTS

    Sample 5

    Text: And Mr Ossipon brings every week a pile of these f p tracts to sell at a halfpenny each.,

    Groundtruth VALL-E CosyVoice UniAudio GPT-Talker BridgeTTS


    Sample for Unseen Speaker - Test Set

    Sample 1

    Text: So choose for yourself-to make a rush or tarry here.",

    Groundtruth VALL-E CosyVoice UniAudio GPT-Talker BridgeTTS

    Sample 2

    Text: They met a good many acquaintances; Mainhall, indeed, knew almost every one, and he babbled on incontinently, screwing his small head about over his high collar.,

    Groundtruth VALL-E CosyVoice UniAudio GPT-Talker BridgeTTS

    Sample 3

    Text: The first was-sort of in play, wasn't it?",

    Groundtruth VALL-E CosyVoice UniAudio GPT-Talker BridgeTTS

    Sample 4

    Text: "Nothing whatever," replied the courtier, as pale as death; "but your majesty has not thought of Fruits.",

    Groundtruth VALL-E CosyVoice UniAudio GPT-Talker BridgeTTS


    Ablation Study

  • Ablation studies are performed to evaluate the effectiveness of the sequence compression encoding and feature loss in the DenseBridge framework.
  • -w/o DenseBridge refers to training the AR generator with the compress tokens alone, without using DenseBridge.
  • -w/o Loss Features refers to training the AR generator with DenseBridge but without the feature loss component.
  • Sample for Seen Speaker - Development Set

    Sample 1

    Text: He had stolen out during the half hour allowed at the works for tea, to buy them an orange or two, which now puffed out his jacket pocket.,

    Groundtruth Proposed -w/o DenseBridge -w/o Loss Features

    Sample 2

    Text: I was walking backward, in a crouching position, when I heard Antonia scream.,

    Groundtruth Proposed -w/o DenseBridge -w/o Loss Features

    Sample 3

    Text: It was the afternoon of a holiday, and she had closed early.,

    Groundtruth Proposed -w/o DenseBridge -w/o Loss Features


    Sample for Unseen Speaker - Test Set

    Sample 1

    Text: But, sir, how shall I find a teacher?,

    Groundtruth Proposed -w/o DenseBridge -w/o Loss Features

    Sample 2

    Text: A young lady quietly joined the party at the supper table.,

    Groundtruth Proposed -w/o DenseBridge -w/o Loss Features

    Sample 3

    Text: Yet were they little worse than what were insisted on before the battle of Naseby.,

    Groundtruth Proposed -w/o DenseBridge -w/o Loss Features

    Sample 4

    Text: A ring of amethyst I could not wear here, plainer to my sight, Than that first kiss.,

    Groundtruth Proposed -w/o DenseBridge -w/o Loss Features