HinglishTTS — Audio Examples & Metric Illustrations

Sarvam TTS (bulbul:v3) vs Qwen3-TTS (Base + ICL voice cloning) · Test set T01–T20 · 2026-03-26

Objective Metrics — Definitions & Formulas

CSPI — Code-Switching Phonetic Index ↑ higher is better  ·  range 0–1

Composite index measuring how faithfully a TTS model renders both Hindi and English phonetics in a code-switched sentence. Uses language-proportional weighting: each sentence's Hindi and English components are weighted by that sentence's own token distribution, so errors in the dominant language count more.

w_hi = n_hindi / (n_hindi + n_english)
w_en = n_english / (n_hindi + n_english)

CSPI = w_hi × (H-Index + H-Phoneme) / 2
+ w_en × (E-Index + E-Phoneme) / 2

H-Index = correct Hindi tokens / total Hindi tokens
E-Index = correct English tokens / total English tokens
H-Phoneme = mean char-level similarity of Hindi token pairs
E-Phoneme = mean char-level similarity of English token pairs

All comparisons normalise to Devanagari so Whisper's script inconsistencies don't penalise correct pronunciation. A Hindi-heavy sentence (e.g. T01: w_hi=0.83) penalises Hindi errors much more than English ones; an English-heavy sentence (T05: w_en=0.83) does the reverse.

0.0 — unintelligible0.5 — moderate1.0 — perfect

HNR — Harmonics-to-Noise Ratio ↑ higher is better  ·  unit: dB

Absolute measure of voice quality — the ratio of periodic (harmonic) energy to aperiodic (noise) energy in the synthesised speech. Computed via Praat's autocorrelation method over voiced frames only.

HNR (dB) = 10 × log₁₀( Eharmonic / Enoise )

where Eharmonic = energy of the periodic component
and Enoise = total energy − Eharmonic

Unlike MFCC-based stability scores (which self-normalise per model), HNR is an absolute dB scale comparable across models. Unvoiced and silent frames are excluded — only voiced speech frames contribute.

<10 dB — poor15–20 dB — good>20 dB — excellent

Boundary Penalty — Code-Switch Transition Smoothness ↓ lower is better  ·  ratio ≥ 0

Measures how much rougher code-switch boundaries are compared to within-language transitions, using frame-to-frame MFCC Euclidean distances. A value near 1.0 means the switch is as smooth as any other frame transition.

BP = mean_disc(boundary frames) / mean_disc(within frames)

disc(frame i) = ‖ MFCCi+1 − MFCCi ‖₂

boundary frames: ±2 MFCC frames around each estimated
language-switch point (uniform word-timing assumed)

BP = 1.0 is ideal (boundaries indistinguishable from within-language). BP > 1.5 indicates the model audibly struggles at the language switch point.

1.0 — ideal1.5 — rough>2.0 — jarring
Contents Example 1 — Noun Insertion · T01 Example 2 — Verb Grafting · T03 (E-Phoneme) Example 3 — Tag Switching · T05 (Boundary Penalty) Example 4 — Technical Slang · T09 (HNR)

Example 1 — Noun Insertion · T01

Roman: Aaj ka meeting bahut lamba tha
Mixed: आज का meeting बहुत लंबा था
Tags: HIHIENHIHIHI
What to listen for: All four outputs say "मीटिंग" correctly (E-Phoneme = 1.0 for all). The score that differs is HNR — voice clarity. Sarvam roman (15.8 dB) is noticeably cleaner than Qwen3 mixed (11.5 dB).
Model · Script Audio Whisper transcription CSPI
lang-weighted
H-Phoneme E-Phoneme HNR (dB) BP
Sarvam · Roman आजका मीटिंग बहुत लमबा था 0.688 0.65 1.00 15.8 1.045
Sarvam · Mixed आजका मीटिंग बहुत लमबा था 0.688 0.65 1.00 14.6 1.125
Qwen3 · Roman आजका मीटिंग बहुत लमबा था 0.688 0.65 1.00 13.5 0.960
Qwen3 · Mixed अचका मीटिंग बहुत लिम्बा था 0.583 0.60 1.00 11.5 1.205
Metric insight: E-Phoneme = 1.0 across all variants confirms "meeting" → मीटिंग is handled correctly by both models regardless of input script. HNR separates them: the 4.3 dB gap between Sarvam roman and Qwen3 mixed is audible as breathiness/noise in the voice cloning output.

Example 2 — Verb Grafting · T03 · E-Phoneme illustration

Roman: Pehle send karo phir baat karte hain
Mixed: पहले send करो फिर बात करते हैं
Tags: HIENHIHIHIHIHI
What to listen for: Does the model say सेंड (correct) or something like सेन / सैंट (wrong)? Sarvam drops the final /d/ sound. Qwen3 preserves it. E-Phoneme captures this exact distinction.
Model · Script Audio Whisper transcription CSPI
lang-weighted
H-Phoneme E-Phoneme HNR (dB) BP
Sarvam · Roman तहले सेन करो फिर बात करते हैं 0.875 0.958 0.50 16.1 0.864
Sarvam · Mixed पहले सैंट करो फिर बात करते हैं 0.893 1.000 0.50 16.1 0.878
Qwen3 · Roman पेले सेंड करो फिर बात करते हैं 0.982 0.958 1.00 14.1 1.123
Qwen3 · Mixed तहले सेंड करो फिर बात करते हैं 0.982 0.958 1.00 15.4 0.914
Metric insight: E-Phoneme = 0.50 for both Sarvam variants catches a real, audible mispronunciation — "send" is rendered as "सेन" (roman) or "सैंट" (mixed), losing the final /d/. Qwen3's multilingual training handles English verb stems correctly (E-Phoneme = 1.00), a genuine advantage in verb-grafted code-switching.

Example 3 — Tag Switching · T05 · Boundary Penalty + H/E trade-off

Roman: Yaar the deadline is really urgent
Mixed: यार the deadline is really urgent
Tags: HIENENENENEN
What to listen for: Single Hindi word ("Yaar") followed immediately by a full English clause. The hardest possible boundary — the model must switch language after word 1. Does "Yaar" survive? Is the switch smooth or jarring?
Model · Script Audio Whisper transcription CSPI
lang-weighted
H-Phoneme E-Phoneme HNR (dB) BP
Sarvam · Roman यार ते देड़ाइन इस रियोली अर्जिनट 0.633 1.00 0.52 15.7 1.895
Sarvam · Mixed Yaar, the deadline is really urgent 1.000 1.00 1.00 14.7 1.674
Qwen3 · Roman (Yaar dropped) The deadline is really urgent 0.833 0.00 1.00 14.5 1.075
Qwen3 · Mixed (hallucinates) Yeah yeah yeah… the deadline is really urgent 0.833 0.00 1.00 11.1 1.095
Metric insight — three stories in one sentence:
1. Boundary Penalty: Sarvam roman BP = 1.895 (highest in test set) — the abrupt switch from pronounced Hindi "Yaar" to mangled English creates a severe acoustic jolt. Audible as a jarring discontinuity.
2. H/E trade-off: Sarvam preserves "Yaar" (H-Phoneme 1.0) but mangles English; Qwen3 produces fluent English but silently drops "Yaar" (H-Phoneme 0.0). Neither model handles both sides in roman script.
3. Mixed script solves it for Sarvam: Devanagari "यार" gives Sarvam an explicit script-level boundary cue → H-Phoneme 1.0 and E-Phoneme 1.0. Best output of the four.

Example 4 — Technical Slang · T09 · HNR extremes

Roman: Mera laptop slow ho gaya hai kya karoon
Mixed: मेरा laptop slow हो गया है क्या करूं
Tags: HIENENHIHIHIHIHI
What to listen for: Two consecutive English loanwords (laptop, slow) mid-sentence. Qwen3 mixed hits HNR = 17.6 dB — the highest single-sentence score in the entire test set. Both Qwen3 variants have BP < 1.0, meaning the code-switch transitions are smoother than within-language frames.
Model · Script Audio Whisper transcription CSPI
lang-weighted
H-Phoneme E-Phoneme HNR (dB) BP
Sarvam · Roman मेरा लाप्टौप स्लो हो गया है क्या परू 0.817 0.917 0.786 16.5 1.081
Sarvam · Mixed मेरा लाप्टाब स्लो हो घया है क्या करूँ 0.803 0.903 0.714 15.2 1.118
Qwen3 · Roman मेरा लाप्टॉप स्लो हो गया है क्या करूण 0.967 0.958 0.857 15.3 0.967
Qwen3 · Mixed मेरा लेप्टोप स्लो हो घगया है क्या करों 0.879 0.917 0.786 17.6 0.960
Metric insight:
HNR 17.6 dB (Qwen3 mixed) is the peak voice quality score across all 80 synthesised files — this particular sentence with the voice cloning produces exceptionally clean harmonics. Compare to Sarvam roman (16.5 dB) which is also good but measurably noisier.
BP < 1.0 for both Qwen3 variants means code-switch boundaries are acoustically smoother than the within-language frames — the opposite of a rough transition. This is the ideal outcome Boundary Penalty is designed to reward.

Loudness normalisation — for listening only. Raw outputs differ by ~13 dB: Sarvam averaged −15.8 dBFS, Qwen3 averaged −29.1 dBFS. Without normalisation, Sarvam sounds artificially louder and punchier, which biases subjective quality judgements. All audio here has been normalised to −23 LUFS (EBU R128) so both models play back at equal perceived loudness.

This normalisation is applied only to these audio players and has no effect on any objective metric. HNR, Boundary Penalty, and CSPI were all computed from the original unnormalised files: HNR is a harmonic/noise ratio (loudness-invariant); MFCCs use log compression (loudness-invariant); Whisper normalises its input internally before transcription.

Generated 2026-03-26 · HinglishTTS evaluation pipeline · Sarvam bulbul:v3 (suhani speaker) vs Qwen3-TTS-12Hz-1.7B-Base (ICL voice cloning)