HinglishTTS — Audio Examples & Metric Illustrations

Sarvam TTS (bulbul:v3) vs Qwen3-TTS (Base + ICL voice cloning) · Test set T01–T20 · 2026-03-26

Objective Metrics — Definitions & Formulas

CSPI — Code-Switching Phonetic Index ↑ higher is better · range 0–1

Composite index measuring how faithfully a TTS model renders both Hindi and English phonetics in a code-switched sentence. Uses language-proportional weighting: each sentence's Hindi and English components are weighted by that sentence's own token distribution, so errors in the dominant language count more.

w_hi = n_hindi / (n_hindi + n_english)
w_en = n_english / (n_hindi + n_english)

CSPI = w_hi × (H-Index + H-Phoneme) / 2
+ w_en × (E-Index + E-Phoneme) / 2

H-Index = correct Hindi tokens / total Hindi tokens
E-Index = correct English tokens / total English tokens
H-Phoneme = mean char-level similarity of Hindi token pairs
E-Phoneme = mean char-level similarity of English token pairs

All comparisons normalise to Devanagari so Whisper's script inconsistencies don't penalise correct pronunciation. A Hindi-heavy sentence (e.g. T01: w_hi=0.83) penalises Hindi errors much more than English ones; an English-heavy sentence (T05: w_en=0.83) does the reverse.

0.0 — unintelligible0.5 — moderate1.0 — perfect

HNR — Harmonics-to-Noise Ratio ↑ higher is better · unit: dB

Absolute measure of voice quality — the ratio of periodic (harmonic) energy to aperiodic (noise) energy in the synthesised speech. Computed via Praat's autocorrelation method over voiced frames only.

HNR (dB) = 10 × log₁₀( E_harmonic / E_noise )

where E_harmonic = energy of the periodic component
and E_noise = total energy − E_harmonic

Unlike MFCC-based stability scores (which self-normalise per model), HNR is an absolute dB scale comparable across models. Unvoiced and silent frames are excluded — only voiced speech frames contribute.

<10 dB — poor15–20 dB — good>20 dB — excellent

Boundary Penalty — Code-Switch Transition Smoothness ↓ lower is better · ratio ≥ 0

Measures how much rougher code-switch boundaries are compared to within-language transitions, using frame-to-frame MFCC Euclidean distances. A value near 1.0 means the switch is as smooth as any other frame transition.

BP = mean_disc(boundary frames) / mean_disc(within frames)

disc(frame i) = ‖ MFCC_i+1 − MFCC_i ‖₂

boundary frames: ±2 MFCC frames around each estimated
language-switch point (uniform word-timing assumed)

BP = 1.0 is ideal (boundaries indistinguishable from within-language). BP > 1.5 indicates the model audibly struggles at the language switch point.

1.0 — ideal1.5 — rough>2.0 — jarring

Example 1 — Noun Insertion · T01

Model · Script	Whisper transcription	CSPI lang-weighted	H-Phoneme	E-Phoneme	HNR (dB)	BP
Sarvam · Roman	आजका मीटिंग बहुत लमबा था	0.688	0.65	1.00	15.8	1.045
Sarvam · Mixed	आजका मीटिंग बहुत लमबा था	0.688	0.65	1.00	14.6	1.125
Qwen3 · Roman	आजका मीटिंग बहुत लमबा था	0.688	0.65	1.00	13.5	0.960
Qwen3 · Mixed	अचका मीटिंग बहुत लिम्बा था	0.583	0.60	1.00	11.5	1.205

Example 2 — Verb Grafting · T03 · E-Phoneme illustration

Roman: Pehle send karo phir baat karte hain
Mixed: पहले send करो फिर बात करते हैं
Tags: HIENHIHIHIHIHI
What to listen for: Does the model say सेंड (correct) or something like सेन / सैंट (wrong)? Sarvam drops the final /d/ sound. Qwen3 preserves it. E-Phoneme captures this exact distinction.

Model · Script	Whisper transcription	CSPI lang-weighted	H-Phoneme	E-Phoneme	HNR (dB)	BP
Sarvam · Roman	तहले सेन करो फिर बात करते हैं	0.875	0.958	0.50	16.1	0.864
Sarvam · Mixed	पहले सैंट करो फिर बात करते हैं	0.893	1.000	0.50	16.1	0.878
Qwen3 · Roman	पेले सेंड करो फिर बात करते हैं	0.982	0.958	1.00	14.1	1.123
Qwen3 · Mixed	तहले सेंड करो फिर बात करते हैं	0.982	0.958	1.00	15.4	0.914

Example 3 — Tag Switching · T05 · Boundary Penalty + H/E trade-off

Model · Script	Whisper transcription	CSPI lang-weighted	H-Phoneme	E-Phoneme	HNR (dB)	BP
Sarvam · Roman	यार ते देड़ाइन इस रियोली अर्जिनट	0.633	1.00	0.52	15.7	1.895
Sarvam · Mixed	Yaar, the deadline is really urgent	1.000	1.00	1.00	14.7	1.674
Qwen3 · Roman	(Yaar dropped) The deadline is really urgent	0.833	0.00	1.00	14.5	1.075
Qwen3 · Mixed	(hallucinates) Yeah yeah yeah… the deadline is really urgent	0.833	0.00	1.00	11.1	1.095

Metric insight — three stories in one sentence:
1. Boundary Penalty: Sarvam roman BP = 1.895 (highest in test set) — the abrupt switch from pronounced Hindi "Yaar" to mangled English creates a severe acoustic jolt. Audible as a jarring discontinuity.
2. H/E trade-off: Sarvam preserves "Yaar" (H-Phoneme 1.0) but mangles English; Qwen3 produces fluent English but silently drops "Yaar" (H-Phoneme 0.0). Neither model handles both sides in roman script.
3. Mixed script solves it for Sarvam: Devanagari "यार" gives Sarvam an explicit script-level boundary cue → H-Phoneme 1.0 and E-Phoneme 1.0. Best output of the four.

Example 4 — Technical Slang · T09 · HNR extremes

Roman: Mera laptop slow ho gaya hai kya karoon
Mixed: मेरा laptop slow हो गया है क्या करूं
Tags: HIENENHIHIHIHIHI
What to listen for: Two consecutive English loanwords (laptop, slow) mid-sentence. Qwen3 mixed hits HNR = 17.6 dB — the highest single-sentence score in the entire test set. Both Qwen3 variants have BP < 1.0, meaning the code-switch transitions are smoother than within-language frames.

Model · Script	Whisper transcription	CSPI lang-weighted	H-Phoneme	E-Phoneme	HNR (dB)	BP
Sarvam · Roman	मेरा लाप्टौप स्लो हो गया है क्या परू	0.817	0.917	0.786	16.5	1.081
Sarvam · Mixed	मेरा लाप्टाब स्लो हो घया है क्या करूँ	0.803	0.903	0.714	15.2	1.118
Qwen3 · Roman	मेरा लाप्टॉप स्लो हो गया है क्या करूण	0.967	0.958	0.857	15.3	0.967
Qwen3 · Mixed	मेरा लेप्टोप स्लो हो घगया है क्या करों	0.879	0.917	0.786	17.6	0.960

Loudness normalisation — for listening only. Raw outputs differ by ~13 dB: Sarvam averaged −15.8 dBFS, Qwen3 averaged −29.1 dBFS. Without normalisation, Sarvam sounds artificially louder and punchier, which biases subjective quality judgements. All audio here has been normalised to −23 LUFS (EBU R128) so both models play back at equal perceived loudness.

This normalisation is applied only to these audio players and has no effect on any objective metric. HNR, Boundary Penalty, and CSPI were all computed from the original unnormalised files: HNR is a harmonic/noise ratio (loudness-invariant); MFCCs use log compression (loudness-invariant); Whisper normalises its input internally before transcription.

Generated 2026-03-26 · HinglishTTS evaluation pipeline · Sarvam bulbul:v3 (suhani speaker) vs Qwen3-TTS-12Hz-1.7B-Base (ICL voice cloning)