Speech-to-SOAP in Hinglish — measuring what works

Walk into any clinic in north India and the doctor's dictation will sound roughly like this: "Patient hai 45 year old male, chief complaint chest pain since teen din, exertional, radiating to left arm, ECG normal, troponin awaited, mein admit kar raha hoon for observation."

That's three languages in one sentence — English clinical terminology, Hindi connectives, English numerics, Hindi verbs at the end. Every off-the-shelf speech-to-text model we tested got large parts of this wrong. The English chunks were transcribed as English. The Hindi chunks were transcribed as English-sounding gibberish, or dropped entirely, or — most embarrassingly — transcribed in Devanagari script that the doctor would then have to read back and edit.

The problem is that "Indian English" isn't a discrete language. It's a continuum that depends on the speaker, the listener, and the topic. Clinical speech in particular code-switches more aggressively than casual speech because the medical vocabulary is mostly English while everyday tissue (numbers, body parts, time, modal verbs) is often Hindi.

What we measured

Over twelve weeks in late 2025, we collected 4,318 doctor dictations from 22 pilot clinics across Punjab, Delhi, Maharashtra, and Tamil Nadu. Each dictation was reviewed by the dictating doctor for accuracy, with corrections logged at the word level. We computed Word Error Rate (WER) per language pair and per scenario.

Here's the baseline — three off-the-shelf models on this corpus, before any fine-tuning or pipeline work:

7.2% English-only · vendor A

9.4% Hindi-only · vendor A

31.6% Code-switched · vendor A

That third number is the one that kills the product. A 31.6% WER means roughly one in three words is wrong, and the doctor spends more time correcting than they would have spent typing. We had to do better.

What goes wrong on code-switched speech

Three failure modes dominate.

Language ID flips mid-utterance. Most ASR systems assign a language tag at utterance level. They commit to "Hindi" or "English" and then try to force-fit. When a sentence has both, the model picks one and degrades the other badly.

Numerics get mangled. A doctor saying "BP one-forty by ninety" and "temp ek sau do" in the same dictation is normal. Some models render "ek sau do" as "100 do" because they fall back to digit-mapping mid-stream. Others render "one-forty" in Hindi orthography. Either way, the chart gets a wrong number — which is the worst possible kind of error in a medical context.

Drug names are accent-sensitive. "Telmisartan" pronounced with a north-Indian accent reads as "tel-mi-sar-tan" with stress patterns that throw English-tuned acoustic models. The model sometimes writes "Telmisartan", sometimes "Tel Mi Sar Tan", sometimes "Telmysertan", sometimes nothing.

The pipeline we ended up with

We didn't train our own model. The good open-source acoustic models (Whisper, Deepgram's Nova) are better-engineered than anything we'd build in 6 months. What we did instead is build a pipeline around the off-the-shelf model that handles the failure modes specifically.

1. Two-pass transcription with confidence routing

Audio first hits a fast, cheap model — Deepgram Nova-2 — which returns word-level confidence scores. Any segment with a confidence below 0.6 is re-routed to a second pass through a slower, more capable model with multilingual context (we use a fine-tuned variant of Whisper-large with our own clinical vocabulary). The second pass is ~4× slower and ~6× more expensive per minute, so we only run it on the 8–12% of audio that needs it.

This gave us a 30% relative WER reduction on code-switched segments without ballooning the unit economics.

2. Clinical lexicon biasing

Both passes get a bias list of ~12,000 clinical terms (drug names, lab terms, body parts, conditions, common abbreviations). Bias terms get a probability boost during decoding. This dropped drug-name errors from 18% to 4%.

The bias list is curated and updated weekly. We also append per-clinic custom vocabulary — every clinic's prescribing patterns, common patient terms (especially for vernacular complaint phrases), and the doctor's own preferred shorthand — to the bias list at inference time.

3. Numeric normalisation post-processor

This is a small, dumb, deterministic step that runs after transcription. It looks for sequences like "ek sau do" / "one hundred two" / "100 do" and normalises them all to "102". It runs in regex-plus-lookup-table time, which is roughly free, and it eliminates the majority of numeric errors that survive the ASR step.

4. Structure extraction with LLM-as-formatter

The transcribed text — code-switched, raw, sometimes punctuated, sometimes not — is then fed to a small LLM (we use GPT-4o-mini) with a prompt that says: "Format this clinical dictation as SOAP. Preserve verbatim quotes. Mark uncertain spans with [?]." The LLM doesn't translate; it organises. Hindi sentences stay in Hindi if that's how they were dictated; English sentences stay in English.

This step also extracts structured data — the chief complaint, vitals, drug names — into JSON fields that the rest of the app can render as proper UI elements. The doctor sees a clean SOAP screen, but under the hood the visit object has typed fields, not just a wall of text.

The new numbers

Here's the same corpus, run through our pipeline:

4.1% English-only

5.8% Hindi-only

9.7% Code-switched

The 9.7% on code-switched speech is still higher than English-only, but it's now in the range where doctors find the assist useful rather than counterproductive. Drug-name errors specifically dropped to 1.3%. Numeric errors dropped to 0.8%.

What we still get wrong

Honest list:

Tamil and Telugu code-switching is worse than Hindi (~14% WER on code-switched segments) because we have less training data. Working on it.
Background noise in busy clinics adds 2–3 percentage points. We're not solving this with software — we recommend a $30 lapel mic for any clinic above ~40 patients/day, and the WER goes back to baseline.
Multi-speaker dictation (doctor + patient + accompanying family member talking over each other) is a known unsolved problem. Most doctors solve it by stepping out for the dictation; we're working on speaker diarisation but it's not shipped yet.
Highly local terms. A patient saying "kamar mein dard" (back pain) is fine. A patient saying "lakwa" (stroke / paralysis, regional usage) sometimes gets transcribed as a sound-alike word. Per-clinic vocabulary fixes this within a few weeks of the doctor correcting it once.

What you can take from this

Three lessons that probably generalise beyond clinical dictation:

The pipeline beats the model. We didn't train a better speech model. We built a smarter pipeline around an existing one. Confidence-routed two-pass, clinical biasing, deterministic post-processing, and LLM-as-structurer combined to halve our error rate without any model training.

Measure per-scenario, not in aggregate. Our overall WER didn't tell us anything useful. The English-only number was already fine; it was the code-switched number that mattered to actual users. Bucketing the eval by scenario was the single most useful engineering decision we made.

Drug names are sacred. A 1% drug-name error rate sounds great until you realise that 1% means one in every hundred prescriptions has a wrong drug. We weight drug-name accuracy 5× higher than any other error in our internal metric, and the bias list is the single most-edited file in the repo.

Try it

The pipeline above is what runs on every dictation in MediSero+ today. If you want to try it on your own voice and your own clinical vocabulary, sign up at app.medisero.com/signup — the first 10 dictations are free. If you've built something similar and want to compare notes, we'd love to read your post too.