How we measure speech-to-text accuracy

One metric, one audio set, identical conditions — the only way to compare transcription engines without misleading anyone.

Speech-to-text "accuracy" gets thrown around loosely. To compare engines fairly you need one defined metric, one audio set, and identical conditions. Here's the method any honest STT comparison should use — and the one we hold our published numbers to.

Word Error Rate (WER), the core metric

WER measures how far a transcript is from a correct reference:

WER = (Substitutions + Deletions + Insertions) ÷ Total reference words

Substitution — wrong word ("their" → "there").
Deletion — a spoken word missing from the transcript.
Insertion — a word the engine added that wasn't said.

Lower is better. A WER of 0.10 means 1 error per 10 words. WER is reported after light normalization (casing and punctuation removed) so engines aren't penalized for formatting choices.

What else has to be equal

Factor	Why it matters
Audio set	Same files for every engine — clean speech, accented speech, and noisy/overlapping speech tested separately.
Domain	General vs. medical/legal vocabulary changes results dramatically.
Latency	Real-time vs. batch; we report median seconds to first and final token.
Diarization	Whether speaker labels are correct, scored separately from WER.
Price	Cost per audio-hour at list pricing, noted with the test date.

Our honesty rule: any number on this site that hasn't been measured under these conditions is labelled illustrative, not a ranking. We publish a verdict only after running the real benchmark on identical audio.

See the API comparisons