ScribeBench

How we measure speech-to-text accuracy

One metric, one audio set, identical conditions — the only way to compare transcription engines without misleading anyone.

Speech-to-text "accuracy" gets thrown around loosely. To compare engines fairly you need one defined metric, one audio set, and identical conditions. Here's the method any honest STT comparison should use — and the one we hold our published numbers to.

Word Error Rate (WER), the core metric

WER measures how far a transcript is from a correct reference:

WER = (Substitutions + Deletions + Insertions) ÷ Total reference words

Lower is better. A WER of 0.10 means 1 error per 10 words. WER is reported after light normalization (casing and punctuation removed) so engines aren't penalized for formatting choices.

What else has to be equal

FactorWhy it matters
Audio setSame files for every engine — clean speech, accented speech, and noisy/overlapping speech tested separately.
DomainGeneral vs. medical/legal vocabulary changes results dramatically.
LatencyReal-time vs. batch; we report median seconds to first and final token.
DiarizationWhether speaker labels are correct, scored separately from WER.
PriceCost per audio-hour at list pricing, noted with the test date.
Our honesty rule: any number on this site that hasn't been measured under these conditions is labelled illustrative, not a ranking. We publish a verdict only after running the real benchmark on identical audio.

See the API comparisons