One metric, one audio set, identical conditions — the only way to compare transcription engines without misleading anyone.
Speech-to-text "accuracy" gets thrown around loosely. To compare engines fairly you need one defined metric, one audio set, and identical conditions. Here's the method any honest STT comparison should use — and the one we hold our published numbers to.
WER measures how far a transcript is from a correct reference:
Lower is better. A WER of 0.10 means 1 error per 10 words. WER is reported after light normalization (casing and punctuation removed) so engines aren't penalized for formatting choices.
| Factor | Why it matters |
|---|---|
| Audio set | Same files for every engine — clean speech, accented speech, and noisy/overlapping speech tested separately. |
| Domain | General vs. medical/legal vocabulary changes results dramatically. |
| Latency | Real-time vs. batch; we report median seconds to first and final token. |
| Diarization | Whether speaker labels are correct, scored separately from WER. |
| Price | Cost per audio-hour at list pricing, noted with the test date. |