When I implemented my first speech-synthesis app using the Web Speech API I was shocked how hard it was to setup and execute it with cross-browser support in mind:
Some browsers don't support speech synthesis at all, for instance IE (at least I don't care ๐คทโโ๏ธ) and Opera (I do care ๐ ) and a few more mobile browsers (I haven't decided yet, whether I care or not ๐ค).
On top of that, each browser implements the API differently or with some specific quirks the other browsers don't have
Just try it yourself - go to and execute the MDN speech synthesis example on different browsers and different platforms:
Linux, Windows, MacOS, BSD, Android, iOS
Firefox, Chrome, Chromium, Safari, Opera, Edge, IE, Samsung Browser, Android Webview, Safari on iOS, Opera Mini
You will realize that this example will only work on a subset of these platform-browser combinations. Worst: when you start researching you'll get shocked how quirky and underdeveloped this whole API still is in 2021/2022.
To be fair: it is still labeled as experimental technology. However, it's almost 10 years now, since it has been drafted and still is not a living standard.
This makes it much harder to leverage for our applications and I hope this guide I will help you to get the most out of it for as many browsers as possible.
Minimal example
Let's approach this topic step-by-step and start with a minimal example that all browsers (that generally support speech synthesis) should run:
if ('speechSynthesis'inwindow){window.speechSynthesis.speak(newSpeechSynthesisUtterance('Hello, world!'))}
You can simply copy that code and execute it in your browser console.
If you have basic support you will hear some "default" voice speaking the text 'Hello, world!' and it may sound natural or not, depending on the default "voice" that is used.
Loading voices
Browsers may detect your current language and select a default voice, if installed. However, this may not represent the desired language you'd like to hear for the text to be spoken.
In such case you need to load the list of voices, which are instances of SpeechSynthesisVoice. This is the first greater obstacle where browsers behave quite differently:
Load voices sync-style
constvoices=window.speechSynthesis.getVoices()voices// Array of voices or empty if none are installed
Firefox and Safari Desktop just load the voices immediately in sync-style. This however would return an empty array on Chrome Desktop, Chrome Android and may return an empty Array on Firefox Android (see next section).
Load voices async-style
window.speechSynthesis.onvoiceschanged=function (){constvoices=window.speechSynthesis.getVoices()voices// Array of voices or empty if none are installed}
This methods loads the voices async, so your overall system needs a callback or wrap it with a Promise. Firefox Desktop does not support this method at all, although it's defined as property of window.speechSynthesis, while Safari does not have it at all.
In contrast: Firefox Android loads the voices the first time using this method and on a refresh has them available via the sync-style method.
Loading using interval
Some users of older Safari have reported that their voices are not available immediately (while onvoiceschanged is not available, too). For this case we need to check in a constant interval for the voices:
lettimeout=0constmaxTimeout=2000constinterval=250constloadVoices=(cb)=>{constvoices=speechSynthesis.getVoices()if (voices.length>0){returncb(undefined,voices)}if (timeout>=maxTimeout){returncb(newError('loadVoices max timeout exceeded'))}timeout+=intervalsetTimeout(()=>loadVoices(cb),interval)}loadVoices((err,voices)=>{if (err)returnconsole.error(err)voices// voices loaded and available})
Speaking with a certain voice
There are use-cases, where the default selected voice is not the same language as the text to be spoken. We need to change the voice for the "utterance" to speak.
Step 1: get a voice by a given language
// assume voices are loaded, see previous sectionconstgetVoicebyLang=lang=>speechSynthesis.getVoices().find(voice=>voice.startsWith(lang))constgerman=getVoicebyLang('de')
Note: Voices have standard language codes, like en-GB or en-US or de-DE. However, on Android's Samsung Browser or Android Chrome voices have underscore-connected codes, like en_GB.
Then on Firefox android voices have three characters before the separator, like deu-DEU-f00 or eng-GBR-f00.
However, they all start with the language code so passing a two-letter short-code should be sufficient.
Step 2: create a new utterance
We can now pass the voice to a new SpeechSynthesisUtterance and as your precognitive abilities correctly manifest - there are again some browser-specific issues to consider:
consttext='Guten Tag!'constutterance=newSpeechSynthesisUtterance(text)if (utterance.text!==text){// I found no browser yet that does not support text// as constructor arg but who knows!?utterance.text=text}utterance.voice=german// ios requiredutterance.lang=voice.lang// // Android Chrome requiredutterance.voiceURI=voice.voiceURI// Who knows if required?utterance.pitch=1utterance.volume=1// API allows up to 10 but values > 2 break on all Chromeutterance.rate=1
We can now pass the utterance to the speak function as a preview:
speechSynthesis.speak(utterance)// speaks 'Guten Tag!' in German
Step 3: add events and speak
This is of course just the half of it. We actually want to get deeper insights of what's happening and what's missing by tapping into some of the utterance's events:
consthandler=e=>console.debug(e.type)utterance.onstart=handlerutterance.onend=handlerutterance.onerror=e=>console.error(e)// SSML markup is rarely supported// See: https://www.w3.org/TR/speech-synthesis/utterance.onmark=handler// word boundaries are supported by// Safari MacOS and on windows but// not on Linux and Android browsersutterance.onboundary=handler// not supported / fired// on many browsers somehowutterance.onpause=handlerutterance.onresume=handler// finally speak and log all the eventsspeechSynthesis.speak(utterance)
Step 4: Chrome-specific fix
Longer texts on Chrome-Desktop will be cancelled automatically after 15 seconds. This can be fixed by either chunking the texts or by using an interval of "zero"-latency pause/resume combination. At the same time this fix breaks on Android, since Android devices don't implement speechSynthesis.pause() as pause but as cancel:
lettimerutterance.onstart=()=>{// detection is up to you for this article as// this is an own huge topic for itselfif (!isAndroid){resumeInfinity(utterance)}}constclear=()=>{clearTimeout(timer)}utterance.onerror=clearutterance.onend=clearconstresumeInfinity=(target)=>{// prevent memory-leak in case utterance is deleted, while this is ongoingif (!target&&timer){returnclear()}speechSynthesis.pause()speechSynthesis.resume()timer=setTimeout(function (){resumeInfinity(target)},5000)}
Furthermore, some browser don't update the speechSynthesis.paused property when speechSynthesis.pause() is executed (and speech is correctly paused). You need to manage these states yourself then.
Issues that can't be fixed with JavaScript:
All the above fixes rely on JavaScript but some issues are platform-specific. You need to your app in a way to avoid these issues, where possible:
All browsers on Android actually do a cancel/stop when calling speechSynthesis.pause; pause is simply not supported on Android ๐
There are no voices on Chromium-Ubuntu and Ubuntu-derivatives unless the browser is started with a flag ๐
If on Chromium-Desktop Ubuntu and the very first page wants to load speech synthesis, then there are no voices ever loaded until the page is refreshed or a new page is entered. This can be fixed with JavaScript but it can lead to very bad UX to auto-refresh the page. ๐
If voices are not installed on the host-OS and there are no voices loaded from remote by the browser, then there are no voices and thus no speech synthesis ๐
There is no chance to just instant-load custom voices from remote and use them as a shim in case there are no voices ๐
If the installed voices are just bad users have to manually install better voices ๐
Making your life easier with EasySpeech
Now you have seen the worst and believe me, it takes ages to implement all potential fixes.
Fortunately I already did this and published a package to NPM with the intent to provide a common API that handles most issues internally and provide the same experience across browsers (that support speechSynthesis):
You should give it a try if you want to implement speech synthesis the next time. It also comes with a DEMO page so you can easy test and debug your devices there: https://jankapunkt.github.io/easy-speech/
Let's take a look how it works:
importEasySpeechfrom'easy-speech'// sync, returns Object with detected featuresEasySpeech.detect()EasySpeech.init().catch(e=>console.error('no speech synthesis:',error.message).then(()=>{EasySpeech.speak({text:'Hello, world!'})})
It will not only detect, which features are available but also loads an optimal default voice, based on a few heuristics.