Logo

dev-resources.site

for different kinds of informations.

Building a Real-time Speech-to-text Web App with Web Speech API

Published at
1/2/2024
Categories
tutorial
webdev
javascript
Author
ā„µiāœ—āœ—
Categories
3 categories in total
tutorial
open
webdev
open
javascript
open
Building a Real-time Speech-to-text Web App with Web Speech API

Happy New Year, everyone! In this short tutorial, we will build a simple yet useful real-time speech-to-text web app using the Web Speech API. Feature-wise, it will be straightforward: click a button to start recording, and your speech will be converted to text, displayed in real-time on the screen. We'll also play with voice commands; saying "stop recording" will halt the recording. Sounds fun? Okay, let's get into it. šŸ˜Š

Web Speech API Overview

The Web Speech API is a browser technology that enables developers to integrate speech recognition and synthesis capabilities into web applications. It opens up possibilities for creating hands-free and voice-controlled features, enhancing accessibility and user experience.

Some use cases for the Web Speech API include voice commands, voice-driven interfaces, transcription services, and more.

Let's Get Started

Now, let's dive into building our real-time speech-to-text web app. I'm going to use vite.js to initiate the project, but feel free to use any build tool of your choice or none at all for this mini demo project.

STEP 1ļøāƒ£. Create a new vite project:

   npm create vite@latest

STEP 2ļøāƒ£. Choose "Vanilla" on the next screen and "JavaScript" on the following one. Use arrow keys on your keyboard to navigate up and down.

HTML Structure

Basic layout of the application.

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <link rel="icon" type="image/svg+xml" href="/vite.svg" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <script type="module" src="/main.js"></script>
    <title>Real-time Speech to Text App</title>
  </head>
  <body>
    <div class="container">
      <h1>Real-time Stt App</h1>

      <div class="btn-wrapper">
        <button id="startBtn" class="btn-start">
          <svg viewBox="0 0 100 100" class="hidden">
            <!-- Outer circle -->
            <circle
              cx="50"
              cy="50"
              r="40"
              stroke="#ccc"
              stroke-width="5"
              fill="none"
            />

            <!-- Inner circle indicating recording -->
            <circle
              cx="50"
              cy="50"
              r="30"
              stroke="#ccc"
              stroke-width="5"
              fill="none"
            >
              <animate
                attributeName="r"
                values="30; 25; 30"
                dur="1.5s"
                repeatCount="indefinite"
              />
            </circle>

            <!-- Record icon in the center -->
            <circle cx="50" cy="50" r="5" fill="#ccc" />
          </svg>

          <span> Start Recording </span>
        </button>
        <button id="stopBtn" class="btn-stop" disabled>Stop Recording</button>
      </div>

      <div id="result" class="result"></div>
    </div>
  </body>
</html>

CSS Styling

Visual appearance of the application.

:root {
  font-family: Inter, system-ui, Avenir, Helvetica, Arial, sans-serif;
  line-height: 1.5;
  font-weight: 400;

  font-synthesis: none;
  text-rendering: optimizeLegibility;
  -webkit-font-smoothing: antialiased;
  -moz-osx-font-smoothing: grayscale;
}

* {
  margin: 0;
  padding: 0;
  box-sizing: border-box;
}

body {
  background: radial-gradient(
      circle at 100%,
      rgba(3, 6, 21, 0.9) 15%,
      rgba(189, 205, 226, 0.5) 5%,
      rgba(7, 9, 22, 0.9) 15%
    ),
    url('./public/chevron.png') center/cover;

  height: 100vh;
  padding: 40px 0;
}

.container {
  max-width: 1100px;
  margin: 0 auto;
  display: flex;
  flex-direction: column;
  align-items: center;
  padding: 0 15px;
}

h1 {
  color: #fff;
  font-size: 1.5rem;
  text-transform: uppercase;
}

.btn-wrapper {
  margin-top: 20px;
  display: flex;
  flex-wrap: wrap;
  justify-content: center;
  align-items: center;
  gap: 10px;
}

button {
  display: flex;
  align-items: center;
  column-gap: 5px;
  border: none;
  cursor: pointer;
  padding: 12px 24px;
  border-radius: 3px;
  font-weight: 600;
  box-shadow: 0 0 10px rgba(0, 0, 0, 0.3);
  transition: opacity 400ms ease-in-out;
}

button:disabled {
  opacity: 0.47;
  cursor: default;
}

button:hover:not(:disabled) {
  opacity: 0.9;
}

button > svg {
  height: 1rem;
}

.btn-start {
  background-color: #ff2c4f;
  color: #fff;
}

.btn-stop {
  background-color: rgb(7, 2, 44);
  color: #fff;
}

.result {
  background-color: #fff;
  width: 100%;
  min-height: 200px;
  padding: 10px;
  border-radius: 3px;
  margin-top: 20px;
  box-shadow: 0 0 10px rgba(0, 0, 0, 0.3);
  text-transform: capitalize;
}

.result:empty {
  display: none;
}

.hidden {
  display: none !important;
}

@media screen and (min-width: 768px) {
  h1 {
    font-size: 3.125rem;
    text-transform: capitalize;
  }

  .container {
    padding: 0 30px;
  }

  .result {
    padding: 15px;
  }
}

JavaScript Implementation

Logic.

const resultElement = document.getElementById('result');
const startBtn = document.getElementById('startBtn');
const animatedSvg = startBtn.querySelector('svg');
const stopBtn = document.getElementById('stopBtn');

startBtn.addEventListener('click', startRecording);
stopBtn.addEventListener('click', stopRecording);

let recognition = window.SpeechRecognition || window.webkitSpeechRecognition;

if (recognition) {
  recognition = new recognition();
  recognition.continuous = true;
  recognition.interimResults = true;
  recognition.lang = 'en-US';

  recognition.onstart = () => {
    startBtn.disabled = true;
    stopBtn.disabled = false;
    animatedSvg.classList.remove('hidden');
    console.log('Recording started');
  };

  recognition.onresult = function (event) {
    let result = '';

    for (let i = event.resultIndex; i < event.results.length; i++) {
      if (event.results[i].isFinal) {
        result += event.results[i][0].transcript + ' ';
      } else {
        result += event.results[i][0].transcript;
      }
    }

    resultElement.innerText = result;

    if (result.toLowerCase().includes('stop recording')) {
      resultElement.innerText = result.replace(/stop recording/gi, '');
      stopRecording();
    }
  };

  recognition.onerror = function (event) {
    startBtn.disabled = false;
    stopBtn.disabled = true;
    console.error('Speech recognition error:', event.error);
  };

  recognition.onend = function () {
    startBtn.disabled = false;
    stopBtn.disabled = true;
    animatedSvg.classList.add('hidden');
    console.log('Speech recognition ended');
  };
} else {
  console.error('Speech recognition not supported');
}

function startRecording() {
  resultElement.innerText = '';
  recognition.start();
}

function stopRecording() {
  if (recognition) {
    recognition.stop();
  }
}

Conclusion

This simple web app utilizes the Web Speech API to convert spoken words into text in real-time. Users can start and stop recording with the provided buttons. Customize the design and functionalities further based on your project requirements.

Final demo: https://stt.nixx.dev

Feel free to explore the complete code on the GitHub repository.

Now, you have a basic understanding of how to create a real-time speech-to-text web app using the Web Speech API. Experiment with additional features and enhancements to make it even more versatile and user-friendly. šŸ˜Š šŸ™

Featured ones: