Speech Recognition










PDF version of this report
You must have Adobe Acrobat reader to view, save, or print PDF files. The
reader is available for free
download
.

Speech Recognition

by James G. Barr

Docid: 00018029

Publication Date: 2206

Publication Type: TUTORIAL

Preview

Speech recognition, or speech-to-text, is the technology that enables the
recognition and translation of spoken language into text through
computational linguistics. Speech recognition is often confused with
voice recognition. Whereas speech recognition identifies spoken
words, voice recognition is a biometric technology that establishes the
identity of the speaker. Initially deployed as an automated
dictation tool, speech recognition has become an essential
business-to-consumer (B2C) capability. Today’s leading speech recognition
systems are Amazon Alexa, Google Assistant, and Apple’s Siri.

Report Contents:

Executive Summary

[return to top of this
report]

Speech recognition, or speech-to-text, is the technology that enables the
recognition and translation of spoken language into text through
computational linguistics. Speech recognition is often confused with
voice recognition. Whereas speech recognition identifies spoken words, voice
recognition is a biometric technology that establishes the identity of the
speaker.

Related
Faulkner Reports
Speech Analytics in the
Call Center Tutorial
Computational Linguistics
Tutorial
Biometrics Market Trends Market

Initially deployed as an automated dictation tool, speech recognition has
become an essential business-to-consumer (B2C) capability. Today’s
leading speech recognition systems are:

  • Google Assistant
  • Amazon Alexa
  • Apple’s Siri

Figure 1. Siri Speech Recognition

Figure 1. Siri Speech Recognition

Source: Wikimedia Commons

How Does Speech Recognition Work?

As outlined by Summa Linguae Technologies, the science of speech
recognition consists of four basic steps:

  1. “A microphone transmits the vibrations of a person’s voice into a
    wavelike electrical signal.
  2. “This signal in turn is converted by the system’s hardware – a
    computer’s sound card, for example – into a digital signal.
  3. “The speech recognition software analyzes the digital signal to
    register phonemes, units of sound that distinguish one word from another
    in a particular language.
  4. “The phenomes are reconstructed into words.”1

An Imperfect Science

While speech recognition is aimed at enabling people to communicate
comfortably and naturally with today’s electronic information
infrastructure, speech recognition technology, despite decades of
development, is still a “work in progress.”

Writing on the current state of “speech engines,” analyst Erik J. Martin
highlights two persistent problems: background noise and speech diversity.

Regarding background noise, Martin quotes Michael Zagorsek, chief
operating officer of SoundHound. “Noise disrupts the speech patterns
that are being picked up by the microphone. The ability to remove
noise can open the door for interacting with the voice assistant in a
variety of environments, such as cars, on the street, or in areas with a
lot of background noise.”

Concerning speech diversity, Martin cites Phil Steitz, chief technology
officer of Nextiva. “Current [automatic speech recognition (ASR)]
models are now extremely good at clear, slow speech, but they need to get
better at picking up different dialects and specialized vocabularies.”2

Another prevailing problem with speech recognition is that many consumers
hate it. More specifically, they hate interacting with “virtual
agents” a type of electronic customer service representative that – thanks
to speech recognition – has largely replaced traditional call center
personnel. The problem is twofold:

  • First, the virtual agent may – indeed, frequently does – misunderstand
    what a consumer is saying;
  • Second, and unrelated to speech recognition, the virtual agent’s set
    of pre-programmed responses may not answer a consumer’s question or
    solve her problem.

To help address this dilemma, Zagorsek predicts that “In the near future,
we will see voice assistants taking a proactive role and providing greater
usefulness by collecting information about the context and situation and
then taking the initiative to make helpful suggestions and take actions.”3

Some even suggest that voice assistants will be programmed to project
empathy, thus helping defuse hostile interactions.

Applications

[return to top of this
report]

Speech recognition technology is featured in a wide variety of personal
and business functions. Among the more prominent of these are:

Voice Search

As analyst Cem Dilmegani reports, voice search “is, arguably, the most
common use of speech recognition. Specifically, using [a] voice
assistant to search for stuff on the Internet has now become the ideal way
of searching for 71 percent of participants in a PWC survey.”4

Speech-to-Text

Speech-to-text, a term often used interchangeably with speech
recognition, offers a hands-free approach to:

  • Writing e-mails
  • Sending texts
  • Composing documents

Microsoft Word, for example, has a built-in speech-to-text capability for
dictating documents.

Call Handling

To the delight of call center managers – and the dismay of some call
center callers, speech recognition has allowed the creation of virtual
customer service representatives, who work 24X7, and at a fraction of the
cost of real CSRs.

Closed Captioning

Speech recognition is central to the delivery of “closed captioning”
services. With it, the US Federal Communications Commission ensures
that viewers who are deaf and hard of hearing have full access to
programming. Implemented with limited exemptions, the FCC requires
that captions be:

  • Accurate – Captions must match the spoken words in the dialogue and
    convey background noises and other sounds to the fullest extent
    possible.
  • Synchronous – Captions must coincide with their corresponding spoken
    words and sounds to the greatest extent possible and must be displayed
    on the screen at a speed that can be read by viewers.
  • Complete – Captions must run from the beginning to the end of the
    program to the fullest extent possible.
  • Properly placed – Captions should not block other important visual
    content on the screen, overlap one another or run off the edge of the video
    screen.

Smart Home Management

Smart home devices leverage speech recognition to carry out residents’
household commands, like:

  • Turn off the lights
  • Lock the doors
  • Adjust the thermostat

In-Car Management

According to analyst Dilmegani, “In-car speech recognition systems have
become a standard feature for most modern vehicles. The biggest
benefit of this technology is that it eliminates the need for the driver
to look anywhere else except the road ahead while they are driving,
allowing them to multitask by using their voice. Thanks to these
systems, drivers can use simple voice commands to initiate phone calls,
select radio stations, or play music.”5

Medical and Legal Transcription

Speech recognition helps produce medical and legal paperwork, including:

  • Patient examination notes
  • Trial transcripts
  • e-discovery search results
  • Depositions
  • Interrogation proceedings

Emotion Recognition

As analysts Ben Lutkevich and Karolina Kiwak observe, speech recognition
is empowering a new field, “emotion recognition,” which “can analyze
certain vocal characteristics to determine what emotion the speaker is
feeling. Paired with ‘sentiment analysis,’ this can reveal how
someone feels about a product or service [for example].”6

[return to top of this
report]

Speech and Voice Recognition Market

According to MarketsandMarkets, the speech and voice recognition market is
expected to grow from $8.3 billion in 2021 to $22.0 billion by 2026, which
represents a robust compound annual growth rate (CAGR) of 21.6 percent
during the forecast period. A major market diver is the increasing adoption of speech and voice
recognition technology in the smart appliance space.

A major challenge for speech and voice recognition providers is the
absence of a standard application programming interface (API) through
which they could address issues related to high costs, deployment delays,
and inter-system interoperability,7

Real-Time Speech Translation Devices

The speech recognition industry is “stepping up” to help mono-lingual
Americans and others negotiate foreign lands and foreign languages. Imagine you’re an English-only speaker visiting Paris on holiday or Rome
on business. You need to communicate with the locals but it’s too
late to sign up with Rosetta Stone or Babbel. Fortunately, there is a
practical alternative: a real-time speech translation device.

Figure 2. Langogo Genesis 2-in-1 AI Translator Device

Figure 2. Langogo Genesis 2-in-1 AI Translator Device

Source: Amazon

One of the best, according to analyst Joy Sallegue is the Langogo Genesis
2-in-1 AI Translator Device, depicted in Figure 2. Key features
include:

  • “Accurate and fast translation for over 100 languages.
  • “Self-learning algorithm and continuous updates.
  • “Active noise canceling.
  • “Offline translations for Chinese, English, Japanese, and Korean.
  • “Intelligent travel assistant features.”8

Support for Less-Spoken Languages

While automatic speech recognition technology exists for common
languages, speakers fluent in approximately 7.000 less-common languages have been out of luck. But, as analyst Adam Zewe
reports, “Recent advances have enabled machine learning models that can
learn the world’s uncommon languages, which lack the large amount of
transcribed speech needed to train algorithms. Researchers at MIT
[have developed] a simple technique that reduces the complexity of an
advanced speech-learning model, enabling it to run more efficiently and
achieve higher performance.

“Their technique involves removing unnecessary parts of a common, but
complex, speech recognition model and then making minor adjustments so it
can recognize a specific language. Because only small tweaks are
needed once the larger model is cut down to size, it is much less
expensive and time-consuming to teach this model an uncommon language.

“This work could help level the playing field and bring automatic speech
recognition systems to many areas of the world where they have yet to be
deployed.”9

Natural Language Understanding

One of the newly emerging elements of computational linguistics – and one
of the most exciting – is natural language understanding (NLU), also
called “natural language interpretation” (NLI). Through NLU analysis,
“computers manage to interpret … language and define a user’s
intent.” Unlike simple speech recognition, NLU “focuses on the
determination of intent, sentiment, and context.”

According to analyst Bogdan Koretski, “[with] the evolution of NLU, we’ll
see the ubiquitous growth of its applications. Just imagine how it
will be: systems that understand the meaning, provide instant answers, and
even direct the conversation. Directory services that assist people
in finding the desired information, resolving their issues, and
more. Such products will allow companies to dramatically improve
customer service and automate manual operations.

“Due to the deep and correct language interpretation, machine-to-human
communication will reach a completely new level. Business processes
like data collection and analysis, data vetting, and facts checking will
be automated and errors excluded.”10

Recommendations

[return to top of this
report]

Establish a Speech Recognition Strategy

With speech recognition becoming a more viable – and, thus, valuable –
technology, enterprise planners should consider how speech might align
with their products and priorities.

To that end, analyst Cindy Gordon encourages planners to entertain a few
simple questions:

  • “How many data sources do we have that are speech enabled that could
    help us secure a competitive advantage?
  • “What percentage of our products and services are leveraging speech
    recognition … to create new communication channels?
  • “What are our competitors doing in advancing speech recognition
    solutions across their ecosystems?
  • “How many AI enabled solutions do we have leveraging voice?
  • “Do we have … speech recognition skills and talents in our
    organization?”11

The goal is to create a speech recognition strategy, in the same way the
enterprise devised a cloud strategy, an edge strategy, and an Internet of
Things (IoT) strategy.

Utilize Commercial Speech Recognition APIs

For an enterprise new to speech recognition technology, enterprise
planners should leverage commercial speech recognition application
programming interfaces (APIs), like the IBM Watson Speech to Text API.12

According to the vendor, “IBM Watson Speech to Text technology enables
fast and accurate speech transcription in multiple languages for a variety
of use cases, including but not limited to customer self-service, agent
assistance, and speech analytics.”

[return to top of this
report]

References

1 “A Complete Guide to Speech Recognition Technology.” Summa
Linguae Technologies. June 11, 2021.

2-3 Erik J. Martin. “The 2022 State of Speech Engines.”
Speech Technology (magazine) | Information Today. February 17, 2022.

4-5 Cem Dilmegani. “Top 11 Speech Recognition Applications in
2022.” AIMultiple. April 22, 2022.

6 Ben Lutkevich and Karolina Kiwak. “Speech Recognition” TechTarget. September 2021.

7 "Speech and Voice Recognition Market by Delivery Method,
Deployment Mode (On Cloud, On-Premises/Embedded), Technology (Speech
Recognition, Voice Recognition), Vertical and Geography (2021-2026).”
MarketsandMarkets. August 2021.

8 Joy Sallegue. “12 Best Language Translator Devices in the
Market Right Now.” Learn Languages from Home. February 18, 2022.

9 Adam Zewe. “Toward Speech Recognition for Uncommon Spoken
Languages.” Massachusetts Institute of Technology. November 4, 2021.

10 Bogdan Koretski. “Five Recent Trends in Natural Language
Processing You Need to Know.” YSBM Group sp. z o.o. November 5, 2019.

11 Cindy Gordon. “A Market to Harness: Speech Recognition
Artificial Intelligence (AI) Innovations on the Rise.” Forbes. December
23, 2021.

12 “A Complete Guide to Speech Recognition Technology.” Summa
Linguae Technologies. June 11, 2021.

About the Author

[return to top of this
report]

James G. Barr is a leading business continuity analyst
and business writer with more than 40 years’ IT experience. A member of
“Who’s Who in Finance and Industry,” Mr. Barr has designed, developed, and
deployed business continuity plans for a number of Fortune 500 firms. He
is the author of several books, including How to Succeed in Business
BY Really Trying
, a member of Faulkner’s Advisory Panel, and a
senior editor for Faulkner’s Security Management Practices.
Mr. Barr can be reached via e-mail at jgbarr@faulkner.com.

[return to top of this
report]