(Thanks to Mohammad Saleem)
The Past:
Speech and language, and the mysteries and magic surrounding them,
have a long and venerable history, reaching back into mythological
time. Only in this past half century have serious inroads been made
into understanding them well enough to be able to emulate them with
computer technology. Many top-notch researchers and engineers
worldwide have contributed critical pieces to these puzzles. The
examples discussed here illustrate just a few of the key milestones,
both technical and commercial. Successful research breakthroughs
eventually give rise to new products and applications, sometimes
quickly, though often, not so soon as desired. Major contributors to
progress have been increasing the understanding of the speech and
language processes themselves, in concert with ever increasing and
less expensive computer power.
Speech synthesis technology harkens back to Van Kempelen's 1791
"talking machines," which could generate intelligible speech at the
hands of well-trained technicians skillfully manipulating a set of
bellows to force air through various tubes and apertures which
mimicked the shapes and cavities of the vocal tract. In the
mid-1870s, Alexander Graham Bell tried to create speech recognition to
provide an instrument for the deaf that would turn speech into text.
Failing that, he focused his energy on creating what, in 1876, became
the telephone!
"talking machines," which could generate intelligible speech at the
hands of well-trained technicians skillfully manipulating a set of
bellows to force air through various tubes and apertures which
mimicked the shapes and cavities of the vocal tract. In the
mid-1870s, Alexander Graham Bell tried to create speech recognition to
provide an instrument for the deaf that would turn speech into text.
Failing that, he focused his energy on creating what, in 1876, became
the telephone!
Over the past half century, speech synthesis techniques have centered
around (1) extracting key characteristics, using formants, pitch, etc.
and/or other parameterizations, such as LPC (Linear Predictive
Coding), and then using these to generate intelligible playback (e.g.
formant synthesizers, LPC synthesizers, etc), or (2) modeling the
sounds themselves, and combinations of them, and then seamlessly
joining them together (e.g. concatenative synthesis). The first set of
techniques, though trickier to implement well, has the virtue of
requiring low bit rates and much less computation; the second set of
techniques, though much more memory-intensive, typically generates
more natural sounding speech output. Major commercial laboratories
(e.g. - Bell Labs, NTT, etc.) as well as academic and government
laboratories (e.g. Univ. Amsterdam, JSRU, KTH, MIT, Univ. Tokyo)
spearheaded both basic speech production research and synthesis
methodologies. Numerous smaller laboratories also have contributed key
synthesis techniques and applications.
around (1) extracting key characteristics, using formants, pitch, etc.
and/or other parameterizations, such as LPC (Linear Predictive
Coding), and then using these to generate intelligible playback (e.g.
formant synthesizers, LPC synthesizers, etc), or (2) modeling the
sounds themselves, and combinations of them, and then seamlessly
joining them together (e.g. concatenative synthesis). The first set of
techniques, though trickier to implement well, has the virtue of
requiring low bit rates and much less computation; the second set of
techniques, though much more memory-intensive, typically generates
more natural sounding speech output. Major commercial laboratories
(e.g. - Bell Labs, NTT, etc.) as well as academic and government
laboratories (e.g. Univ. Amsterdam, JSRU, KTH, MIT, Univ. Tokyo)
spearheaded both basic speech production research and synthesis
methodologies. Numerous smaller laboratories also have contributed key
synthesis techniques and applications.
In 1936, U.K. Tel introduced a "speaking clock" to tell time. Homer
Dudley of Bell Labs demonstrated his "Voder," (a manually-controlled
speech synthesizer) at the 1939 World's Fair. "Reading machines for
the blind" were introduced in the mid-1970s by Kurzweil in the U.S.
and NEC in Japan. In 1978, Texas Instruments introduced the very
popular "Speak & Spell" learning toy, which contained their new
TMS5220 integrated circuit (IC) chip. Laboratory text-to-speech
systems started evolving into commercial services and products, such
as MIT's "Klattalk," introduced in 1983 as "DECTalk." As processors
became more powerful, a host of new synthesizers became available in
software in many world languages. Starting in the late 1980s, large
scale concatenative synthesis (e.g. Sagisaka at ATR) became
progressively more prevalent. The same approach also became popular
for music synthesizers.
Dudley of Bell Labs demonstrated his "Voder," (a manually-controlled
speech synthesizer) at the 1939 World's Fair. "Reading machines for
the blind" were introduced in the mid-1970s by Kurzweil in the U.S.
and NEC in Japan. In 1978, Texas Instruments introduced the very
popular "Speak & Spell" learning toy, which contained their new
TMS5220 integrated circuit (IC) chip. Laboratory text-to-speech
systems started evolving into commercial services and products, such
as MIT's "Klattalk," introduced in 1983 as "DECTalk." As processors
became more powerful, a host of new synthesizers became available in
software in many world languages. Starting in the late 1980s, large
scale concatenative synthesis (e.g. Sagisaka at ATR) became
progressively more prevalent. The same approach also became popular
for music synthesizers.
Speech recognition has been actively pursued globally by numerous
laboratories in commercial, academic, and government sectors. In 1922,
a sound-activated toy dog named "Rex" (from Elmwood Button Co.) could
be called by name from his doghouse. Small vocabulary recognition was
demonstrated for digits over the telephone by Bell Labs in 1952. At
the Seattle World's Fair in 1962, IBM demonstrated their "Shoebox"
recognizer with 16 words (digits plus command/control words)
interfaced with a mechanical calculator for performing arithmetic
computations by voice. Based on mathematical modeling and optimization
techniques learned at IDA (now the Center for Communications
Research, Princeton), Jim Baker introduced stochastic processing with
Hidden Markov Models (HMM) to speech recognition, while at
Carnegie-Mellon University in 1972. In the same time frame, Jelinek et
al, coming from a background of information theory, also independently
developed HMM techniques for speech recognition at IBM. Over the next
10-15 years, as other labs gradually tested, understood, and applied
this methodology, it became the dominant speech recognition
methodology. Recent performance improvements have been achieved
through the incorporation of discriminative training (e.g. Cambridge
University, LIMSI, etc.) and large databases for training.
laboratories in commercial, academic, and government sectors. In 1922,
a sound-activated toy dog named "Rex" (from Elmwood Button Co.) could
be called by name from his doghouse. Small vocabulary recognition was
demonstrated for digits over the telephone by Bell Labs in 1952. At
the Seattle World's Fair in 1962, IBM demonstrated their "Shoebox"
recognizer with 16 words (digits plus command/control words)
interfaced with a mechanical calculator for performing arithmetic
computations by voice. Based on mathematical modeling and optimization
techniques learned at IDA (now the Center for Communications
Research, Princeton), Jim Baker introduced stochastic processing with
Hidden Markov Models (HMM) to speech recognition, while at
Carnegie-Mellon University in 1972. In the same time frame, Jelinek et
al, coming from a background of information theory, also independently
developed HMM techniques for speech recognition at IBM. Over the next
10-15 years, as other labs gradually tested, understood, and applied
this methodology, it became the dominant speech recognition
methodology. Recent performance improvements have been achieved
through the incorporation of discriminative training (e.g. Cambridge
University, LIMSI, etc.) and large databases for training.
Starting in the 1970s, government funding agencies throughout the
world (e.g. Alvey, ATR, DARPA, Esprit, etc.) began making a major
impact on expanding and directing speech technology for strategic
purposes. These efforts have resulted in significant advances,
especially for speech recognition, and have created large
widely-available databases in many languages while fostering rigorous
comparative testing and evaluation methodologies.
world (e.g. Alvey, ATR, DARPA, Esprit, etc.) began making a major
impact on expanding and directing speech technology for strategic
purposes. These efforts have resulted in significant advances,
especially for speech recognition, and have created large
widely-available databases in many languages while fostering rigorous
comparative testing and evaluation methodologies.
In the mid-1970s, small vocabulary commercial recognizers utilizing
expensive custom hardware were introduced by Threshold Technology and
NEC, primarily for hands-free industrial applications. In the late
1970s, Verbex (division of Exxon Enterprises), also using custom
special-purpose hardware systems, was commercializing small vocabulary
applications over the telephone, primarily for telephone toll
management and financial services (e.g. Fidelity fund inquiries). By
the mid-1990s as computers became progressively more powerful, even
large vocabulary speech recognition applications progressed from
requiring hardware assists to being implementable all in software. As
performance and capabilities increased, prices dropped.
expensive custom hardware were introduced by Threshold Technology and
NEC, primarily for hands-free industrial applications. In the late
1970s, Verbex (division of Exxon Enterprises), also using custom
special-purpose hardware systems, was commercializing small vocabulary
applications over the telephone, primarily for telephone toll
management and financial services (e.g. Fidelity fund inquiries). By
the mid-1990s as computers became progressively more powerful, even
large vocabulary speech recognition applications progressed from
requiring hardware assists to being implementable all in software. As
performance and capabilities increased, prices dropped.
In 1990, Dragon Systems introduced a general-purpose discrete
dictation system (i.e. requiring pauses between each spoken word), and
in 1997, Dragon started shipping general purpose continuous speech
dictation systems, to allow any user to speak naturally to their
computer instead of, or in addition to, typing. IBM rapidly followed
suit, as did Lernout & Hauspie (using technology acquired from
Kurzweil Applied Intelligence), Philips, and more recently, Microsoft.
Medical reporting and legal dictation are two of the largest market
segments for this technology. Although intended for use by typical PC
users, this technology has proven especially valuable to disabled or
physically impaired users, including many who suffer from Repetitive
Stress Injury (RSI).
dictation system (i.e. requiring pauses between each spoken word), and
in 1997, Dragon started shipping general purpose continuous speech
dictation systems, to allow any user to speak naturally to their
computer instead of, or in addition to, typing. IBM rapidly followed
suit, as did Lernout & Hauspie (using technology acquired from
Kurzweil Applied Intelligence), Philips, and more recently, Microsoft.
Medical reporting and legal dictation are two of the largest market
segments for this technology. Although intended for use by typical PC
users, this technology has proven especially valuable to disabled or
physically impaired users, including many who suffer from Repetitive
Stress Injury (RSI).
AT&T introduced their automated operator system (e.g. "collect call,"
"operator," etc.) in 1992. In 1996, Nuance supplied recognition
technology to allow customers of Charles Schwab to get stock quotes
and to engage in financial transactions over the telephone. Similar
recognition applications were also supplied by SpeechWorks. Today, it
is possible to book airline reservations with British Airways, make a
train reservation for Amtrak, obtain weather forecasts and telephone
directory information, all by using speech recognition technology.
"operator," etc.) in 1992. In 1996, Nuance supplied recognition
technology to allow customers of Charles Schwab to get stock quotes
and to engage in financial transactions over the telephone. Similar
recognition applications were also supplied by SpeechWorks. Today, it
is possible to book airline reservations with British Airways, make a
train reservation for Amtrak, obtain weather forecasts and telephone
directory information, all by using speech recognition technology.
Other important speech technologies include speaker
verification/identification and spoken language learning for both
literacy and interactive foreign language instruction. For information
search and retrieval applications (e.g. audio mining) by voice, large
vocabulary recognition preprocessing has proven highly effective,
preserving acoustic as well as statistical semantic/syntactic
information. This approach also has broad applications for speaker
identification, language identification, etc.
verification/identification and spoken language learning for both
literacy and interactive foreign language instruction. For information
search and retrieval applications (e.g. audio mining) by voice, large
vocabulary recognition preprocessing has proven highly effective,
preserving acoustic as well as statistical semantic/syntactic
information. This approach also has broad applications for speaker
identification, language identification, etc.
What's Coming:
Computer processing power will continue to increase, with lower costs
for both processor and memory components. The systems that support
even the most sophisticated speech applications will move from
centralized locales (e.g. computer center, or server) to distributed
configurations (i.e. with some processing done local to the user and
the balance done elsewhere), to primarily being located local to the
end user. This trend has been repeated many times (e.g. with
computers, telephones, etc).
Computer processing power will continue to increase, with lower costs
for both processor and memory components. The systems that support
even the most sophisticated speech applications will move from
centralized locales (e.g. computer center, or server) to distributed
configurations (i.e. with some processing done local to the user and
the balance done elsewhere), to primarily being located local to the
end user. This trend has been repeated many times (e.g. with
computers, telephones, etc).
On the research side, a great deal of progress has been made, but a
great deal of progress remains to be made. Unfortunately, in the wake
of the economic downturn and heavy consolidation of speech technology
companies over the past five years, the amount of corporate and
government funding has declined. The technology presently is good
enough for certain products and services to be successfully sold and
incrementally improved. A great deal more opportunity exists when the
fundamentals of the core technology can be thoroughly explored and
tested (not possible with previous processing limitations) to remove
known sub-optimizations and to enable major new applications.
Experienced researchers are not short of ideas to make fundamental
improvements; they are short of the resources to implement many of
them.
great deal of progress remains to be made. Unfortunately, in the wake
of the economic downturn and heavy consolidation of speech technology
companies over the past five years, the amount of corporate and
government funding has declined. The technology presently is good
enough for certain products and services to be successfully sold and
incrementally improved. A great deal more opportunity exists when the
fundamentals of the core technology can be thoroughly explored and
tested (not possible with previous processing limitations) to remove
known sub-optimizations and to enable major new applications.
Experienced researchers are not short of ideas to make fundamental
improvements; they are short of the resources to implement many of
them.
The promise and the opportunities to be realized for speech
technologies, and the time-frames for these, are gated by the
resources available to pursue these ideas. The first beneficiaries of
this new era in speech technology are likely to be the institutions
willing and able to look beyond short-term incremental gains to break
new ground. Until remedied, present performance limitations will
continue to inhibit the utility and commercial returns of products and
services. Nonetheless some very exciting entrants are on the near-term
horizon!
technologies, and the time-frames for these, are gated by the
resources available to pursue these ideas. The first beneficiaries of
this new era in speech technology are likely to be the institutions
willing and able to look beyond short-term incremental gains to break
new ground. Until remedied, present performance limitations will
continue to inhibit the utility and commercial returns of products and
services. Nonetheless some very exciting entrants are on the near-term
horizon!
We can expect that full, general purpose, continuous dictation systems
will become available in a variety of handheld devices. Speech
technologies will be embedded in handheld computers, cell phones,
remote controls, automotive navigation systems, appliances, foreign
language phrase books, toys, and a lot more!
will become available in a variety of handheld devices. Speech
technologies will be embedded in handheld computers, cell phones,
remote controls, automotive navigation systems, appliances, foreign
language phrase books, toys, and a lot more!
Speech technology will gradually be incorporated into a wide range of
different services and products, progressively more ubiquitous and
pervasive. Multiple speech technologies (recognition, synthesis,
verification, etc.) will become increasingly better integrated and
bundled together. More natural language dialog systems with better
user interfaces should mean that many enterprise applications, such as
customer and technical support, can be conducted automatically with
huge cost savings, and eventually, greater customer satisfaction.
different services and products, progressively more ubiquitous and
pervasive. Multiple speech technologies (recognition, synthesis,
verification, etc.) will become increasingly better integrated and
bundled together. More natural language dialog systems with better
user interfaces should mean that many enterprise applications, such as
customer and technical support, can be conducted automatically with
huge cost savings, and eventually, greater customer satisfaction.
Lecture and meeting transcripts will be readily searchable by voice as
well as broadcast news and your favorite TV shows. Voice portals will
become better enabled with speech input and output. Speaker
verification will become a more prevalent technology, especially used
in combination with other security protections (passwords, hand
geometry, fingerprints, retinal scans, etc). More systems will
incorporate natural language capabilities, directed dialogs, and
multilinguality as needed.
well as broadcast news and your favorite TV shows. Voice portals will
become better enabled with speech input and output. Speaker
verification will become a more prevalent technology, especially used
in combination with other security protections (passwords, hand
geometry, fingerprints, retinal scans, etc). More systems will
incorporate natural language capabilities, directed dialogs, and
multilinguality as needed.
You will be able to talk and give orders to the characters in your
video and simulation adventure games. You can expect customized
pronunciation help when you are trying to learn a new foreign language
on your own. Children will be able to get personalized friendly
reading support on their own, as will adults in need of private
literacy instruction. In some stores, bus stations, and street
corners, you will be able to ask for information from the roving robot
information kiosks! Key components of each of these future
applications have already been demonstrated (at least in prototype
form). Speech isn't just for people any more!
video and simulation adventure games. You can expect customized
pronunciation help when you are trying to learn a new foreign language
on your own. Children will be able to get personalized friendly
reading support on their own, as will adults in need of private
literacy instruction. In some stores, bus stations, and street
corners, you will be able to ask for information from the roving robot
information kiosks! Key components of each of these future
applications have already been demonstrated (at least in prototype
form). Speech isn't just for people any more!
[By Janet M. Baker
Saras Institute/Dibner Institute at MIT
Puiblished in SpeechTechMag]
Saras Institute/Dibner Institute at MIT
Puiblished in SpeechTechMag]
Regards,
Team Revolutions
Yahoo! Music Unlimited - Access over 1 million songs. Try it free.