Speech Recognition Machines Master One Thousand Words

In response to incentives offered by the U.S. government and building on concepts from linguistics and mathematics, a group of computer scientists at Carnegie Mellon University designed speech recognition systems that achieved a vocabulary of one thousand words.

Summary of Event

Although speech synthesis was achieved as early as the mid-1930’s by researchers at the American Telephone and Telegraph Company’s (AT&T) Bell Telephone Laboratories, the more difficult task of getting machines to recognize speech was not accomplished until much later. The early achievements in mechanical speech synthesis at Bell Labs Bell Telephone Laboratories;speech recognition technology were useful, however, in helping researchers to understand the acoustic properties of human speech. Early research efforts in the United States regarding both speech synthesis and speech recognition were funded by government organizations. Speech recognition technology
Computing, applied;speech recognition
[kw]Speech Recognition Machines Master One Thousand Words (1976)
[kw]Machines Master One Thousand Words, Speech Recognition (1976)
[kw]Words, Speech Recognition Machines Master One Thousand (1976)
Speech recognition technology
Computing, applied;speech recognition
[g]North America;1976: Speech Recognition Machines Master One Thousand Words[02260]
[g]United States;1976: Speech Recognition Machines Master One Thousand Words[02260]
[c]Communications and media;1976: Speech Recognition Machines Master One Thousand Words[02260]
[c]Computers and computer science;1976: Speech Recognition Machines Master One Thousand Words[02260]
[c]Inventions;1976: Speech Recognition Machines Master One Thousand Words[02260]
[c]Science and technology;1976: Speech Recognition Machines Master One Thousand Words[02260]
Reddy, Raj
Baker, James Karl
Kurzweil, Ray

In 1952, scientists at Bell Labs created the first system that could recognize speech, but it was limited to the numbers one through ten, spoken in English. Although the recognition of ten separated words spoken by the same speaker was far from the goal of recognizing complete sentences, it was believed that the system could be useful in certain contexts in which users needed to have their hands free and at the same time needed to communicate numbers. Almost two decades later, a commercial product with only slightly more advanced capabilities appeared, the VIP 100, made by Threshold Technology. Threshold Technology The ten digits were augmented with five control words that could be used to manipulate the numbers, and it was hoped that the system would be useful for entering depth readings at sea. Although these functions were rather limited, government agencies became interested in possible defense applications of the technology and began funding further research in this area.

Over time, it became obvious that solutions to speech recognition would require expertise from several disciplines, including mathematics, computer science, and linguistics as well as acoustics. A basic mechanical problem was the lack of sufficiently powerful computer hardware to be able to capture and store the acoustic data, access the data, and perform the analytic calculations to match the data with human speech patterns. Other challenges were linguistic, such as the issue of homonyms—that is, words that sound the same but have different meanings that can be determined only by context. Even though some limited success had been achieved with having a speaker pause between spoken numbers and having a limited number of speakers, researchers were faced with problems stemming from the lack of segmentation in normal speech and the tremendous amount of variability in tonal qualities and habits of pronunciation.

During the 1960’s, linguists, led by Noam Chomsky, Chomsky, Noam began describing the interpretation of spoken language as an active human facility that changes and adjusts interpretation during the time of listening rather than existing as a static collection of words and structures. In other words, the mind continuously and actively “learns,” even during the course of casual conversation. As more and more of this complexity was revealed, computer scientists began to view speech recognition as a branch of artificial intelligence (AI).

One of the early centers for the formal study of artificial intelligence Artificial intelligence was Stanford University, which was able to purchase analog-to-digital conversion hardware for its research programs. Raj Reddy, a Stanford graduate student who was also interested in languages and robotics, saw the potential use of analog-to-digital conversion in pattern recognition, including both vision and speech, and volunteered to work on speech recognition as a project for one of his classes. After he joined the faculty at Carnegie Mellon University in Pittsburgh, Pennsylvania, in 1969, Reddy continued his research into automatic speech recognition.

The first few years of this work were supported by ARPA, the Advanced Research Projects Agency Advanced Research Projects Agency of the U.S. Department of Defense. In 1971, ARPA created a study group to encourage work in this area and to set goals for a speech recognition system, including one thousand vocabulary words, a task-specific syntax, the ability to understand diverse speakers, and an accuracy rate of 90 percent or higher. The study group set a five-year deadline and had a budget of fifteen million dollars. The first two-year phase of the project was open to a relatively large pool of research teams, after which the field would be narrowed to the four most promising teams.

In response to the ARPA goals, Reddy and his graduate students at Carnegie Mellon University Carnegie Mellon University (CMU) developed several systems that achieved levels of accuracy and vocabulary sizes far beyond previous attempts. The team’s primary strategy was to reduce the number of steps needed for the computer to eliminate bad matches. The Hearsay I system, Hearsay speech recognition systems built in 1973, kept the Carnegie Mellon group in the running for continued funding. The other three competing teams were government contractors.

By 1976, as the government’s five-year deadline approached, the Carnegie Mellon group had developed three systems, each programmed with different kinds of searching structures that had various advantages and disadvantages, but all of which contributed to subsequent developments in the field. The Hearsay II system was designed for very large vocabularies and higher levels of interpretation, but it took a long time to process the data.

The Dragon system, Dragon speech recognition system developed by CMU graduate student James Karl Baker, took a nonsegmented approach in order to handle continuous speech. Baker had proposed a system based on hidden Markov models—patterns revealed through observation of the probability of transitions between states. He believed that these models, named for mathematician Andrey Andreyevich Markov, were suitable for the dynamic, constantly shifting interpretation of audio phenomena in which groups of spoken words would typically transition smoothly from one phoneme (unit of sound meaning) to the next. The probability of these transitions could be predicted, forming the basis of the search strategy.

The system that came closest to meeting the ARPA project’s goals was the Harpy system, Harpy speech recognition system developed by another CMU graduate student, Bruce Lowerre, Lowerre, Bruce and Reddy. This system used the “beam search” algorithm, in which the results of each step in the search are scored to determine the continued path (or “beam”) of the search. This made the searches more efficient, resulting in a high accuracy score.

Outside Carnegie Mellon University, only one other competing team was able to complete a viable system: the Hear What I Mean (HWIM) system Hear What I Mean speech recognition system developed by William Woods Woods, William and others at Bolt Beranek and Newman Bolt Beranek and Newman (now known as BBN Technologies) BBN Technologies in Cambridge, Massachusetts. Although the HWIM system was innovative, its designers ran out of time to refine it, and its speed and accuracy were exceeded by the CMU systems.

Although the government’s evaluators agreed that CMU’s systems had achieved some of ARPA’s goals, including the one-thousand-word vocabulary, some within ARPA had doubts about the short-term feasibility of practical applications, and funding ceased for a few years. Important developments were occurring in related fields, however. In the same year that ARPA evaluated the speech recognition systems, Ray Kurzweil announced an important achievement in a related aspect of human-computer interactions: a computer-based system that could read printed text to the blind. Like Reddy, Kurzweil was also interested in optical pattern recognition and artificial intelligence.

Significance

After the ARPA project ended, James Baker, who had developed the Dragon system at CMU, continued his work on speech recognition while working for IBM and other organizations, in collaboration with his wife, Janet Baker, Baker, Janet a biophysicist who shared his research interests. They felt that a commercial product could be developed, and in 1982 they founded Dragon Systems and took full advantage of the new personal computer industry to introduce the first voice recognition system available for consumers, running on the Apple II computer. Apple II computer[Apple two computer];software As the capabilities of computer hardware expanded rapidly, the Bakers were able to increase the speed, accuracy, and vocabulary of their software. By 1989, their DragonDictate system achieved a vocabulary of thirty thousand words. Many competing commercial products appeared, including systems designed by IBM and by Kurzweil Applied Intelligence.

Carnegie Mellon University continued its research programs for speech recognition in the 1980’s, and government funding was restored. New heights of accuracy and speed were achieved, especially by the Sphinx system, Sphinx speech recognition system designed by Kai-Fu Lee, Raj Reddy, Roni Rosenfeld, Xuedong Huang, and others. The CMU researchers shared their methods with the scientific community at large, contributing to similar programs at other universities around the world. By the 1990’s, the Language Technologies Institute Language Technologies Institute (LTI) was established at CMU as the importance and interdisciplinary character of the field became increasingly clear.

In subsequent decades, mechanical voice recognition branched into many high-pressure fields that rely on specialized vocabulary, including medicine, law, automated phone answering systems, and stock trading. Consumers gradually became more familiar and comfortable with the experience of having their spoken words recognized by computers. Speech recognition technology
Computing, applied;speech recognition

WikiSummaries

Free Book Summaries

Categories

Speech Recognition Machines Master One Thousand Words

Categories

Related posts: