This thesis examines how artificial neural networks can benefit a large vocabulary, speaker
independent, continuous speech recognition system. Currently, most speech recognition
systems are based on hidden Markov models (HMMs), a statistical framework that supports
both acoustic and temporal modeling. Despite their state-of-the-art performance, HMMs
make a number of suboptimal modeling assumptions that limit their potential effectiveness.
Tăng cường Windows Speech Recognition bằng các Macro
Trong bài viết này, chúng ta sẽ thảo luận về Windows Speech Recognition và làm cách nào để tăng cường chức năng của nó bằng cách dùng macro. Chúng ta sẽ học cách làm cách nào tạo macro để làm các việc như: chèn những khối văn bản xác định, chạy những chương trình với tham số xác định và gửi các keystroke đến những ứng dụng bất kỳ.
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition.
This book focuses primarily on speech recognition and the related tasks such as speech enhancement and modeling. This book comprises 3 sections and thirteen chapters written by eminent researchers from USA, Brazil, Australia, Saudi Arabia, Japan, Ireland, Taiwan, Mexico, Slovakia and India. Section 1 on speech recognition consists of seven chapters. Sections 2 and 3 on speech enhancement and speech modeling have three chapters each respectively to supplement section 1.
Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. ...
It is important to correct the errors in the results of speech recognition to increase the performance of a speech translation system. This paper proposes a method for correcting errors using the statistical features of character co-occurrence, and evaluates the method. The proposed method comprises two successive correcting processes. The first process uses pairs of strings: the first string is an erroneous substring of the utterance predicted by speech recognition, the second string is the corresponding section of the actual utterance.
We demonstrate that transformation-based learning can be used to correct noisy speech recognition transcripts in the lecture domain with an average word error rate reduction of 12.9%. Our method is distinguished from earlier related work by its robustness to small amounts of training data, and its resulting efﬁciency, in spite of its use of true word error rate computations as a rule scoring function.
We describe a method for discriminative training of a language model that makes use of syntactic features. We follow a reranking approach, where a baseline recogniser is used to produce 1000-best output for each acoustic input, and a second “reranking” model is then used to choose an utterance from these 1000-best lists. The reranking model makes use of syntactic features together with a parameter estimation method that is based on the perceptron algorithm. We describe experiments on the Switchboard speech recognition task. ...
Large vocabulary continuous speech recognition of inﬂective languages, such as Czech, Russian or Serbo-Croatian, is heavily deteriorated by excessive out of vocabulary rate. In this paper, we tackle the problem of vocabulary selection, language modeling and pruning for inﬂective languages. We show that by explicit reduction of out of vocabulary rate we can achieve signiﬁcant improvements in recognition accuracy while almost preserving the model size. Reported results are on Czech speech corpora. ...
This paper describes recent progress and the author's perspectives of speech recognition technology. Applications of speech recognition technology can be classified into two main areas, dictation and human-computer dialogue systems. In the dictation domain, the automatic broadcast news transcription is now actively investigated, especially under the DARPA project.
Integration of language constraints into a large vocabulary speech recognition system often leads to prohibitive complexity. We propose to factor the constraints into two components. The first is characterized by a covering grammar which is small and easily integrated into existing speech recognizers. The recognized string is then decoded by means of an efficient language post-processor in which the full set of constraints is imposed to correct possible errors introduced by the speech recognizer. ...
This paper addresses two issues concerning lexical access in connected speech recognition: 1) the nature of the pre-lexical representation used to initiate lexical lookup 2) the points at which lexical look-up is triggered off this representation. The results of an experiment are reported which was designed to evaluate a number of access strategies proposed in the literature in conjunction with several plausible pre-lexical representations of the speech input.
We introduce a novel mechanism for incorporating articulatory dynamics into speech recognition with the theory of task dynamics. This system reranks sentencelevel hypotheses by the likelihoods of their hypothetical articulatory realizations which are derived from relationships learned with aligned acoustic/articulatory data.
Grounded language models represent the relationship between words and the non-linguistic context in which they are said. This paper describes how they are learned from large corpora of unlabeled video, and are applied to the task of automatic speech recognition of sports video. Results show that grounded language models improve perplexity and word error rate over text based language models, and further, support video information retrieval better than human generated speech transcriptions.
While speech recognition systems have come a long way in the last thirty years, there is still room for improvement. Although readily available, these systems are sometimes inaccurate and insufficient. The research presented here outlines a technique called Distributed Listening which demonstrates noticeable improvements to existing speech recognition methods. The Distributed Listening architecture introduces the idea of multiple, parallel, yet physically separate automatic speech recognizers called listeners. Distributed Listening also uses a piece of middleware called an interpreter.
Speech recognition problems are a reality in current spoken dialogue systems. In order to better understand these phenomena, we study dependencies between speech recognition problems and several higher level dialogue factors that define our notion of student state: frustration/anger, certainty and correctness. We apply Chi Square (χ2) analysis to a corpus of speech-based computer tutoring dialogues to discover these dependencies both within and across turns.
This paper proposes a named entity recognition (NER) method for speech recognition results that uses conﬁdence on automatic speech recognition (ASR) as a feature. The ASR conﬁdence feature indicates whether each word has been correctly recognized. The NER model is trained using ASR results with named entity (NE) labels as well as the corresponding transcriptions with NE labels.
Speech recognition in many morphologically rich languages suffers from a very high out-of-vocabulary (OOV) ratio. Earlier work has shown that vocabulary decomposition methods can practically solve this problem for a subset of these languages. This paper compares various vocabulary decomposition approaches to open vocabulary speech recognition, using Estonian speech recognition as a benchmark. Comparisons are performed utilizing large models of 60000 lexical items and smaller vocabularies of 5000 items. ...
This paper presents the application of WordNet-based semantic relatedness measures to Automatic Speech Recognition (ASR) in multi-party meetings. Different word-utterance context relatedness measures and utterance-coherence measures are deﬁned and applied to the rescoring of N best lists. No signiﬁcant improvements in terms of Word-Error-Rate (WER) are achieved compared to a large word-based ngram baseline model. We discuss our results and the relation to other work that achieved an improvement with such models for simpler tasks. ...