Classes in language modeling

Xem 1-20 trên 95 kết quả Classes in language modeling
  • Building on earlier work that integrates different factors in language modeling, we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation. This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events.

    pdf10p hongdo_1 12-04-2013 27 2   Download

  • In statistical language modeling, one technique to reduce the problematic effects of data sparsity is to partition the vocabulary into equivalence classes. In this paper we investigate the effects of applying such a technique to higherorder n-gram models trained on large corpora.

    pdf8p hongphan_1 15-04-2013 31 1   Download

  • Active Learning (AL) is typically initialized with a small seed of examples selected randomly. However, when the distribution of classes in the data is skewed, some classes may be missed, resulting in a slow learning progress. Our contribution is twofold: (1) we show that an unsupervised language modeling based technique is effective in selecting rare class examples, and (2) we use this technique for seeding AL and demonstrate that it leads to a higher learning rate. The evaluation is conducted in the context of word sense disambiguation. ...

    pdf5p hongdo_1 12-04-2013 45 5   Download

  • We present a novel probabilistic classifier, which scales well to problems that involve a large number of classes and require training on large datasets. A prominent example of such a problem is language modeling. Our classifier is based on the assumption that each feature is associated with a predictive strength, which quantifies how well the feature can predict the class by itself. The predictions of individual features can then be combined according to their predictive strength, resulting in a model, whose parameters can be reliably and efficiently estimated.

    pdf6p hongdo_1 12-04-2013 36 3   Download

  • In this paper, a new language model, the Multi-Class Composite N-gram, is proposed to avoid a data sparseness problem for spoken language in that it is difficult to collect training data. The Multi-Class Composite N-gram maintains an accurate word prediction capability and reliability for sparse data with a compact model size based on multiple word clusters, called MultiClasses. In the Multi-Class, the statistical connectivity at each position of the N-grams is regarded as word attributes, and one word cluster each is created to represent the positional attributes. ...

    pdf8p bunrieu_1 18-04-2013 47 3   Download

  • When automatically translating from a weakly inflected source language like English to a target language with richer grammatical features such as gender and dual number, the output commonly contains morpho-syntactic agreement errors. To address this issue, we present a target-side, class-based agreement model.

    pdf10p nghetay_1 07-04-2013 41 2   Download

  • In this paper, we address statistical machine translation of public conference talks. Modeling the style of this genre can be very challenging given the shortage of available in-domain training data. We investigate the use of a hybrid LM, where infrequent words are mapped into classes. Hybrid LMs are used to complement word-based LMs with statistics about the language style of the talks. Extensive experiments comparing different settings of the hybrid LM are reported on publicly available benchmarks based on TED talks, from Arabic to English and from English to French.

    pdf10p bunthai_1 06-05-2013 32 2   Download

  • We demonstrate an open-source natural language generation engine that produces descriptions of entities and classes in English and Greek from OWL ontologies that have been annotated with linguistic and user modeling information expressed in RDF . We also demonstrate an accompanying plug-in for the Prot´ g´ ontology editor, e e which can be used to create the ontology’s annotations and generate previews of the resulting texts by invoking the generation engine.

    pdf4p bunthai_1 06-05-2013 38 2   Download

  • In statistical natural language processing we always face the problem of sparse data. One way to reduce this problem is to group words into equivalence classes which is a standard method in statistical language modeling. In this paper we describe a method to determine bilingual word classes suitable for statistical machine translation. We develop an optimization criterion based on a maximumlikelihood approach and describe a clustering algorithm. We will show that the usage of the bilingual word classes we get can improve statistical machine translation. ...

    pdf6p bunthai_1 06-05-2013 33 2   Download

  • We present a syntax-based statistical translation model. Our model transforms a source-language parse tree into a target-language string by applying stochastic operations at each node. These operations capture linguistic differences such as word order and case marking. Model parameters are estimated in polynomial time using an EM algorithm. The model produces word alignments that are better than those produced by IBM Model 5. is conditioned only on word classes and positions in the string, and the duplication and translation are conditioned only on the word identity. ...

    pdf8p bunrieu_1 18-04-2013 48 4   Download

  • Conversation between two people is usually of MIXED-INITIATIVE, with CONTROL over the conversation being transferred from one person to another. We apply a set of rules for the transfer of control to 4 sets of dialogues consisting of a total of 1862 turns. The application of the control rules lets us derive domain-independent discourse structures. The derived structures indicate that initiative plays a role in the structuring of discourse.

    pdf9p bungio_1 03-05-2013 27 3   Download

  • Transliteration, a rich source of proper noun spelling variations, is usually recognized by phonetic- or spelling-based models. However, a single model cannot deal with different words from different language origins, e.g., “get” in “piaget” and “target.” Li et al. (2007) propose a method which explicitly models and classifies the source language origins and switches transliteration models accordingly. This model, however, requires an explicitly tagged training set with language origins. ...

    pdf5p hongdo_1 12-04-2013 31 2   Download

  • Since the early Sixties and Seventies it has been known that the regular and context-free languages are characterized by definability in the monadic second-order theory of certain structures. More recently, these descriptive characterizations have been used to obtain complexity results for constraint- and principle-based theories of syntax and to provide a uniform model-theoretic framework for exploring the relationship between theories expressed in disparate formal terms.

    pdf5p bunrieu_1 18-04-2013 41 2   Download

  • A natural next step in the evolution of constraint-based grammar formalisms from rewriting formalisms is to abstract fully away from the details of the grammar mechanism--to express syntactic theories purely in terms of the properties of the class of structures they license.

    pdf7p bunmoc_1 20-04-2013 47 2   Download

  • At least since Chomsky, the usual response to the projection problem has been to characterize knowledge of language as a grammar, and then proceed by restricting so severely the class of grammars available for acquisition that the induction task is greatly simplified - perhaps trivialized. The work reported here describes an implemented LISP program that explicitly reproduces this methodological approach to acquisitio,~ - but in a computational setting.

    pdf6p bungio_1 03-05-2013 35 2   Download

  • The Layered Domain Class system (LDC) is an experimental natural language processor being developed at Duke University which reached the prototype stage in M a y of 1983. Its primary goals are (I) to provide English-language retrieval capabilities for structured but unnormaUzed data files created by the user, (2) to allow very complex semantics, in terms of the information directly available from the physical data file; and (3) to enable users to customize the system to operate with new types of data. In this paper we shall discuss (a) the types of modifiers LDC provides for; (b) h o...

    pdf5p bungio_1 03-05-2013 42 2   Download

  • The choice of verb features is crucial for the learning of verb classes. This paper presents clustering experiments on 168 German verbs, which explore the relevance of features on three levels of verb description, purely syntactic frame types, prepositional phrase information and selectional preferences. In contrast to previous approaches concentrating on the sparse data problem, we present evidence for a linguistically defined limit on the usefulness of features which is driven by the idiosyncratic properties of the verbs and the specific attributes of the desired verb classification. ...

    pdf8p bunthai_1 06-05-2013 23 2   Download

  • Patent translation is a complex problem due to the highly specialized technical vocabulary and the peculiar textual structure of patent documents. In this paper we analyze patents along the orthogonal dimensions of topic and textual structure. We view different patent classes and different patent text sections such as title, abstract, and claims, as separate translation tasks, and investigate the influence of such tasks on machine translation performance.

    pdf11p bunthai_1 06-05-2013 30 2   Download

  • In previous work, supertag disambiguation has been presented as a robust, partial parsing technique. In this paper we present two approaches: contextual models, which exploit a variety of features in order to improve supertag performance, and class-based models, which assign sets of supertags to words in order to substantially improve accuracy with only a slight increase in ambiguity.

    pdf8p bunthai_1 06-05-2013 27 2   Download

  • In the past the evaluation of machine translation systems has focused on single system evaluations because there were only few systems available. But now there are several commercial systems for the same language pair. This requires new methods of comparative evaluation. In the paper we propose a black-box method for comparing the lexical coverage of MT systems. The method is based on lists of words from different frequency classes. It is shown how these word lists can be compiled and used for testing. We also present the results of using our method on 6 MT systems that translate...

    pdf8p bunthai_1 06-05-2013 36 2   Download



p_strKeyword=Classes in language modeling

nocache searchPhinxDoc


Đồng bộ tài khoản