This paper proposes a novel method for learning probability models of subcategorization preference of verbs. We consider the issues of case dependencies and noun class generalization in a uniform way by employing the maximum entropy modeling method. We also propose a new model selection algorithm which starts from the most general model and gradually examines more specific models.
In the quest for knowledge, it is not uncommon for researchers to push the limits
of simulation techniques to the point where they have to be adapted or totally new
techniques or approaches become necessary. True multiscale modeling techniques
are becoming increasingly necessary given the growing interest in materials and
processes on which large-scale properties are dependent or that can be tuned by their
low-scale properties. An example would be nanocomposites, where embedded nanostructures
completely change the matrix properties due to effects occurring at the
User simulations are shown to be useful in spoken dialog system development. Since most current user simulations deploy probability models to mimic human user behaviors, how to set up user action probabilities in these models is a key problem to solve. One generally used approach is to estimate these probabilities from human user data. However, when building a new dialog system, usually no data or only a small amount of data is available.
Language models for speech recognition typically use a probability model of the form Pr(an[al,a2,...,an-i). Stochastic grammars, on the other hand, are typically used to assign structure to utterances, A language model of the above form is constructed from such grammars by computing the prefix probability ~we~* Pr(al.-.artw), where w represents all possible terminations of the prefix al...an. The main result in this paper is an algorithm to compute such prefix probabilities given a stochastic Tree Adjoining Grammar (TAG). The algorithm achieves the required computation in O(n 6) time. ...
This paper explores the use of clickthrough data for query spelling correction. First, large amounts of query-correction pairs are derived by analyzing users' query reformulation behavior encoded in the clickthrough data. Then, a phrase-based error model that accounts for the transformation probability between multi-term phrases is trained and integrated into a query speller system.
This paper compares a number of generative probability models for a widecoverage Combinatory Categorial Grammar (CCG) parser. These models are trained and tested on a corpus obtained by translating the Penn Treebank trees into CCG normal-form derivations. According to an evaluation of unlabeled word-word dependencies, our best model achieves a performance of 89.9%, comparable to the ﬁgures given by Collins (1999) for a linguistically less expressive grammar. In contrast to Gildea (2001), we ﬁnd a significant improvement from modeling wordword dependencies. ...
Econometricians, as well as other scientists, are engaged in learning from their
experience and data - a fundamental objective of science. Knowledge so obtained
may be desired for its own sake, for example to satisfy our curiosity about aspects
of economic behavior and/or for use in solving practical problems, for example
to improve economic policymaking. In the process of learning from experience
and data, description and generalization both play important roles.
There are many books written about statistics, some brief, some detailed, some humorous, some
colorful, and some quite dry. Each of these texts is designed for a specific audience. Too often, texts
about statistics have been rather theoretical and intimidating for those not practicing statistical
analysis on a routine basis. Thus, many engineers and scientists, who need to use statistics much
more frequently than calculus or differential equations, lack sufficient knowledge of the use of
Continuing improvements led to the furnace and bellows and provided the ability to smelt and forge native metals (naturally occurring in relatively pure form). Gold, copper, silver, and lead, were such early metals. The advantages of copper tools over stone, bone, and wooden tools were quickly apparent to early humans, and native copper was probably used from near the beginning of Neolithic times (about 8000 BC). Native copper does not naturally occur in large amounts, but copper ores are quite common and some of them produce metal easily when burned in wood or charcoal fires.
This paper presents an algorithm for learning the probabilities of optional phonological rules from corpora. The algorithm is based on using a speech recognition system to discover the surface pronunciations of words in spe.ech corpora; using an automatic system obviates expensive phonetic labeling by hand. We describe the details of our algorithm and show the probabilities the system has learned for ten common phonological rules which model reductions and coarticulation effects.
We investigate a number of simple methods for improving the word-alignment accuracy of IBM Model 1. We demonstrate reduction in alignment error rate of approximately 30% resulting from (1) giving extra weight to the probability of alignment to the null word, (2) smoothing probability estimates for rare words, and (3) using a simple heuristic estimation method to initialize, or replace, EM training of model parameters.
We present the PONG method to compute selectional preferences using part-of-speech (POS) N-grams. From a corpus labeled with grammatical dependencies, PONG learns the distribution of word relations for each POS N-gram. From the much larger but unlabeled Google N-grams corpus, PONG learns the distribution of POS N-grams for a given pair of words. We derive the probability that one word has a given grammatical relation to the other. PONG estimates this probability by combining both distributions, whether or not either word occurs in the labeled corpus. ...
Several attempts have been made to learn phrase translation probabilities for phrasebased statistical machine translation that go beyond pure counting of phrases in word-aligned training data. Most approaches report problems with overﬁtting. We describe a novel leavingone-out approach to prevent over-ﬁtting that allows us to train phrase models that show improved translation performance on the WMT08 Europarl German-English task.
We propose a statistical method that ﬁnds the maximum-probability segmentation of a given text. This method does not require training data because it estimates probabilities from the given text. Therefore, it can be applied to any text in any domain. An experiment showed that the method is more accurate than or at least as accurate as a state-of-the-art text segmentation system.
Language modeling is to associate a sequence of words with a priori probability, which is a key part of many natural language applications such as speech recognition and statistical machine translation. In this paper, we present a language modeling based on a kind of simple dependency grammar. The grammar consists of head-dependent relations between words and can be learned automatically from a raw corpus using the reestimation algorithm which is also introduced in this paper. Our experiments show that the proposed model performs better than n-gram models at 11% to 11.
PCFGs can be accurate, they suffer from vocabulary coverage problems: treebanks are small and lexicons induced from them are limited. The reason for this treebank-centric view in PCFG learning is 3-fold: the English treebank is fairly large and English morphology is fairly simple, so that in English, the treebank does provide mostly adequate lexical coverage1 ; Lexicons enumerate analyses, but don’t provide probabilities for them; and, most importantly, the treebank and the external lexicon are likely to follow different annotation schemas, reﬂecting different linguistic perspectives.
We describe a novel method that extracts paraphrases from a bitext, for both the source and target languages. In order to reduce the search space, we decompose the phrase-table into sub-phrase-tables and construct separate clusters for source and target phrases. We convert the clusters into graphs, add smoothing/syntacticinformation-carrier vertices, and compute the similarity between phrases with a random walk-based measure, the commute time.
In this paper, we extend the work on using latent cross-language topic models for identifying word translations across comparable corpora. We present a novel precisionoriented algorithm that relies on per-topic word distributions obtained by the bilingual LDA (BiLDA) latent topic model. The algorithm aims at harvesting only the most probable word translations across languages in a greedy fashion, without any prior knowledge about the language pair, relying on a symmetrization process and the one-to-one constraint.
In data-oriented language processing, an annotated language corpus is used as a stochastic grammar. The most probable analysis of a new sentence is constructed by combining fragments from the corpus in the most probable way. This approach has been successfully used for syntactic analysis, using corpora with syntactic annotations such as the Penn Tree-bank. If a corpus with semantically annotated sentences is used, the same approach can also generate the most probable semantic interpretation of an input sentence. The present paper explains this semantic interpretation method. ...