We present a statistical model of Japanese unknown words consisting of a set of length and spelling models classified by the character types that constitute a word. The point is quite simple: different character sets should be treated differently and the changes between character types are very important because Japanese script has both ideograms like Chinese (kanji) and phonograms like English (katakana). Both word segmentation accuracy and part of speech tagging accuracy are improved by the proposed model. ...
We present several unsupervised statistical models for the prepositional phrase attachment task that approach the accuracy of the best supervised methods for this task. Our unsupervised approach uses a heuristic based on attachment proximity and trains from raw text that is annotated with only part-of-speech tags and morphological base forms, as opposed to attachment information. It is therefore less resource-intensive and more portable than previous corpus-based algorithm proposed for this task. ...
Traditional concatenative speech synthesis systems use a number of heuristics to deﬁne the target and concatenation costs, essential for the design of the unit selection component. In contrast to these approaches, we introduce a general statistical modeling framework for unit selection inspired by automatic speech recognition. Given appropriate data, techniques based on that framework can result in a more accurate unit selection, thereby improving the general quality of a speech synthesizer. They can also lead to a more modular and a substantially more efﬁcient system. ...
This paper presents a novel statistical model for automatic identification of English baseNP. It uses two steps: the Nbest Part-Of-Speech (POS) tagging and baseNP identification given the N-best POS-sequences. Unlike the other approaches where the two steps are separated, we integrate them into a unified statistical framework. Our model also integrates lexical information. Finally, Viterbi algorithm is applied to make global search in the entire sentence, allowing us to obtain linear complexity for the entire process. ...
This paper presents a Bayesian decision framework that performs automatic story segmentation based on statistical modeling of one or more lexical chain features. Automatic story segmentation aims to locate the instances in time where a story ends and another begins. A lexical chain is formed by linking coherent lexical items chronologically. A story boundary is often associated with a significant number of lexical chains ending before it, starting after it, as well as a low count of chains continuing through it.
One advantage of Short Messaging Service (SMS) texts bethis pre-translation normalization is that the dihave quite differently from normal written versity in different user groups and domains can texts and have some very special phenombe modeled separately without accessing and ena. To translate SMS texts, traditional adapting the language model of the MT system approaches model such irregularities difor each SMS application. Another advantage is rectly in Machine Translation (MT).
(BQ) Part 2 book "The statistical mechanics of financial markets" has contents: Turbulence and foreign exchange markets, derivative pricing beyond black–scholes, microscopic market models, theory of stock exchange crashes, risk management, economic and regulatory capital for financial institutions.
We present a set of algorithms that enable us to translate natural language sentences by exploiting both a translation memory and a statistical-based translation model. Our results show that an automatically derived translation memory can be used within a statistical framework to often ﬁnd translations of higher probability than those found using solely a statistical model.
Most documents are about more than one subject, but many NLP and IR techniques implicitly assume documents have just one topic. We describe new clues that mark shifts to new topics, novel algorithms for identifying topic boundaries and the uses of such boundaries once identified. We report topic segmentation performance on several corpora as well as improvement on an IR task that benefits from good segmentation. Introduction Dividing documents into topically-coherent sections has many uses, but the primary motivation for this work comes from information retrieval (IR). ...
In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and highlevel morphemic correspondences.
In this paper we investigate how to automatically determine if two document collections are written from different perspectives. By perspectives we mean a point of view, for example, from the perspective of Democrats or Republicans. We propose a test of different perspectives based on distribution divergence between the statistical models of two collections. Experimental results show that the test can successfully distinguish document collections of different perspectives from other types of collections. ...
We propose a statistical method that ﬁnds the maximum-probability segmentation of a given text. This method does not require training data because it estimates probabilities from the given text. Therefore, it can be applied to any text in any domain. An experiment showed that the method is more accurate than or at least as accurate as a state-of-the-art text segmentation system.
This paper presents noisy-channel based Korean preprocessor system, which corrects word spacing and typographical errors. The proposed algorithm corrects both errors simultaneously. Using Eojeol transition pattern dictionary and statistical data such as Eumjeol n-gram and Jaso transition probabilities, the algorithm minimizes the usage of huge word dictionaries.
This paper examines the current performance of the stochastic tagger P A R T S (Church 88) in handling phrasal verbs, describes a problem that arises from the statistical model used, and suggests a way to improve the tagger's performance. The solution involves a change in the definition of what counts as a word for the purpose of tagging phrasal verbs.
This book is intended to introduce environmental scientists and
managers to the statistical methods that will be useful for them in their
work. A secondary aim was to produce a text suitable for a course in
statistics for graduate students in the environmental science area. I
wrote the book because it seemed to me that these groups should
really learn about statistical methods in a special way. It is true that
their needs are similar in many respects to those working in other
(1) Since the simpler model features less regressor than the larger model, it follows that the VIF of
the simpler model will be less than that of the larger model. The reason is that the more variables
we include in the model, the greater multicollinearity, and, hence, the greater Rj
, unless the
omitted variables happen to be orthogonal to the regressors included in the simpler model. The
simpler model, which omits relevant variables, produces bias estimates but with smaller
variances. Consequently, there appears to be a tradeoff between bias and precision.
Statistical models are empirical. Although they are derived from observations, the
relationship described must have a basis in our underlying understanding of processes
if we are to have faith in the predictive capabilities of the model (National
Research Council 2000).
Modeling Hydrologic Change: Statistical Methods is about modeling systems where
change has affected data that will be used to calibrate and test models of the systems
and where models will be used to forecast system responses after change occurs.
The focus is not on the hydrology. Instead, hydrology serves as the discipline from
which the applications are drawn to illustrate the principles of modeling and the
detection of change. All four elements of the modeling process are discussed:
conceptualization, formulation, calibration, and verification.
As your choice of today will directly influence your picture of tomorrow, it is important to correctly
choose your projector depending on the field of application you would like to use it in. A digital
cinema projector remains after all a cinema projector! To choose the projector model it is essential
that it is correctly adapted to the size of the screen that it has to illuminate ...
In addition to covering statistical methods, most of the existing books on
equating also focus on the practice of equating, the implications of test development
and test use for equating practice and policies, and the daily equating challenges
that need to be solved. In some sense, the scope of this book is narrower than of
other existing books: to view the equating and linking process as a statistical