We present a statistical model of Japanese unknown words consisting of a set of length and spelling models classified by the character types that constitute a word. The point is quite simple: different character sets should be treated differently and the changes between character types are very important because Japanese script has both ideograms like Chinese (kanji) and phonograms like English (katakana). Both word segmentation accuracy and part of speech tagging accuracy are improved by the proposed model. ...
We present several unsupervised statistical models for the prepositional phrase attachment task that approach the accuracy of the best supervised methods for this task. Our unsupervised approach uses a heuristic based on attachment proximity and trains from raw text that is annotated with only part-of-speech tags and morphological base forms, as opposed to attachment information. It is therefore less resource-intensive and more portable than previous corpus-based algorithm proposed for this task. ...
Traditional concatenative speech synthesis systems use a number of heuristics to deﬁne the target and concatenation costs, essential for the design of the unit selection component. In contrast to these approaches, we introduce a general statistical modeling framework for unit selection inspired by automatic speech recognition. Given appropriate data, techniques based on that framework can result in a more accurate unit selection, thereby improving the general quality of a speech synthesizer. They can also lead to a more modular and a substantially more efﬁcient system. ...
This paper presents a novel statistical model for automatic identification of English baseNP. It uses two steps: the Nbest Part-Of-Speech (POS) tagging and baseNP identification given the N-best POS-sequences. Unlike the other approaches where the two steps are separated, we integrate them into a unified statistical framework. Our model also integrates lexical information. Finally, Viterbi algorithm is applied to make global search in the entire sentence, allowing us to obtain linear complexity for the entire process. ...
This paper presents a Bayesian decision framework that performs automatic story segmentation based on statistical modeling of one or more lexical chain features. Automatic story segmentation aims to locate the instances in time where a story ends and another begins. A lexical chain is formed by linking coherent lexical items chronologically. A story boundary is often associated with a significant number of lexical chains ending before it, starting after it, as well as a low count of chains continuing through it.
In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and highlevel morphemic correspondences.
We propose a statistical method that ﬁnds the maximum-probability segmentation of a given text. This method does not require training data because it estimates probabilities from the given text. Therefore, it can be applied to any text in any domain. An experiment showed that the method is more accurate than or at least as accurate as a state-of-the-art text segmentation system.
We propose a novel model to automatically extract transliteration pairs from parallel corpora. Our model is efﬁcient, language pair independent and mines transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models.
One advantage of Short Messaging Service (SMS) texts bethis pre-translation normalization is that the dihave quite differently from normal written versity in different user groups and domains can texts and have some very special phenombe modeled separately without accessing and ena. To translate SMS texts, traditional adapting the language model of the MT system approaches model such irregularities difor each SMS application. Another advantage is rectly in Machine Translation (MT).
This paper presents noisy-channel based Korean preprocessor system, which corrects word spacing and typographical errors. The proposed algorithm corrects both errors simultaneously. Using Eojeol transition pattern dictionary and statistical data such as Eumjeol n-gram and Jaso transition probabilities, the algorithm minimizes the usage of huge word dictionaries.
Modeling Hydrologic Change: Statistical Methods is about modeling systems where
change has affected data that will be used to calibrate and test models of the systems
and where models will be used to forecast system responses after change occurs.
The focus is not on the hydrology. Instead, hydrology serves as the discipline from
which the applications are drawn to illustrate the principles of modeling and the
detection of change. All four elements of the modeling process are discussed:
conceptualization, formulation, calibration, and verification.
In addition to covering statistical methods, most of the existing books on
equating also focus on the practice of equating, the implications of test development
and test use for equating practice and policies, and the daily equating challenges
that need to be solved. In some sense, the scope of this book is narrower than of
other existing books: to view the equating and linking process as a statistical
Tuyển tập báo cáo các nghiên cứu khoa học quốc tế ngành hóa học dành cho các bạn yêu hóa học tham khảo đề tài: An efficient voice activity detection algorithm by combining statistical model and energy detection
Mathematical modelling is the process of formulating an abstract model
in terms of mathematical language to describe the complex behaviour of
a real system. Mathematical models are quantitative models and often
expressed in terms of ordinary differential equations and partial differential
equations. Mathematical models can also be statistical models,
fuzzy logic models and empirical relationships. In fact, any model description
using mathematical language can be called a mathematical
Hidden Markov models (HMMs) are powerful statistical models that have found successful applications in Information Extraction (IE). In current approaches to applying HMMs to IE, an HMM is used to model text at the document level. This modelling might cause undesired redundancy in extraction in the sense that more than one ﬁller is identiﬁed and extracted. We propose to use HMMs to model text at the segment level, in which the extraction process consists of two steps: a segment retrieval step followed by an extraction step. ...
We present a set of algorithms that enable us to translate natural language sentences by exploiting both a translation memory and a statistical-based translation model. Our results show that an automatically derived translation memory can be used within a statistical framework to often ﬁnd translations of higher probability than those found using solely a statistical model.
Tuyển tập các báo cáo nghiên cứu về y học được đăng trên tạp chí y học Wertheim cung cấp cho các bạn kiến thức về ngành y đề tài: A statistical model that predicts the length from the left subclavian artery to the celiac axis; towards accurate intra aortic balloon sizing...
Paraphrase generation (PG) is important in plenty of NLP applications. However, the research of PG is far from enough. In this paper, we propose a novel method for statistical paraphrase generation (SPG), which can (1) achieve various applications based on a uniform statistical model, and (2) naturally combine multiple resources to enhance the PG performance. In our experiments, we use the proposed method to generate paraphrases for three different applications.