A complete practical tutorial for RStudio, designed keeping in mind the needs of analysts and R developers alike.
Step-by-step examples that apply the principles of reproducible research and good programming practices to R projects.
Learn to effectively generate reports, create graphics, and perform analysis, and even build R-packages with RStudio.
This series aims to capture new developments and summarize what is known over the whole spectrum of mathematical and computational biology and medicine. It seeks to encourage the integration of mathematical, statistical and computational methods into biology by publishing a broad range of textbooks, reference works and handbooks. The titles included in the series are meant to appeal to students, researchers and professionals in the mathematical, statistical and computational sciences, fundamental biology and bioengineering, as well as interdisciplinary researchers involved in the field....
This paper describes an extension to the hidden Markov model for part-of-speech tagging using second-order approximations for both contextual and lexical probabilities. This model increases the accuracy of the tagger to state of the art levels. These approximations make use of more contextual information than standard statistical systems. New methods of smoothing the estimated probabilities are also introduced to address the sparse data problem.
This paper reports the on-going research of a thesis project investigating a computational model of early language acquisition. The model discovers word-like units from crossmodal input data and builds continuously evolving internal representations within a cognitive model of memory. Current cognitive theories suggest that young infants employ general statistical mechanisms that exploit the statistical regularities within their environment to acquire language skills.
This paper presents a comparative study of five parameter estimation algorithms on four NLP tasks. Three of the five algorithms are well-known in the computational linguistics community: Maximum Entropy (ME) estimation with L2 regularization, the Averaged Perceptron (AP), and Boosting. We also investigate ME estimation with L1 regularization using a novel optimization algorithm, and BLasso, which is a version of Boosting with Lasso (L1) regularization. We first investigate all of our estimators on two re-ranking tasks: a parse selection task and a language model (LM) adaptation task. ...
In this paper we describe a novel data structure for phrase-based statistical machine translation which allows for the retrieval of arbitrarily long phrases while simultaneously using less memory than is required by current decoder implementations. We detail the computational complexity and average retrieval times for looking up phrase translations in our sufﬁx array-based data structure. We show how sampling can be used to reduce the retrieval time by orders of magnitude with no loss in translation quality. ...
The search space of Phrase-Based Statistical Machine Translation (PBSMT) systems can be represented under the form of a directed acyclic graph (lattice). The quality of this search space can thus be evaluated by computing the best achievable hypothesis in the lattice, the so-called oracle hypothesis. For common SMT metrics, this problem is however NP-hard and can only be solved using heuristics.
Think Bayes is an introduction to Bayesian statistics using computational methods and Python programming language. Bayesian statistics are usually presented mathematically, but many of the ideas are easier to understand computationally. Contents: Bayes's Theorem; Computational statistics; Tanks and Trains; Urns and Coins; Odds and addends; Hockey; The variability hypothesis; Hypothesis testing.
As part of its new Digital Government program, the National Science
Foundation (NSF) requested that the Computer Science and Telecommunications
Board (CSTB) undertake an in-depth study of how information
technology research and development could more effectively support
advances in the use of information technology in government.
This textbook was designed and developed to provide health care students, primarily health
information management and health information technology students, and health care professionals
with a rudimentary understanding of the terms, definitions, and formulae used in
computing health care statistics and to provide self-testing opportunities and applications of
the statistical formulae.
We tackle the previously unaddressed problem of unsupervised determination of the optimal morphological segmentation for statistical machine translation (SMT) and propose a segmentation metric that takes into account both sides of the SMT training corpus. We formulate the objective function as the posterior probability of the training corpus according to a generative segmentation-translation model. We describe how the IBM Model-1 translation likelihood can be computed incrementally between adjacent segmentation states for efﬁcient computation. ...
Statistical models in machine translation exhibit spurious ambiguity. That is, the probability of an output string is split among many distinct derivations (e.g., trees or segmentations). In principle, the goodness of a string is measured by the total probability of its many derivations. However, ﬁnding the best string (e.g., during decoding) is then computationally intractable. Therefore, most systems use a simple Viterbi approximation that measures the goodness of a string using only its most probable derivation.
State-of-the-art computer-assisted translation engines are based on a statistical prediction engine, which interactively provides completions to what a human translator types. The integration of human speech into a computer-assisted system is also a challenging area and is the aim of this paper. So far, only a few methods for integrating statistical machine translation (MT) models with automatic speech recognition (ASR) models have been studied. They were mainly based on N best rescoring approach. ...
We describe a new loss function, due to Jeon and Lin (2006), for estimating structured log-linear models on arbitrary features. The loss function can be seen as a (generative) alternative to maximum likelihood estimation with an interesting information-theoretic interpretation, and it is statistically consistent. It is substantially faster than maximum (conditional) likelihood estimation of conditional random ﬁelds (Lafferty et al., 2001; an order of magnitude or more).
In this paper we focus on how to improve pronoun resolution using the statisticsbased semantic compatibility information. We investigate two unexplored issues that inﬂuence the effectiveness of such information: statistics source and learning framework. Speciﬁcally, we for the ﬁrst time propose to utilize the web and the twin-candidate model, in addition to the previous combination of the corpus and the single-candidate model, to compute and apply the semantic information. t
In statistical machine translation, the generation of a translation hypothesis is computationally expensive. If arbitrary wordreorderings are permitted, the search problem is NP-hard. On the other hand, if we restrict the possible word-reorderings in an appropriate way, we obtain a polynomial-time search algorithm. In this paper, we compare two different reordering constraints, namely the ITG constraints and the IBM constraints.
The processes through which readers evoke mental representations of phonological forms from print constitute a hotly debated and controversial issue in current psycholinguistics. In this paper we present a computational analysis of the grapho-phonological system of written French, and an empirical validation of some of the obtained descriptive statistics.
This paper introduces a novel generation system that composes humanlike descriptions of images from computer vision detections. By leveraging syntactically informed word co-occurrence statistics, the generator ﬁlters and constrains the noisy detections output from a vision system to generate syntactic trees that detail what the computer vision system sees. Results show that the generation system outperforms state-of-the-art systems, automatically generating some of the most natural image descriptions to date. ...
Since they cluster terms through statistical measures of context similarities, these tools exploit recurring situations. Since single-word terms denote broader concepts than multi-word terms, they appear more frequently in corpora and are therefore more appropriate for statistical clustering. The contribution of this paper is to propose an integrated platform for computer-aided term extraction and structuring that results from the combination of LEXTER, a Term Extraction tool (Bouriganlt et al., 1996), and FASTR 1, a Term Normalization tool (Jacquemin et al., 1997). ...
In this chapter, you learned to: Define the terms state of nature, event, decision alternatives, payoff, and utility; organize information in a payoff table or a decision tree; compute opportunity loss and utility function; find an optimal decision alternative based on a given decision criterion; assess the expected value of additional information.