  • Lecture "Advanced Econometrics (Part II) - Chapter 6: Models for count data" presentation of content: Poisson regression model, goodness of fit, overdispersion, negative binomial regression model, too many zeros data.

  • It is well known that occurrence counts of words in documents are often modeled poorly by standard distributions like the binomial or Poisson. Observed counts vary more than simple models predict, prompting the use of overdispersed models like Gamma-Poisson or Beta-binomial mixtures as robust alternatives. Another deficiency of standard models is due to the fact that most words never occur in a given document, resulting in large amounts of zero counts. We propose using zeroinflated models for dealing with this, and evaluate competing models on a Naive Bayes text classification task.

  • This book is intended primarily for use in a second-semester course in graduate econometrics, after a first course at the level of Goldberger (1991) or Greene (1997). Parts of the book can be used for special-topics courses, and it should serve as a general reference. My focus on cross section and panel data methods—in particular, what is often...

  • Web text has been successfully used as training data for many NLP applications. While most previous work accesses web text through search engine hit counts, we created a Web Corpus by downloading web pages to create a topic-diverse collection of 10 billion words of English. We show that for context-sensitive spelling correction the Web Corpus results are better than using a search engine. For thesaurus extraction, it achieved similar overall results to a corpus of newspaper text.

  • Several attempts have been made to learn phrase translation probabilities for phrasebased statistical machine translation that go beyond pure counting of phrases in word-aligned training data. Most approaches report problems with overfitting. We describe a novel leavingone-out approach to prevent over-fitting that allows us to train phrase models that show improved translation performance on the WMT08 Europarl German-English task.

  • Measure words in Chinese are used to indicate the count of nouns. Conventional statistical machine translation (SMT) systems do not perform well on measure word generation due to data sparseness and the potential long distance dependency between measure words and their corresponding head words. In this paper, we propose a statistical model to generate appropriate measure words of nouns for an English-to-Chinese SMT system. We model the probability of measure word generation by utilizing lexical and syntactic knowledge from both source and target sentences. ...

  • This paper pro -poses two ideas for adapting standard kinematic techniques to situations that do not naturally allow for the constraint of a fixed baseline. The first calls for extracting the information needed to resolve the integer ambiguity from the very data collected while the kinematic survey is in progress. The second idea addresses the use of the antenna exchange technique for mobile platforms where the original locations of the antennas are not likely to remain stationary during the physical exchange.

  • We describe our initial investigations into generating textual summaries of spatiotemporal data with the help of a prototype Natural Language Generation (NLG) system that produces pollen forecasts for Scotland. forecasts were written. An example of a pollen forecast text is shown in Figure 1, its corresponding data is shown in table 1. A pollen forecast in the map form is shown in Figure 2. ‘Monday looks set to bring another day of relatively high pollen counts, with values up to a very high eight in the Central Belt. ...

