We describe experiments with a Naive Bayes text classifier in the context of anti- spam E-mail filtering, using two different statistical event models: a multi-variate Bernoulli model and a multinomial model. We introduce a family of feature ranking functions for feature selection in the multinomial event model that take account of the word frequency information. We present evaluation results on two publicly available corpora of legitimate and spam E-mails.
Recent approaches to text classication have used two
dierent rst-order probabilistic models for classication,
both of which make the naive Bayes assumption.
Some use a multi-variate Bernoulli model, that is, a
Bayesian Network with no dependencies between words
and binary word features (e.g. Larkey and Croft 1996;
Koller and Sahami 1997). Others use a multinomial
model, that is, a uni-gram language model with integer
word counts (e.g. Lewis and Gale 1994; Mitchell 1997).
In banking, especially in risk management, portfolio management, and structured ﬁnance, solid quantitative know-how becomes more and more important. We had a two-fold intention when writing this book: First, this book is designed to help mathematicians and physicists leaving the academic world and starting a profession as risk or portfolio managers to get quick access to the world of credit risk management. Second, our book is aimed at being helpful to risk managers looking for a more quantitative approach to credit risk. ...
We analyse the mathematical structure of portfolio credit risk models with particular
regard to the modelling of dependence between default events in these models. We
explore the role of copulas in latent variable models (the approach that underlies KMV
and CreditMetrics) and use non-Gaussian copulas to present extensions to standard
industry models. We explore the role of the mixing distribution in Bernoulli mixture
models (the approach underlying CreditRisk+) and derive large portfolio approximations
for the loss distribution.
Chapter 9: Words and maps covers global properties of words (N-letter strings from an M-letter alphabet), which are well-studied in classical combinatorics (because they model sequences of independent Bernoulli trials) and in classical applied algorithmics (because they model input sequences for hashing algorithms). The chapter also covers random maps (N-letter words from an N-letter alphabet) and discusses relationships with trees and permutations.