Focusing on multi-document personal name disambiguation, this paper develops an agglomerative clustering approach to resolving this problem. We start from an analysis of pointwise mutual information between feature and the ambiguous name, which brings about a novel weight computing method for feature in clustering. Then a trade-off measure between within-cluster compactness and among-cluster separation is proposed for stopping clustering. After that, we apply a labeling method to find representative feature for each cluster. ...
This paper presents an exploratory data analysis in lexical acquisition for adjective classes using clustering techniques. From a theoretical point of view, this approach provides large-scale empirical evidence for a sound classification. From a computational point of view, it helps develop a reliable automatic subclassification method. Results show that the features used in theoretical work can be successfully modelled in terms of shallow cues.
In this paper we present TroFi (Trope Finder), a system for automatically classifying literal and nonliteral usages of verbs through nearly unsupervised word-sense disambiguation and clustering techniques. TroFi uses sentential context instead of selectional constraint violations or paths in semantic hierarchies. It also uses literal and nonliteral seed sets acquired and cleaned without human supervision in order to bootstrap learning.
We propose a system which builds, in a semi-supervised manner, a resource that aims at helping a NER system to annotate corpus-speciﬁc named entities. This system is based on a distributional approach which uses syntactic dependencies for measuring similarities between named entities. The speciﬁcity of the presented method however, is to combine a clique-based approach and a clustering technique that amounts to a soft clustering method.
This paper explores techniques to take advantage of the fundamental difference in structure between hidden Markov models (HMM) and hierarchical hidden Markov models (HHMM). The HHMM structure allows repeated parts of the model to be merged together. A merged model takes advantage of the recurring patterns within the hierarchy, and the clusters that exist in some sequences of observations, in order to increase the extraction accuracy.
Nowadays, huge amount of multimedia data are being constantly generated in
various forms from various places around the world. With ever increasing complexity
and variability of multimedia data, traditional rule-based approaches
where humans have to discover the domain knowledge and encode it into a
set of programming rules are too costly and incompetent for analyzing the
contents, and gaining the intelligence of this glut of multimedia data.
The challenges in data complexity and variability have led to revolutions
in machine learning techniques.
One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors.
Creates nested clusters
Agglomerative clustering algorithms vary in terms of how the proximity of two clusters are computed
MIN (single link): susceptible to noise/outliers
MAX/GROUP AVERAGE: may not work well with non-globular clusters
CURE algorithm tries to handle both problems
Often starts with a proximity matrix
A type of graph-based algorithm
We present a technique for automatic induction of slot annotations for subcategorization frames, based on induction of hidden classes in the EM framework of statistical estimation. The models are empirically evalutated by a general decision test. Induction of slot labeling for subcategorization frames is accomplished by a further application of EM, and applied experimentally on frame observations derived from parsing large corpora. We outline an interpretation of the learned representations as theoretical-linguistic decompositional lexical entries. ...
Statistical machine learning methods are employed to train a Named Entity Recognizer from annotated data. Methods like Maximum Entropy and Conditional Random Fields make use of features for the training purpose. These methods tend to overﬁt when the available training corpus is limited especially if the number of features is large or the number of values for a feature is large. To overcome this we proposed two techniques for feature reduction based on word clustering and selection.
In statistical language modeling, one technique to reduce the problematic eﬀects of data sparsity is to partition the vocabulary into equivalence classes. In this paper we investigate the eﬀects of applying such a technique to higherorder n-gram models trained on large corpora.
This paper presents a hybrid approach to question answering in the clinical domain that combines techniques from summarization and information retrieval. We tackle a frequently-occurring class of questions that takes the form “What is the best drug treatment for X?” Starting from an initial set of MEDLINE citations, our system ﬁrst identiﬁes the drugs under study. Abstracts are then clustered using semantic classes from the UMLS ontology. Finally, a short extractive summary is generated for each abstract to populate the clusters. ...
We present a clustering algorithm for Arabic words sharing the same root. Root based clusters can substitute dictionaries in indexing for IR. Modifying Adamson and Boreham (1974), our Two-stage algorithm applies light stemming before calculating word pair similarity coefficients using techniques sensitive to Arabic morphology. Tests show a successful treatment of infixes and accurate clustering to up to 94.06% for unedited Arabic text samples, without the use of dictionaries.
In this paper we present a method to group adjectives according to their meaning, as a first step towards the automatic identification of adjectival scales. We discuss the properties of adjectival scales and of groups of semantically related adjectives and how they imply sources of linguistic knowledge in text corpora. We describe how our system exploits this linguistic knowledge to compute a measure of similarity between two adjectives, using statistical techniques and without having access to any semantic information about the adjectives. ...
Semantic clusters of a domain form an important feature that can be useful for performing syntactic and semantic disambiguation. Several attempts have been made to extract the semantic clusters of a domain by probabilistic or taxonomic techniques. However, not much progress has been made in evaluating the obtained semantic clusters. This paper focuses on an evaluation mechanism that can be used to evaluate semantic clusters produced by a system against those provided by human experts.
When you have completed this chapter, you will be able to: Organize raw data into frequency distribution; produce a histogram, a frequency polygon, and a cumulative frequency polygon from quantitative data; develop and interpret a stem-and-leaf display; present qualitative data using such graphical techniques such as a clustered bar chart, a stacked bar chart, and a pie chart; detect graphic deceptions and use a graph to present data with clarity, precision, and efficiency.
The need for more rigorous and systematic research in public administration has grown as the
complexity of problems in government and nonprofit organizations has increased. This book
describes and explains the use of research methods that will strengthen the research efforts of
those solving government and nonprofit problems.
This book is aimed primarily at those studying research methods in masters and doctoral
level courses in curricula that concern the public and nonprofit sector.
The use of ethanol for fuel was widespread in Europe and the United States
until the early 1900s (Illinois Corn Growers’ Association/Illinois Corn
Marketing Board). Because it became more expensive to produce than
petroleum-based fuel, especially after World War II, ethanol’s potential was
largely ignored until the Arab oil embargo of the 1970s. One response to the
embargo was increased use of the fuel extender “gasohol ” (or E-10), a
mixture of one part ethanol made from corn mixed with nine parts gasoline.
Macadamia trees (often called the tree people Mac-ca) is one of the crops of high economic value. Macadamia is the common name for nine species of the genus Macadamia, open cross Affairs (Proteaceae) in which two species have commercial value as M. integrifolia Maiden & Betche and M.tetraphylla. L. Johnson. Both species were grown in the eastern coastal regions - South and South East Queensland - Northern New Wales of Australia.