This paper discusses two problems that arise in the Generation of Referring Expressions: (a) numeric-valued attributes, such as size or location; (b) perspective-taking in reference. Both problems, it is argued, can be resolved if some structure is imposed on the available knowledge prior to content determination. We describe a clustering algorithm which is sufﬁciently general to be applied to these diverse problems, discuss its application, and evaluate its performance. close’ on the given dimension, and ‘sufﬁciently distant’ from those of their distractors. ...
We present a clustering algorithm for Arabic words sharing the same root. Root based clusters can substitute dictionaries in indexing for IR. Modifying Adamson and Boreham (1974), our Two-stage algorithm applies light stemming before calculating word pair similarity coefficients using techniques sensitive to Arabic morphology. Tests show a successful treatment of infixes and accurate clustering to up to 94.06% for unedited Arabic text samples, without the use of dictionaries.
Tuyển tập báo cáo các nghiên cứu khoa học quốc tế ngành hóa học dành cho các bạn yêu hóa học tham khảo đề tài: Research Article An Energy Consumption Optimized Clustering Algorithm for Radar Sensor Networks Based on an Ant Colony Algorithm
Tuyển tập các báo cáo nghiên cứu về sinh học được đăng trên tạp chí sinh học Journal of Biology đề tài: Research Article A New-Fangled FES-k -Means Clustering Algorithm for Disease Discovery and Visual Analytics
Coreferencing entities across documents in a large corpus enables advanced document understanding tasks such as question answering. This paper presents a novel cross document coreference approach that leverages the proﬁles of entities which are constructed by using information extraction tools and reconciled by using a within-document coreference module. We propose to match the proﬁles by using a learned ensemble distance function comprised of a suite of similarity specialists.
Our paper reports an attempt to apply an unsupervised clustering algorithm to a Hungarian treebank in order to obtain semantic verb classes. Starting from the hypothesis that semantic metapredicates underlie verbs’ syntactic realization, we investigate how one can obtain semantically motivated verb classes by automatic means. The 150 most frequent Hungarian verbs were clustered on the basis of their complementation patterns, yielding a set of basic classes and hints about the features that determine verbal subcategorization. ...
To cluster textual sequence types (discourse types/modes) in French texts, K-means algorithm with high-dimensional embeddings and fuzzy clustering algorithm were applied on clauses whose POS (part-ofspeech) n-gram proﬁles were previously extracted. Uni-, bi- and trigrams were used on four 19th century French short stories by Maupassant. For high-dimensional embeddings, power transformations on the chisquared distances between clauses were explored.
In statistical language modeling, one technique to reduce the problematic eﬀects of data sparsity is to partition the vocabulary into equivalence classes. In this paper we investigate the eﬀects of applying such a technique to higherorder n-gram models trained on large corpora.
The EM clustering algorithm (Hofmann and Puzicha, 1998) used here is an unsupervised machine learning algorithm that has been applied in many NLP tasks, such as inducing a semantically labeled lexicon and determining lexical choice in machine translation (Rooth et al., 1998), automatic acquisition of verb semantic classes (Schulte im Walde, 2000) and automatic semantic labeling (Gildea and Jurafsky, 2002).
As digital libraries and the World Wide Web (WWW) continue to grow exponentially,
the ability to find useful information will greatly depend on the associated
underlying framework of the indexing infrastructure or search engine. The push to
get information on-line must be mediated by the design of automated techniques for
extracting that information for a variety of users and needs. What algorithms and
software environments are plausible for achieving both accuracy and speed in text
Creates nested clusters
Agglomerative clustering algorithms vary in terms of how the proximity of two clusters are computed
MIN (single link): susceptible to noise/outliers
MAX/GROUP AVERAGE: may not work well with non-globular clusters
CURE algorithm tries to handle both problems
Often starts with a proximity matrix
A type of graph-based algorithm
We revisit the algorithm of Schütze (1995) for unsupervised part-of-speech tagging. The algorithm uses reduced-rank singular value decomposition followed by clustering to extract latent features from context distributions. As implemented here, it achieves state-of-the-art tagging accuracy at considerably less cost than more recent methods.
In this paper we describe an unsupervised method for semantic role induction which holds promise for relieving the data acquisition bottleneck associated with supervised role labelers. We present an algorithm that iteratively splits and merges clusters representing semantic roles, thereby leading from an initial clustering to a ﬁnal clustering of better quality.
We combine multiple word representations based on semantic clusters extracted from the (Brown et al., 1992) algorithm and syntactic clusters obtained from the Berkeley parser (Petrov et al., 2006) in order to improve discriminative dependency parsing in the MSTParser framework (McDonald et al., 2005).
We present a simple and scalable algorithm for clustering tens of millions of phrases and use the resulting clusters as features in discriminative classifiers. To demonstrate the power and generality of this approach, we apply the method in two very different applications: named entity recognition and query classification. Our results show that phrase clusters offer significant improvements over word clusters. Our NER system achieves the best current result on the widely used CoNLL benchmark.
We present a novel framework for the discovery and representation of general semantic relationships that hold between lexical items. We propose that each such relationship can be identiﬁed with a cluster of patterns that captures this relationship. We give a fully unsupervised algorithm for pattern cluster discovery, which searches, clusters and merges highfrequency words-based patterns around randomly selected hook words. Pattern clusters can be used to extract instances of the corresponding relationships. ...
Effectively identifying events in unstructured text is a very difﬁcult task. This is largely due to the fact that an individual event can be expressed by several sentences. In this paper, we investigate the use of clustering methods for the task of grouping the text spans in a news article that refer to the same event. The key idea is to cluster the sentences, using a novel distance metric that exploits regularities in the sequential structure of events within a document.
In this paper, we explore the power of randomized algorithm to address the challenge of working with very large amounts of data. We apply these algorithms to generate noun similarity lists from 70 million pages. We reduce the running time from quadratic to practically linear in the number of elements to be computed.
We address the problem of clustering words (or constructing a thesaurus) based on co-occurrence data, and using the acquired word classes to improve the accuracy of syntactic disambiguation. We view this problem as that of estimating a joint probability distribution specifying the joint probabilities of word pairs, such as noun verb pairs. We propose an efficient algorithm based on the Minimum Description Length (MDL) principle for estimating such a probability distribution. Our method is a natural extension of those proposed in (Brown et al.