We present a novel approach to the automatic acquisition of a Verbnet like classiﬁcation of French verbs which involves the use (i) of a neural clustering method which associates clusters with features, (ii) of several supervised and unsupervised evaluation metrics and (iii) of various existing syntactic and semantic lexical resources. We evaluate our approach on an established test set and show that it outperforms previous related work with an Fmeasure of 0.70.
Query expansion is an effective technique to improve the performance of information retrieval systems. Although hand-crafted lexical resources, such as WordNet, could provide more reliable related terms, previous studies showed that query expansion using only WordNet leads to very limited performance improvement. One of the main challenges is how to assign appropriate weights to expanded terms.
This paper describes an on-going project concerning with an ontological lexical resource based on the abundant conceptual information grounded on Chinese characters. The ultimate goal of this project is set to construct a cognitively sound and computationally effective character-grounded machine-understandable resource. Philosophically, Chinese ideogram has its ontological status, but its applicability to the NLP task has not been expressed explicitly in terms of language resource.
Lexicon definition is one of the main bottlenecks in the development of new applications in the field of Information Extraction from text. Generic resources (e.g., lexical databases) are promising for reducing the cost of specific lexica definition, but they introduce lexical ambiguity. This paper proposes a methodology for building application-specific lexica by using WordNet. Lexical ambiguity is kept under control by marking synsets in WordNet with field labels taken from the Dewey Decimal Classification. tion requirement.
database maintained by the National Library of Medicine1 (NLM), which incorporates around 40,000 Health Sciences papers each month. Researchers depend on these electronic resources to keep abreast of their rapidly changing field. In order to maintain and update vital indexing references such as the Unified Medical Language System (UMLS) resources, the MeSH and SPECIALIST vocabularies, the NLM staff needs to review 400,000 highly-technical papers each year.
Recognizing entailment at the lexical level is an important and commonly-addressed component in textual inference. Yet, this task has been mostly approached by simpliﬁed heuristic methods. This paper proposes an initial probabilistic modeling framework for lexical entailment, with suitable EM-based parameter estimation. Our model considers prominent entailment factors, including differences in lexical-resources reliability and the impacts of transitivity and multiple evidence.
Dictionaries are now commonly used resources in NLP systems. However, different lexical resources are not uniform; they contain different types of information and do not assign words the same number of senses. One way in which this problem might be tackled is by producing mappings between the senses of different resources, the "dictionary mapping problem". However, this is a non-trivial problem, as examination of existing lexical resources demonstrates.
Thesauri and ontologies provide important value in facilitating access to digital archives by representing underlying principles of organization. Translation of such resources into multiple languages is an important component for providing multilingual access. However, the specificity of vocabulary terms in most ontologies precludes fully-automated machine translation using general-domain lexical resources. In this paper, we present an efficient process for leveraging human translations when constructing domain-specific lexical resources.
In this paper we present a methodology for extracting subcategorisation frames based on an automatic LFG f-structure annotation algorithm for the Penn-II Treebank. We extract abstract syntactic function-based subcategorisation frames (LFG semantic forms), traditional CFG categorybased subcategorisation frames as well as mixed function/category-based frames, with or without preposition information for obliques and particle information for particle verbs.
Substantial formal grammatical and lexical resources exist in various NLP systems and in the form of textbook specifications. In the present paper we report on experimental results obtained in manual, semi-antomatic and automatic migration of entire computational or textbook descriptions (as opposed to a more informal reuse of ideas or the design of a single "polytheoretic" representation) from a variety of formalisms into the ALEP formalism.
This paper describes automatic techniques for mapping 9611 entries in a database of English verbs to WordNet senses. The verbs were initially grouped into 491 classes based on syntactic features. Mapping these verbs into WordNet senses provides a resource that supports disambiguation in multilingual applications such as machine translation and cross-language information retrieval.
Monolingual translation probabilities have recently been introduced in retrieval models to solve the lexical gap problem. They can be obtained by training statistical translation models on parallel monolingual corpora, such as question-answer pairs, where answers act as the “source” language and questions as the “target” language. In this paper, we propose to use as a parallel training dataset the deﬁnitions and glosses provided for the same term by different lexical semantic resources.
This paper introduces a machine learning method based on bayesian networks which is applied to the mapping between deep semantic representations and lexical semantic resources. A probabilistic model comprising Minimal Recursion Semantics (MRS) structures and lexicalist oriented semantic features is acquired. Lexical semantic roles enriching the MRS structures are inferred, which are useful to improve the accuracy of deep semantic parsing.
This paper describes a reader-based experiment on lexical cohesion, detailing the task given to readers and the analysis of the experimental data. We conclude with discussion of the usefulness of the data in future research on lexical cohesion. Cohesive ties between items in a text draw on the resources of a language to build up the text’s unity (Halliday and Hasan, 1976). Lexical cohesive ties draw on the lexicon, i.e. word meanings.
This paper presents our work on accumulation o f lexical sets which includes acquisition o f dictionary resources and production o f new lexical sets from this. The method for the acquisition, using a context-free syntax-directed translator and text modification techniques, proves easy-to-use, flexible, and efficient. Categories of production are analyzed, and basic operations are proposed which make up a formalism for specifying and doing production.
A lexicon is an essential component in a generation system but few efforts have been made to build a rich, large-scale lexicon and make it reusable for different generation applications. In this paper, we describe our work to build such a lexicon by combining multiple, heterogeneous linguistic resources which have been developed for other purposes. Novel transformation and integration of resources is required to reuse them for generation.
We present U BY, a large-scale lexicalsemantic resource combining a wide range of information from expert-constructed and collaboratively constructed resources for English and German. It currently contains nine resources in two languages: English WordNet, Wiktionary, Wikipedia, FrameNet and VerbNet, German Wikipedia, Wiktionary and GermaNet, and multilingual OmegaWiki modeled according to the LMF standard. For FrameNet, VerbNet and all collaboratively constructed resources, this is done for the ﬁrst time.
Widely accepted resources for semantic parsing, such as PropBank and FrameNet, are not perfect as a semantic role labeling framework. Their semantic roles are not strictly deﬁned; therefore, their meanings and semantic characteristics are unclear. In addition, it is presupposed that a single semantic role is assigned to each syntactic argument. This is not necessarily true when we consider internal structures of verb semantics. We propose a new framework for semantic role annotation which solves these problems by extending the theory of lexical conceptual structure (LCS). ...
This paper deals with multilingual database generation from parallel corpora. The idea is to contribute to the enrichment of lexical databases for languages with few linguistic resources. Our approach is endogenous: it relies on the raw texts only, it does not require external linguistic resources such as stemmers or taggers. The system produces alignments for the 20 European languages of the ‘Acquis Communautaire’ Corpus.
We describe the ongoing construction of a large, semantically annotated corpus resource as reliable basis for the largescale acquisition of word-semantic information, e.g. the construction of domainindependent lexica. The backbone of the annotation are semantic roles in the frame semantics paradigm. We report experiences and evaluate the annotated data from the ﬁrst project stage. On this basis, we discuss the problems of vagueness and ambiguity in semantic annotation.