In this paper we present research in which we apply (i) the kind of intrinsic evaluation metrics that are characteristic of current comparative HLT evaluation, and (ii) extrinsic, human task-performance evaluations more in keeping with NLG traditions, to 15 systems implementing a language generation task. We analyse the evaluation results and ﬁnd that there are no signiﬁcant correlations between intrinsic and extrinsic evaluation measures for this task.
Previous studies evaluate simulated dialog corpora using evaluation measures which can be automatically extracted from the dialog systems’ logs. However, the validity of these automatic measures has not been fully proven. In this study, we ﬁrst recruit human judges to assess the quality of three simulated dialog corpora and then use human judgments as the gold standard to validate the conclusions drawn from the automatic measures. We observe that it is hard for the human judges to reach good agreement when asked to rate the quality of the dialogs from given perspectives.
This paper proposes a framework for training Conditional Random Fields (CRFs) to optimize multivariate evaluation measures, including non-linear measures such as F-score. Our proposed framework is derived from an error minimization approach that provides a simple solution for directly optimizing any evaluation measure. Speciﬁcally focusing on sequential segmentation tasks, i.e. text chunking and named entity recognition, we introduce a loss function that closely reﬂects the target evaluation measure for these tasks, namely, segmentation F-score. ...
Sentence compression is the task of producing a summary at the sentence level. This paper focuses on three aspects of this task which have not received detailed treatment in the literature: training requirements, scalability, and automatic evaluation. We provide a novel comparison between a supervised constituentbased and an weakly supervised wordbased compression algorithm and examine how these models port to different domains (written vs. spoken text).
Error analysis in machine translation is a necessary step in order to investigate the strengths and weaknesses of the MT systems under development and allow fair comparisons among them. This work presents an application that shows how a set of heterogeneous automatic metrics can be used to evaluate a test bed of automatic translations. To do so, we have set up an online graphical interface for the A SIYA toolkit, a rich repository of evaluation measures working at different linguistic levels. ...
Dependency parsing is a central NLP task. In this paper we show that the common evaluation for unsupervised dependency parsing is highly sensitive to problematic annotations. We show that for three leading unsupervised parsers (Klein and Manning, 2004; Cohen and Smith, 2009; Spitkovsky et al., 2010a), a small set of parameters can be found whose modiﬁcation yields a signiﬁcant improvement in standard evaluation measures. These parameters correspond to local cases where no linguistic consensus exists as to the proper gold annotation. ...
We evaluate measures of contextual ﬁtness on the task of detecting real-word spelling errors. For that purpose, we extract naturally occurring errors and their contexts from the Wikipedia revision history. We show that such natural errors are better suited for evaluation than the previously used artiﬁcially created errors. In particular, the precision of statistical methods has been largely over-estimated, while the precision of knowledge-based approaches has been under-estimated.
We present a large-scale meta evaluation of eight evaluation measures for both single-document and multi-document summarizers. To this end we built a corpus consisting of (a) 100 Million automatic summaries using six summarizers and baselines at ten summary lengths in both English and Chinese, (b) more than 10,000 manual abstracts and extracts, and (c) 200 Million automatic document and summary retrievals using 20 queries.
In this paper we argue that comparative evaluation in anaphora resolution has to be performed using the same pre-processing tools and on the same set of data. The paper proposes an evaluation environment for comparing anaphora resolution algorithms which is illustrated by presenting the results of the comparative evaluation of three methods on the basis of several evaluation measures.
Most state-of-the-art evaluation measures for machine translation assign high costs to movements of word blocks. In many cases though such movements still result in correct or almost correct sentences. In this paper, we will present a new evaluation measure which explicitly models block reordering as an edit operation. Our measure can be exactly calculated in quadratic time. Furthermore, we will show how some evaluation measures can be improved by the introduction of word-dependent substitution costs. ...
How can we evaluate the performance of a portfolio manager? It turns out that even average portfolio return is not as straightforward to measure as it might seem. In addition, adjusting average returns for risk presents a host of other problems. In this chapter, we begin with the measurement of portfolio returns. From there we move on to conventional approaches to risk adjustment. We identify the problems with these approaches when applied in various situations.
Excess abdominal fat, assessed by measurement of waist circumference or waist-to-hip ratio, is independently associated with higher risk for diabetes mellitus and cardiovascular disease. Measurement of the waist circumference is a surrogate for visceral adipose tissue and should be performed in the horizontal plane above the iliac crest. Cut points that define higher risk for men and women based on ethnicity have been proposed by the International Diabetes Federation (Table 75-3).
Having a vision, a mission, and a passion are invariably seen as
conditions for success. The 1995 U.S. Department of Health and
Human Services (DHHS) concept of a Metropolitan Medical Response
System (MMRS) demonstrated that the leaders of DHHS had a
vision for an effective response to a mass-casualty terrorism incident with
a weapon of mass destruction. The mission was to expand the experimental
model of the Metropolitan Medical Strike Team (MMST) established
in Washington, D.C., and neighboring counties into a national
Studies assessing rating scales are very common in psychology and related ﬁelds, but are rare in NLP. In this paper we assess discrete and continuous scales used for measuring quality assessments of computergenerated language. We conducted six separate experiments designed to investigate the validity, reliability, stability, interchangeability and sensitivity of discrete vs. continuous scales. We show that continuous scales are viable for use in language evaluation, and offer distinct advantages over discrete scales. ...
This study has two main goals: to construct a measuring instrument with which to evaluate the structural quality of faceted thesauri; and to determine the validity and reliability of this measuring instrument. The measuring instrument consists of gods, objectives, and criteria against which to measure the structural quality of faceted thesauri.
We present a human-robot dialogue system that enables a robot to work together with a human user to build wooden construction toys. We then describe a study in which na¨ve subjects interacted with this ı system under a range of conditions and then completed a user-satisfaction questionnaire. The results of this study provide a wide range of subjective and objective measures of the quality of the interactions.
This paper presents a probabilistic framework, QARLA, for the evaluation of text summarisation systems. The input of the framework is a set of manual (reference) summaries, a set of baseline (automatic) summaries and a set of similarity metrics between summaries. It provides i) a measure to evaluate the quality of any set of similarity metrics, ii) a measure to evaluate the quality of a summary using an optimal set of similarity metrics, and iii) a measure to evaluate whether the set of baseline summaries is reliable or may produce biased results. ...
In particular, we examine whether the discourse coherence found in an essay, as de ned by a measure of relative proportion of Rough-Shift transitions, might be a signi cant contributor to the accuracy of computer-generated essay scores. Our positive nding validates the role of the Rough-Shift transition and suggests a route for exploring Centering Theory's practical applicability to writing evaluation and instruction.
This paper presents methods for a qualitative, unbiased comparison of lexical association measures and the results we have obtained for adjective-noun pairs and preposition-noun-verb triples extracted from German corpora. In our approach, we compare the entire list of candidates, sorted according to the particular measures, to a reference set of manually identiﬁed “true positives”. We also show how estimates for the very large number of hapaxlegomena and double occurrences can be inferred from random samples.
We study distributional similarity measures for the purpose of improving probability estimation for unseen cooccurrences. Our contributions are three-fold: an empirical comparison of a broad range of measures; a classification of similarity functions based on the information that they incorporate; and the introduction of a novel function that is superior at evaluating potential proxy distributions.