[ Team LiB ] Recipe 3.6 Combining Data in Tables from Heterogeneous Data Sources Problem You want to create a report that is based on data from tables in more than one data source. Solution Use ad-hoc connector names in SQL statements.
Inspired by the incremental TER alignment, we re-designed the Indirect HMM (IHMM) alignment, which is one of the best hypothesis alignment methods for conventional MT system combination, in an incremental manner. One crucial problem of incremental alignment is to align a hypothesis to a confusion network (CN).
In this paper we examine how the differences in modelling between different data driven systems performing the same NLP task can be exploited to yield a higher accuracy than the best individual system. We do this by means of an experiment involving the task of morpho-syntactic wordclass tagging. Four well-known tagger generators (Hidden Markov Model, Memory-Based, Transformation Rules and Maximum Entropy) are trained on the same corpus data. After comparison, their outputs are combined using several voting strategies and second stage classifiers. ...
We present a novel method to improve word alignment quality and eventually the translation performance by producing and combining complementary word alignments for low-resource languages. Instead of focusing on the improvement of a single set of word alignments, we generate multiple sets of diversiﬁed alignments based on different motivations, such as linguistic knowledge, morphology and heuristics.
Recently confusion network decoding shows the best performance in combining outputs from multiple machine translation (MT) systems. However, overcoming different word orders presented in multiple MT systems during hypothesis alignment still remains the biggest challenge to confusion network-based MT system combination. In this paper, we compare four commonly used word alignment methods, namely GIZA++, TER, CLA and IHMM, for hypothesis alignment.
Although indexes may overlap, the output of an automatic indexer is generally presented as a fiat and unstructured list of terms. Our purpose is to exploit term overlap and embedding so as to yield a substantial qualitative and quantitative improvement in automatic indexing through concept combination. The increase in the volume of indexing is 10.5% for free indexing and 52.3% for controlled indexing. The resulting structure of the indexed corpus is a partial conceptual analysis.
This paper presents a method to combine a set of unsupervised algorithms that can accurately disambiguate word senses in a large, completely untagged corpus. Although most of the techniques for word sense resolution have been presented as stand-alone, it is our belief that full-fledged lexical ambiguity resolution should combine several information sources and techniques. The set of techniques have been applied in a combined way to disambiguate the genus terms of two machine-readable dictionaries (MRD), enabling us to construct complete taxonomies for Spanish and French. ...
Handwritten text recognition is a difficult problem in the field of pattern recognition. This paper focuses on two aspects of the work on recognizing isolated handwritten Vietnamese characters, including feature extraction and classifier combination. For the first task, based on the work in  we will present how to extract features for Vietnamese characters based on gradient, structural, and concavity characteristics of optical character images. For the second task, we first develop a general framework of classifier combination under the context of optical character recognition. ...
This paper extends the training and tuning regime for phrase-based statistical machine translation to obtain ﬂuent translations into morphologically complex languages (we build an English to Finnish translation system). Our methods use unsupervised morphology induction. Unlike previous work we focus on morphologically productive phrase pairs – our decoder can combine morphemes across phrase boundaries. Morphemes in the target language may not have a corresponding morpheme or word in the source language.
Unsupervised word alignment is most often modeled as a Markov process that generates a sentence f conditioned on its translation e. A similar model generating e from f will make different alignment predictions. Statistical machine translation systems combine the predictions of two directional models, typically using heuristic combination procedures like grow-diag-ﬁnal. This paper presents a graphical model that embeds two directional aligners into a single model.
The state-of-the-art system combination method for machine translation (MT) is based on confusion networks constructed by aligning hypotheses with regard to word similarities. We introduce a novel system combination framework in which hypotheses are encoded as a confusion forest, a packed forest representing alternative trees.
We present minimum Bayes-risk system combination, a method that integrates consensus decoding and system combination into a uniﬁed multi-system minimum Bayes-risk (MBR) technique. Unlike other MBR methods that re-rank translations of a single SMT system, MBR system combination uses the MBR decision rule and a linear combination of the component systems’ probability distributions to search for the minimum risk translation among all the ﬁnite-length strings over the output vocabulary.
We combine multiple word representations based on semantic clusters extracted from the (Brown et al., 1992) algorithm and syntactic clusters obtained from the Berkeley parser (Petrov et al., 2006) in order to improve discriminative dependency parsing in the MSTParser framework (McDonald et al., 2005).
In this paper, we present a new word alignment combination approach on language pairs where one language has no explicit word boundaries. Instead of combining word alignments of different models (Xiang et al., 2010), we try to combine word alignments over multiple monolingually motivated word segmentation. Our approach is based on link confidence score defined over multiple segmentations, thus the combined alignment is more robust to inappropriate word segmentation.
The standard set of rules deﬁned in Combinatory Categorial Grammar (CCG) fails to provide satisfactory analyses for a number of syntactic structures found in natural languages. These structures can be analyzed elegantly by augmenting CCG with a class of rules based on the combinator D (Curry and Feys, 1958). We show two ways to derive the D rules: one based on unary composition and the other based on a logical characterization of CCG’s rule base (Baldridge, 2002).
This paper presents an innovative, complex approach to semantic verb classiﬁcation that relies on selectional preferences as verb properties. The probabilistic verb class model underlying the semantic classes is trained by a combination of the EM algorithm and the MDL principle, providing soft clusters with two dimensions (verb senses and subcategorisation frames with selectional preferences) as a result. A language-model-based evaluation shows that after 10 training iterations the verb class model results are above the baseline results.
This paper proposes to solve the bottleneck of finding training data for word sense disambiguation (WSD) in the domain of web queries, where a complete set of ambiguous word senses are unknown. In this paper, we present a combination of active learning and semi-supervised learning method to treat the case when positive examples, which have an expected word sense in web search result, are only given. The novelty of our approach is to use “pseudo negative examples” with reliable confidence score estimated by a classifier trained with positive and unlabeled examples.