Each textbook provides a speciﬁc perspective on the discipline that it aims
to introduce. Therefore, writing this book has not only been a challenge for
me because of the didactic demands that each textbook imposes on its
writer. It also forced me to rethink my own ideas on morphology in
confrontation with those of others, and to come up with a consistent
picture of what morphology is about. This perspective is summarized by
the title of this book, The Grammar of Words, which gives the linguistic
entity of the word a pivotal role in understanding morphology....
The wide geographical distribution of the rice
plant ( Oryza sativa L.) and its long history of
cultivation in Asian countries have led to the development
of a great diversity of varietal types.
Similarly, workers in various rice-growing countries
use different terms to designate identical
morphological and physiological characters, agronomic
traits, gene symbols, and cultural practices.
Whereas varietal diversity in germ plasm is desired
in rice breeding, variations in nomenclature hinder
scientific communication among the workers.
The aim of this work is to present some preliminary results of an investigation in course on the typology of the morphology of the native South American languages from the point of view of the formal language theory. With this object, we give two contrasting examples of descriptions of two Aboriginal languages ﬁnite verb forms morphology: Argentinean Quechua (quichua santiague˜ o) and Toba.
This paper presents an exponential model for translation into highly inﬂected languages which can be scaled to very large datasets. As in other recent proposals, it predicts targetside phrases and can be conditioned on sourceside context. However, crucially for the task of modeling morphological generalizations, it estimates feature parameters from the entire training set rather than as a collection of separate classiﬁers.
Arabic handwriting recognition (HR) is a challenging problem due to Arabic’s connected letter forms, consonantal diacritics and rich morphology. In this paper we isolate the task of identiﬁcation of erroneous words in HR from the task of producing corrections for these words. We consider a variety of linguistic (morphological and syntactic) and non-linguistic features to automatically identify these errors. Our best approach achieves a roughly ∼15% absolute increase in F-score over a simple but reasonable baseline. ...
Most previous studies of morphological disambiguation and dependency parsing have been pursued independently. Morphological taggers operate on n-grams and do not take into account syntactic relations; parsers use the “pipeline” approach, assuming that morphological information has been separately obtained. However, in morphologically-rich languages, there is often considerable interaction between morphology and syntax, such that neither can be disambiguated without the other.
We explore the contribution of morphological features – both lexical and inﬂectional – to dependency parsing of Arabic, a morphologically rich language. Using controlled experiments, we ﬁnd that deﬁniteness, person, number, gender, and the undiacritzed lemma are most helpful for parsing on automatically tagged input.
Turkish is an agglutinative language with complex morphological structures, therefore using only word forms is not enough for many computational tasks. In this paper we analyze the effect of morphology in a Named Entity Recognition system for Turkish. We start with the standard word-level representation and incrementally explore the effect of capturing syntactic and contextual properties of tokens. Furthermore, we also explore a new representation in which roots and morphological features are represented as separate tokens instead of representing only words as tokens. ...
We report in this paper our work on accurately generating case markers and sufﬁxes in English-to-Hindi SMT. Hindi is a relatively free word-order language, and makes use of a comparatively richer set of case markers and morphological sufﬁxes for correct meaning representation. From our experience of large-scale English-Hindi MT, we are convinced that ﬂuency and ﬁdelity in the Hindi output get an order of magnitude facelift if accurate case markers and sufﬁxes are produced.
Morphological processes in Semitic languages deliver space-delimited words which introduce multiple, distinct, syntactic units into the structure of the input sentence. These words are in turn highly ambiguous, breaking the assumption underlying most parsers that the yield of a tree for a given sentence is known in advance. Here we propose a single joint model for performing both morphological segmentation and syntactic disambiguation which bypasses the associated circularity.
In this paper we report our work on building a POS tagger for a morphologically rich language- Hindi. The theme of the research is to vindicate the stand that- if morphology is strong and harnessable, then lack of training corpora is not debilitating. We establish a methodology of POS tagging which the resource disadvantaged (lacking annotated corpora) languages can make use of.
In this paper we present Morphy, an integrated tool for German morphology, part-ofspeech tagging and context-sensitive lemmatization. Its large lexicon of more than 320,000 word forms plus its ability to process German compound nouns guarantee a wide morphological coverage. Syntactic ambiguities can be resolved with a standard statistical part-of-speech tagger. By using the output of the tagger, the lemmatizer can determine the correct root even for ambiguous word forms. The complete package is freely available and can be downloaded from the World Wide Web. ...
We present experiments with part-ofspeech tagging for Bulgarian, a Slavic language with rich inﬂectional and derivational morphology. Unlike most previous work, which has used a small number of grammatical categories, we work with 680 morpho-syntactic tags. We combine a large morphological lexicon with prior linguistic knowledge and guided learning from a POS-annotated corpus, achieving accuracy of 97.98%, which is a signiﬁcant improvement over the state-of-the-art for Bulgarian.
Morphological lexica are often implemented on top of morphological paradigms, corresponding to different ways of building the full inﬂection table of a word. Computationally precise lexica may use hundreds of paradigms, and it can be hard for a lexicographer to choose among them. To automate this task, this paper introduces the notion of a smart paradigm. It is a metaparadigm, which inspects the base form and tries to infer which low-level paradigm applies. If the result is uncertain, more forms are given for discrimination.
If unsupervised morphological analyzers could approach the effectiveness of supervised ones, they would be a very attractive choice for improving MT performance on low-resource inﬂected languages. In this paper, we compare performance gains for state-of-the-art supervised vs. unsupervised morphological analyzers, using a state-of-theart Arabic-to-English MT system.
We present a novel scheme to apply factored phrase-based SMT to a language pair with very disparate morphological structures. Our approach relies on syntactic analysis on the source side (English) and then encodes a wide variety of local and non-local syntactic structures as complex structural tags which appear as additional factors in the training data. On the target side (Turkish), we only perform morphological analysis and disambiguation but treat the complete complex morphological tag as a factor, instead of separating morphemes. ...
We propose a novel approach to translating from a morphologically complex language. Unlike previous research, which has targeted word inﬂections and concatenations, we focus on the pairwise relationship between morphologically related words, which we treat as potential paraphrases and handle using paraphrasing techniques at the word, phrase, and sentence level.
Translating compounds is an important problem in machine translation. Since many compounds have not been observed during training, they pose a challenge for translation systems. Previous decompounding methods have often been restricted to a small set of languages as they cannot deal with more complex compound forming processes. We present a novel and unsupervised method to learn the compound parts and morphological operations needed to split compounds into their compound parts.
We present a class-based language model that clusters rare words of similar morphology together. The model improves the prediction of words after histories containing outof-vocabulary words. The morphological features used are obtained without the use of labeled data. The perplexity improvement compared to a state of the art Kneser-Ney model is 4% overall and 81% on unknown histories.
We present a pointwise approach to Japanese morphological analysis (MA) that ignores structure information during learning and tagging. Despite the lack of structure, it is able to outperform the current state-of-the-art structured approach for Japanese MA, and achieves accuracy similar to that of structured predictors using the same feature set. We also ﬁnd that the method is both robust to outof-domain data, and can be easily adapted through the use of a combination of partial annotation and active learning. ...