Each textbook provides a speciﬁc perspective on the discipline that it aims
to introduce. Therefore, writing this book has not only been a challenge for
me because of the didactic demands that each textbook imposes on its
writer. It also forced me to rethink my own ideas on morphology in
confrontation with those of others, and to come up with a consistent
picture of what morphology is about. This perspective is summarized by
the title of this book, The Grammar of Words, which gives the linguistic
entity of the word a pivotal role in understanding morphology....
Arabic handwriting recognition (HR) is a challenging problem due to Arabic’s connected letter forms, consonantal diacritics and rich morphology. In this paper we isolate the task of identiﬁcation of erroneous words in HR from the task of producing corrections for these words. We consider a variety of linguistic (morphological and syntactic) and non-linguistic features to automatically identify these errors. Our best approach achieves a roughly ∼15% absolute increase in F-score over a simple but reasonable baseline. ...
We describe an approach to simultaneous tokenization and part-of-speech tagging that is based on separating the closed and open-class items, and focusing on the likelihood of the possible stems of the openclass words. By encoding some basic linguistic information, the machine learning task is simpliﬁed, while achieving stateof-the-art tokenization results and competitive POS results, although with a reduced tag set and some evaluation difﬁculties.
Morphological processes in Semitic languages deliver space-delimited words which introduce multiple, distinct, syntactic units into the structure of the input sentence. These words are in turn highly ambiguous, breaking the assumption underlying most parsers that the yield of a tree for a given sentence is known in advance. Here we propose a single joint model for performing both morphological segmentation and syntactic disambiguation which bypasses the associated circularity.
We present experiments with part-ofspeech tagging for Bulgarian, a Slavic language with rich inﬂectional and derivational morphology. Unlike most previous work, which has used a small number of grammatical categories, we work with 680 morpho-syntactic tags. We combine a large morphological lexicon with prior linguistic knowledge and guided learning from a POS-annotated corpus, achieving accuracy of 97.98%, which is a signiﬁcant improvement over the state-of-the-art for Bulgarian.
A system for the automatic production of controlled index terms is presented using linguistically-motivated techniques. This includes a finite-state part of speech tagger, a derivational morphological processor for analysis and generation, and a unificationbased shallow-level parser using transformational rules over syntactic patterns. The contribution of this research is the successful combination of parsing over a seed term list coupled with derivational morphology to achieve greater coverage of multi-word terms for indexing and retrieval. ...
Concerning different approaches to automatic PoS tagging: EngCG-2, a constraintbased morphological tagger, is compared in a double-blind test with a state-of-the-art statistical tagger on a common disambiguation task using a common tag set. The experiments show that for the same amount of remaining ambiguity, the error rate of the statistical tagger is one order of magnitude greater than that of the rule-based one. The two related issues of priming effects compromising the results and disagreement between human annotators are also addressed. ...
Our poster presents results and experiences from the application of the system to 300,000 word forms, a subpart of a larger corpus. The application of the system is carried out in two steps, an automatic lexical look up followed by homograph separation, which is done partly automatically, partly manually. Lexical and morphological analysis and disambiguation of Swedish is a rather complicated task, a fact which should hold for several other languages as well. Below a sample text is given, showing both the amount of information that has to be specified for each word form and the degree of...
Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links.
We address the problem of translating from morphologically poor to morphologically rich languages by adding per-word linguistic information to the source language. We use the syntax of the source sentence to extract information for noun cases and verb persons and annotate the corresponding words accordingly. In experiments, we show improved performance for translating from English into Greek and Czech. For English–Greek, we reduce the error on the verb conjugation from 19% to 5.4% and noun case agreement from 9% to 6%. ...
This paper proposes the application of ﬁnite-state approximation techniques on a uniﬁcation-based grammar of word formation for a language like German. A reﬁnement of an RTN-based approximation algorithm is proposed, which extends the state space of the automaton by selectively adding distinctions based on the parsing history at the point of entering a context-free rule. The selection of history items exploits the speciﬁc linguistic nature of word formation.
We describe a computational framework for a grammar architecture in which different linguistic domains such as morphology, syntax, and semantics are treated not as separate components but compositional domains. The framework is based on Combinatory Categorial Grammars and it uses the morpheme as the basic building block of the categorial lexicon.
In this paper, we present a morphological processor for Modern Greek. From the linguistic point of view, we tr5, to elucidate the complexity of the inflectional system using a lexical model which follows the mecent work by Lieber, 1980, Selkirk 1982, Kiparsky 1982, and others. The implementation is based on the concept of "validation grammars" (Coumtin 1977). The morphological processing is controlled by a finite automaton and it combines a. a dictionary containing the stems for a representative fragment of Modern Greek and all the inflectional affixes with b.
We present a notation for the declarative statement of morphological relationships and lexieal rules, based on the traditional notion of Word and Paradigm (cf Hockett 1954). The phenomenon of blocking arises from a generalized version of Kiparsky's (1973) Elsewhere Condition, stated in terms of ordering by subsumption over paradigms. Orthographic constraints on morphemic alternation are described by means of string equations (Siekmann 1975).
The correction method distinguishes between orthographic errors and typographical errors. • Typographical errors (or misstypings) are uncognitive errors which do not follow linguistic criteria. • Orthographic errors are cognitive errors which occur when the writer does not know or has forgotten the correct spelling for a word. They are more persistent because of their cognitive nature, they leave worse impression and, finally, its treatment is an interesting application for language standardization purposes. ...
1. The purpose of this paper is the establishment of classes of verbals according to the morphemic alternations of base-form finals; 2. Verbals which are subject to morphemic alternation are treated as single entries instead of as multiple entries;
We present a novel method to improve word alignment quality and eventually the translation performance by producing and combining complementary word alignments for low-resource languages. Instead of focusing on the improvement of a single set of word alignments, we generate multiple sets of diversiﬁed alignments based on different motivations, such as linguistic knowledge, morphology and heuristics.
Hungarian is a stereotype of morphologically rich and non-conﬁgurational languages. Here, we introduce results on dependency parsing of Hungarian that employ a 80K, multi-domain, fully manually annotated corpus, the Szeged Dependency Treebank. We show that the results achieved by state-of-the-art data-driven parsers on Hungarian and English (which is at the other end of the conﬁgurational-nonconﬁgurational spectrum) are quite similar to each other in terms of attachment scores.
We investigate the controversial issue about the upper bound of interjudge agreement in the use of a low-level grammatical representation. Pessimistic views suggest that several percent of words in running text are undecidable in terms of part-of-speech categories. Our experiments with 55kW data give reason for optimism: linguists with only 30 hours' training apply the EngCG-2 morphological tags with almost 100% interjudge agreement.
Assamese is a morphologically rich, agglutinative and relatively free word order Indic language. Although spoken by nearly 30 million people, very little computational linguistic work has been done for this language. In this paper, we present our work on part of speech (POS) tagging for Assamese using the well-known Hidden Markov Model. Since no well-deﬁned suitable tagset was available, we develop a tagset of 172 tags in consultation with experts in linguistics.