Corpus use

Xem 1-20 trên 322 kết quả Corpus use
  • The BTL aims to stimulate research and training in translation and interpreting studies. The Library provides a forum for a variety of approaches (which may sometimes be conflicting) in a socio-cultural, historical, theoretical, applied and pedagogical context. The Library includes scholarly works, reference books, postgraduate text books and readers in the English language. EST Subseries The European Society for Translation Studies (EST) Subseries is a publication channel within the Library to optimize EST’s function as a forum for the translation and interpreting research community.

    pdf165p namson94 18-07-2012 127 79   Download

  • We investigate the automatic detection of sentences containing linguistic hedges using corpus statistics and syntactic patterns. We take Wikipedia as an already annotated corpus using its tagged weasel words which mark sentences and phrases as non-factual. We evaluate the quality of Wikipedia as training data for hedge detection, as well as shallow linguistic features.

    pdf4p hongphan_1 15-04-2013 39 1   Download

  • Word alignment using recency-vector based approach has recently become popular. One major advantage of these techniques is that unlike other approaches they perform well even if the size of the parallel corpora is small. This makes these algorithms worth-studying for languages where resources are scarce. In this work we studied the performance of two very popular recency-vector based approaches, proposed in (Fung and McKeown, 1994) and (Somers, 1998), respectively, for word alignment in English-Hindi parallel corpus.

    pdf8p hongvang_1 16-04-2013 21 1   Download

  • The quality of the part-of-speech (PoS) annotation in a corpus is crucial for the development of PoS taggers. In this paper, we experiment with three complementary methods for automatically detecting errors in the PoS annotation for the Icelandic Frequency Dictionary corpus. The first two methods are language independent and we argue that the third method can be adapted to other morphologically complex languages. Once possible errors have been detected, we examine each error candidate and hand-correct the corresponding PoS tag if necessary. ...

    pdf9p bunthai_1 06-05-2013 16 1   Download

  • This paper describes novel and practical Japanese parsers that uses decision trees. First, we construct a single decision tree to estimate modification probabilities; how one phrase tends to modify another. Next, we introduce a boosting algorithm in which several decision trees are constructed and then combined for probability estimation. The two constructed parsers are evaluated by using the EDR Japanese annotated corpus. The single-tree method outperforms the conventional .Japanese stochastic methods by 4%. ...

    pdf7p bunrieu_1 18-04-2013 14 5   Download

  • A third criterion for effective research on skills transfer is study over time. To be certain that students are transferring skills from their first language rather than using skills learned in their second language, researchers must study subjects who have received reading instruction in their first language prior to receiving it in their second language, and who have received sufficient first-language instruction to have developed a base of first-language skills that can be transferred.

    pdf44p commentcmnr 03-06-2013 29 5   Download

  • We present work on the automatic generation of short indicative-informative abstracts of scientific and technical articles. The indicative part of the abstract identifies the topics of the document while the informative part of the abstract elaborate some topics according to the reader's interest by motivating the topics, describing entities and defining concepts. We have defined our method of automatic abstracting by studying a corpus professional abstracts. The method also considers the reader's interest as essential in the process of abstracting. ...

    pdf6p bunrieu_1 18-04-2013 29 4   Download

  • We report on an investigation of the pragmatic category of topic in Danish dialog and its correlation to surface features of NPs. Using a corpus of 444 utterances, we trained a decision tree system on 16 features. The system achieved nearhuman performance with success rates of 84–89% and F 1 -scores of 0.63–0.72 in 10fold cross validation tests (human performance: 89% and 0.78). The most important features turned out to be preverbal position, definiteness, pronominalisation, and non-subordination. We discovered that NPs in epistemic matrix clauses (e.g. “I think . . .

    pdf6p bunbo_1 17-04-2013 23 3   Download

  • Web text has been successfully used as training data for many NLP applications. While most previous work accesses web text through search engine hit counts, we created a Web Corpus by downloading web pages to create a topic-diverse collection of 10 billion words of English. We show that for context-sensitive spelling correction the Web Corpus results are better than using a search engine. For thesaurus extraction, it achieved similar overall results to a corpus of newspaper text.

    pdf8p bunthai_1 06-05-2013 27 3   Download

  • Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 638–646, Jeju, Republic of Korea, 8-14 July 2012. c 2012 Association for Computational Linguistics. Systems using sentiment analysis are affected, as sentiment-suggestive terms appearing in metalanguage (especially in quotation, a form of the phenomenon (Maier 2007)) are not necessarily reflective of the writer or speaker.

    pdf9p nghetay_1 07-04-2013 11 2   Download

  • We present a new edition of the Google Books Ngram Corpus, which describes how often words and phrases were used over a period of five centuries, in eight languages; it reflects 6% of all books ever published. This new edition introduces syntactic annotations: words are tagged with their part-of-speech, and headmodifier relationships are recorded. The annotations are produced automatically with statistical models that are specifically adapted to historical text.

    pdf6p nghetay_1 07-04-2013 19 2   Download

  • Bilingual lexicons are fundamental resources. Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs. Lexicons can be generated using non-parallel corpora or a pivot language, but such lexicons are noisy. We present an algorithm for generating a high quality lexicon from a noisy one, which only requires an independent corpus for each language.

    pdf10p hongdo_1 12-04-2013 16 2   Download

  • This article presents the main points in the creation of the French TimeBank (Bittar, 2010), a reference corpus annotated according to the ISO-TimeML standard for temporal annotation. A number of improvements were made to the markup language to deal with linguistic phenomena not yet covered by ISO-TimeML, including cross-language modifications and others specific to French. An automatic preannotation system was used to speed up the annotation process.

    pdf5p hongdo_1 12-04-2013 19 2   Download

  • We present an approach of expanding parallel corpora for machine translation. By utilizing Semantic role labeling (SRL) on one side of the language pair, we extract SRL substitution rules from existing parallel corpus. The rules are then used for generating new sentence pairs. An SVM classifier is built to filter the generated sentence pairs. The filtered corpus is used for training phrase-based translation models, which can be used directly in translation tasks or combined with baseline models. ...

    pdf5p hongdo_1 12-04-2013 22 2   Download

  • Translations from a parallel corpus implicitly deals with the granularity problem as finer sense distinctions are only relevant as far as they are lexicalized in the target translations. It also facilitates the integration of WSD in multilingual applications such as multilingual Information Retrieval (IR) or Machine Translation (MT).

    pdf6p hongdo_1 12-04-2013 17 2   Download

  • Most spoken dialogue systems are still lacking in their ability to accurately model the complex process that is human turntaking. This research analyzes a humanhuman tutoring corpus in order to identify prosodic turn-taking cues, with the hopes that they can be used by intelligent tutoring systems to predict student turn boundaries. Results show that while there was variation between subjects, three features were significant turn-yielding cues overall. In addition, a positive relationship between the number of cues present and the probability of a turn yield was demonstrated. ...

    pdf5p hongdo_1 12-04-2013 20 2   Download

  • An important and well-studied problem is the production of semantic lexicons from a large corpus. In this paper, we present a system named ASIA (Automatic Set Instance Acquirer), which takes in the name of a semantic class as input (e.g., “car makers”) and automatically outputs its instances (e.g., “ford”, “nissan”, “toyota”). ASIA is based on recent advances in webbased set expansion - the problem of finding all instances of a set given a small number of “seed” instances.

    pdf9p hongphan_1 14-04-2013 17 2   Download

  • We describe an unsupervised approach to multi-document sentence-extraction based summarization for the task of producing biographies. We utilize Wikipedia to automatically construct a corpus of biographical sentences and TDT4 to construct a corpus of non-biographical sentences. We build a biographical-sentence classifier from these corpora and an SVM regression model for sentence ordering from the Wikipedia corpus. We evaluate our work on the DUC2004 evaluation data and with human judges.

    pdf9p hongphan_1 15-04-2013 15 2   Download

  • When multiple conversations occur simultaneously, a listener must decide which conversation each utterance is part of in order to interpret and respond to it appropriately. We refer to this task as disentanglement. We present a corpus of Internet Relay Chat (IRC) dialogue in which the various conversations have been manually disentangled, and evaluate annotator reliability. This is, to our knowledge, the first such corpus for internet chat. We propose a graph-theoretic model for disentanglement, using discourse-based features which have not been previously applied to this task. ...

    pdf9p hongphan_1 15-04-2013 21 2   Download

  • Finding temporal and causal relations is crucial to understanding the semantic structure of a text. Since existing corpora provide no parallel temporal and causal annotations, we annotated 1000 conjoined event pairs, achieving inter-annotator agreement of 81.2% on temporal relations and 77.8% on causal relations. We trained machine learning models using features derived from WordNet and the Google N-gram corpus, and they outperformed a variety of baselines, achieving an F-measure of 49.0 for temporals and 52.4 for causals. ...

    pdf4p hongphan_1 15-04-2013 15 2   Download


Đồng bộ tài khoản