intTypePromotion=1
zunia.vn Tuyển sinh 2024 dành cho Gen-Z zunia.vn zunia.vn
ADSENSE

Word tokenization

Xem 1-20 trên 30 kết quả Word tokenization
  • In this paper, with these algorithms, artificial intelligent supported is much complex, people can communicate naturally with robot, not only English but also Vietnamese as well. In this paper, we will introduce the specification and intelligent interaction processing in naturally Vietnamese.

    pdf8p vidoctorstrange 06-05-2023 11 4   Download

  • Information retrieval techniques: Lecture 8. The main topics covered in this chapter include: parsing a document; complications: format/language; precision and recall; tokenization; numbers; tokenization: language issues; stop words;... Please refer to the content of document.

    ppt16p tieuvulinhhoa 22-09-2022 6 3   Download

  • Lecture Compiler construction: Lesson 10 - Sohail Aslam. The main topics covered in this chapter include: using generated scanner, input tokenized, flex input for C++, ISO C++ lexical analyzer, Front-End parser, checks the stream of words and their parts of speech for grammatical correctness,...

    ppt33p youzhangjing_1909 28-04-2022 11 1   Download

  • In this paper, we apply the tokenizer for Vietnamese text to build the lexicon and hence, each record of the lexicon may contain several single words. On the basis of this method, we decrease the size of the lexicon and improve the precision of search while maintaining the complexity of the process.

    pdf8p viengland2711 23-07-2019 8 0   Download

  • This paper reviews the background of lexical borrowing in Vietnamese context and investigates the scale of borrowability of English tokens that occured in magazine issues. The findings show that the syntactic system of the Vietnamese language has influenced how English word types are borrowed.

    pdf9p miulovesmile4 19-11-2018 26 1   Download

  • In the last three iterations 23 subjects performed in all 107 dialogues with 28 different scenarios using a total of 4455 words. The constraints (1) and (2) above on vocabulary size and maximum and average user utterance length have been met. In the last iteration only 3 user utterances out. of 881 contained more than 10 tokens and the average number of tokens per user turn was 1.85.

    pdf1p buncha_1 08-05-2013 31 2   Download

  • Work at the Unit for Computer Research on the Eaglish Language at the University of Lancaster has been directed towards producing a grammatically snnotated version of the Lancaster-Oslo/ Bergen (LOB) Corpus of written British English texts as the prel~minary stage in developing computer programs and data files for providing a grammatical analysis of -n~estricted English text. From 1981-83, a suite of PASCAL programs was devised to automatically produce a single level of grammatical description with one word tag representing the word class or part of speech of each word token in the corpus.

    pdf7p buncha_1 08-05-2013 43 1   Download

  • We apply topic modelling to automatically induce word senses of a target word, and demonstrate that our word sense induction method can be used to automatically detect words with emergent novel senses, as well as token occurrences of those senses. We start by exploring the utility of standard topic models for word sense induction (WSI), with a pre-determined number of topics (=senses). We next demonstrate that a non-parametric formulation that learns an appropriate number of senses per word actually performs better at the WSI task. ...

    pdf11p bunthai_1 06-05-2013 46 4   Download

  • Although a lot of progress has been made recently in word segmentation and POS tagging for Chinese, the output of current state-of-the-art systems is too inaccurate to allow for syntactic analysis based on it. We present an experiment in improving the output of an off-the-shelf module that performs segmentation and tagging, the tokenizer-tagger from Beijing University (PKU). Our approach is based on transformation-based learning (TBL).

    pdf9p bunthai_1 06-05-2013 35 2   Download

  • The Web contains vast amounts of linguistic data. One key issue for linguists and language technologists is how to access it. Commercial search engines give highly compromised access. An alternative is to crawl the Web ourselves, which also allows us to remove duplicates and nearduplicates, navigational material, and a range of other kinds of non-linguistic matter. We can also tokenize, lemmatise and part-of-speech tag the corpus, and load the data into a corpus query tool which supports sophisticated linguistic queries.

    pdf4p bunthai_1 06-05-2013 51 2   Download

  • We present an adaptive technique that enables users to produce a high quality dictionary parsed into its lexicographic components (headwords, pronunciations, parts of speech, translations, etc.) using an extremely small amount of user provided training data. We use transformationbased learning (TBL) as a postprocessor at two points in our system to improve performance.

    pdf8p bunthai_1 06-05-2013 49 2   Download

  • We describe a word alignment platform which ensures text pre-processing (tokenization, POS-tagging, lemmatization, chunking, sentence alignment) as required by an accurate word alignment. The platform combines two different methods, producing distinct alignments. The basic word aligners are described in some details and are individually evaluated. The union of the individual alignments is subject to a filtering postprocessing phase. Two different filtering methods are also presented. The evaluation shows that the combined word alignment contains 10.

    pdf8p bunthai_1 06-05-2013 50 2   Download

  • Research has been under way at the Unit for Computer Research on the ~hglish Language at the University of Lancaster, England, to develop a suite of computer programs which provide a detailed grammatical analysis of the LOB corpus, a collection of about 1 million words of British English texts available in machine readable form. The first phrase of the pruject, completed in September 1983, produced a grammatically annotated version of the corpus giving a tag showing the word class of each word token. ...

    pdf6p bungio_1 03-05-2013 31 1   Download

  • We present a new approach to disambiguating syntactically ambiguous words in context, based on Variable Memory Markov (VMM) models. In contrast to fixed-length Markov models, which predict based on fixed-length histories, variable memory Markov models dynamically adapt their history length based on the training data, and hence may use fewer parameters. In a test of a VMM based tagger on the Brown corpus, 95.81% of tokens are correctly classified.

    pdf7p bunmoc_1 20-04-2013 50 3   Download

  • For languages whose character set is very large and whose orthography does not require spacing between words, such as Japanese, tokenizing and part-of-speech tagging are often the difficult parts of any morphological analysis. For practical systems to tackle this problem, uncontrolled heuristics are primarily used. The use of information on character sorts, however, mitigates this difficulty. This paper presents our method of incorporating character clustering based on mutual information into DecisionTree Dictionary-less morphological analysis.

    pdf5p bunrieu_1 18-04-2013 57 2   Download

  • We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including partof-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the high nineties.

    pdf8p bunbo_1 17-04-2013 65 2   Download

  • This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to find a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efficiently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges.

    pdf4p hongvang_1 16-04-2013 46 2   Download

  • A distributional method for part-of-speech induction is presented which, in contrast to most previous work, determines the part-of-speech distribution of syntactically ambiguous words without explicitly tagging the underlying text corpus. This is achieved by assuming that the word pair consisting of the left and right neighbor of a particular token is characteristic of the part of speech at this position, and by clustering the neighbor pairs on the basis of their middle words as observed in a large corpus.

    pdf4p hongvang_1 16-04-2013 39 1   Download

  • Figure 1: Intuitive illustration of a variety of successive tokens and a word boundary mentation by formalizing the uncertainty of successive tokens via the branching entropy (which we mathematically de ne in the next section). Our intention in this paper is above all to study the fundamental and scienti c statistical property underlying language data, so that it can be applied to language engineering. The above assumption (A) dates back to the fundamental work done by Harris (Harris, 1955), where he says that when the number of di erent tokens coming after every pre x of a word marks...

    pdf8p hongvang_1 16-04-2013 40 1   Download

  • We experiment with splitting words into their stem and suffix components for modeling morphologically rich languages. We show that using a morphological analyzer and disambiguator results in a significant perplexity reduction in Turkish. We present flexible n-gram models, FlexGrams, which assume that the n−1 tokens that determine the probability of a given token can be chosen anywhere in the sentence rather than the preceding n − 1 positions. Our final model achieves 27% perplexity reduction compared to the standard n-gram model. ...

    pdf4p hongphan_1 15-04-2013 49 1   Download

CHỦ ĐỀ BẠN MUỐN TÌM

TOP DOWNLOAD
ADSENSE

nocache searchPhinxDoc

 

Đồng bộ tài khoản
4=>1