zunia.vn

Tuyển sinh 2024 dành cho Gen-Z

zunia.vn

» Kỹ Thuật - Công Nghệ

» Cơ khí - Chế tạo máy

Word tokenization

Xem 1-20 trên 30 kết quả Word tokenization

Vietnamese natural language processing for interaction between human and robot

In this paper, with these algorithms, artificial intelligent supported is much complex, people can communicate naturally with robot, not only English but also Vietnamese as well. In this paper, we will introduce the specification and intelligent interaction processing in naturally Vietnamese.

8p vidoctorstrange 06-05-2023 11 4 Download

Information retrieval techniques: Lecture 8

Information retrieval techniques: Lecture 8. The main topics covered in this chapter include: parsing a document; complications: format/language; precision and recall; tokenization; numbers; tokenization: language issues; stop words;... Please refer to the content of document.

16p tieuvulinhhoa 22-09-2022 6 3 Download

Lecture Compiler construction: Lesson 10 - Sohail Aslam

Lecture Compiler construction: Lesson 10 - Sohail Aslam. The main topics covered in this chapter include: using generated scanner, input tokenized, flex input for C++, ISO C++ lexical analyzer, Front-End parser, checks the stream of words and their parts of speech for grammatical correctness,...

33p youzhangjing_1909 28-04-2022 11 1 Download
The indexing algorithm in searching engine for Vietnamese text

In this paper, we apply the tokenizer for Vietnamese text to build the lexicon and hence, each record of the lexicon may contain several single words. On the basis of this method, we decrease the size of the lexicon and improve the precision of search while maintaining the complexity of the process.

8p viengland2711 23-07-2019 8 0 Download
English borrowings and scale of borrowability in vietnamese magazines

This paper reviews the background of lexical borrowing in Vietnamese context and investigates the scale of borrowability of English tokens that occured in magazine issues. The findings show that the syntactic system of the Vietnamese language has influenced how English word types are borrowed.

9p miulovesmile4 19-11-2018 26 1 Download
Báo cáo khoa học: "Knowledge acquisition for a constrained speech system using WoZ"

In the last three iterations 23 subjects performed in all 107 dialogues with 28 different scenarios using a total of 4455 words. The constraints (1) and (2) above on vocabulary size and maximum and average user utterance length have been met. In the last iteration only 3 user utterances out. of 881 contained more than 10 tokens and the average number of tokens per user turn was 1.85.

1p buncha_1 08-05-2013 31 2 Download
Báo cáo khoa học: "A PROBABILISTIC APPROACH TO GRAMMATICAL ANALYSIS OF WRITTEN ENGLISH BY COMPUTER"

Work at the Unit for Computer Research on the Eaglish Language at the University of Lancaster has been directed towards producing a grammatically snnotated version of the Lancaster-Oslo/ Bergen (LOB) Corpus of written British English texts as the prel~minary stage in developing computer programs and data files for providing a grammatical analysis of -n~estricted English text. From 1981-83, a suite of PASCAL programs was devised to automatically produce a single level of grammatical description with one word tag representing the word class or part of speech of each word token in the corpus.

7p buncha_1 08-05-2013 43 1 Download
Báo cáo khoa học: "Word Sense Induction for Novel Sense Detection"

We apply topic modelling to automatically induce word senses of a target word, and demonstrate that our word sense induction method can be used to automatically detect words with emergent novel senses, as well as token occurrences of those senses. We start by exploring the utility of standard topic models for word sense induction (WSI), with a pre-determined number of topics (=senses). We next demonstrate that a non-parametric formulation that learns an appropriate number of senses per word actually performs better at the WSI task. ...

11p bunthai_1 06-05-2013 46 4 Download
Báo cáo khoa học: "TBL-Improved Non-Deterministic Segmentation and POS Tagging for a Chinese Parser"

Although a lot of progress has been made recently in word segmentation and POS tagging for Chinese, the output of current state-of-the-art systems is too inaccurate to allow for syntactic analysis based on it. We present an experiment in improving the output of an off-the-shelf module that performs segmentation and tagging, the tokenizer-tagger from Beijing University (PKU). Our approach is based on transformation-based learning (TBL).

9p bunthai_1 06-05-2013 35 2 Download
Báo cáo khoa học: "Large linguistically-processed Web corpora for multiple languages"

The Web contains vast amounts of linguistic data. One key issue for linguists and language technologists is how to access it. Commercial search engines give highly compromised access. An alternative is to crawl the Web ourselves, which also allows us to remove duplicates and nearduplicates, navigational material, and a range of other kinds of non-linguistic matter. We can also tokenize, lemmatise and part-of-speech tag the corpus, and load the data into a corpus query tool which supports sophisticated linguistic queries.

4p bunthai_1 06-05-2013 51 2 Download
Báo cáo khoa học: "Adaptive Transformation-based Learning for Improving Dictionary Tagging"

We present an adaptive technique that enables users to produce a high quality dictionary parsed into its lexicographic components (headwords, pronunciations, parts of speech, translations, etc.) using an extremely small amount of user provided training data. We use transformationbased learning (TBL) as a postprocessor at two points in our system to improve performance.

8p bunthai_1 06-05-2013 49 2 Download
Báo cáo khoa học: "Improved Lexical Alignment by Combining Multiple Reified Alignments"

We describe a word alignment platform which ensures text pre-processing (tokenization, POS-tagging, lemmatization, chunking, sentence alignment) as required by an accurate word alignment. The platform combines two different methods, producing distinct alignments. The basic word aligners are described in some details and are individually evaluated. The union of the individual alignments is subject to a filtering postprocessing phase. Two different filtering methods are also presented. The evaluation shows that the combined word alignment contains 10.

8p bunthai_1 06-05-2013 50 2 Download
Báo cáo khoa học: "GRAMMATICAL AN ALYSIS BY COMPUTER OF THE LANCASTER OSLO/BERGEN (LOB) CORPUS OF BRITISH ENGLISH TEXTS."

Research has been under way at the Unit for Computer Research on the ~hglish Language at the University of Lancaster, England, to develop a suite of computer programs which provide a detailed grammatical analysis of the LOB corpus, a collection of about 1 million words of British English texts available in machine readable form. The first phrase of the pruject, completed in September 1983, produced a grammatically annotated version of the corpus giving a tag showing the word class of each word token. ...

6p bungio_1 03-05-2013 31 1 Download
Báo cáo khoa học: "PART-OF-SPEECH TAGGING USING A VARIABLE MEMORY MARKOV MODEL"

We present a new approach to disambiguating syntactically ambiguous words in context, based on Variable Memory Markov (VMM) models. In contrast to fixed-length Markov models, which predict based on fixed-length histories, variable memory Markov models dynamically adapt their history length based on the training data, and hence may use fewer parameters. In a test of a VMM based tagger on the Brown corpus, 95.81% of tokens are correctly classified.

7p bunmoc_1 20-04-2013 50 3 Download
Báo cáo khoa học: "Use of Mutual Information Based Character Clusters in Dictionary-less Morphological Analysis of Japanese"

For languages whose character set is very large and whose orthography does not require spacing between words, such as Japanese, tokenizing and part-of-speech tagging are often the difficult parts of any morphological analysis. For practical systems to tackle this problem, uncontrolled heuristics are primarily used. The use of information on character sorts, however, mitigates this difficulty. This paper presents our method of incorporating character clustering based on mutual information into DecisionTree Dictionary-less morphological analysis.

5p bunrieu_1 18-04-2013 57 2 Download
Báo cáo khoa học: "Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop"

We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including partof-speech tagging) Arabic words in one process. We learn classiﬁers for individual morphological features, as well as ways of using these classiﬁers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the high nineties.

8p bunbo_1 17-04-2013 65 2 Download
Báo cáo khoa học: "Rethinking Chinese Word Segmentation: Tokenization, Character Classiﬁcation, or Wordbreak Identiﬁcation"

This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to ﬁnd a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efﬁciently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges.

4p hongvang_1 16-04-2013 46 2 Download
Báo cáo khoa học: "Deriving an Ambiguous Word’s Part-of-Speech Distribution from Unannotated Text"

A distributional method for part-of-speech induction is presented which, in contrast to most previous work, determines the part-of-speech distribution of syntactically ambiguous words without explicitly tagging the underlying text corpus. This is achieved by assuming that the word pair consisting of the left and right neighbor of a particular token is characteristic of the part of speech at this position, and by clustering the neighbor pairs on the basis of their middle words as observed in a large corpus.

4p hongvang_1 16-04-2013 39 1 Download
Báo cáo khoa học: "Unsupervised Segmentation of Chinese Text by Use of Branching Entropy"

Figure 1: Intuitive illustration of a variety of successive tokens and a word boundary mentation by formalizing the uncertainty of successive tokens via the branching entropy (which we mathematically de ne in the next section). Our intention in this paper is above all to study the fundamental and scienti c statistical property underlying language data, so that it can be applied to language engineering. The above assumption (A) dates back to the fundamental work done by Harris (Harris, 1955), where he says that when the number of di erent tokens coming after every pre x of a word marks...

8p hongvang_1 16-04-2013 40 1 Download
Báo cáo khoa học: "Modeling Morphologically Rich Languages Using Split Words and Unstructured Dependencies"

We experiment with splitting words into their stem and sufﬁx components for modeling morphologically rich languages. We show that using a morphological analyzer and disambiguator results in a signiﬁcant perplexity reduction in Turkish. We present ﬂexible n-gram models, FlexGrams, which assume that the n−1 tokens that determine the probability of a given token can be chosen anywhere in the sentence rather than the preceding n − 1 positions. Our ﬁnal model achieves 27% perplexity reduction compared to the standard n-gram model. ...

4p hongphan_1 15-04-2013 49 1 Download

+

Xem thêm 30 Word tokenization khác

CHỦ ĐỀ BẠN MUỐN TÌM

TOP DOWNLOAD

LV.15: Bộ Đồ Án Tốt Nghiệp Chuyên Ngành Cơ Khí

65 tài liệu

2412 lượt tải

THÔNG TIN

TRỢ GIÚP

HỖ TRỢ KHÁCH HÀNG

Theo dõi chúng tôi

Chịu trách nhiệm nội dung:

Nguyễn Công Hà - Giám đốc Công ty TNHH TÀI LIỆU TRỰC TUYẾN VI NA

LIÊN HỆ

Địa chỉ: P402, 54A Nơ Trang Long, Phường 14, Q.Bình Thạnh, TP.HCM

Hotline: 093 303 0098

Email: support@tailieu.vn

Giấy phép Mạng Xã Hội số: 670/GP-BTTTT cấp ngày 30/11/2015 Copyright © 2022-2032 TaiLieu.VN. All rights reserved.