Trang chủ » Luận Văn - Báo Cáo » Thạc sĩ - Tiến sĩ - Cao học

14 trang

76 lượt xem

Summary of Computer science master thesis: Enhancing the quality of machine translation system using cross lingual word embedding models

The purpose of this thesis is to propose two models for using cross-lingual word embedding models to address the above impediment. The first model enhances the quality of the phrase-table in SMT, and the remaining model tackles the unknown word problem in NMT.

Chủ đề:

tamynhan0

Luận văn thạc sĩ CNTT

Luận văn thạc sĩ khoa học máy tính

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERNING AND TECHNOLOGY

NGUYEN MINH THUAN

ENHANCING THE QUALITY OF MACHINE TRANSLATION

SYSTEM USING CROSS-LINGUAL WORD EMBEDDING

MODELS

Major: Computer Science

Code: 8480101.01

SUMMARY OF COMPUTER SCIENCE MASTER THESIS

SUPERVISOR: Associate Professor Nguyen Phuong Thai

Publication: Minh-Thuan Nguyen, Van-Tan Bui, Huy-Hien Vu, Phuong-Thai

Nguyen, Chi-Mai Luong, Enhancing the quality of Phrase-table in Statistical Machine

Translation for Less-Common and Low-Resource Languages, in the 2018 International

Conference on Asian Language Processing (IALP 2018).

Hanoi, 10/2018

Chapter 1: Introduction

This chapter introduces the motivation of the thesis, related

works and our proposed models. Nowadays, machine

translation systems attain much success in practice, and two

approaches that have been widely used for MT are Phrase-

based statistical machine translation (PBSMT) and Neural

Machine Translation (NMT). In PBSMT, having a good

phrase-table possibly makes translation systems improve the

quality of translation. However, attaining a rich phrase-table is

a challenge since the phrase-table is extracted and trained from

large amounts of bilingual corpora which require much effort

and financial support, especially for less-common languages

such as Vietnamese, Laos, etc. In the NMT system, To reduce

the computational complexity, conventional NMT systems

often limit their vocabularies to be the top 30K-80K most

frequent words in the source and target language, and all

words outside the vocabulary, called unknown words, are

replaced into a single unk symbol. This approach leads to the

inability to generate the proper translation for this unknown

words during testing.

Latterly, there are several approaches to address the above

impediments. Especially, techniques using word embedding

receive much interest from natural language processing

communities. Word embedding is a vector representation of

words which conserves semantic information and their

contexts words. Additionally, we can exploit the advantage of

embedding to represent words in diverse distinction spaces.

Besides, cross-lingual word embedding models are also

receiving a lot of interest, which learn cross-lingual

representations of words in a joint embedding space to

represent meaning and transfer knowledge in cross-lingual

scenarios. Inspired by the advantages of the cross-lingual

embedding models, we propose a model to enhance the quality

of a phrase-table by recomputing the phrase weights and

generating new phrase pairs for the phrase-table, and a model

to address the unknown word problem in the NMT system by

replacing the unknown words with the most appropriate in-

vocabulary words.

The rest of this thesis is organized as follows: Chapter 2

gives an overview of related backgrounds. In Chapter 3, we

describe our two proposed models. A model enhances the

quality of phrase-table in SMT, and the remaining model

tackles the unknown word problem in NMT. Settings and

results of our experiments are shown in Chapter 4. We indicate

our conclusion and future works in Chapter 5.

Chapter 2: Literature review

2.1 Machine Translation

This section shows the history, approaches, evaluation and

open-source in MT.

2.1.1 History

In the mid-1930s, Georges Artsrouni attempted to build

“translation machines” by using paper tape to create an

automatic dictionary. After that, Peter Troyanskii proposed a

model including a bilingual dictionary and a method for

handling grammatical issues between languages based on the

Esperanto’s grammatical system. During the 2000s, research

in MT has seen major changes. A lot of research has focused

on example-based machine translation and statistical machine

translation (SMT). Besides, researchers also gave more

interests in hybridization by combining morphological and

syntactic knowledge into statistical systems, as well as

combining statistics with existing rule-based systems.

Recently, the hot trend of MT is using a large artificial neural

network into MT, called Neural Machine Translation (NMT).

In 2014, (Cho et al., 2014) published the first paper on using

neural networks in MT, followed by a lot of research in the

following few years.

2.1.2 Approaches

In this section, we indicate typically approaches for MT

based on linguistic rules, statistical and neural network. These

are Rule-based Machine Translation (RBMT), Statistical

Machine Translation (STM), Example-based machine

translation (EBMT), and Neural Machine Translation (NMT).

2.1.3 Evaluation

This section describes BLEU - a popular method for

automatic evaluating MT output that is quick, inexpensive,

and language-independent. The basic idea of this method is to

compare n-grams of the MT output with n-grams of the

standard translation and count the number of matches. The

more the matches, the better the MT output is.

2.1.4 Open-Source Machine Translation

This subsection introduces a list of free and complete

toolkits for MT and describes two MT systems, which are used

in our work. The first system is Moses - an open system for

SMT and the remaining system is OpenNMT - an open

system for NMT.

2.2 Word Embedding

In this section, we introduce models about monolingual and

cross-lingual word embedding.

2.2.1 Monolingual Word Embedding Models

This subsection introduces models which used for

estimating continuous representations of words based on

monolingual data.

2.2.2 Cross-Lingual Word Embedding Models

This subsection introduces models which used for learning

the cross-lingual representation of words in a joint embedding

space to represent meaning and transfer knowledge in cross-

lingual applications.

Tài liệu liên quan

Nghiên cứu phương pháp nâng cao hiệu quả tính toán tập rút gọn trên không gian xấp xỉ mờ: Tóm tắt Luận án Tiến sĩ

Tóm tắt Luận án Tiến sĩ: Nghiên cứu một số phương pháp nâng cao hiệu quả tính toán tập rút gọn trên không gian xấp xỉ mờ

Luận án Tiến sĩ: Nghiên cứu phương pháp nâng cao hiệu quả tính toán tập rút gọn trên không gian xấp xỉ mờ

Luận án Tiến sĩ: Nghiên cứu một số phương pháp nâng cao hiệu quả tính toán tập rút gọn trên không gian xấp xỉ mờ

Đề án Thạc sĩ: Nghiên cứu và phát triển ăng ten MIMO độ cách ly cao cho ứng dụng 5G băng tần dưới 6GHz

Đề án Thạc sĩ: Nghiên cứu và phát triển ăng ten MIMO có độ cách ly cao cho các ứng dụng 5G băng tần dướng 6GHz

Ứng dụng mạng neuron trong học hệ động lực: Luận văn Thạc sĩ Toán học

Luận văn Thạc sĩ Toán học: Ứng dụng mạng neuron trong việc học các hệ động lực

Luận án Tiến sĩ Khoa học máy tính: Nghiên cứu giải pháp đảm bảo an toàn thông tin cho học liên kết dựa trên mật mã

Tóm tắt Luận án Tiến sĩ Khoa học máy tính: Nghiên cứu xây dựng giải pháp đảm bảo an toàn thông tin cho quá trình học liên kết dựa trên mặt mã

Luận án Tiến sĩ: Nghiên cứu đánh giá thuật toán tính liều AAA, AXB trong xạ trị photon máy gia tốc TrueBeam STx, môi trường không đồng nhất

Luận án Tiến sĩ Vật lí: Nghiên cứu đánh giá các thuật toán tính liều AAA, AXB trong môi trường không đồng nhất đối với xạ trị photon sử dụng máy gia tốc TrueBeam STx

Tính toán dao động uốn phi tuyến của dầm đàn nhớt cấp phân số: Luận án Tiến sĩ Cơ học

Luận án Tiến sĩ Cơ học: Tính toán dao động uốn phi tuyến của dầm đàn nhớt cấp phân số

Mô hình mạng nơ ron xung trọng số nhị phân: Tóm tắt luận án Tiến sĩ Kỹ thuật, nghiên cứu xây dựng và thực thi trên kiến trúc tính toán trong bộ nhớ

Tóm tắt Luận án Tiến sĩ Kỹ thuật: Nghiên cứu xây dựng mô hình mạng nơ ron xung trọng số nhị phân hướng tới thực thi trên kiến trúc tính toán trong bộ nhớ

Phương pháp học trọng số cho mạng nơ-ron tế bào bậc hai: Luận án Tiến sĩ Kỹ thuật

Luận án Tiến sĩ Kỹ thuật: Phát triển một số phương pháp học trọng số cho mạng nơ ron tế bào bậc hai

Luận án Tiến sĩ Khoa học Máy tính: Nghiên cứu phát triển hệ tư vấn nhóm mờ trực cảm và tích phân Choquet

Tóm tắt Luận án Tiến sĩ ngành Khoa học máy tính: Nghiên cứu phát triển hệ tư vấn nhóm theo tiếp cận mờ trực cảm và tích phân Choquet

Tài liêu mới

Năng lực cạnh tranh và ổn định tài chính hệ thống ngân hàng thương mại Đông Nam Á: Luận án Tiến sĩ (Nghiên cứu bối cảnh khủng hoảng phi kinh tế)

Luận án Tiến sĩ: Mối quan hệ giữa năng lực cạnh tranh và ổn định tài chính của hệ thống ngân hàng thương mại các nước Đông Nam Á - Nghiên cứu trong bối cảnh khủng hoảng phi kinh tế

Năng lực cạnh tranh và ổn định tài chính của hệ thống ngân hàng thương mại Đông Nam Á: Tóm tắt luận án tiến sĩ (Nghiên cứu trong bối cảnh khủng hoảng phi kinh tế)

Tóm tắt Luận án Tiến sĩ: Mối quan hệ giữa năng lực cạnh tranh và ổn định tài chính của hệ thống ngân hàng thương mại các nước Đông Nam Á - Nghiên cứu trong bối cảnh khủng hoảng phi kinh tế

Phát triển năng lực đánh giá học sinh cho sinh viên đại học ngành Giáo dục Tiểu học: Luận án Tiến sĩ theo yêu cầu đổi mới giáo dục

Luận án Tiến sĩ: Phát triển năng lực đánh giá học sinh cho sinh viên đại học ngành Giáo dục Tiểu học theo yêu cầu đổi mới giáo dục

Phát triển năng lực đánh giá học sinh cho sinh viên đại học ngành Giáo dục Tiểu học: Tóm tắt Luận án Tiến sĩ theo yêu cầu đổi mới giáo dục

Tóm tắt Luận án Tiến sĩ: Phát triển năng lực đánh giá học sinh cho sinh viên đại học ngành Giáo dục Tiểu học theo yêu cầu đổi mới giáo dục

Phát triển năng lực vận dụng kiến thức, kỹ năng môn Tự nhiên và Xã hội cho học sinh tiểu học: Luận án Tiến sĩ

Luận án Tiến sĩ: Phát triển năng lực vận dụng kiến thức, kỹ năng đã học cho học sinh tiểu học trong dạy học môn Tự nhiên và Xã hội

Phát triển năng lực vận dụng kiến thức, kỹ năng cho học sinh tiểu học trong dạy học Tự nhiên và Xã hội: Tóm tắt Luận án Tiến sĩ

Tóm tắt Luận án Tiến sĩ: Phát triển năng lực vận dụng kiến thức, kỹ năng đã học cho học sinh tiểu học trong dạy học môn Tự nhiên và Xã hội

Phát triển năng lực marketing giáo dục cho nhân viên tư vấn tuyển sinh: Tóm tắt luận án Tiến sĩ tại các trường tiểu học tư thục Hà Nội

Tóm tắt Luận án Tiến sĩ: Phát triển năng lực marketing giáo dục cho nhân viên tư vấn tuyển sinh ở các trường tiểu học tư thục thành phố Hà Nội trong bối cảnh hiện nay

Phát triển năng lực marketing giáo dục cho nhân viên tư vấn tuyển sinh: Luận án Tiến sĩ về trường tiểu học tư thục Hà Nội

Luận án Tiến sĩ: Phát triển năng lực marketing giáo dục cho nhân viên tư vấn tuyển sinh ở các trường tiểu học tư thục thành phố Hà Nội trong bối cảnh hiện nay

Summary of Computer science master thesis: Enhancing the quality of machine translation system using cross lingual word embedding models

The purpose of this thesis is to propose two models for using cross-lingual word embedding models to address the above impediment. The first model enhances the quality of the phrase-table in SMT, and the remaining model tackles the unknown word problem in NMT.

Chủ đề:

Tài liệu liên quan

Tài liêu mới

AI tóm tắt

Giới thiệu tài liệu

Đối tượng sử dụng

Từ khoá chính

Nội dung tóm tắt

Hỗ trợ

Phương thức thanh toán

Theo dõi chúng tôi