
VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERNING AND TECHNOLOGY
NGUYEN MINH THUAN
ENHANCING THE QUALITY OF MACHINE TRANSLATION
SYSTEM USING CROSS-LINGUAL WORD EMBEDDING
MODELS
Major: Computer Science
Code: 8480101.01
SUMMARY OF COMPUTER SCIENCE MASTER THESIS
SUPERVISOR: Associate Professor Nguyen Phuong Thai
Publication: Minh-Thuan Nguyen, Van-Tan Bui, Huy-Hien Vu, Phuong-Thai
Nguyen, Chi-Mai Luong, Enhancing the quality of Phrase-table in Statistical Machine
Translation for Less-Common and Low-Resource Languages, in the 2018 International
Conference on Asian Language Processing (IALP 2018).
Hanoi, 10/2018

2
Chapter 1: Introduction
This chapter introduces the motivation of the thesis, related
works and our proposed models. Nowadays, machine
translation systems attain much success in practice, and two
approaches that have been widely used for MT are Phrase-
based statistical machine translation (PBSMT) and Neural
Machine Translation (NMT). In PBSMT, having a good
phrase-table possibly makes translation systems improve the
quality of translation. However, attaining a rich phrase-table is
a challenge since the phrase-table is extracted and trained from
large amounts of bilingual corpora which require much effort
and financial support, especially for less-common languages
such as Vietnamese, Laos, etc. In the NMT system, To reduce
the computational complexity, conventional NMT systems
often limit their vocabularies to be the top 30K-80K most
frequent words in the source and target language, and all
words outside the vocabulary, called unknown words, are
replaced into a single unk symbol. This approach leads to the
inability to generate the proper translation for this unknown
words during testing.
Latterly, there are several approaches to address the above
impediments. Especially, techniques using word embedding
receive much interest from natural language processing
communities. Word embedding is a vector representation of
words which conserves semantic information and their
contexts words. Additionally, we can exploit the advantage of
embedding to represent words in diverse distinction spaces.
Besides, cross-lingual word embedding models are also
receiving a lot of interest, which learn cross-lingual

3
representations of words in a joint embedding space to
represent meaning and transfer knowledge in cross-lingual
scenarios. Inspired by the advantages of the cross-lingual
embedding models, we propose a model to enhance the quality
of a phrase-table by recomputing the phrase weights and
generating new phrase pairs for the phrase-table, and a model
to address the unknown word problem in the NMT system by
replacing the unknown words with the most appropriate in-
vocabulary words.
The rest of this thesis is organized as follows: Chapter 2
gives an overview of related backgrounds. In Chapter 3, we
describe our two proposed models. A model enhances the
quality of phrase-table in SMT, and the remaining model
tackles the unknown word problem in NMT. Settings and
results of our experiments are shown in Chapter 4. We indicate
our conclusion and future works in Chapter 5.

4
Chapter 2: Literature review
2.1 Machine Translation
This section shows the history, approaches, evaluation and
open-source in MT.
2.1.1 History
In the mid-1930s, Georges Artsrouni attempted to build
“translation machines” by using paper tape to create an
automatic dictionary. After that, Peter Troyanskii proposed a
model including a bilingual dictionary and a method for
handling grammatical issues between languages based on the
Esperanto’s grammatical system. During the 2000s, research
in MT has seen major changes. A lot of research has focused
on example-based machine translation and statistical machine
translation (SMT). Besides, researchers also gave more
interests in hybridization by combining morphological and
syntactic knowledge into statistical systems, as well as
combining statistics with existing rule-based systems.
Recently, the hot trend of MT is using a large artificial neural
network into MT, called Neural Machine Translation (NMT).
In 2014, (Cho et al., 2014) published the first paper on using
neural networks in MT, followed by a lot of research in the
following few years.
2.1.2 Approaches
In this section, we indicate typically approaches for MT
based on linguistic rules, statistical and neural network. These
are Rule-based Machine Translation (RBMT), Statistical
Machine Translation (STM), Example-based machine
translation (EBMT), and Neural Machine Translation (NMT).

5
2.1.3 Evaluation
This section describes BLEU - a popular method for
automatic evaluating MT output that is quick, inexpensive,
and language-independent. The basic idea of this method is to
compare n-grams of the MT output with n-grams of the
standard translation and count the number of matches. The
more the matches, the better the MT output is.
2.1.4 Open-Source Machine Translation
This subsection introduces a list of free and complete
toolkits for MT and describes two MT systems, which are used
in our work. The first system is Moses - an open system for
SMT and the remaining system is OpenNMT - an open
system for NMT.
2.2 Word Embedding
In this section, we introduce models about monolingual and
cross-lingual word embedding.
2.2.1 Monolingual Word Embedding Models
This subsection introduces models which used for
estimating continuous representations of words based on
monolingual data.
2.2.2 Cross-Lingual Word Embedding Models
This subsection introduces models which used for learning
the cross-lingual representation of words in a joint embedding
space to represent meaning and transfer knowledge in cross-
lingual applications.