Composite Indexes trong Neural Machine Translation: Khám Phá Domain Adaptation

Exploring Composite Indexes for Domain Adaptation

in Neural Machine Translation

Nhan Vo Minh , Khue Nguyen Tran Minh , Long H. B. Nguyen

and Dien Dinh

Faculty of Information Technology, University of Science

Ho Chi Minh City, Vietnam

Vietnam National University

Ho Chi Minh City, Vietnam

nhblong@fit.hcmus.edu.vn;long.hb.nguyen@gmail.com

Received 2 April 2023

Revised 24 July 2023

Accepted 22 August 2023

Published 23 September 2023

Domain adaptation in neural machine translation (NMT) tasks often involves working with

datasets that have a di®erent distribution from the training data. In such scenarios, k-nearest-

neighbor machine translation (kNN-MT) has been shown to be e®ective in retrieving relevant

information from large datastores. However, the high-dimensional context vectors of large

neural machine translation model result in high computational costs for distance computation

and storage. To address this issue, index optimization techniques have been proposed, including

the use of inverted ¯le index (IVF) and product vector quantization (PQ), called IVFPQ. In this

paper, we explore the recent index techniques for e±cient machine translation domain adap-

tation and combine multiple index structures to improve the e±ciency of nearest-neighbor

search in domain adaptation datasets for machine translation task. Speci¯cally, we evaluate the

e®ectiveness when combining optimized product quantization (OPQ) and hierarchical navi-

gable small-world (HNSW) indexing with IVFPQ. Our study aims to provide insights into the

most suitable composite index methods for e±cient nearest-neighbor search in domain adap-

tation datasets, with a focus on improving both accuracy and speed.

Keywords: Domain adaptation in neural machine translation; k-nearest-neighbor machine

translation; index method.

1. Introduction

Modern neural machine translation (NMT) models are mostly parametric, meaning

that the output for each input depends only on a ¯xed number of model parameters

obtained from some training data in the same domain.

1,2

However, in real-world

Corresponding author.

This is an Open Access article published by World Scienti¯c Publishing Company. It is distributed under

the terms of the Creative Commons Attribution 4.0 (CC BY) License which permits use, distribution and

reproduction in any medium, provided the original work is properly cited.

OPEN ACCESS

Vietnam Journal of Computer Science

Vol. 11, No. 1 (2024) 75–94

cThe Author(s)

DOI: 10.1142/S2196888823500148

Vietnam J. Comp. Sci. 2024.11:75-94. Downloaded from www.worldscientific.com

by 2402:800:f473:a235:98cc:533f:b145:9069 on 10/07/25. Re-use and distribution is strictly not permitted, except for Open Access articles.

scenarios, the input to the NMT system often comes from domains that were not part

of the training data, resulting in suboptimal translations. One solution to this issue is

to train or ¯ne-tune the entire model or a part of it for each domain, but this can be

costly and may result in catastrophic forgetting.

Recently, a promising approach has been to augment parametric models with a

retrieval component, leading to semi-parametric models.

4–6

These models construct a

datastore based on a set of source/target sentences or word-level contexts, i.e.

translation memories, and retrieve similar examples from the datastore during the

generation process. This allows for the use of a single model for multiple domains.

However, the runtime of the model increases with the size of the domain's

datastore, and searching for related examples on large datastores can be computa-

tionally expensive. For instance, when retrieving 64 neighbors from the datastore,

the model may become two orders of magnitude slower.

To address this issue, recent works have proposed methods that aim to make the

retrieval process more e±cient. For example, Meng et al.

proposed constructing a

separate datastore for each source sentence by ¯rst searching for the neighbors of the

source tokens, while He et al.

suggested various techniques such as adaptive retrieval

and dimension reduction for k-nearest-neighbor machine translation (kNN-MT).

In this paper, we adapt the dimension reduction and cache methods proposed

by He et al.

and Martins et al.

to kNN-MT, and further examine various compo-

site index methods that increase the approximate k-nearest-neighbor search

over the datastore performance. The index is implemented using FAISS

which

was previously proposed by Khandelwal et al.

to implement ANN search in

kNN-MT. The index method basically combines inverted ¯le index

and product

vector quantization

which proved to reduce memory usage by only storing quan-

tized vectors and accelerates kNN search by pre-clustering the datastore vectors.

We examine various composite index methods for the e±cient nearest-neighbor

(NN) retrieval of information from large NLP databases. Our focus is to improve the

existing indexing methods and to explore the trade-o®s between indexing speed,

accuracy and memory usage.

2. Related Studies

The primary objective of NMT is to develop a neural network model that can au-

tomatically translate text from source language to target language. Usually, the

NMT model performs well on sentences that have the same data distribution with

the training data or in-domain data. However, the model struggles when dealing with

out-of-domain data. That is why domain adaptation in NMT is used to improve the

performance of this model in target domain by adapting it from source domain where

the training data may be insu±cient or unavailable. In this section, we present

various related studies that have been published on the approaches of domain

adaptation in NMT.

76 N. V. Minh et al.

Vietnam J. Comp. Sci. 2024.11:75-94. Downloaded from www.worldscientific.com

by 2402:800:f473:a235:98cc:533f:b145:9069 on 10/07/25. Re-use and distribution is strictly not permitted, except for Open Access articles.

Based on Refs. 3and 13, domain adaptation for NMT is classi¯ed into three main

approaches: data-, model- and inference time-centric approaches. Figure 1shows

these three approaches.

The data-centric approach focuses on the data being used that can be in-domain

monolingual corpora, synthetic corpora or parallel corpora:

.In-domain monolingual corpora: They are a cheaper data source compared to

parallel bilingual sentence pairs and can be used to improve model performance in

MT. In statistical machine translation (SMT), monolingual data can be used

directly, while in NMT, more complex approaches are required to fuse LMs and

translation models.

For instance, the monolingual data can be used to train the

decoder as the LM and NMT using multi-task learning.

On the source side, the

monolingual data can be used to strengthen the encoder via multi-task learning

with both translation and reordering of source sentences.

.Synthetic parallel corpora: Back-translation of monolingual in-domain target

sentences can be used to generate a synthetic parallel corpus, which can further

enhance the decoder performance.

This technique can be applied to both the

source and target corpora.

.Out-of-domain parallel corpora: They are often cheaper and more widely available

than bilingual in-domain data. Therefore, it is desirable to use both types of data

when training the NMT to improve its performance on in-domain data while

achieving a solid baseline on out-of-domain data. The multi-domain method

uses

tags to inform the NMT as to whether a sentence is in-domain or out-of-domain.

The NMT system is trained on both in-domain and a smaller amount of out-of-

domain data while oversampling the in-domain sentences. Data selection methods

from SMT systems have limited e®ects on NMT because these methods do not

relate well to NMT. However, internal sentence embedding can be evaluated and

used as a measure of similarity between in-domain and out-of-domain data.

NMT, dynamically introducing in-domain data through gradual ¯ne-tuning can

Fig. 1. Overview of domain adaptation for the NMT task, based on Refs. 3and 13.

Exploring Composite Indexes for Domain Adaptation in NMT 77

Vietnam J. Comp. Sci. 2024.11:75-94. Downloaded from www.worldscientific.com

by 2402:800:f473:a235:98cc:533f:b145:9069 on 10/07/25. Re-use and distribution is strictly not permitted, except for Open Access articles.

lead to signi¯cant increases in BLEU scores.

The training data is scored based on

its relevance to in-domain translation, and starting with the entire dataset, more

and more relevant sentences are selected over several epochs, leading to smaller

subsets with more speci¯c data.

The model-centric approach focuses on changing the NMT models for domain

adaptation, which can be changing the training function or procedure, the NMT

architecture or the decoding algorithm.

Training objective-centric technique of model-centric group: Training objectives

can be manipulated to achieve better in-domain translation. Instance weighting

assigns a training weight using cross-entropy between two in-domain and out-of-

domain LMs to re°ect this goal.

However, this is challenging in NMT as it involves

nonlinear models due to the activation functions in the neural network. Another

method is ¯ne-tuning, which involves pre-training an NMT system on a larger,

parallel out-of-domain corpus and then optimizing the NMT parameters according

to a smaller in-domain corpus.

A re¯nement of this method is mixed ¯ne-tuning,

where the NMT is trained exclusively on an out-of-domain corpus until convergence

and then continues training on a mix of out-of-domain and in-domain data, with

oversampling of in-domain sentences. This approach has been shown to outperform

both multi-domain and ¯ne-tuning methods.

Model architecture-centric technique of model-centric group: It involves two

approaches as discussed below:

.One way to do this is through fusion approaches, where an in-domain Recurrent

Neural Network Language Model (RNNLM) is trained and combined with an

NMT model. Shallow fusion combines the scores from NMT and LM to choose the

best suitable translation, while deep fusion integrates RNNLM into the NMT

architecture to merge their internal representations, i.e. their hidden states, to

translate based on this fused representation. This can be done by training LM and

NMT separately or jointly.

14,15

.Another approach is to introduce a Domain Discriminator as a discriminative

method, e.g. a feed-forward neural network, on top of the encoder to predict the

domain of the source sentence.

Decoding-centric technique of model-centric group: This technique aims at modifying

the decoding algorithm, which makes them complementary to other model-centric

approaches:

.Shallow Fusion is an approach in this technique that combines the LM and NMT

scores during hypothesis generation.

When extending an existing subhypothesis,

the possible next words are evaluated based on a weighted sum of the NMT and

LM probabilities.

.Another approach is ensembling with models trained on out-of-domain data and a

¯ne-tuned in-domain model is another approach to prevent degrading model

78 N. V. Minh et al.

Vietnam J. Comp. Sci. 2024.11:75-94. Downloaded from www.worldscientific.com

by 2402:800:f473:a235:98cc:533f:b145:9069 on 10/07/25. Re-use and distribution is strictly not permitted, except for Open Access articles.

performance on out-of-domain translations.

This method involves combining the

outputs of multiple models, such as an out-of-domain model and an in-domain

model, to generate a single translation output.

Inference time-centric group: It means changing at inference time, i.e. we could

assign a separate model to each domain and combine them at inference. One example

of this approach is the k-nearest-neighbor machine translation proposed by Khan-

delwal et al.

Here, the authors build a speci¯c-domain datastore. The datastore will

map NMT decoder representations to in-domain target tokens. Then, at each in-

ference step, the NMT model's predictions are interpolated with a distribution of the

k-nearest-neighbor tokens. Their approach is robust since it overcomes the cata-

strophic forgetting problem of parametric NMT models. However, its drawback lies

in the slow model runtime and searching time when meeting a large datastore.

Our paper is based on kNN-MT proposed by Khandelwal et al.

of the inference

approach.

3. Background

3.1. The k-nearest-neighbor machine translation

The kNN-MT is a retrieval-based MT method that augments a pre-trained MT

model with a nearest-neighbor retrieval mechanism. This allows the model to have

direct access to a datastore of cached translation examples, improving its translation

quality. At each time step, the model ¯nds the kmost similar contexts in the

datastore and computes a distribution over the corresponding target tokens, which is

then interpolated with the output distribution from the pre-trained MT model.

To elaborate, NMT models take a source language input sequence, s¼

ðs1;...;sM1Þ, and generate a target language output sequence, t¼ðt1;...;tM2Þ.

Autoregressive decoders condition the output distribution for each target token ti

on both the source sequence and the previous target tokens, expressed as

pðtijs;t1:i1Þ. The translation context is denoted as s;t1:i1and the target token as ti.

Khandelwal et al.

constructed the datastore consisting of key–value pairs com-

puted o®line. The keys represent high-dimensional representations of the translation

context computed by the MT decoder using a mapping function fðs;t1:i1Þfrom

input to intermediate representation, while the values are the corresponding ground-

truth target tokens ti. For a parallel text collection ðS;TÞ, representations are

generated for each example with a single forward pass, and the complete datastore is

de¯ned as

ðK;VÞ ¼ fðfðs;t1:i1Þ;tiÞ;8ti2tjðs;tÞ2ðS;TÞg:

Note that source language tokens are not stored as values in the datastore, but are

implicitly conditioned via the keys, while the values correspond to target language

tokens.

During generation at test time, given a source sequence x, the NMT model

produces a distribution pMTðyijx;^

y1:i1Þover the target vocabulary for the target

Exploring Composite Indexes for Domain Adaptation in NMT 79

Vietnam J. Comp. Sci. 2024.11:75-94. Downloaded from www.worldscientific.com

by 2402:800:f473:a235:98cc:533f:b145:9069 on 10/07/25. Re-use and distribution is strictly not permitted, except for Open Access articles.

Exploring composite indexes for domain adaptation in neural machine translation

Bài viết nghiên cứu các chỉ số tổng hợp để thích ứng miền trong dịch máy thần kinh (NMT), tập trung vào cải thiện hiệu quả tìm kiếm lân cận gần nhất.

Chủ đề:

Tài liệu liên quan

Tài liêu mới

AI tóm tắt

Giới thiệu tài liệu

Đối tượng sử dụng

Từ khoá chính

Nội dung tóm tắt

Hỗ trợ

Phương thức thanh toán

Theo dõi chúng tôi