MINISTRY OF EDUCATION AND TRAINING
UNIVERSITY OF SCIENCE AND TECHNOLOGY
BUI HAI PHONG
ENHANCING PERFORMANCE OF MATHEMATICAL
EXPRESSION DETECTION IN SCIENTIFIC
DOCUMENT IMAGES
Major: Computer Science
Code: 9480101
ABSTRACT OF DOCTORAL DISSERTATION
COMPUTER SCIENCE
Hanoi 2021
This study is completed at:
Hanoi University of Science and Technology
Supervisors:
1. Assoc. Prof. Hoang Manh Thang
2. Assoc. Prof. Le Thi Lan
Reviewer 1:
Reviewer 2:
Reviewer 3:
This dissertation will be defended before approval commitee
at Hanoi University of Science and Technology:
Time , date month year 2021
This dissertation can be found at:
1. Ta Quang Buu Library - Hanoi University of Science and Technology
2. Vietnam National Library
INTRODUCTION
Motivation
Up to now, a huge number of scientific documents have been produced. Scientific doc-
uments have provided valuable information for research community. The documents need to
be digitized to allow users to retrieve information efficiently. Recently, most documents have
been published in the PDF format. However, a large number of documents have been still
available in raster format. It is obvious that the PDF processing techniques cannot be applied
for such raster document images. We need to apply image processing for the digitization of the
document images. The key steps of the document digitization are: document analysis, optical
character recognition and content searching [2]. The digitization of standard text rich docu-
ments has considered as a solved problem. However, the digitization of scientific documents
that contained rich MEs is a non trivial task. Actually, scientific documents usually consist of
heterogeneous components: tables, figures, texts and MEs. In scientific documents, MEs may
be mixed with various components and sizes, styles of MEs may frequently vary. Therefore,
the improvement of accuracy of the detection and recognition of MEs is an important step of
the digitization of scientific documents. Inspired by the above ideas, the thesis mainly aims
to improve the accuracy of detection and recognition of MEs in scientific document images.
Introduction of ME detection and recognition in document images
In mathematics, an expression or mathematical expression is a finite combination of
symbols that is well-formed according to rules that depend on the context [5]. In scientific
documents, MEs are classified in two categories, i.e. isolated (displayed) and inline (embedded)
expressions. Isolated expressions display in separate lines, meanwhile inline expressions are
mixed with other components in document pages, e.g. texts and figures.
The detection of expressions aims to locate MEs in document images. Meanwhile, the
recognition of MEs aims at converting expressions from image format to string (representation
in Latex). An example of ME detection and recognition is illustrated in Figure 1. Actually,
the detection and recognition of MEs in document images are closely related. The accuracy of
the detection allows to obtain accuracy of the recognition. In contrast, the incorrect detection
may cause errors in the recognition of MEs.
The hypotheses of the thesis are assumed as follows: (1) The thesis focuses on the de-
tection and recognition of MEs in scientific document images that have been written in a
formal way. The thesis aims to detect MEs in the body of documents, the detection of MEs
contained in other document components such as tables, figures are actually investigated in
other problems (table or figure detection). Moreover, the size of MEs should not pass the size
of the whole documents. (2) Scientific documents can be generated in various ways: camera
1
Figure 1 Example of the detection (a) and a detected ME in a document image (b). Isolated
and inline MEs are denoted in red and blue, respectively. Extracted ME is recognized and
represented using Latex (c).
captured images, handwritten documents, scanned format or PDF conversion. Moreover, the
detection accuracy highly depends on the quality of the documents. Like conventional meth-
ods in document analysis, the thesis focuses on the detection of MEs in document images
that are scanned at high resolution and non-skew. (3) The detection of MEs is represented by
bounding boxes. Then the detected MEs are recognized and represented in Latex format [4].
Main challenges of the recognition of MEs can be described as follows: (1) Accurate
recognition of a large number of mathematical symbols is a difficult task. (2) Some symbols
in MEs may play different roles in different contexts. (3) Operator symbols can be explicit or
implicit. When consecutive operator symbols exist in an expression, we can apply operator
precedence rules to group the symbols into units. (4) In addition, mathematical notation
has many dialects. Similar to natural languages, it is impossible to design a system that
can recognize all dialects. As a result, our systems are developed based on a subset of the
mathematical notation only.
Contributions
The main scientific contributions of the thesis are threefold: (1) First, a hybrid method
of two stages has been proposed for the effective detection of MEs. Both hand-crafted and
deep learning features are extensively investigated and combined to improve the detection
accuracy. The merit of the method is that it can operate directly on the ME images without the
employment of character recognition. (2) Second, an end-to-end framework for mathematical
expression detection in scientific document images is proposed without using any Optical
Character Recognition (OCR) or Document Analysis techniques as in conventional methods.
The distance transform is firstly applied for input document images in order to take advantages
of the distinguished features of spatial layout of MEs. Then, the transformed images are fed
2
into the Faster Region with Convolutional Neural Network (Faster R-CNN) that has been
optimized to improve the accuracy of the detection. (3) Finally, the detection and recognition
of MEs has been integrated in a system. The MEs in document images have been detected
and recognized. The recognition results are represented in Latex.
Thesis structure
Chapter Introduction firstly presents the basic information and definition of ME detection
and recognition. Then, the scope of the thesis is presented. The main contributions of the
thesis are also summarized in the chapter. In chapter 1, significant related works to the detec-
tion and recognition of MEs are reviewed. Based on the current limitations, the contributions
of the thesis are proposed. Chapter 2 presents the ME detection using the fusion technique
of hand crafted and deep learning features. Chapter 3 presents the ME detection using the
combination of the Distance Transform (DT) of images and Faster R-CNN. The framework
allows to achieve high accuracy of detection with an end-to-end way. Chapter 4 presents the
system of ME detection and recognition. Chapter Conclusion gives the conclusion and future
works of the thesis.
CHAPTER 1
LITERATURE REVIEW
In this chapter, significant works of the detection and recognition of MEs in document
images are analysed.
1.1 Document analysis
Traditional approaches for ME detection in document images normally consist of two
steps [10, 14]: document analysis and ME detection. The first step focuses on obtaining text
lines and words of text paragraphs. Whereas, the second one focuses on the separation of MEs
and normal texts. Document layout analysis can be defined as the task of segmenting a given
document into semantically meaningful regions. Page segmentation which is a well-researched
topic of document analysis aims to specify regions in documents and classify them into phys-
ical components such as tables, figures, texts. In recent years, the page segmentation is an
active research topic and has attracted more and more researches. Firstly, the image prepro-
cessing (noise removal and skew correction) is performed. Then, each component (e.g. text,
figure, or table) is separated based on their structure layout. Traditional page segmentation
techniques can be divided into four types: top-down, bottom-up, multi-scale resolution and
hybrid method. In recent years, deep learning approaches have been utilized for the page seg-
mentation. The advantage of the approaches is that the page segmentation task is performed
without the prior knowledge of document structure.
3