
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/354359618
Multi-task Learning for Newspaper Image Segmentation and Baseline Detection
Using Attention-Based U-Net Architecture
ChapterinLecture Notes in Computer Science · September 2021
DOI: 10.1007/978-3-030-86159-9_32
CITATIONS
3
READS
508
5 authors, including:
Prerana Mukherjee
Jawaharlal Nehru University
87 PUBLICATIONS929 CITATIONS
SEE PROFILE
All content following this page was uploaded by Prerana Mukherjee on 14 November 2021.
The user has requested enhancement of the downloaded file.

Multi-task Learning for Newspaper Image
Segmentation and Baseline Detection Using
Attention-Based U-Net Architecture
Anukriti Bansal1?, Prerana Mukherjee2?, Divyansh Joshi2, Devashish
Tripathi2, and Arun Pratap Singh2
1The LNM Institute of Information Technology, Jaipur, Rajasthan, India
anukriti.bansal@lnmiit.ac.in
2School of Engineering, Jawaharlal Nehru University, Delhi, India
divyan95 soe@jnu.ac.in, devash76 soe@jnu.ac.in, arun70 soe@jnu.ac.in,
prerana@jnu.ac.in
Abstract. In this work, we propose an end-to-end language agnostic
multi-task learning based U-Net framework for performing text block
segmentation and baseline detection in document images. We leverage
the performance of U-Net by augmenting attention layers between the
contracting and expansive path via skip connections. The generalization
ability of the model is validated on handwritten images as well. We per-
form exhaustive experiments on ICPR2020 challenge dataset and obtain
a test accuracy of 96.09% and 99.44% for simple track baseline detection
and text block segmentation respectively, 97.47% and 98.51% complex
track baseline and text block segmentation respectively. The source code
is made publicly available at https://github.com/divyanshjoshi/Attention-
U-Net-Newspaper-Text-Block-Segmentation.
Keywords: Multi-task learning ·Newspaper document images ·atten-
tion ·text block segmentation ·text baseline detection.
1 Introduction
With the increase in information dissemination, the usage of e-newspapers, digi-
tal content has become more prolific. Particularly, this further surges the demand
of more research in the domain of document image processing. Thus, it becomes
also imperative to maintain and archive such different kinds of digital docu-
ments with minimal cost and in most efficient manner. Out of these, the most
complex layout is present in newspaper images as it has an overload of both
text and graphics. There are two main challenges while dealing with newspaper
document images, i) the structure of old newspapers are more complex than the
recent e-newspapers as pages are subjected to more degeneration due to issues
such as poor scan quality, wear and tear of pages etc., and ii) huge variations in
?The authors have contributed equally.

2 A. Bansal et al.
the layout structure of different publishing houses and variants within the same
edition in different issues.
In order to exploit the information present in such documents, it becomes
necessary to segment these articles in a way that they become more decipher-
able. There could be many interesting applications such as indexing and retrieval
tasks [26, 27] once this complex layout is extracted. In this work, we address to
solve the two relevant problems in the page segmentation domain: i) text block
segmentation-it enables to separate the text and graphical components and treat
them independently, and ii) text based baseline detection-it helps in identifying
the lines belonging to each block components. Motivated by the above men-
tioned challenges and advantages, we develop a novel method of logical labelling
of such documents particularly newspaper images in our context. We leverage
the use of multi-task learning paradigm to jointly learn the shareable parame-
ters for performing two complementary tasks. Inspired by the much celebrated
U-Net [24,25] architecture prominently utilized in medical imagery, we augment
an attention block in the pipeline of the Modified U-Net architecture catering
multi-task learning. The core idea is to provide successive layers as opposed
to a contracting network, here pooling operations are replaced by up-sampling
operations, thus increasing the newspaper page layout resolution. High resolu-
tion features from contracting path when combined with the up-sampled out,
followed by a convolution layer can give more precise segmentation result. The
large number of feature channels allow the propagation of contextual information
across successive layers. The expansive path is symmetric to the contractive path
connected by skip connections and the attention block. The attention block is a
self attention grid based gating module which help to concentrate the attention
coefficients on the localized regions. Finally, the output is bifurcated into two
outputs: text block output and baseline output.
1.1 Related Work
Page segmentation and baselines detection has been an important and active area
of research since long back [23]. Here we review some of the recent work using
both hand-crafted features and advanced machine learning and deep learning
algorithms.
1.2 Hand crafted features methods
Document layout analysis is very important research direction and has several
applications in optical character recognition for geometric and logical analysis.
Geometric layout analysis [7] or typically named as page segmentation algo-
rithms require splitting the page into homogeneous regions consisting of text or
graphics. Most of the techniques attempt to solve it using ruling lines (horizon-
tal or vertical) detection. Logical layout analysis [8, 13] requires the segregation
into logical units such as headlines or paragraphs and then form a consistent
relationships amongst them. Article segmentation tasks are highly dependent on
the task’s complexity [5]. In [22], authors solve the problem of article detection

Multi-task Learning for Newspaper Image 3
in digitised newspaper images. As most of the prior works assume that the text
segmentation is ordered whereas in this work, the authors proposed 2D Marko-
vian process to encode the appropriate reading order inside the geometric text
blocks. Bansal et. al [4] utilized fixed point models to solve the task of arti-
cle segmentation. Other works include text line segmentation in unconstrained
printed text documents [14], straight line based segmentation [2] which include
handcrafted features for performing the requisite tasks.
1.3 Deep learning feature based methods
In [12], the authors utilized cascaded instance aware segmentation technique
based on multi-scale fully convolutional neural network (FCN). It consists of
two major components: i) text block region segmentation framework, ii) rotation
invariant instance aware segmentation which further disintegrates the text block
regions into requisite text or word lines. In [11], authors provide an end-to-end
framework for page segmentation for performing 3 types of instance segmenta-
tion which includes: text block, tables and figures. They propose a multi-scale
multi-task FCN learning framework which enables page segmentation and ele-
ment wise contour detection. On one hand the semantic segmentation task helps
in pixel wise prediction the various elements whereas on the other hand the
contour detection pipeline identifies the nearby edges around each element. A
conditional random field network is trained on the output from the semantic
segmentation and contour detection branches which further improvises the seg-
mentation output. They also utilize some heuristic rules based post-processing
to identify the individual table elements. Lee et al. [17], authors proposed train-
able multiplication layers (TMLs) in the standard U-Net convolutional neural
network. These TMLs extract the co-occurrence features across the layers to
detect the presence of any periodic textual element re-occurring (such as tables,
text line structures etc.) or textural similarities in various elements in the text.
In [19], authors proposed a machine learning approach for page segmentation,
the first step includes classification score generation for various page compo-
nents and second step involves the connected component analysis to group the
semantically and spatially close components in the page layout.
1.4 Major Contributions
In view of the above discussions, our contributions can be summarized in the
following aspects:
1. We propose an end-to-end novel learning based deep neural network having
two complementary branches, which is utilized to solve the problem of text
block segmentation and baseline detection in document images. To the best
of the authors’ knowledge, this is the first work that segment text blocks as
well as detect baselines in arbitrary type of documents in a unified multi-task
learning based framework.

4 A. Bansal et al.
2. We augment an attention block in the U-Net pipeline which consists of con-
volutional neural network with shared feature learning for text block seg-
mentation and baseline detection to focus on the correct regions of interest.
3. We evaluate the proposed approach for both tasks to demonstrate superior
performance on Text Block Segmentation Competition ICPR2020-NewsEye
dataset as compared to the existing frameworks. The generalization ability
of the model to detect baselines on various types of document images is also
shown on the real handwritten documents from the ICDAR 2017 handwrit-
ten baseline detection dataset.
The organization of the paper is as follows. In Section 2, we explain the
proposed multi-task learning framework with attention gates for the text-block
segmentation and base line detection. Details of the experimental evaluations
on various standard datasets are discussed in Section 3. Finally we conclude the
paper in Section 4 and provide avenues for future research.
2 Proposed multi-task learning-based framework
Image
Slice
generation Normalization Modified
UNet
Text Block
Baseline
Overall functional pipeline
Final
outputs
after
stitching
slices
Fig. 1. Overview of the proposed method: Input document images are first augmented
using a unique image slicing-based method. Intensity values of each slice is normalized
(max-normalization) in the range of 0 to 1 before feeding into an attention-based U-Net
architecture. The model output two images, one containing text-blocks and the other
containing baselines.

