Trang chủ » Công Nghệ Thông Tin » Trí tuệ nhân tạo AI

16 trang

102 lượt xem

Multi-task Learning for Newspaper ImageSegmentation and Baseline Detection UsingAttention-Based U-Net Architecture

Bài viết nghiên cứu U-Net attention đa nhiệm để phân đoạn ảnh báo và phát hiện đường cơ sở. Đạt độ chính xác cao trên ICPR2020 và ICDAR 2017. Mã nguồn mở.

Chủ đề:

duong297

Nhận dạng ảnh và video

Tài liệu Nhận dạng ảnh và video

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/354359618

Multi-task Learning for Newspaper Image Segmentation and Baseline Detection

Using Attention-Based U-Net Architecture

ChapterinLecture Notes in Computer Science · September 2021

DOI: 10.1007/978-3-030-86159-9_32

CITATIONS

READS

508

5 authors, including:

Prerana Mukherjee

Jawaharlal Nehru University

87 PUBLICATIONS929 CITATIONS

SEE PROFILE

All content following this page was uploaded by Prerana Mukherjee on 14 November 2021.

The user has requested enhancement of the downloaded file.

Multi-task Learning for Newspaper Image

Segmentation and Baseline Detection Using

Attention-Based U-Net Architecture

Anukriti Bansal1?, Prerana Mukherjee2?, Divyansh Joshi2, Devashish

Tripathi2, and Arun Pratap Singh2

1The LNM Institute of Information Technology, Jaipur, Rajasthan, India

anukriti.bansal@lnmiit.ac.in

2School of Engineering, Jawaharlal Nehru University, Delhi, India

divyan95 soe@jnu.ac.in, devash76 soe@jnu.ac.in, arun70 soe@jnu.ac.in,

prerana@jnu.ac.in

Abstract. In this work, we propose an end-to-end language agnostic

multi-task learning based U-Net framework for performing text block

segmentation and baseline detection in document images. We leverage

the performance of U-Net by augmenting attention layers between the

contracting and expansive path via skip connections. The generalization

ability of the model is validated on handwritten images as well. We per-

form exhaustive experiments on ICPR2020 challenge dataset and obtain

a test accuracy of 96.09% and 99.44% for simple track baseline detection

and text block segmentation respectively, 97.47% and 98.51% complex

track baseline and text block segmentation respectively. The source code

is made publicly available at https://github.com/divyanshjoshi/Attention-

U-Net-Newspaper-Text-Block-Segmentation.

Keywords: Multi-task learning ·Newspaper document images ·atten-

tion ·text block segmentation ·text baseline detection.

1 Introduction

With the increase in information dissemination, the usage of e-newspapers, digi-

tal content has become more prolific. Particularly, this further surges the demand

of more research in the domain of document image processing. Thus, it becomes

also imperative to maintain and archive such different kinds of digital docu-

ments with minimal cost and in most efficient manner. Out of these, the most

complex layout is present in newspaper images as it has an overload of both

text and graphics. There are two main challenges while dealing with newspaper

document images, i) the structure of old newspapers are more complex than the

recent e-newspapers as pages are subjected to more degeneration due to issues

such as poor scan quality, wear and tear of pages etc., and ii) huge variations in

?The authors have contributed equally.

2 A. Bansal et al.

the layout structure of different publishing houses and variants within the same

edition in different issues.

In order to exploit the information present in such documents, it becomes

necessary to segment these articles in a way that they become more decipher-

able. There could be many interesting applications such as indexing and retrieval

tasks [26, 27] once this complex layout is extracted. In this work, we address to

solve the two relevant problems in the page segmentation domain: i) text block

segmentation-it enables to separate the text and graphical components and treat

them independently, and ii) text based baseline detection-it helps in identifying

the lines belonging to each block components. Motivated by the above men-

tioned challenges and advantages, we develop a novel method of logical labelling

of such documents particularly newspaper images in our context. We leverage

the use of multi-task learning paradigm to jointly learn the shareable parame-

ters for performing two complementary tasks. Inspired by the much celebrated

U-Net [24,25] architecture prominently utilized in medical imagery, we augment

an attention block in the pipeline of the Modified U-Net architecture catering

multi-task learning. The core idea is to provide successive layers as opposed

to a contracting network, here pooling operations are replaced by up-sampling

operations, thus increasing the newspaper page layout resolution. High resolu-

tion features from contracting path when combined with the up-sampled out,

followed by a convolution layer can give more precise segmentation result. The

large number of feature channels allow the propagation of contextual information

across successive layers. The expansive path is symmetric to the contractive path

connected by skip connections and the attention block. The attention block is a

self attention grid based gating module which help to concentrate the attention

coefficients on the localized regions. Finally, the output is bifurcated into two

outputs: text block output and baseline output.

1.1 Related Work

Page segmentation and baselines detection has been an important and active area

of research since long back [23]. Here we review some of the recent work using

both hand-crafted features and advanced machine learning and deep learning

algorithms.

1.2 Hand crafted features methods

Document layout analysis is very important research direction and has several

applications in optical character recognition for geometric and logical analysis.

Geometric layout analysis [7] or typically named as page segmentation algo-

rithms require splitting the page into homogeneous regions consisting of text or

graphics. Most of the techniques attempt to solve it using ruling lines (horizon-

tal or vertical) detection. Logical layout analysis [8, 13] requires the segregation

into logical units such as headlines or paragraphs and then form a consistent

relationships amongst them. Article segmentation tasks are highly dependent on

the task’s complexity [5]. In [22], authors solve the problem of article detection

Multi-task Learning for Newspaper Image 3

in digitised newspaper images. As most of the prior works assume that the text

segmentation is ordered whereas in this work, the authors proposed 2D Marko-

vian process to encode the appropriate reading order inside the geometric text

blocks. Bansal et. al [4] utilized fixed point models to solve the task of arti-

cle segmentation. Other works include text line segmentation in unconstrained

printed text documents [14], straight line based segmentation [2] which include

handcrafted features for performing the requisite tasks.

1.3 Deep learning feature based methods

In [12], the authors utilized cascaded instance aware segmentation technique

based on multi-scale fully convolutional neural network (FCN). It consists of

two major components: i) text block region segmentation framework, ii) rotation

invariant instance aware segmentation which further disintegrates the text block

regions into requisite text or word lines. In [11], authors provide an end-to-end

framework for page segmentation for performing 3 types of instance segmenta-

tion which includes: text block, tables and figures. They propose a multi-scale

multi-task FCN learning framework which enables page segmentation and ele-

ment wise contour detection. On one hand the semantic segmentation task helps

in pixel wise prediction the various elements whereas on the other hand the

contour detection pipeline identifies the nearby edges around each element. A

conditional random field network is trained on the output from the semantic

segmentation and contour detection branches which further improvises the seg-

mentation output. They also utilize some heuristic rules based post-processing

to identify the individual table elements. Lee et al. [17], authors proposed train-

able multiplication layers (TMLs) in the standard U-Net convolutional neural

network. These TMLs extract the co-occurrence features across the layers to

detect the presence of any periodic textual element re-occurring (such as tables,

text line structures etc.) or textural similarities in various elements in the text.

In [19], authors proposed a machine learning approach for page segmentation,

the first step includes classification score generation for various page compo-

nents and second step involves the connected component analysis to group the

semantically and spatially close components in the page layout.

1.4 Major Contributions

In view of the above discussions, our contributions can be summarized in the

following aspects:

1. We propose an end-to-end novel learning based deep neural network having

two complementary branches, which is utilized to solve the problem of text

block segmentation and baseline detection in document images. To the best

of the authors’ knowledge, this is the first work that segment text blocks as

well as detect baselines in arbitrary type of documents in a unified multi-task

learning based framework.

4 A. Bansal et al.

2. We augment an attention block in the U-Net pipeline which consists of con-

volutional neural network with shared feature learning for text block seg-

mentation and baseline detection to focus on the correct regions of interest.

3. We evaluate the proposed approach for both tasks to demonstrate superior

performance on Text Block Segmentation Competition ICPR2020-NewsEye

dataset as compared to the existing frameworks. The generalization ability

of the model to detect baselines on various types of document images is also

shown on the real handwritten documents from the ICDAR 2017 handwrit-

ten baseline detection dataset.

The organization of the paper is as follows. In Section 2, we explain the

proposed multi-task learning framework with attention gates for the text-block

segmentation and base line detection. Details of the experimental evaluations

on various standard datasets are discussed in Section 3. Finally we conclude the

paper in Section 4 and provide avenues for future research.

2 Proposed multi-task learning-based framework

Image

Slice

generation Normalization Modified

UNet

Text Block

Baseline

Overall functional pipeline

Final

outputs

after

stitching

slices

Fig. 1. Overview of the proposed method: Input document images are first augmented

using a unique image slicing-based method. Intensity values of each slice is normalized

(max-normalization) in the range of 0 to 1 before feeding into an attention-based U-Net

architecture. The model output two images, one containing text-blocks and the other

containing baselines.

Tài liệu liên quan

Ứng dụng xử lý ảnh trong nhận dạng ký tự quang học (OMR)

An application of image processing in optical mark recognition

Cách đưa kênh lên Top Youtube nhanh nhất 2024, không bị ban kênh

Làm sao để đưa kênh lên Top youtube nhanh nhất mà không ban kênh

Cách cắt video và ghi video bằng webcam trong VLC Media Player

Cắt một video và ghi video lại bằng webcam trong VLC media player

Thêm biểu tượng, hình ảnh, văn bản vào video trong VLC media player

Thủ thuật VLC Media Player: Tổng hợp các mẹo hay bạn nên biết

Tìm hiểu một số thủ thuật của VLC media player

Xoay video bị ngược bằng VLC Media Player

Hướng dẫn cách kiếm tiền trên Youtube

Tự học Photoshop 6.0 bằng hình ảnh

Thủ thuật Facebook: Công cụ up và tải ảnh trên Facebook siêu nhanh, hiệu quả

thủ thuật faccông cụ giúp up và tải ảnh trên fac siêu nhanh

Hướng dẫn tự làm phim HD

Tài liêu mới

Giáo trình AI: Cấu trúc câu lệnh hiệu quả, chuẩn nhất

Bìa giảng AI - Cấu trúc một câu lệnh hiệu quả

Kỷ yếu hội thảo khoa học: AI trong giáo dục nghề nghiệp - Ứng dụng AI hiện nay

Kỷ yếu hội thảo khoa học: AI và sự ứng dụng của AI trong giáo dục nghề nghiệp hiện nay

Công cụ Gemini AI: Giới thiệu giải pháp học từ vựng tiếng Anh chuyên ngành hiệu quả cho sinh viên Cao đẳng Bách khoa Nam Sài Gòn

Giới thiệu công cụ Gemini AI trong việc học từ vựng tiếng Anh chuyên ngành cho sinh viên cao đẳng không chuyên ngữ tại Trường Cao đẳng Bách khoa Nam Sài Gòn

Công cụ Gemini AI: Giới thiệu cách học từ vựng tiếng Anh chuyên ngành hiệu quả cho sinh viên Cao đẳng Bách khoa Nam Sài Gòn

Giới thiệu công cụ Gemini AI trong việc học từ vựng tiếng Anh chuyên ngành cho sinh viên cao đẳng không chuyên ngữ tại Trường Cao đẳng Bách khoa Nam Sài Gòn

Multi-task Learning for Newspaper ImageSegmentation and Baseline Detection UsingAttention-Based U-Net Architecture

Bài viết nghiên cứu U-Net attention đa nhiệm để phân đoạn ảnh báo và phát hiện đường cơ sở. Đạt độ chính xác cao trên ICPR2020 và ICDAR 2017. Mã nguồn mở.

Chủ đề:

Tài liệu liên quan

Tài liêu mới

AI tóm tắt

Giới thiệu tài liệu

Đối tượng sử dụng

Từ khoá chính

Nội dung tóm tắt

Hỗ trợ

Phương thức thanh toán

Theo dõi chúng tôi