MINISTRY OF EDUCATION
AND TRAINING
VIETNAM ACADEMY OF
SCIENCE AND TECHNOLOGY
GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY
-----------------------------
VU QUOC TUAN
DISCOVERING FUNCTIONAL DEPENDENCIES
AND RELAXED FUNCTIONAL DEPENDENCIES
IN DATABASES
Major: Math Fundamentals for Informatics
Code: 9 46 01 10
SUMMARY OF MATHEMATICS DOCTORAL THESIS
Ha Noi - 2019
This work is completed at:
Graduate University of Science and Technology
Vietnam Academy of Science and Technology
Supervisor 1: Assoc. Prof. Dr. Ho Thuan
Supervisor 2: Assoc. Prof. Dr. Nguyen Thanh Tung
Reviewer 1: …..................................................
Reviewer 2: …..................................................
Reviewer 3: …..................................................
This thesis will be officially presented in front of the Doctoral thesis
Grading Committee, meeting at:
Graduate University of Science and Technology
Vietnam Academy of Science and Technology
At ............. hrs ....... day ....... month....... year .......
This thesis is available at:
1. Library of Graduate University of Science and Technology
2. National Library of Vietnam.
1
INTRODUCTION
Data dependencies play important roles in database design, data
quality management and knowledge representation. Dependencies in
knowledge discovery are extracted from the existing data of the
database. This extraction process is called dependency discovery.
The aim of dependency discovery is to find important
dependencies holding on the data of the database. These discovered
dependencies represent domain knowledge and can be used to verify
database design and assess data quality.
Dependency discovery has attracted a lot of research interests
from scientists since early 1980s. At the present time, the problem of
discovering data dependencies on big data sets becomes more
important because these big data sets contain a lot of valuable
knowledge.
Currently, with the development of digital devices, especially
social networks and smart phone applications, the amount of data in
the applications increases very quickly, these arise proplems in data
storage, data management, especially the problem of knowledge
discovery from those big data sets. The problem of discovering FDs
and RFDs in databases is one of important proplems of knowledge
discovery. Three typical types of data dependencies which are
interested in discovering are FD, AFD and CFD. AFD is an
extension of FD, the "approximation" is based on a degree of
satisfaction or an error measure; CFD is an extension of FD which
aims to capture inconsistencies in data.
The research directions which solve the problem of RFD
discovery in databases, firstly focus on FD discovery because FD is
the separate case of all types of RFD, the research results about FD
2
discovery can be adapted to discover other types of data
dependencies (such as AFD). The general model of FD discovery
problem includes steps: generating a search space of FDs, verifying
the satisfaction of each FD, pruning the search space, outputting the
set of satisfied FDs and reducing redundancies in this set of satisfied
FDs. In the FD discovery problem, the key discovery is the special
case and is also an important problem in normalizing relational
databases.
The time complexity of the FD discovery problem is polynomial
in the number of tuples in the relation but is exponential in the
number of attributes of that relation. Therefore, for reducing the
processing time, effective pruning rules should be developed. Among
the proposed pruning rules, it is important to prune keys, and if a key
is discovered then it is possible to prune (delete) all sets containing
the key in the search space. However, the disadvantage of existing
key pruning rules is to find keys on the entire set of attributes of
the database (this is really a very difficult problem because the time
complexity can be exponential in number of attributes of ). So is
there any way to find keys in a proper subset of ? This question is
one of basic motivations of this thesis.
After the set of data dependencies is discovered, this set can be
very large and difficult to use because it contains unnecessary
redundancies. The important problem is how to eliminate (as much
as possible) the redundancy in the set of discovered data
dependencies. This is also a problem interested in the thesis.
Another research direction in the thesis is to focus on discovering
two typical types of RFD, namely AFD and CFD. Both AFD and
CFD have many applications and occurences in relational databases,
3
especially CFD is also a powerful tool for dealing with data cleaning
problems. For AFD, the most important problem is to improve and
develop techniques for computing approximate measures; For CFD,
in addition to discovering them, the research about a unified
hierarchy between CFD and other types of data dependencies is also
a very interesting problem.
The research content in the thesis is the current problems which
are renewed with a series of works of foreign authors; while in the
country (in Vietnam), there are many published works related to
methods and algorithms finding reducts of a decision table by
different approaches.
The objective of the thesis is to research some analyzed problems
in range of relational databases. The main contents of the thesis are
described as follows:
Chapter 1. An overview of relational data model, concepts of
functional dependency, closure of a set of attributes, key for a
relational schema, etc. This chapter also focuses on RFD and the
generalization of methods used for discovering FDs and RFDs.
Chapter 2. The presentation of AFD and CFD (two typical types
of RFD) and some related results.
Chapter 3. The presentation of the closure computing algorithms
of a set of attributes under a set of FDs, reducing the key finding
problem of a relation schema and some related results.
Chapter 4. The presentation of an effective preprocessing
transformation for sets of FDs (to reduce redundancies in a given set
of FDs) and some related results.