Vol:.(1234567890)
Applied Intelligence (2024) 54:4012–4042
https://doi.org/10.1007/s10489-024-05296-2
Efficient algorithms tomine concise representations offrequent high
utility occupancy patterns
HaiDuong1· HuyPham1· TinTruong1· PhilippeFournier‑Viger2
Accepted: 28 January 2024 / Published online: 18 March 2024
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024
Abstract
Identifying all frequent high utility occupancy itemsets (FHUOIs) in a quantitative transaction dataset is a new trend in data
mining. By combining both factors of frequency and utility occupancy, these patterns are more suitable for several applica-
tions in the real world. These patterns not only reflect the interests of most users but also contribute a high proportion of the
utility in supporting transactions. Nonetheless, the set of all discovered FHUOIs may be very large, especially for large and
dense datasets or for low values of predefined minimum thresholds. For this reason, it is often quite challenging for users
to analyze and use the obtained patterns. To address this issue, this paper proposes two novel algorithms named MaxCloF-
HUOIM and CloFHUOIM to extract compact representations of FHUOIs. The former is designed to simultaneously mine
two concise representations of FHUOIs that consist of all closed FHUOIs and all maximal FHUOIs, whereas the latter only
discovers the closed FHUOIs, which provide a lossless summary of all FHUOIs. The proposed algorithms rely on a novel
weak upper bound on utility occupancy, to reduce the search space by quickly pruning itemsets with low utility occupancy.
Especially, the algorithms integrate two new efficient strategies to prune non-closed FHUOI candidate branches early in the
prefix search tree. Results from an in-depth experimental evaluation conducted on several benchmark real-life and synthetic
quantitative datasets demonstrate that MaxCloFHUOIM and CloFHUOIM have excellent performance in terms of runtime,
memory usage, and scalability. In particular, the proposed algorithms are up to two orders of magnitude faster than a baseline
algorithm.
Keywords Frequent high utility occupancy itemset· Upper-bound· Weak upper-bound· Concise representations
1 Introduction
Frequent pattern mining (FPM) [1] in transaction databases
is a traditional problem in data mining, which has many
practical applications such as analyzing customer transac-
tions to study the customers’ purchase behavior. Algorithms
for mining frequent patterns in a binary transaction database
can identify patterns that frequently appear in a database
and represent the common interests of several customers
(e.g. the items that customers like to purchase together).
Since FPM plays an essential role in association rule min-
ing [14], many FPM algorithms have been put forward
[57]. However, the frequency of a pattern alone is often
insufficient to obtain a good understanding of the data. In
particular, traditional FPM does not consider the proportion
of a transaction that is occupied by a pattern in its support-
ing transaction, which is the ratio of the number of items in
the pattern to the total number of items in the transaction
where it appears. This information plays an important role
in many real-life situations and can be captured by a measure
called occupancy, which is the relative length of a pattern to
those of its supporting transactions [8]. By considering both
frequency and occupancy, a user can mine not only frequent
but also dominant patterns. This concept is valuable in many
real-world applications, such as recommending print-areas
* Philippe Fournier-Viger
philfv@szu.edu.cn
Hai Duong
haidv@dlu.edu.vn
Huy Pham
huypq@dlu.edu.vn
Tin Truong
tintc@dlu.edu.vn
1 Department ofMathematics andComputer Science, Dalat
University, Dalat, Vietnam
2 Shenzhen University, Shenzhen, China
4013
Efficient algorithms tomine concise representations offrequent high utility occupancy…
of web pages, and recommending investment portfolios
based on a database of financial assets [8], or making travel
landscape recommendations [9]. For instance, when recom-
mending print-areas, such patterns may not only reflect the
preferences of most users but also ensure that no relevant
items are left out of a recommended pattern. Similarly, when
recommending investment portfolios, a new investor seeking
a balanced and safe investment plan may prefer an invest-
ment pattern that is frequently selected for other customers
and covers a substantial portion of their investments. The
problem of mining high occupancy patterns has attracted the
attention of several researchers, such as [1013].
However, the conventional concept of occupancy is typi-
cally applicable only to traditional binary databases. Such
datasets lack critical information about the utility of items,
including quantities and profits associated with items in
transactions, which is often present in quantitative databases
(QDBs) reflecting real-world scenarios, such as customer
transaction records in a supermarket's basket database. The
utility of a pattern is a measure of the benefits it offers, such
as profits earned from item sales or dividends generated by
investments [1416]. Nevertheless, the (absolute) utility of a
pattern in each supporting transaction fails to assess the pro-
portion of its utility contribution to the transaction's overall
utility. To address this limitation and extend the concept of
occupancy to utility, Shen etal. [17] introduced a new meas-
ure known as utility occupancy for QDBs. This measure
quantifies the utility contribution of an itemset within each
supporting transaction. The utility occupancy of an itemset
A
in a transaction
T
containing it, denoted as
occ(A,T)
, rep-
resents the ratio of the utility contributed by
A
to the utility
of
T
(or the relative utility of
A
in
T
). The utility occupancy
occ(A)
of an itemset
A
in a QDB is then the average of the
utility occupancies of
A
in all its supporting transactions in
the QDB. This measure finds practical applications, such
as constructing recommendation systems for QDBs based
on customers' mobile app traffic profiles, or analyzing web
browsing logs [17, 18]. As example, consider a scenario
where a mobile provider, named X-Telecom, offers diverse
apps, including video streaming, social media, gaming, and
productivity tools, and it gathers usage data from users who
opt in. If a user User1 uses Video Streaming for 2h daily or
60h monthly and Social Media for 1h daily or 30h monthly.
X-Telecom treats these usage profiles as transactions, view-
ing each user's app activity over daily or monthly periods as
transactions. Each app is an item, and its utility represents
the usage time measured in hours. Then, the utility occu-
pancy of a set
A
of apps per profile (or transaction) repre-
sents the proportion of the usage time of
A
compared to the
total usage time of all apps within that profile. Therefore, the
utility occupancy of
A
on X-Telecom can be viewed as the
average of relative usage time (or time occupancy of each
user who uses
A
) per user across the platform. Analyzing
this data allows X-Telecom to discern popular apps among
users and identify usage trends, enabling improvements
catering to customer preferences.
Additionally, relying solely on either frequency or util-
ity occupancy separately may not fully address the user's
requirements in numerous real-world scenarios. This is
because a frequent pattern may have low utility occupancy,
and conversely, a high utility occupancy pattern may be
infrequent. Therefore, combining both the concepts of fre-
quency and utility occupancy for patterns is more practi-
cal and valuable in real-world applications [17, 18]. This
approach allows the discovery of valuable patterns that not
only occur frequently but also account for a significant por-
tion of the total utility of supporting transactions. Such pat-
terns enable users to answer questions like "Which important
mobile apps consume most of the customers’ data traffic?"
[17]. These are questions that traditional patterns with either
high utility or high frequency alone cannot address. This dif-
ference arises because the traditional utility measure relies
on the absolute utility of a pattern in a supporting transac-
tion, irrespective of the transaction's total utility, whereas
utility occupancy employs the relative utility of a pattern
within each transaction to reflect its contribution to the trans-
action's utility.
1.1 Motivation
While frequent and high utility occupancy (FHUO) itemsets
(FHUOIs) are valuable in numerous applications, the set of
such itemsets can become exceptionally large, especially in
the case of large and dense quantitative databases (QDBs)
with low minimum support (
ms
) and utility occupancy
(
muo
) thresholds. Consequently, analyzing the discovered
itemsets and effectively using them for decision-making or
gaining insights can be a daunting task.
To tackle this challenge, this paper proposes the min-
ing of two subsets of FHUOIs,
CFHUOI
and
,
which represent all closed FHUOIs and maximal FHUOIs,
respectively. These subsets offer concise and informative
representations of FHUOIs. In other words, discovering
these subsets from the entire pool of FHUOIs has the advan-
tage of reducing the number of presented itemsets to users
while preserving the essential information from all FHUOIs.
Furthermore, these concise representations can be utilized
to generate non-redundant FHUOI-based association rules.
Notably, closed FHUOIs hold particular practical signifi-
cance as they are the largest itemsets with the highest utility
occupancy and are common to specific groups of transac-
tions (e.g., groups of customers). Therefore, closed FHUOIs
can be regarded as the most relevant itemsets within each
equivalence class, composed of itemsets sharing the same
group of transactions. Nevertheless, to the best of our
4014 H.Duong et al.
knowledge, there has been no prior work addressing the
problem of mining the two sets,
CFHUOI
and
MFHUOI
.
Although frequent and high utility occupancy (FHUO)
itemsets (FHUOIs) are useful in many applications, the set
of such itemsets can become extremely large, especially for
large and dense QDBs with low minimum support
ms
and
utility occupancy
muo
thresholds. Consequently, it can be
challenging to analyze the discovered itemsets and use them
effectively for decision-making or gain insights. To address
this issue, this paper proposes mining only two subsets of
FHUOIs,
CFHUOI
and
MFHUOI
of all closed FHUOIs
and maximal FHUOIs, that are concise and informative rep-
resentations of FHUOIs. In other words, discovering these
subsets of all FHUOIs has the benefit of reducing the set
of itemsets that are presented to the user while preserving
the information of all FHUOIs. Additionally, these con-
cise representations could be used to generate non-redun-
dant FHUOI-based association rules. In particular, closed
FHUOIs can have an interesting practical meaning as they
are the largest itemsets that have the highest utility occu-
pancy and are common to some groups of transactions (e.g. a
group of customers). Thus, closed FHUOIs can be viewed as
the most qualified itemsets in each equivalence class, which
consists of itemsets sharing the same group of transactions.
However, to the best of our knowledge, there is so far not
any work for solving the problem of mining the two sets
CFHUOI
and
.
1.2 Challenges
Note that since the cardinality of
CFHUOI
is often much
smaller than that of
FHUOI
, and
is a subset of
CFHUOI
, the main difficulty lies in solving the problem of
mining the
CFHUOI
set, which brings three key challenges.
Firstly, considering that all closed FHUOIs are also
FHUOIs, the first question is how to efficiently prune all
infrequent or low utility occupancy itemsets in the min-
ing process. Eliminating infrequent itemsets can be easily
accomplished by utilizing the downward closure (or anti-
monotonic) property of the support measure. However,
because the utility occupancy (
occ
) is not anti-monotonic,
the removal of low utility occupancy branches from the pre-
fix search tree necessitates the development of upper bounds
on
occ
, such as
Φ
in [17, 19], which satisfies an anti-mono-
tone-like property. In simpler terms, this means that if
Φ(A)
is less than
muo
, then the entire branch
Br(A)
, which con-
sists of all forward extensions
B
of
A
, is pruned as
occ(B)
is
also less than
muo
. However, as the value of
Φ
is often too
large compared to the actual value of
occ
, it limits the prun-
ing effectiveness of
Φ
. Hence, devising a novel weak upper
bound on
occ
that is much tighter than
Φ
becomes crucial
for the earlier pruning of low utility occupancy branches.
Secondly, for the discovery of the
CFHUOI
set, a
post-processing-based and straightforward approach
for mining
CFHUOI
involves initially identifying
the set
FHUOI
of all FHUOIs, and then filtering out
closed FHUOIs
A
according to the definition. In other
words, for each FHUOI
A
in
FHUOI
, the condition
[∄BFHUOI BAsupp(B)=supp(A)]
is checked.
This approach, however, allows only the pruning of one
FHUOI/node
A
at a time. Moreover, since
|FHUOI|
is often
exceedingly large, especially for large and dense datasets
with low
ms
and
muo
thresholds, this approach proves inef-
ficient in terms of time and memory. Consequently, the sec-
ond question is whether we can replace the costly checking
condition “there is no itemset
B
” in the extensive
FHUOI
set with a simpler condition in a much smaller subset of
FHUOI
, such as the gradually constructed
CFHUOI
set.
The response is yes. In more detail,
CFHUOI
is first initial-
ized as an empty set. Then, each newly generated FHUOI is
gradually added to
CFHUOI
if it has no 1-forward and no
super-itemset in
CFHUOI
with the same support. In this
context, an itemset
B
is called a 1-forward of
A
if
B
is an
extension of
A
with an item and shares the same support as
supp(A)
.
Thirdly, in the process of constructing
CFHUOI
, the next
question revolves around the condition under which we can
prune the whole branch
Br(A)
instead of eliminating only
one node
A
, as in the post-processing-based approach. Fortu-
nately, the proposed method allows us to state that, for each
newly generated FHUOI
A
, if there exists a super-itemset
in
CFHUOI
with the same support as
supp(A)
, not only
the itemset/node
A
can be eliminated but the entire branch
Br(A)
can be also removed from the prefix tree. Furthermore,
this paper also proposes another new condition, based solely
on the support of two itemsets and often occurring at two
successive levels of the prefix tree, under which non-closed
FHUO branches can be pruned safely. The two branch prun-
ing techniques significantly enhance the performance of the
MaxCloFHUOIM algorithm to simultaneously mine both
the
CFHUOI
and
sets.
The process of constructing the
CFHUOI
set of all
closed FHUOIs begins by initializing it as an empty set.
Subsequently, each newly generated FHUOI
A
is assessed
against inclusion criteria for
CFHUOI
. If
A
satisfies these
criteria, it becomes a part of the set. Notably, FHUOIs that
meet these conditions and are already present in
CFHUOI
are certainly closed FHUOIs, eliminating the need for fur-
ther closure verification. This incremental addition pro-
cess gradually expands the
CFHUOI
result set with closed
FHUOIs. In other words, the size of
CFHUOI
grows as
the mining process progresses. Conversely, if
A
meets any
of the two presented pruning conditions, the entire branch
Br(A)
cannots contain closed FHUOIs, resulting in early
pruning. Consequently, all redundant candidate patterns
B
4015
Efficient algorithms tomine concise representations offrequent high utility occupancy…
in
Br(A)
are not generated in this scenario. In contrast, for
the post-processing approach, all such redundant candi-
date patterns
B
in
Br(A)
must be generated during the first
phase. These key features define the proposed method:
gradually constructing the
CFHUOI
set of all closed
FHUOIs during the mining process without generating
such candidate patterns
B
.
1.3 Contributions
The main contributions of this paper are fourfold.
(i). The design of a novel weak upper bound named
wubocc
on utility occupancy (
occ
), which is tighter
than the well-known upper bound
Φ
, and used in a
new pruning strategy
SDPS
to remove low utility
occupancy branches from the prefix search tree.
(ii). A method for gradually constructing the set of closed
FHUOIs (CFHUOIs) without generating candi-
dates, which is a basis for designing the following
two efficient strategies to prune non-closed FHUO
branches.
(iii). The development of a novel efficient strategy for
pruning non-closed FHUO branches during back-
ward checking and the introduction of a local prun-
ing strategy to detect and remove non-closed FHUO
branches at two successive levels of the prefix
search tree without checking itemset subset rela-
tions. These theoretical results, which are strictly
proven, lead to the introduction of two novel algo-
rithms, CloFHUOIM and MaxCloFHUOIM. The
CloFHUOIM algorithm is designed to identify only
all closed FHUOIs, while MaxCloFHUOIM simulta-
neously discovers both closed FHUOIs and maximal
FHUOIs.
(iv). Extensive experiments are conducted on real-life
and synthetic QDBs to evaluate the efficiency and
effectiveness of the proposed algorithms in terms of
runtime, memory consumption, and scalability.
The remainder of this paper is organized as follows: Sec-
tion2 provides an overview of the literature on the problem
of mining FHUOIs. Then, Section3 introduces fundamen-
tal concepts and notations related to that task. Afterward,
Section4 presents the new theoretical contributions of this
paper, including a novel weak upper bound on the utility
occupancy, general and local pruning strategies for elimi-
nating a large number of unpromising candidates, and
two novel algorithms, called CloFHUOIM and MaxCloF-
HUOIM. Thereafter, Section5 presents the experimental
evaluation of the proposed algorithms. And lastly, conclu-
sions are drawn, and opportunities for future work are dis-
cussed in Section6.
2 Related work
2.1 Mining frequent andhigh occupancy itemsets
A fundamental problem in data mining is frequent pattern
mining (FPM) in a binary database (BDB) [1, 20]. The
goal is to find all patterns with a support no less than a
minimum threshold selected by the user. The Apriori algo-
rithm [1] is the first algorithm for frequent itemset mining.
Since it performs multiple database scans, it consumes
much time to provide the result. To compress data and
reduce the computational cost of FPM, Han etal. [5] pro-
posed a tree structure named frequent pattern tree (FP-tree)
and the FP-Growth algorithm to mine frequent itemsets
more efficiently with only two database scans. Although
frequent itemset mining is useful for many domains, the
support of a pattern only indicates that it appears often.
Hence, frequent itemsets may represent the common inter-
ests of many users, but the occupancy of each pattern in
the transactions (records) is not taken into account, that is,
the ratio of the number of items in a pattern to that of the
supporting transactions. The occupancy of a pattern in a
database is a concept first proposed by Tang etal. [8], and
defined as the harmonic average value of its occupancies
in all its supporting transactions in the BDB. A pattern is
called qualified if it is frequent and has a high occupancy
(FHO), i.e. its support and occupancy are no less than two
predefined minimum support and occupancy thresholds,
respectively. The goal of the problem of
FHOPM
(fre-
quent high occupancy pattern mining) is to mine all FHO
patterns. Tang etal. proposed an algorithm named DOFIA
for
FHOPM
, which uses upper bounds on the occupancy
to reduce the search space.
FHOPM
has been applied in
various real-life applications, such as recommending print-
areas of webpages or investment portfolios from a transac-
tion database containing sets of financial assets [8]. For the
recommendation of webpage print-areas, it was shown that
FHO patterns are more effective than maximal frequent
patterns [8]. The concept of occupancy was also extended
to mining sequential patterns in sequence databases and
applied to the problem of travel landscape recommenda-
tions by using an algorithm named DOFRA [9].
However, selecting appropriate minimum support and
occupancy thresholds can be a challenging task. If the
thresholds are not selected carefully, it can lead to either
a very large number of patterns or no results at all. To
address this issue, Deng [10] proposed a new occupancy
measure, where the occupancy of a pattern is defined as
the sum (instead of the average) of its occupancies in all
of its supporting transactions [10]. Deng also designed
an algorithm named HEP to mine such high occupancy
itemsets in BDBs, which relies on an anti-monotonic
4016 H.Duong et al.
upper bound named UBO and a new data structure called
Occupancy-Lists. To improve the performance of HEP,
the authors in [13] proposed two algorithms, FHOI and
DFHOI, to mine high occupancy itemsets by applying the
equivalence class concept and new candidate elimination
strategies. This new occupancy measure has been used to
analyse incremental data in intelligent systems [12].
2.2 Mining high utility itemsets andhigh utility
occupancy itemsets
Although support and occupancy measures are commonly
used to mine useful patterns in BDBs, these measures do not
take into account the additional information found in quanti-
tative databases (QDBs) about the utility of items, which is
available in many real-world applications. QDBs are more
general than BDBs as they provide valuable additional infor-
mation about the number of occurrences (or internal utility)
of items in transactions and their relative importance (or
external utility such as unit profit).
To find patterns in QDBs, researchers have studied the
problem of high utility pattern mining (HUPM) in quantita-
tive transaction databases (QTDBs) [21]. One of the main
challenges of HUPM is that the anti-monotonic (or Down-
ward Closure) property, commonly used to efficiently reduce
the search space for pattern mining, does not hold for the
utility function
u
. To overcome this issue, researchers have
proposed upper bounds on
u
, such as the TWU [22], which
is anti-monotonic, and a tighter upper bound that relies on
the remaining utility of a pattern in its supporting transac-
tions with the corresponding HUI-Miner algorithm using a
utility-list-based structure [15]. Many algorithms have been
developed to solve the HUPM problem such as UP-Growth
[14] based on a utility pattern tree (UP-Tree) with only two
database scans, EFIM [23] with two UBs named sub-tree
utility and local utility, HMiner [24] using a compact util-
ity list and virtual hyperlink data structure to store itemset
information, and Hamm [25], a one-phase algorithm based
on a novel data structure named TV (prefix Tree and utility
Vector). An extension of the problem of mining high utility
itemsets is to discover all high utility sequential patterns in
quantitative sequential databases [2630]. Another type of
HUPM is the problem of mining high average utility patterns
in QDBs [3137].
Although HUPM has been applied in many applications,
the traditional utility
u
of an itemset
A
in a supporting trans-
action
T
(or absolute utility of
A
in
T
[38]) does not reflect
the utility occupancy level of
A
in
T
, which emphasizes
the importance of
A
in terms of utility for each customer
T
. This intrinsic requirement is important in several real-
life applications such as recommendation for mobile app
traffic profiles or webpage recommendation based on web
browsing logs [17]. Shen etal. [17] extended the concept of
occupancy to handle binary quantitative databases (BQBs)
by proposing the concept of utility occupancy in QTDBs.
The utility occupancy of an itemset
A
in a supporting trans-
action
T
is defined as the ratio of the utility of
A
in
T
to the
transaction utility of
T
(or the relative utility of
A
in
T
),
which emphasizes the importance of
A
in terms of utility for
each individual customer
T
. The utility occupancy of
A
in a
QTDB, denoted as
occ(A)
, is the arithmetic average of the
utility occupancies of
A
in all transactions
T
containing
A
.
A pattern is called a frequent high utility occupancy itemset
(FHUOI) if its support and utility occupancy are respectively
no less than two predefined minimum support
ms
and util-
ity occupancy
muo
thresholds. The problem
FHUOIM
of
mining all FHUOIs has been used in several applications,
such as for recommendation based on mobile app traffic
profiles, where a mobile app is viewed as an item and its
traffic usage as its utility. Discovered FHUOIs allow the user
to answer questions such as "Which important mobile apps
consume most of the customers’ data traffic?" [17]. These
are questions that traditional patterns with either high utility
or high frequency alone cannot address. Another application
is webpage recommendation based on web browsing logs,
where each website and its browsing time can be viewed as
an item and its utility, respectively. To find patterns using
the utility occupancy, an important challenge is that the
occ
measure is not anti-monotonic (as the utility). There-
fore, to reduce the search space for mining all FHUOIs, the
authors in [17] proposed an upper bound (UB) on
occ
and
the OCEAN algorithm for
FHUOIM
. To further improve
the performance of algorithms for
FHUOIM
, Gan etal.
[19] proposed an algorithm named HUOPM for mining the
complete set of FHUOIs, which efficiently prunes infrequent
or low utility occupancy itemsets based on the anti-mono-
tonicity of the support and anti-monotonicity w.r.t. forward
extensions of an upper bound named
Φ
on
occ
. Recently, He
etal. [18] proposed a novel algorithm named SHO to effi-
ciently mine all FHUOIs in massive data using an approach
of suffix-based partitioning. The task of
FHUOIM
has been
modified and extended in several ways, such as to mine all
potential FHUOIs in uncertain QTDBs [39], or all frequent
high utility occupancy sequential patterns in quantitative
sequence databases [40].
2.3 Concise representations forhigh utility itemsets
High utility itemsets (HUIs) are commonly used in various
real-life applications, but the cardinality of the set of HUIs
can be very large, especially for large and dense QDBs or for
low minimum utility threshold values. For this reason, it can
be challenging and time-consuming to analyze HUIs and uti-
lize them effectively. To address this issue, researchers have
proposed algorithms that discover compact representations
of HUIs that often have a much smaller cardinality and can