
tells us that total sales of all Aardvark lnodels in all colors, over all time at all
dealers is 198.000 cars for
a
total price of $3,521,727,000.
Consider how to answer
a
query in \\-hich we specify conditions on certain
attributes of the Sales relation and group by some other attributes, n-hile
asking for the sum, count, or average price. In the relation
are r sales),
we
look for those tuples
t
with the fo1lov;ing properties:
1. If the query specifies
a
value
v
for attribute
a;
then tuple
t
has
v
in its
component for
a.
2. If the query groups by an attribute
a,
then
t
has any non-* value in its
conlponent for
a.
3.
If the query neither groups by attribute
a
nor specifies a value for
a.
then
t
has
*
in its component for
a.
Each tuple
t
has tlie sum and count for one of the desired groups. If n-e \%-ant
the average price, a division is performed on the sum and count conlponents of
each tuple
t.
Example
20.18
:
The query
SELECT color, AVG(price)
FROM Sales
WHERE model
=
'Gobi'
GROUP
BY
color;
is ansn-ered by looking for all tuples of
sales)
~vith the form
('Gobi',
C.
*,
*,
21,
n)
here
c
is any specific color. In this tuple,
v
will be the sum of sales of Gobis
in that color, while
n
will be the nlini!)cr of sales of Gobis in that color. Tlie
average price. although not an attribute of Sales or
sales)
directly. is
v/n.
Tlie answer to the query is the set of
(c,
vln)
pairs obtained fi-om all
('Gobi'.
c,
*,
*.
v.
n)
tuples.
20.5.2
Cube ImplementaOion
by
Materialized Views
11%
suggested in Fig. 20.17 that adding
aggregations
to the cube doesn't cost
much in tcrms of space. and saves a lot in time \vhen the common kincis of
decision-support queries are asked. Ho~vever: our analysis is based on the as-
sumption that queries choose either to aggregate completely in a dimension
or not to aggregate at all. For some dime~isions. there are many degrees of
granularity that could be chosen for a grouping on that dimension.
Uc have already mentioned thc case of time. xvl-here numerolls options such
as aggregation by weeks, months: quarters, or ycars exist,, in addition to the
all-or-nothing choices of grouping by day or aggregating over all time. For
another esanlple based on our running automobile database, Ive could choose
to aggregate dealers completely or not aggregate them at all. Hon-ever, we could
also choose to aggregate by city, by state, or perhaps by other regions, larger
or smaller. Thus: there are at least sis choices of grouping for time and at least
four for dealers.
l\Tllen the number of choices for grouping along each dimension grows, it
becomes increasingly expensive to store the results of aggregating by every
possible conlbination of groupings. Sot only are there too many of them, but
they are not as easily organized as the structure of Fig. 20.17 suggests for tlle
all-or-nothing case. Thus, commercial data-cube systems may help the user to
choose some
n~aterialized
views
of the data cube.
A
materialized view is the
result of some query, which we choose to store in the database, rather than
reconstructing (parts of) it as needed in response to queries. For the data cube,
the vie~vs we n-ould choose to materialize xi11 typically be aggregations of the
full data cube.
The coarser the partition implied by the grouping, the less space the mate-
rialized view takes. On the other hand, if ire ~vant to use a view to answer a
certain query, then the view must not partition any dimension more coarsely
than the query does. Thus, to maximize the utility of materialized views, we
generally n-ant some large \-iers that group dimensions into a fairly fine parti-
tion. In addition, the choice of vien-s to materialize is heavily influenced by the
kinds of queries that the analysts are likely
to
ask.
.in
example will suggest tlie
tradeoffs in\-011-ed.
INSERT INTO SalesVl
SELECT model, color, month, city,
SUM(va1) AS val, SUM(cnt) AS cnt
FROM Sales JOIN Dealers
ON
dealer
=
name
GROUP
BY
model, color, month, city;
Figure 20.18: The materialized vien. SalesVl
Example
20.19
:
Let us return to the data cube
Sales (model, color, date, dealer, val
,
cnt)
that ne de\-eloped in Esample 20.17. One possible materialized vie\\- groups
dates by nionth and dealers by city. This view. 1%-hich
1%-e
call SalesV1, is
constlucted
by
the query in Fig. 20.18. This query is not strict
SQL.
since n-e
imagine that dates and their grouping units such as months are understood
by the data-cube system n-ithout being told to join Sales with the imaginary
relation rep~esenting dajs that \ve discussed in Example 20.14.

CHAPTER
20.
IiYFORI\IATIOAr IArTEGR.4TION
20.5.
DdT.4 CUBES
1055
INSERT INTO SalesV2
SELECT model, week, state,
SUM(va1) AS val, SUM(cnt) AS cnt
FROM Sales JOIN Dealers
ON
dealer
=
name
GROUP
BY
model, week, state;
Figure 20.19: Another materialized view,
SalesV2
Another possible materialized view aggregates colors completely, aggregates
time into u-eeks, and dealers by states. This view,
SalesV2,
is defined by the
query in Fig. 20.19. Either view
SalesVl
or
SalesV2
can be used to ansn-er a
query that partitions no more finely than either in any dimension. Thus, the
query
41:
SELECT model, SUM(va1)
FROM Sales
GROUP
BY
model;
can be answered either by
SELECT model, SUM(va1)
FROM SalesVl
GROUP
BY
model;
SELECT model, SUM(va1)
FROM SalesV2
GROUP BY model;
On the other hand, the query
42: SELECT model, year, state, SUM(va1)
FROM Sales JOIN Dealers
ON
dealer
=
name
GROUP
BY
model, year, state;
can on1 be ans\vered from
SalesV1.
as
SELECT model, year, state, SUM(va1)
FROM SalesVl
GROUP
BY
model, year, state;
Incidentally. the query inmediately above. like the qu'rics that nggregate time
units, is not strict
SQL.
That is.
state
is not ari attribute of
SalesVl:
only
city
is. \Ye rmust assume that the data-cube systenl knol\-s how to perform the
aggregation of cities into states, probably by accessing the dimension table for
dealers.
\Ye
cannot answer Q2 from
SalesV2.
Although we could roll-up cities into
states (i.e.. aggregate the cities into their states) to use
SalesV1,
we
carrrlot
roll-up ~veeks into years, since years are not evenly divided into weeks. and
data from a week beginning. say, Dec.
29,
2001. contributes to years 2001 and
2002 in a way we carinot tell from the data aggregated by weeks.
Finally, a query like
43:
SELECT model, color, date, ~~~(val)
FROM Sales
GROUP BY model, color, date;
can be anslvered from neither
SalesVl
nor
SalesV2.
It cannot be answered
from
Salesvl
because its partition of days by ~nonths is too coarse to recover
sales by day, and it cannot be ans~vered from
SalesV2
because that view does
not group by color. We would have to answer this query directly from the full
data cube.
20.5.3
The Lattice
of
Views
To formalize the cbservations of Example 20.10. it he!ps to think of a lattice of
possibl~ groupings for each dimension of the cube. The points of the lattice are
the ways that we can partition the ~alucs of a dimension
by
grouping according
to one or more attributes of its dimension table.
nB
say that partition
PI
is
belo~v partition
P2.
written
PI
5
P2
if and only if each group of
Pl
is contained
within some group of
PZ.
All
Years
/
1
I
Quarters
I
Weeks Months
Days
Figure 20.20:
A
lattice of partitions for time inter\-als
Example
20.20:
For the lattice of time partitions n-e might choose the dia-
gram of Fig. 20.20.
-4
path from some node
fi
dotvn to
PI
means that
PI
5
4.
These are not the only possible units of time, but they
\\-ill
serve as an example

of what units a s~stern might support. Sotice that daks lie below both \reeks
and months, but weeks do not lie below months. The reason is that while a
group of events that took place in one day surely took place within one \reek
and within one month. it is not true that a group of events taking place in one
week necessarily took place in any one month. Similarly, a week's group need
not be contained within the group cor~esponding to one quarter or to one year.
At
tlie top is a partition we call "all," meaning that events are grouped into a
single group; i.e.. we niake no distinctions among diffeient times.
All
I
State
I
City
I
Dealer
Figure 20.21:
A
lattice of partitions for automobile dealers
Figure 20.21 shows another lattice, this time for the dealer dimension of our
automobiles example. This lattice is siniplcr: it shows that partitioning sales by
dealer gives a finer partition than partitioning by the city of the dealer. i<-hich is
in turn finer than partitioning by tlie state of tlie dealer. The top of tlle ldrtice
is the partition that places all dealers in one group.
Having a lattice for each dimension,
15-12
can now define a lattice for all the
possible materialized views of a data cube that can be formed by grouping
according to some partition in each dimension. If
15
and
1%
are two views
formed by choosing a partition (grouping) for each dimension, then
1;
5
11
means that in each dimension, the partition
Pl
that ~ve use in
1;
is at least as
fine as the partition
Pl
that n.e use for that dimension in
Ti;
that is.
Pl
5
P?
Man) OLAP queries can also be placed in the lattice of views
In
fact. fie-
quently an OLAP query has the same form as the views we have described: the
query specifies some pa~titioning (possibly none or all) for each of the dimen-
sions. Other OL.iP queiics involve tliis same soit of grouping, and then "slice
tlie cube to focus
011
a subset of the data. as nas suggested
by
the diag~ani in
Fig. 20.15. The general rule is.
I\c can ansn-er a quciy
Q
using view
1-
if and o~ily if
1-
5
Q.
Example 20.21
:
Figure 20.22 takes the vielvs and queries of Example 20.19
and places them in a lattice. Sotice that the Sales data cube itself is technically
a view. corresponding to tlie finest possible partition along each climensio~l. As
we observed in the original example,
QI
can be ans~vered from either SalesVl or
Sales
Figure 20.22: The lattice of views and queries from Example 20.19
SalesV2; of course it could also be answered froni the full data cube Sales, but
there is no reason to want to do so if one of the other views is materialized.
Q2
can be answered from either SalesVl or Sales, while
Q3
can only be answered
from Sales. Each of these relationships is expressed in Fig. 20.22 by the paths
downxard from the queries to their supporting vie~vs.
Placing queries in the lattice of views helps design data-cube databases.
Some recently developed design tools for data-cube systems start with a set of
queries that they regard as ..typical" of the application at hand. They then
select
a
set of views to materialize so that each of these queries is above at least
one of the riel\-s, preferably identical to it or very close (i.e., the query and the
view use the same grouping in most of the dimensions).
20.5.4
Exercises
for
Section
20.5
Exercise 20.5.1
:
IVhat is the ratio of the size of CCBE(F) to the size of
F
if
fact table
F
has the follorving characteristics?
*
a)
F
has ten dimension attributes, each with ten different values.
b)
F
has ten dimension attributes. each with two differcnt values.
Exercise 20.5.2:
Let us use the cube ~nBE(Sa1es) from Example 20.17,
~vhich was built from the relation
Sales (model, color, date, dealer, val,
cnt)
Tcll I\-hat tuples of the cube n-e 15-ould use to answer tlle follon-ing queries:
*
a) Find the total sales of I~lue cars for each dealer.
b) Find the total nurnber of green Gobis sold by dealer .'Smilin' Sally."
c) Find the average number of Gobis sold on each day of March, 2002 by
each dealer.

1088
CHAPTER
20.
ISFORJlATIOS IXTEGRA4TIOS
*!
Exercise
20.5.3:
In Exercise 20.4.1 lve spoke of PC-order data organized as
a cube. If we are to apply the CCBE operator, we might find it convenient to
break several dimensions more finely. For example, instead of one processor
dimension, we might have one dimension for the type (e.g., AlID Duron or
Pentium-IV), and another d~mension for the speed. Suggest a set of dimrnsions
and dependent attributes that will allow us to obtain answers to a variety of
useful aggregation queries. In particular, what role does the customer play?
.Also, the price in Exercise 20.4.1 referred to the price of one macll~ne, while
several identical machines could be ordered in a single tuple. What should the
dependent attribute(s) be?
Exercise
20.5.4
:
What tuples of the cube from Exercise 20.5.3 would you use
to answer the following queries?
a) Find, for each processor speed, the total number of computers ordered in
each month of the year 2002.
b) List for each type of hard disk (e.g., SCSI or IDE) and eacli processor
type the number of computers ordered.
c) Find the average price of computers with 1500 megahertz processors for
each month from Jan., 2001.
!
Exercise
20.5.5
:
The computers described in the cube of Exercise 20.5.3 do
not include monitors. IVhat dimensions would you suggest to represent moni-
tors? You may assume that the price of the monitor is included in the price of
the computer.
Exercise
20.5.6
:
Suppose that a cube has 10 dimensions. and eacli dimension
has
5
options for granularity of aggregation. including "no aggregation" and
"aggregate fully.'' How many different views can we construct by clioosing a
granularity in each dinlension?
Exercise
20.5.7
:
Show how to add the following time units to the lattice of
Fig. 20.20: hours, minutes, seconds, fortnights (two-week periods). decades.
and centuries.
Exercise
20.5.8:
How 15-onld you change the dealer lattice of Fig. 20.21 to
include -regions." ~f:
a)
A
region is a set of states.
*
b) Regions are not com~liensurate with states. but each city is in only one
region.
c) Regions are like area codes: each region is contained \vithin a state. some
cities are in two or more regions. and some regions ha~e several cities.
20.6.
DATA
-111-YIA-G
1089
!
Exercise
20.5.9:
In Exercise 20.5.3 ne designed a cube suitable for use ~vith
the CCBE operator.
Horn-ever.
some of the dimensions could also be given a non-
trivial lattice structure. In particular, the processor type could be organized by
manufacturer (e
g., SUT, Intel. .AND. llotorola). series (e.g..
SUN
Ult~aSparc.
Intel Pentium or Celeron. AlID rlthlon, or llotorola G-series), and model (e.g.,
Pentiuni-I\- or G4).
a) Design tlie lattice of processor types following the examples described
above.
b) Define a view that groups processors by series, hard disks by type, and
removable disks by speed, aggregating everything else.
c) Define a view that groups processors by manufacturer, hard disks by
speed. and aggregates everything else except memory size.
d) Give esamples of qneries that can be ansn-ered from the view of (11) only,
the vieiv of (c) only, both, and neither.
*!!
Exercise
20.5.10:
If the fact table
F
to n-hicli n-e apply the
CuBE
operator is
sparse (i.e.. there are inany fen-er tuples in
F
than the product of the number
of possihle values along each dimension), then tlie ratio of the sizes of CCBE(F)
and
F
can be very large. Hon large can it be?
20.6
Data
Mining
A
family of database applications cal!ed
data
rnin,ing
or
knowledge discovery
in
dntnbases
has captured considerable interest because of opportunities to learn
surprising facts fro111 esisting databases. Data-mining queries can be thought
of as an estended form of decision-support querx, although the distinction is in-
formal (see the box on -Data-llining Queries and Decision-Support Queries").
Data nli11i11:. stresses both the cpcry-optimization and data-management com-
ponents of a traditional database system, as 1%-ell as suggesting some important
estensions to database languages, such as language primitix-es that support effi-
cient sampling of data. In this section, we shall esamine the principal directions
data-mining applications have taken. Me then focus on tlie problem called "fre-
quc'iit iteinsets." n-hich has 1-eceiwd the most attention from the database point
of view.
20.6.1
Data-Iblining Applications
Broadly. data-mining queries ask for a useful summary of data, often ~vithout
suggcstir~g the values of para~netcrs that would best yield such a summary.
This family of problems thus requires rethinking the nay database systems are
to be used to provide snch insights abo~it the data. Below are some of tlie
applications and problems that are being addressed using very large amounts


