Database Systems: The Complete Book

1080

CHAPTER

20.

IXFORlI.4TION IXTEGRATIO-\-

it appears. Figure 20.17 suggests the process of adding a border to the cube

in each dimension, to represent the

ulue and the aggregated values that

it implies. In this figure u;e see three din~ensions. with the lightest shading

representing aggregates in one dimension, darker shading for aggregates over

two dimensions, and tlle darkest cube in the corner for aggregation over all

three dimensions. Notice that if the number of values along each dirnension is

reasonably large, but not so large that most poil~ts in tlle cube are unoccupied.

then the "border" represents only a small addition to the volume of the cube

(i.e., the number of tuples in the fact table).

that case, the size of the stored

data

CCBE(F)

is not much greater than tlic size of

itself.

Figure 20.17: The cube operator augments a data cube with

border of aggre-

gations in all combinations of ctinien.ions

tuple of the table

CLBE;(F)

that has

in one or more dimensions TI-ill

have for each dependent attribute the sum (or another aggregate f~~nction) of

the values of that attribute in all the tuples that xve can obtain by replacing

the *'s by real values.

effect.

build into the data the result of aggregating

along any set of dimensions. Sotice. holvever. that the

CUBE

operator does

not support <\ggregation at intermediate levels of granularity based on values in

the dirnension tables For instance. ne may either leave data broken dovi-11 by

day (or whatever the finest granularity for time is). or xve may aggregate time

completely, but \re cannot, with thc

CCBE

operator alone, aggregate by weeks.

months. or years.

Example

20.17

Let us reconsider the -1ardvark database from Esarnple 20.12

in the light of ~vhat the

Ct-BE

oprr;i~or

can

givc us. Recall the fact table from

that exiumplc, is

Sales(serialN0, date, dealer, price)

Hoxvever, the dimension represented by

serialNo

is not well suited for the cube.

since the serial number is a key for

Sales.

Thus. sumning the price over all

dates, or over all dealers, but keeping the serial ~lumbrr fixed has

110

effect:

n-e

n-ould still gct the "sum" for the one auto ~vith that serial number.

Illole

useful data cube would replace the serial number by the txo attributes

model

and color

to which the serial number connects

Sales

via the dimension table

Autos.

Sotice that if we replace

serialNo

model

and

color,

then tile cube

no longer has a key among its dimensions. Thus, an entry of the cube ~vould

hare the total sales price for all automobiles of a given model. with a given

color, by a given dealer, on a given date.

There is another change that is useful for the data-cube implementation

of the

Sales

fact table. Since the

CUBE

operator normally sums dependent

variables, and

13-e

might want to get average prices for sales in some category,

n-e need both the sum of the prices for each category of automobiles (a given

model of a given color sold on a given day by a given dealer) and the total

number of sales in that category. Thus, the relation

Sales

to which we apply

the

CCBE

operator is

Sales(mode1, color, date, dealer, val, cnt)

The attribute

val

is intended to be the total price of all automobiles for the

given model, color. date. and dealer, while

cnt

is the total number of automo-

biles in that category. Xotice that in this data cube. individual cars are not

identified: they only affect the value and count for their category.

Son-. let us consider the relation

cC~~(Sa1es).

.-I

hypothetical tuple that

n-ould be in both

Sales

and

ti lo sales).

('Gobi', 'red', '2001-05-21', 'Friendly Fred', 45000, 2)

The interpretation is that

May 21; 2001. dealer Friendly Fled sold two red

Gobis for a total of $45.000. The tuple

('Gobi',

'2001-05-21', 'Friendly Fred', 152000,

says that on SIay 21, 2001. Friendly Fred sold seven Gobis of all colors, for

a total price of S152.000. Sote that this tuple is in

sales)

but not in

Sales.

Relation

sales)

also contains tuples that represent the aggregation

over more than one attribute. For instance.

('Gobi',

'2001-05-21',

2348000, 100)

says rliat on \la!- 21. 2001. rllei-e n-ere 100 Gobis sold

all the dealers. and

the total price of tliose Gobis Tvas S2.348.000.

('Gobi',

1339800000, 58000)

Says that over all time, dealers. and colors. 58.000 Gobis have been sold for a

total price of S1.339.800.000. Lastly. the tuple

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

tells us that total sales of all Aardvark lnodels in all colors, over all time at all

dealers is 198.000 cars for

total price of $3,521,727,000.

Consider how to answer

query in \\-hich we specify conditions on certain

attributes of the Sales relation and group by some other attributes, n-hile

asking for the sum, count, or average price. In the relation

are r sales),

look for those tuples

with the fo1lov;ing properties:

1. If the query specifies

value

for attribute

then tuple

has

in its

component for

2. If the query groups by an attribute

then

has any non-* value in its

conlponent for

If the query neither groups by attribute

nor specifies a value for

then

has

in its component for

Each tuple

has tlie sum and count for one of the desired groups. If n-e \%-ant

the average price, a division is performed on the sum and count conlponents of

each tuple

Example

20.18

The query

SELECT color, AVG(price)

FROM Sales

WHERE model

'Gobi'

GROUP

color;

is ansn-ered by looking for all tuples of

sales)

~vith the form

('Gobi',

21,

here

is any specific color. In this tuple,

will be the sum of sales of Gobis

in that color, while

will be the nlini!)cr of sales of Gobis in that color. Tlie

average price. although not an attribute of Sales or

sales)

directly. is

v/n.

Tlie answer to the query is the set of

(c,

vln)

pairs obtained fi-om all

('Gobi'.

tuples.

20.5.2

Cube ImplementaOion

Materialized Views

11%

suggested in Fig. 20.17 that adding

aggregations

to the cube doesn't cost

much in tcrms of space. and saves a lot in time \vhen the common kincis of

decision-support queries are asked. Ho~vever: our analysis is based on the as-

sumption that queries choose either to aggregate completely in a dimension

or not to aggregate at all. For some dime~isions. there are many degrees of

granularity that could be chosen for a grouping on that dimension.

Uc have already mentioned thc case of time. xvl-here numerolls options such

as aggregation by weeks, months: quarters, or ycars exist,, in addition to the

all-or-nothing choices of grouping by day or aggregating over all time. For

another esanlple based on our running automobile database, Ive could choose

to aggregate dealers completely or not aggregate them at all. Hon-ever, we could

also choose to aggregate by city, by state, or perhaps by other regions, larger

or smaller. Thus: there are at least sis choices of grouping for time and at least

four for dealers.

l\Tllen the number of choices for grouping along each dimension grows, it

becomes increasingly expensive to store the results of aggregating by every

possible conlbination of groupings. Sot only are there too many of them, but

they are not as easily organized as the structure of Fig. 20.17 suggests for tlle

all-or-nothing case. Thus, commercial data-cube systems may help the user to

choose some

n~aterialized

views

of the data cube.

materialized view is the

result of some query, which we choose to store in the database, rather than

reconstructing (parts of) it as needed in response to queries. For the data cube,

the vie~vs we n-ould choose to materialize xi11 typically be aggregations of the

full data cube.

The coarser the partition implied by the grouping, the less space the mate-

rialized view takes. On the other hand, if ire ~vant to use a view to answer a

certain query, then the view must not partition any dimension more coarsely

than the query does. Thus, to maximize the utility of materialized views, we

generally n-ant some large \-iers that group dimensions into a fairly fine parti-

tion. In addition, the choice of vien-s to materialize is heavily influenced by the

kinds of queries that the analysts are likely

ask.

.in

example will suggest tlie

tradeoffs in\-011-ed.

INSERT INTO SalesVl

SELECT model, color, month, city,

SUM(va1) AS val, SUM(cnt) AS cnt

FROM Sales JOIN Dealers

dealer

name

GROUP

model, color, month, city;

Figure 20.18: The materialized vien. SalesVl

Example

20.19

Let us return to the data cube

Sales (model, color, date, dealer, val

cnt)

that ne de\-eloped in Esample 20.17. One possible materialized vie\\- groups

dates by nionth and dealers by city. This view. 1%-hich

1%-e

call SalesV1, is

constlucted

the query in Fig. 20.18. This query is not strict

SQL.

since n-e

imagine that dates and their grouping units such as months are understood

by the data-cube system n-ithout being told to join Sales with the imaginary

relation rep~esenting dajs that \ve discussed in Example 20.14.

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

CHAPTER

20.

IiYFORI\IATIOAr IArTEGR.4TION

20.5.

DdT.4 CUBES

1055

INSERT INTO SalesV2

SELECT model, week, state,

SUM(va1) AS val, SUM(cnt) AS cnt

FROM Sales JOIN Dealers

dealer

name

GROUP

model, week, state;

Figure 20.19: Another materialized view,

SalesV2

Another possible materialized view aggregates colors completely, aggregates

time into u-eeks, and dealers by states. This view,

SalesV2,

is defined by the

query in Fig. 20.19. Either view

SalesVl

SalesV2

can be used to ansn-er a

query that partitions no more finely than either in any dimension. Thus, the

query

41:

SELECT model, SUM(va1)

FROM Sales

GROUP

model;

can be answered either by

SELECT model, SUM(va1)

FROM SalesVl

GROUP

model;

SELECT model, SUM(va1)

FROM SalesV2

GROUP BY model;

On the other hand, the query

42: SELECT model, year, state, SUM(va1)

FROM Sales JOIN Dealers

dealer

name

GROUP

model, year, state;

can on1 be ans\vered from

SalesV1.

SELECT model, year, state, SUM(va1)

FROM SalesVl

GROUP

model, year, state;

Incidentally. the query inmediately above. like the qu'rics that nggregate time

units, is not strict

SQL.

That is.

state

is not ari attribute of

SalesVl:

only

city

is. \Ye rmust assume that the data-cube systenl knol\-s how to perform the

aggregation of cities into states, probably by accessing the dimension table for

dealers.

\Ye

cannot answer Q2 from

SalesV2.

Although we could roll-up cities into

states (i.e.. aggregate the cities into their states) to use

SalesV1,

carrrlot

roll-up ~veeks into years, since years are not evenly divided into weeks. and

data from a week beginning. say, Dec.

29,

2001. contributes to years 2001 and

2002 in a way we carinot tell from the data aggregated by weeks.

Finally, a query like

43:

SELECT model, color, date, ~~~(val)

FROM Sales

GROUP BY model, color, date;

can be anslvered from neither

SalesVl

nor

SalesV2.

It cannot be answered

from

Salesvl

because its partition of days by ~nonths is too coarse to recover

sales by day, and it cannot be ans~vered from

SalesV2

because that view does

not group by color. We would have to answer this query directly from the full

data cube.

20.5.3

The Lattice

Views

To formalize the cbservations of Example 20.10. it he!ps to think of a lattice of

possibl~ groupings for each dimension of the cube. The points of the lattice are

the ways that we can partition the ~alucs of a dimension

grouping according

to one or more attributes of its dimension table.

say that partition

belo~v partition

P2.

written

if and only if each group of

is contained

within some group of

PZ.

All

Years

Quarters

Weeks Months

Days

Figure 20.20:

lattice of partitions for time inter\-als

Example

20.20:

For the lattice of time partitions n-e might choose the dia-

gram of Fig. 20.20.

-4

path from some node

dotvn to

means that

These are not the only possible units of time, but they

\\-ill

serve as an example

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

of what units a s~stern might support. Sotice that daks lie below both \reeks

and months, but weeks do not lie below months. The reason is that while a

group of events that took place in one day surely took place within one \reek

and within one month. it is not true that a group of events taking place in one

week necessarily took place in any one month. Similarly, a week's group need

not be contained within the group cor~esponding to one quarter or to one year.

tlie top is a partition we call "all," meaning that events are grouped into a

single group; i.e.. we niake no distinctions among diffeient times.

All

State

City

Dealer

Figure 20.21:

lattice of partitions for automobile dealers

Figure 20.21 shows another lattice, this time for the dealer dimension of our

automobiles example. This lattice is siniplcr: it shows that partitioning sales by

dealer gives a finer partition than partitioning by the city of the dealer. i<-hich is

in turn finer than partitioning by tlie state of tlie dealer. The top of tlle ldrtice

is the partition that places all dealers in one group.

Having a lattice for each dimension,

15-12

can now define a lattice for all the

possible materialized views of a data cube that can be formed by grouping

according to some partition in each dimension. If

and

are two views

formed by choosing a partition (grouping) for each dimension, then

means that in each dimension, the partition

that ~ve use in

is at least as

fine as the partition

that n.e use for that dimension in

Ti;

that is.

Man) OLAP queries can also be placed in the lattice of views

fact. fie-

quently an OLAP query has the same form as the views we have described: the

query specifies some pa~titioning (possibly none or all) for each of the dimen-

sions. Other OL.iP queiics involve tliis same soit of grouping, and then "slice

tlie cube to focus

011

a subset of the data. as nas suggested

the diag~ani in

Fig. 20.15. The general rule is.

I\c can ansn-er a quciy

using view

if and o~ily if

Example 20.21

Figure 20.22 takes the vielvs and queries of Example 20.19

and places them in a lattice. Sotice that the Sales data cube itself is technically

a view. corresponding to tlie finest possible partition along each climensio~l. As

we observed in the original example,

can be ans~vered from either SalesVl or

Sales

Figure 20.22: The lattice of views and queries from Example 20.19

SalesV2; of course it could also be answered froni the full data cube Sales, but

there is no reason to want to do so if one of the other views is materialized.

can be answered from either SalesVl or Sales, while

can only be answered

from Sales. Each of these relationships is expressed in Fig. 20.22 by the paths

downxard from the queries to their supporting vie~vs.

Placing queries in the lattice of views helps design data-cube databases.

Some recently developed design tools for data-cube systems start with a set of

queries that they regard as ..typical" of the application at hand. They then

select

set of views to materialize so that each of these queries is above at least

one of the riel\-s, preferably identical to it or very close (i.e., the query and the

view use the same grouping in most of the dimensions).

20.5.4

Exercises

for

Section

20.5

Exercise 20.5.1

IVhat is the ratio of the size of CCBE(F) to the size of

fact table

has the follorving characteristics?

has ten dimension attributes, each with ten different values.

has ten dimension attributes. each with two differcnt values.

Exercise 20.5.2:

Let us use the cube ~nBE(Sa1es) from Example 20.17,

~vhich was built from the relation

Sales (model, color, date, dealer, val,

cnt)

Tcll I\-hat tuples of the cube n-e 15-ould use to answer tlle follon-ing queries:

a) Find the total sales of I~lue cars for each dealer.

b) Find the total nurnber of green Gobis sold by dealer .'Smilin' Sally."

c) Find the average number of Gobis sold on each day of March, 2002 by

each dealer.

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

1088

CHAPTER

20.

ISFORJlATIOS IXTEGRA4TIOS

Exercise

20.5.3:

In Exercise 20.4.1 lve spoke of PC-order data organized as

a cube. If we are to apply the CCBE operator, we might find it convenient to

break several dimensions more finely. For example, instead of one processor

dimension, we might have one dimension for the type (e.g., AlID Duron or

Pentium-IV), and another d~mension for the speed. Suggest a set of dimrnsions

and dependent attributes that will allow us to obtain answers to a variety of

useful aggregation queries. In particular, what role does the customer play?

.Also, the price in Exercise 20.4.1 referred to the price of one macll~ne, while

several identical machines could be ordered in a single tuple. What should the

dependent attribute(s) be?

Exercise

20.5.4

What tuples of the cube from Exercise 20.5.3 would you use

to answer the following queries?

a) Find, for each processor speed, the total number of computers ordered in

each month of the year 2002.

b) List for each type of hard disk (e.g., SCSI or IDE) and eacli processor

type the number of computers ordered.

c) Find the average price of computers with 1500 megahertz processors for

each month from Jan., 2001.

Exercise

20.5.5

The computers described in the cube of Exercise 20.5.3 do

not include monitors. IVhat dimensions would you suggest to represent moni-

tors? You may assume that the price of the monitor is included in the price of

the computer.

Exercise

20.5.6

Suppose that a cube has 10 dimensions. and eacli dimension

has

options for granularity of aggregation. including "no aggregation" and

"aggregate fully.'' How many different views can we construct by clioosing a

granularity in each dinlension?

Exercise

20.5.7

Show how to add the following time units to the lattice of

Fig. 20.20: hours, minutes, seconds, fortnights (two-week periods). decades.

and centuries.

Exercise

20.5.8:

How 15-onld you change the dealer lattice of Fig. 20.21 to

include -regions." ~f:

region is a set of states.

b) Regions are not com~liensurate with states. but each city is in only one

region.

c) Regions are like area codes: each region is contained \vithin a state. some

cities are in two or more regions. and some regions ha~e several cities.

20.6.

DATA

-111-YIA-G

1089

Exercise

20.5.9:

In Exercise 20.5.3 ne designed a cube suitable for use ~vith

the CCBE operator.

Horn-ever.

some of the dimensions could also be given a non-

trivial lattice structure. In particular, the processor type could be organized by

manufacturer (e

g., SUT, Intel. .AND. llotorola). series (e.g..

SUN

Ult~aSparc.

Intel Pentium or Celeron. AlID rlthlon, or llotorola G-series), and model (e.g.,

Pentiuni-I\- or G4).

a) Design tlie lattice of processor types following the examples described

above.

b) Define a view that groups processors by series, hard disks by type, and

removable disks by speed, aggregating everything else.

c) Define a view that groups processors by manufacturer, hard disks by

speed. and aggregates everything else except memory size.

d) Give esamples of qneries that can be ansn-ered from the view of (11) only,

the vieiv of (c) only, both, and neither.

*!!

Exercise

20.5.10:

If the fact table

to n-hicli n-e apply the

CuBE

operator is

sparse (i.e.. there are inany fen-er tuples in

than the product of the number

of possihle values along each dimension), then tlie ratio of the sizes of CCBE(F)

and

can be very large. Hon large can it be?

20.6

Data

Mining

family of database applications cal!ed

data

rnin,ing

knowledge discovery

dntnbases

has captured considerable interest because of opportunities to learn

surprising facts fro111 esisting databases. Data-mining queries can be thought

of as an estended form of decision-support querx, although the distinction is in-

formal (see the box on -Data-llining Queries and Decision-Support Queries").

Data nli11i11:. stresses both the cpcry-optimization and data-management com-

ponents of a traditional database system, as 1%-ell as suggesting some important

estensions to database languages, such as language primitix-es that support effi-

cient sampling of data. In this section, we shall esamine the principal directions

data-mining applications have taken. Me then focus on tlie problem called "fre-

quc'iit iteinsets." n-hich has 1-eceiwd the most attention from the database point

of view.

20.6.1

Data-Iblining Applications

Broadly. data-mining queries ask for a useful summary of data, often ~vithout

suggcstir~g the values of para~netcrs that would best yield such a summary.

This family of problems thus requires rethinking the nay database systems are

to be used to provide snch insights abo~it the data. Below are some of tlie

applications and problems that are being addressed using very large amounts

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

Database Systems: The Complete Book- P12

Chủ đề:

Tài liệu liên quan

Tài liêu mới

AI tóm tắt

Giới thiệu tài liệu

Đối tượng sử dụng

Từ khoá chính

Nội dung tóm tắt

Hỗ trợ

Phương thức thanh toán

Theo dõi chúng tôi