# Database Systems: The Complete Book- P12

Chia sẻ: Thanh Cong | Ngày: | Loại File: PDF | Số trang:20

0
50
lượt xem
6

## Database Systems: The Complete Book- P12

Mô tả tài liệu

Database Systems: The Complete Book- P12: Database Systems and Database Design and Application courses offered at the junior, senior and graduate levels in Computer Science departments. Written by well-known computer scientists, this introduction to database systems offers a comprehensive approach, focusing on database design, database use, and implementation of database applications and database management systems

Chủ đề:

Bình luận(0)

Lưu

## Nội dung Text: Database Systems: The Complete Book- P12

1. 1080 CHAPTER 20. IXFORlI.4TION IXTEGRATIO-\- it appears. Figure 20.17 suggests the process of adding a border to the cube useful data cube would replace the serial number by the t x o attributes - model in each dimension, to represent the * u l u e and the aggregated values that and color - to which the serial number connects S a l e s via the dimension table it implies. In this figure u;e see three din~ensions.with the lightest shading Autos. Sotice that if we replace s e r i a l N o by model and c o l o r , then tile cube representing aggregates in one dimension, darker shading for aggregates over no longer has a key among its dimensions. Thus, an entry of the cube ~vould two dimensions, and tlle darkest cube in the corner for aggregation over all hare the total sales price for all automobiles of a given model. with a given three dimensions. Notice that if the number of values along each dirnension is color, by a given dealer, on a given date. reasonably large, but not so large that most poil~ts tlle cube are unoccupied. in There is another change that is useful for the data-cube implementation then the "border" represents only a small addition to the volume of the cube of the S a l e s fact table. Since the C U B E operator normally sums dependent (i.e., the number of tuples in the fact table). In that case, the size of the stored variables, and 13-e might want to get average prices for sales in some category, data C C B E ( Fis not much greater than tlic size of F itself. ) n-e need both the sum of the prices for each category of automobiles (a given model of a given color sold on a given day by a given dealer) and the total number of sales in that category. Thus, the relation S a l e s to which we apply the C C B E operator is Sales(mode1, c o l o r , d a t e , d e a l e r , v a l , c n t ) The attribute v a l is intended t o be the total price of all automobiles for the given model, color. date. and dealer, while c n t is the total number of automo- biles in that category. Xotice that in this data cube. individual cars are not identified: they only affect the value and count for their category. Son-. let us consider the relation c C ~ ~ ( S a 1 e s. I. hypothetical tuple that - ) n-ould be in both S a l e s and ti lo sales). is Figure 20.17: The cube operator augments a data cube with a border of aggre- gations in all combinations of ctinien.ions ('Gobi', 'red', '2001-05-21', ' F r i e n d l y F r e d ' , 45000, 2) A tuple of the table CLBE;(F)that has * in one or more dimensions TI-ill The interpretation is that on May 21; 2001. dealer Friendly Fled sold two red have for each dependent attribute the sum (or another aggregate f ~ ~ n c t i o n ) of Gobis for a total of $45.000. The tuple the values of that attribute in all the tuples that xve can obtain by replacing ('Gobi', *, '2001-05-21', ' F r i e n d l y F r e d ' , 152000, 7) the *'s by real values. In effect. we build into the data the result of aggregating along any set of dimensions. Sotice. holvever. that the C U B E operator does says that on SIay 21, 2001. Friendly Fred sold seven Gobis of all colors, for not support 2. tells us that total sales of all Aardvark lnodels in all colors, over all time at all all-or-nothing choices of grouping by day or aggregating over all time. For dealers is 198.000 cars for a total price of$3,521,727,000. another esanlple based on our running automobile database, Ive could choose to aggregate dealers completely or not aggregate them a t all. Hon-ever, we could Consider how to answer a query in \\-hich we specify conditions on certain also choose to aggregate by city, by state, or perhaps by other regions, larger attributes of the S a l e s relation and group by some other attributes, n-hile or smaller. Thus: there are a t least sis choices of grouping for time and at least asking for the sum, count, or average price. In the relation are r sales), we four for dealers. look for those tuples t with the fo1lov;ing properties: l\Tllen the number of choices for grouping along each dimension grows, it becomes increasingly expensive to store the results of aggregating by every 1. If the query specifies a value v for attribute a ; then tuple t has v in its possible conlbination of groupings. S o t only are there too many of them, but component for a. they are not as easily organized as the structure of Fig. 20.17 suggests for tlle 2. If the query groups by an attribute a , then t has any non-* value in its all-or-nothing case. Thus, commercial data-cube systems may help the user t o conlponent for a. choose some n~aterializedviews of the data cube. A materialized view is the result of some query, which we choose t o store in the database, rather than 3. If the query neither groups by attribute a nor specifies a value for a. then reconstructing (parts of) it as needed in response t o queries. For the data cube, t has * in its component for a. the vie~vswe n-ould choose to materialize xi11 typically be aggregations of the full data cube. Each tuple t has tlie sum and count for one of the desired groups. If n-e \%-ant The coarser the partition implied by the grouping, the less space the mate- the average price, a division is performed on the sum and count conlponents of rialized view takes. On the other hand, if ire ~vantto use a view to answer a each tuple t. certain query, then the view must not partition any dimension more coarsely E x a m p l e 20.18 : The query than the query does. Thus, to maximize the utility of materialized views, we generally n-ant some large \-iers that group dimensions into a fairly fine parti- SELECT c o l o r , AVG(price) tion. In addition, the choice of vien-s to materialize is heavily influenced by the F O Sales RM kinds of queries that the analysts are likely to ask. .in example will suggest tlie W E E model = 'Gobi' HR tradeoffs in\-011-ed. GROUP BY c o l o r ; INSERT INTO SalesVl is ansn-ered by looking for all tuples of sales) ~ v i t h form the SELECT model, c o l o r , month, c i t y , ('Gobi', C. *, *, 21, n) SUM(va1) A v a l , SUM(cnt) A c n t S S F O S a l e s JOIN Dealers ON d e a l e r = name RM here c is any specific color. In this tuple, v will be the sum of sales of Gobis GROUP BY model, c o l o r , month, c i t y ; in that color, while n will be the nlini!)cr of sales of Gobis in that color. Tlie average price. although not a n attribute of S a l e s or sales) directly. is v / n . Tlie answer to the query is the set of ( c ,vln) pairs obtained fi-om all Figure 20.18: The materialized vien. SalesVl ('Gobi'. c , *, *. v. n ) tuples. Example 20.19 : Let us return to the data cube 20.5.2 Cube ImplementaOion by Materialized Views 11% suggested in Fig. 20.17 that adding aggregations to the cube doesn't cost S a l e s (model, c o l o r , d a t e , d e a l e r , v a l , c n t ) much in tcrms of space. and saves a lot in time \vhen the common kincis of decision-support queries are asked. Ho~vever:our analysis is based on the as- that n e de\-eloped in Esample 20.17. One possible materialized vie\\- groups sumption that queries choose either t o aggregate completely in a dimension dates by nionth and dealers by city. This view. 1%-hich call SalesV1, is 1%-e or not to aggregate a t all. For some dime~isions.there are many degrees of constlucted by the query in Fig. 20.18. This query is not strict SQL. since n-e granularity that could be chosen for a grouping on that dimension. imagine that dates and their grouping units such as months are understood U c have already mentioned thc case of time. xvl-herenumerolls options such by the data-cube system n-ithout being told to join S a l e s with the imaginary as aggregation by weeks, months: quarters, or ycars exist,, in addition to the relation rep~esenting a j s that \ve discussed in Example 20.14. d Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
3. CHAPTER 20. IiYFORI\IATIOAr IArTEGR.4TION 20.5. DdT.4 CUBES 1055 aggregation of cities into states, probably by accessing the dimension table for INSERT INTO SalesV2 dealers. SELECT model, week, s t a t e , \Ye cannot answer Q2 from SalesV2. Although we could roll-up cities into SUM(va1) A v a l , SUM(cnt) A c n t S S states (i.e.. aggregate the cities into their states) to use SalesV1, we carrrlot F O S a l e s JOIN D e a l e r s ON d e a l e r = name RM roll-up ~veeksinto years, since years are not evenly divided into weeks. and GROUP BY model, week, s t a t e ; data from a week beginning. say, Dec. 29, 2001. contributes to years 2001 and 2002 in a way we carinot tell from the data aggregated by weeks. Figure 20.19: Another materialized view, SalesV2 Finally, a query like 43: SELECT model, c o l o r , d a t e , ~ ~ ~ ( v a l ) Another possible materialized view aggregates colors completely, aggregates F O Sales RM time into u-eeks, and dealers by states. This view, SalesV2, is defined by the GROUP B model, c o l o r , d a t e ; Y query in Fig. 20.19. Either view S a l e s V l or SalesV2 can be used to ansn-er a query that partitions no more finely than either in any dimension. Thus, the can b e anslvered from neither SalesVl nor SalesV2. It cannot be answered query from S a l e s v l because its partition of days by ~nonths too coarse to recover is sales by day, and it cannot be ans~vered from SalesV2 because that view does 41: SELECT model, SUM(va1) not group by color. We would have to answer this query directly from the full FO Sales RM data cube. G O P BY model; RU can be answered either by 20.5.3 The Lattice of Views To formalize the cbservations of Example 20.10. it he!ps to think of a lattice of SELECT model, SUM(va1) possibl~ groupings for each dimension of the cube. The points of the lattice are F O SalesVl RM the ways that we can partition the ~ a l u c of a dimension by grouping according s G O P BY model; RU t o one or more attributes of its dimension table. nB say that partition PI is belo~v partition P2.written PI 5 P2 if and only if each group of Pl is contained within some group of PZ. SELECT model, SUM(va1) All F O SalesV2 RM G O P B model; RU Y On the other hand, the query / Years 1 42: SELECT model, y e a r , s t a t e , SUM(va1) F O S a l e s JOIN D e a l e r s ON d e a l e r = name RM G O P BY model, y e a r , s t a t e ; RU I Weeks Quarters I Months can o n 1 be ans\vered from SalesV1. as Days SELECT model, y e a r , s t a t e , SUM(va1) F O SalesVl RM Figure 20.20: A lattice of partitions for time inter\-als G O P BY model, y e a r , s t a t e ; RU Incidentally. the query inmediately above. like the qu'rics that nggregate time Example 20.20: For the lattice of time partitions n-e might choose the dia- units, is not strict SQL. That is. s t a t e is not ari attribute of SalesVl: only gram of Fig. 20.20. -4 path from some node f i dotvn to PI means that PI 5 4. c i t y is. \Ye rmust assume that the data-cube systenl knol\-s how to perform the These are not the only possible units of time, but they \\-ill serve as an example Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
4. of what units a s ~ s t e r n might support. Sotice that daks lie below both \reeks and months, but weeks do not lie below months. The reason is that while a group of events that took place in one day surely took place within one \reek and within one month. it is not true that a group of events taking place in one week necessarily took place in any one month. Similarly, a week's group need not be contained within the group cor~esponding one quarter or to one year. to Sales At tlie top is a partition we call "all," meaning that events are grouped into a single group; i.e.. we niake no distinctions among diffeient times. Figure 20.22: The lattice of views and queries from Example 20.19 All I SalesV2; of course it could also be answered froni the full data cube S a l e s , but there is no reason to want to do so if one of the other views is materialized. Q2 State can be answered from either SalesVl or S a l e s , while Q 3 can only be answered I from S a l e s . Each of these relationships is expressed in Fig. 20.22 by the paths City downxard from the queries to their supporting vie~vs. I Placing queries in the lattice of views helps design data-cube databases. Dealer Some recently developed design tools for data-cube systems start with a set of Figure 20.21: A lattice of partitions for automobile dealers queries that they regard as ..typical" of the application a t hand. They then select a set of views to materialize so that each of these queries is above a t least Figure 20.21 shows another lattice, this time for the dealer dimension of our one of the riel\-s, preferably identical to it or very close (i.e., the query and the automobiles example. This lattice is siniplcr: it shows that partitioning sales by view use the same grouping in most of the dimensions). dealer gives a finer partition than partitioning by the city of the dealer. i
5. 1088 CHAPTER 20. ISFORJlATIOS IXTEGRA4TIOS 20.6. DATA -111-YIA-G 1089 *! Exercise 20.5.3: In Exercise 20.4.1 lve spoke of PC-order data organized as ! Exercise 20.5.9: In Exercise 20.5.3 n e designed a cube suitable for use ~ v i t h a cube. If we are to apply the CCBE operator, we might find it convenient to the CCBE operator. Horn-ever. some of the dimensions could also b e given a non- break several dimensions more finely. For example, instead of one processor trivial lattice structure. In particular, the processor type could b e organized by dimension, we might have one dimension for the type (e.g., AlID Duron or manufacturer (e g., SUT, Intel. .AND. llotorola). series (e.g.. SUN Ult~aSparc. Pentium-IV), and another d~mension the speed. Suggest a set of dimrnsions for Intel Pentium or Celeron. AlID rlthlon, or llotorola G-series), and model (e.g., and dependent attributes that will allow us to obtain answers to a variety of Pentiuni-I\- or G4). useful aggregation queries. In particular, what role does the customer play? .Also, the price in Exercise 20.4.1 referred to the price of one macll~ne,while a) Design tlie lattice of processor types following the examples described several identical machines could be ordered in a single tuple. What should the above. dependent attribute(s) be? b) Define a view that groups processors by series, hard disks by type, and Exercise 20.5.4 : What tuples of the cube from Exercise 20.5.3 would you use removable disks by speed, aggregating everything else. to answer the following queries? c) Define a view that groups processors by manufacturer, hard disks by a) Find, for each processor speed, the total number of computers ordered in speed. and aggregates everything else except memory size. each month of the year 2002. d) Give esamples of qneries that can be ansn-ered from the view of (11) only, the vieiv of (c) only, both, and neither. b) List for each type of hard disk (e.g., SCSI or IDE) and eacli processor type the number of computers ordered. *!! Exercise 20.5.10: If the fact table F to n-hicli n-e apply the C u B E operator is c) Find the average price of computers with 1500 megahertz processors for sparse (i.e.. there are inany fen-er tuples in F than the product of the number each month from Jan., 2001. of possihle values along each dimension), then tlie ratio of the sizes of CCBE(F) and F can be very large. Hon large can it be? ! Exercise 20.5.5 : The computers described in the cube of Exercise 20.5.3 do not include monitors. IVhat dimensions would you suggest to represent moni- tors? You may assume that the price of the monitor is included in the price of 20.6 Data Mining the computer. A family of database applications cal!ed data rnin,ing or knowledge discovery in dntnbases has captured considerable interest because of opportunities to learn Exercise 20.5.6 : Suppose that a cube has 10 dimensions. and eacli dimension surprising facts fro111esisting databases. Data-mining queries can be thought has 5 options for granularity of aggregation. including "no aggregation" and of as an estended form of decision-support querx, although the distinction is in- "aggregate fully.'' How many different views can we construct by clioosing a formal (see the box on -Data-llining Queries and Decision-Support Queries"). granularity in each dinlension? Data nli11i11:. stresses both the cpcry-optimization and data-management com- Exercise 20.5.7 : Show how t o add the following time units to the lattice of ponents of a traditional database system, as 1%-ell suggesting some important as Fig. 20.20: hours, minutes, seconds, fortnights (two-week periods). decades. estensions to database languages, such as language primitix-es that support effi- and centuries. cient sampling of data. In this section, we shall esamine the principal directions data-mining applications have taken. Me then focus on tlie problem called "fre- Exercise 20.5.8: How 15-onld you change the dealer lattice of Fig. 20.21 to quc'iit iteinsets." n-hich has 1-eceiwd the most attention from the database point include -regions." ~ f : of view. a ) A region is a set of states. 20.6.1 Data-Iblining Applications * b) Regions are not com~liensuratewith states. but each city is in only one Broadly. data-mining queries ask for a useful summary of data, often ~vithout region. suggcstir~gthe values of para~netcrsthat would best yield such a summary. This family of problems thus requires rethinking the n a y database systems are c) Regions are like area codes: each region is contained \vithin a state. some to be used to provide snch insights a b o ~ i tthe data. Below are some of tlie cities are in two or more regions. and some regions h a ~ several cities. e applications and problems that are being addressed using very large amounts Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
6. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
10. 1098 C H A P T E R 20. IXFORJIIATION IhTTEGR.4TI0.\' 20.8. REFERENCES FOR CH-APTER 20 1099 + ROLAP and AIOLAP: It is frequently useful when building a warehouse [4] is a survey of data-mining techniques, and [13] is a n on-line survey of for OLAP, to think of the data as residing in a multidimensional space. data mining. The a-priori algorithm was del-eloped in [I] and 121. with diniensions corresponding t o independent aspects of the data repre- sented. Systems that support such a vie~v data take either a relational of 1. R. Agranal, T. Imielinski, and A. Sn-ami: '.lIining association rules be- point of view (ROLAP, or relational OLAP systems), or use the special- tween sets of items in large databases," Proc. -ACAi SIGAlOD Intl. Conf. ized data-cube model (lIOL.AP, or multidimensional OLAP systems). on ibfanagement of Data (1993), pp. 203-216. + Star Schernas: In a star schema, each data element (e.g., a sale of an item) 2. R. Agrawal, and R. Srikant, "Fast algorithms for mining association rules," Proc. Intl. Conf. on V e q Large Databa.ses (1994), pp. 487-199. is represented in one relation, called tlie fact table, while inforniation helping to interpret the values along each dimension (e.g.. what kind of 3. S. Chaudhuri and U.Dayal, .'Ail overview of data warehousing and OLAP product is iten1 1234?) is stored in a diinension table for each diinension. technology," SIGAJOD Record 26: 1 (1997), pp. 63-74. + The Cube Operator: A specialized operator called CCBE pre-aggregates 4. U. 52. Fayyad, G. Piatetsky-Shapiro. P. Smyth, and R. Uthurusamy, Ad- the fact table along all subsets of dimensions. It may add little to the space Lances i n Knowledge Discovery and Data hlznzng. AAAI Press, hlenlo needed by the fact table, and greatly increases the speed with which many Park CA, 1996. OLAP queries can be answered. 3. H. Garcia-llolina, Y. Papakonstalltinou. D. Quass. -1. Rajalaman, Y. Sa- + Dzmenszon Lattices and Alaterialized Vzews: A more polverful approach giv. V. \Bssalos. J. D. Ullman, and J . n7idorn) The TSIlIlIIS approach than the CLBE operator, used by some data-cube implementations. is to to mediation: data nlodels and languages. J. Intellzgent Informatzon Sys- establish a lattice of granularities for aggregation along each dimension tems 8:2 (1997), pp. 117-132. (e.g., different time units like days, months, and years). The ~vareliouse is then designed by materializing certain v i e w that aggregate in different 6. J. S. Gray, A. Bosworth, A. Layman. and H. Pirahesh, .'Data cube: a \va!.s along the different dimensions, and the rien- with the closest fit is relational aggregation operator generalizing group-by. cross-tab, and sub- used t o answer a given query. totals." Proc. Intl. Conf. on Data Englneerzng (1996). pp. 132-139. + Data Mining: IVareliouses are also used to ask broad questions that in- 7. -1.Gupta and I. S. SIumick. A.laterioltzed Vieccs: Technzques, Implemcn- tatzons, and Applzcatzons. l I I T Pres4. Cambridge 11-1. 1999 volve not only aggregating on command. as in OL.1P queries, but search- ing for the "right" aggregation. Common types of data mining include 8. V. Harinarayan, - .1Rajaraman, and J . D. Ullman. ~~Implementiiig data clustering data into similar groups. designing decision trees to predict one cubes efficiently." Proc. ACAf SIGilfOD Intl. Conf. on Management of attribute based on the value of others. and finding sets of items that occur Data (1996). pp. 205-216. together frequently. 9. D. Loniet and J. U-idom (eds.). Special i ~ s u e materialized l-ie~vsand on + The A-Priori Algorithm: -An efficiellt \\-a?; to find frequent itemsets is to data warehouses. IEEE Data Erlg?ilcerlng Builet~n18:2 (1395). use the a-priori algorithm. This technique exploits the fact that if a set occurs frequently. then so do all of its subsets. 10. I*.Papakonstantinou. H. Garcia-llolina. arid J . n'idom. "Object ex- change across heterogeneous information sources." Proc. Intl. Conf. on Data Englneerlng (1993). pp 251-260. 20.8 References for Chapter 20 11. I-.Papakonstantinou. . .Gupta. and L. Haas. "Capnl>ilities-base query I Recent smveys of \varehonsing arid related technologics are in [9]. [3]. and [ T I . ren-riting in mediator s!-stems." Conference 011 Par(111el and Distributed Federated systems are surveyed 111 11'21. The concept of tlic mediato1 conies Informntion Systc~ns (1996). ,\l-;lil~il~le as: from [14]. Implementation of mediators and \\-rappers, especially tlie mapper-genera- tor approach. is covered in [5]. Capabilities-based optilnization for iriediators n-as explored in [ll. 131. 12. .A. P. Sheth and J . -1. Larson. "Federated databases for managing dis- The cube operator was proposed in 161. The i~iipleinentationof cubes by tributed. heterogeneous. and autonomous databases." Cornputzng Surreys materialized vie\\-s appeared in 181. 22:3 (1990), pp. 183-236. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
11. 14. G. \Viederhold: "Mediators in the architecture of future information sys- terns." IEEE Computer C-25:l (1992), pp. 38-49. Index 15. R. Yerneni, C. Li, H. Garcia-3Iolina, and J. D. Ullman, "Computing capa- bilities of mediators," Proc. ACM SIGMOD Intl. Conf. on Management of Data (1999), pp. 443-454. Alias See AS Abiteboul, S. 21, 187, 1099 ALL 266, 278,437 .Abort 885, 970, 1017, 1026 ALTER TABLE294.334-335 See also Rollback Xnomaiy ABSOLUTE361 See Deletion anomaly, Redun- --ichilles, X.-C. 21 dancy, Update anomaly ACID properties 14 -Anonymous variable 466 See also Atomicity, Consistency, -ASS1 239 Durability, Isolation Antisemijoin 213 .ACR schedule ANY 266 See Cascading rollback .Application server 7 .Action 340 .A-priori algolithm 1093-1096 -ADA 350 Apt, I
15. Howard, I. H. 129 Inheritance 132, 134-135 See JDBC Lattice, of views 1085-1087 Hsu, 11. 916 See also Isa relationship, Llul- JDBC 349, 393-397 Layman. A 1099 HTXlL 629 tiple inheritance, Subclass Join 112-113,192-193.254-255.270- Leader election 1027 Hull, R. 21 Input action 881, 918 272, j05-506 Leaf 174, 633-634 Hybrid hash join 753-755 Input attribute 802 See also .antisemijoin, CROSS JOIN, Least fixedpoint 481-486,488,499 Insensitive cursor 360 Equijoin, Satural join, Sested- Least-recently used 767-768 Insertion 286-288,410,598-599,615- loop join, Outerjoin, Selec- LEAVE 371 I D 183 620,630,639-642,650-651, tivity, of a join. Semijoin, Left outerjoin 228, 273 Idempotence 230,891, 998 653-660,677-679,691,697- Theta-join, Zig-zag join Left-deep join tree 848-849, 853 Identity 555 698, 708 Join ordcring 818, 8-17-859 Left-recursion 484 IDREF 183 See also Modification Join tree 848 Legacy database 9, 175, 1065 I F 368 Instance, of a relation 64, 66 Joined tuple 198 Legality, of schedules 933-934,941- Imielinski, T. 1099 Instance, of an entity set 27 Juke box 512 942, 947 Immediate constraint checking 323- Instance variable 132 Lewis, P. 11. 11 1013 325 INSTEAD OF344-345 Ley, 11. 21 Immutable object 133 Integer 292-293, 569, 650 Kaiser, G. E. 1044 Li, C. 1100 Impedence mismatch 350-351 Intensional predicate 469 I i a ~ ~ e l l a kP.s188 ~ . LIKE 246-248 I N 266-267,430 Intention lock 959 I