Beginning Database Design- P11

Chia sẻ: Cong Thanh | Ngày: | Loại File: PDF | Số trang:20

0
56
lượt xem
7
download

Beginning Database Design- P11

Mô tả tài liệu
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

Beginning Database Design- P11:This book focuses on the relational database model from a beginning perspective. The title is, therefore, Beginning Database Design. A database is a repository for data. In other words, you can store lots of information in a database. A relational database is a special type of database using structures called tables. Tables are linked together using what are called relationships. You can build tables with relationships between those tables, not only to organize your data, but also to allow later retrieval of information from the database....

Chủ đề:
Lưu

Nội dung Text: Beginning Database Design- P11

  1. Understanding Data Warehouse Database Modeling The demands of the modern global economy and the Internet dictate that end user operational applications are required to be active 24/7, 365 days a year. There is no window for any type of batch activity because when people are asleep in Europe, others are awake down under in Australia. The global economy requires instant and acceptable servicing of the needs of a global user population. In reality, the most significant difference between OLTP databases and data warehouses extends all the way down to the hardware layer. OLTP databases need highly efficient sharing of critical resources such as onboard memory (RAM), and have very small I/O requirements. Data warehouses are completely opposite. Data warehouses can consume large portions of RAM by transferring between disk and memory, in detriment to an OLTP database running on the same machine. Where OLTP databases need resource sharing, data warehouses need to hog those resources for extended periods of time. So, a data warehouse hogs machine resources. An OLTP database attempts to share those same resources. It is likely to have unacceptable response times because of a lack of basic I/O resources for both database types. The result, therefore, is a requirement for a complete separation between operational (OLTP) and decision-support (data warehouse) activity. This is why data warehouses exist! The Relational Database Model and Data Warehouses The traditional OLTP (transactional) type of relational database model does not cater for data warehouse requirements. The relational database model is too granular. “Granular” implies too many little pieces. Processing through all those little-bitty pieces is too time consuming for large transactions, joining all those pieces together. Similar to the object database model, the relational database model removes duplication and creates granularity. This type of database model is efficient for front-end application performance, involving small amounts of data that are accessed frequently and concurrently by many users at once. This is what an OLTP database does. Data warehouses, on the other hand, need throughput of huge amounts of data by relatively very few users. Data warehouses process large quantities of data at once, mainly for reporting and analytical processing. Also, data warehouses are regularly updated, but usually in large batch operations. OLTP databases need lightning-quick response to many individual users. Data warehouses perform enormous amounts of I/O activity over copious quantities of data; therefore, the needs of OLTP and data warehouse databases are completely contrary to each other, down to the lowest layer of hardware resource usage. Hardware resource usage is the most critical consideration. Software rests quite squarely on the shoulders of your hardware. Proper use of memory (RAM), disk storage, and CPU time to manage everything is the critical layer for all activity. OLTP and data warehouse database differences extend all the way down to this most critical of layers. OLTP databases require intensely sharable hardware structures (commonly known as concurrency), needing highly efficient use of memory and processor time allocations. Data warehouses need huge amounts of disk space, processing power as well, but all dedicated to long-running programs (commonly known as batch operations or throughput). A data warehouse database simply cannot cope using a standard OLTP database relational database model. Something else is needed for a data warehouse. 173
  2. Chapter 7 Surrogate Keys in a Data Warehouse Surrogate keys, as you already know, are replacement key values. A surrogate key makes database access more efficient — usually. In data warehouse databases, surrogate keys are possibly more important in terms of gluing together different data, even from different databases, perhaps even different database engines. Sometimes different databases could be keyed on different values, or even contain different key values, which in the non-computerized world are actually identical. For example, a customer in a department of a company could be uniquely identified by the customer’s name. In a second department, within the same company, the same customer could be identified by the name of a contact or even perhaps the phone number of that customer. A third department could identify the same customer by a fixed-length character coding system. All three definitions identify exactly the same customer. If this single company is to have meaningful data across all departments, it must identify the three separate formats, all representing the same customer as being the same customer in the data warehouse. A surrogate key is the perfect solution, using the same surrogate key value for each repetition of the same customer, across all departments. Surrogate key use is prominent in data warehouse database modeling. Referential Integrity in a Data Warehouse Data warehouse data modeling is essentially a form of relational database modeling, albeit a simplistic form. Referential integrity still applies to data warehouse databases; however, even though referential integrity applies, it is not essential to create primary keys, foreign keys, and their inter-table referential links (referential integrity). It is important to understand that a data warehouse database generally has two distinct activities. The first activity is updating with large numbers of records added at once, some- times also with large numbers of records changed. It is always best to only add or remove data in a data warehouse. Changing existing data warehouse table records can be extremely inefficient simply because of the sheer size of data warehouses. Referential integrity is best implemented and enforced when updating tables. The second activity of a data warehouse is the reading of data. When data is read, referential integrity does not need to be verified because no changes are occurring to records in tables. On the contrary, because referential integrity implies creation of primary and foreign keys, and because the best database model designs make profligate use of primary and foreign key fields in SQL code, leave referential integrity intact for a data warehouse. So, now we know the origin of data warehouses and why they were devised. What is the data warehouse dimensional database model? 174
  3. Understanding Data Warehouse Database Modeling The Dimensional Database Model A standard, normalized, relational database model is completely inappropriate to the requirements of a data warehouse. Even a denormalized relational database model doesn’t make the cut. An entirely dif- ferent modeling technique, called a dimensional database model, is needed for data warehouses. A dimen- sional model contains what are called facts and dimensions. A fact table contains historical transactions, such as all invoices issued to all customers for the last five years. That could be a lot of records. Dimensions describe facts. The easiest way to describe the dimensional model is to demonstrate by example. Figure 7-1 shows a relational table structure for both static book data and dynamic (transactional) book data. The grayed out tables in Figure 7-1 are static data tables and others are tables containing data, which is in a constant state of change. Static tables are the equivalent of dimensions, describing facts (equivalent to transactions). So, in Figure 7-1, the dimensions are grayed out and the facts are not. Author Customer author_id customer_id name customer address phone Publisher CoAuthor email publisher_id credit_card_type coauthor_id (FK) name credit_card# publication_id (FK) credit_card_expiry Sale Publication Review sale_id publication_id review_id ISBN (FK) shipper_id (FK) subject_id (FK) publication_id (FK) customer_id (FK) author_id (FK) review_date sale_price title text sale_date Subject subject_id Edition Shipper parent_id ISBN name shipper_id publisher_id (FK) shipper Rank publication_id (FK) address print_date ISBN (FK) phone pages email rank list_price format ingram_units Figure 7-1: The OLTP relational database model for books. 175
  4. Chapter 7 What Is a Star Schema? The most effective approach for a data warehouse database model (using dimensions and facts) is called a star schema. Figure 7-2 shows a simple star schema for the REVIEW fact table shown in Figure 7-1. Publisher publisher_id publisher Review review_id Author Publication customer_id (FK) author_id publication_id (FK) publication_id author author_id (FK) title publisher_id (FK) review_date text Customer customer_id customer address phone email credit_card_type credit_card# credit_card_expiry Figure 7-2: The REVIEW table fact-dimensional structure. A more simplistic equivalent diagram to that of Figure 7-2 is shown by the star schema structure in Figure 7-3. 176
  5. Understanding Data Warehouse Database Modeling Publisher Author Review Book Customer One-To-Many Relationship Figure 7-3: The REVIEW fact-dimensional structure is a star schema. A star schema contains a single fact table plus a number of small dimensional tables. If there is more than one fact table, effectively there is more than one star schema. Fact tables contain transactional records, which over a period of time can come to contain very large numbers of records. Dimension tables on the other hand remain relatively constant in record numbers. The objective is to enhance SQL query join performance, where joins are executed between a single fact table and multiple dimensions, all on a single hierarchical level. So, a star schema is a single, very large, very changeable, fact table, connected directly to a single layer of multiple, static-sized dimensional tables. 177
  6. Chapter 7 What Is a Snowflake Schema? A snowflake schema is shown in Figure 7-4. A snowflake schema is a normalized star schema, such that dimension entities are normalized (dimensions are separated into multiple tables). Normalized dimensions have all duplication removed from each dimension, such that the result is a single fact table, connected directly to some of the dimensions. Not all of the dimensions are directly connected to the fact table. In Figure 7-4, the dimensions are grayed out in two shades of gray. The lighter shade of gray represents dimensions connected directly to the fact table (BOOK, AUTHOR, SUBJECT, SHIPPER, and CUSTOMER). The darker-shaded gray dimensional tables, are normalized subset dimensional tables, not connected to the fact table directly (PUBLISHER, PUBLICATION, and CATEGORY). Publisher Author Customer publisher_id author_id customer_id publisher author customer address phone Book email ISBN Sale credit_card_type sale_id credit_card# publication_id (FK) publisher_id (FK) ISBN (FK) credit_card_expiry edition# author_id (FK) print_date shipper_id (FK) pages customer_id (FK) list_price subject_id (FK) format sale_price Shipper rank sale_date ingram_units shipper_id shipper Publication address phone publication_id Subject email title subject_id category_id (FK) subject Category category_id category Figure 7-4: The SALE table fact-dimensional structure. 178
  7. Understanding Data Warehouse Database Modeling A more simplistic equivalent diagram to that of Figure 7-4 is shown by the snowflake schema in Figure 7-5. Author Publisher Customer Book Sale Shipper Publication Subject Category One-To-Many Relationship Figure 7-5: The SALE fact-dimensional structure is a snowflake schema. The problem with snowflake schemas isn’t too many tables but too many layers. Data warehouse fact tables can become incredibly large, even to millions, billions, even trillions of records. The critical factor in creating star and snowflake schemas, instead of using standard “nth” Normal Form layers, is decreasing the number of tables in SQL query joins. The more tables in a join, the more complex a query, the slower it will execute. When fact tables contain enormous record counts, reports can take hours and days, not minutes. Adding just one more table to a fact-dimensional query join at that level of database size could make the query run for weeks. That’s no good! 179
  8. Chapter 7 The solution is an obvious one. Convert (denormalize) a normalized snowflake schema into a star schema, as shown in Figure 7-6. In Figure 7-6 the PUBLISHER and PUBLICATION tables have been denormalized into the BOOK table, plus the CATEGORY table has been denormalized into the SUBJECT table. Author author_id Customer customer_id author customer address phone email Sale credit_card_type Book sale_id credit_card# ISBN credit_card_expiry ISBN (FK) publisher author_id (FK) title shipper_id (FK) edition# customer_id (FK) print_date subject_id (FK) pages sale_price Shipper list_price sale_date format shipper_id rank shipper ingram_units address phone Subject email subject_id category subject Figure 7-6: A denormalized SALE table fact-dimensional structure. A more simplistic equivalent diagram to that of Figure 7-6 is shown by the star schema in Figure 7-7. 180
  9. Understanding Data Warehouse Database Modeling Author Customer Book Sale Shipper Subject One-To-Many Relationship Figure 7-7: The SALE fact-dimensional structure denormalized into a star schema. What does all this prove? Not much, you might say. On the contrary, two things are achieved by using fact-dimensional structures and star schemas: ❑ Figure 7-1 shows a highly normalized table structure, useful for high-concurrency, precision record-searching databases (an OLTP database). Replacing this structure with a fact-dimensional structure (as shown in Figure 7-2, Figure 7-4, and Figure 7-6) reduces the number of tables. As you already know, reducing the number tables is critical to SQL query performance. Data warehouses consist of large quantities of data, batch updates, and incredibly complex queries. The fewer tables, the better. It just makes things so much easier with fewer tables, especially because there is so much data. The following code is a SQL join query for the snowflake schema, joining all nine tables in the snowflake schema shown in Figure 7-5. SELECT * FROM SALE SAL JOIN AUTHOR AUT JOIN CUSTOMER CUS JOIN SHIPPER SHP JOIN SUBJECT SUB JOIN CATEGORY CAT JOIN BOOK BOO JOIN PUBLISHER PBS JOIN PUBLICATION PBL WHERE ... GROUP BY ... ORDER BY ... ; ❑ If the SALE fact table has 1 million records, and all dimensions contain 10 records each, a Cartesian product would return 106 multiplied by 109 records. That makes for 1015 records. That is a lot of records for any CPU to process. 181
  10. Chapter 7 A Cartesian product is a worse-case scenario. ❑ Now look at the next query. SELECT * FROM SALE SAL JOIN AUTHOR AUT JOIN CUSTOMER CUS JOIN SHIPPER SHP JOIN SUBJECT SUB JOIN BOOK BOO WHERE ... GROUP BY ... ORDER BY ... ; ❑ Using the star schema from Figure 7-7, assuming the same number of records, a join occurs between one fact table and six dimensional tables. That is a Cartesian product of 106 multiple by 106, resulting in 1012 records returned. The difference between 1012 and 1015 is three deci- mals. Three decimals is not just three zeroes and thus 1,000 records. The difference is actually 1,000,000,000,000,000 – 1,000,000,000,000 = 999,000,000,000,000. That is effectively just a little less than 1015. The difference between six dimensions and nine dimensions is more or less infinite, from the perspective of counting all those zeros. Fewer dimensions make for faster queries. That’s why it is so essential to denormalize snowflake schemas into star schemas. ❑ Take another quick glance at the snowflake schema in Figure 7-4 and Figure 7-5. Then examine the equivalent denormalized star schema in Figure 7-6 and Figure 7-7. Now put yourself into the shoes of a hustled, harried and very busy executive — trying to get a quick report. Think as an end-user, one only interested in results. Which diagram is easier to decipher as to content and meaning? The diagram in Figure 7-7 is more complex than the diagram in Figure 7-5? After all, being an end-user, you are probably not too interested in understanding the complexities of how to build SQL join queries. You have bigger fish to fry. The point is this: The less complex the table structure, the easier it will be to use. This is because a star schema is more representa- tive of the real world than a snowflake schema. Look at it this way. A snowflake schema is more deeply normalized than a star schema, and, therefore, by definition more mathematical. Something more mathematical is generally of more use to a mathematician than it is to an exec- utive manager. The executive is trying to get a quick overall impression of whether his company will sell more cans of lima beans, or more cans of string beans, over the course of the next ten years. If you are a computer programmer, you will quite probably not agree with this analogy. That tells us the very basics of data warehouse database modeling. How can a data warehouse database model be constructed? How to Build a Data Warehouse Database Model Now you know how to build star schemas for data warehouse database models. As you can see, a star schema is quite different from a standard relational database model (Figure 7-1). The next step is to examine the process, or the steps, by which a data warehouse database model can be built. 182
  11. Understanding Data Warehouse Database Modeling Data Warehouse Modeling Step by Step The primary objective of a data warehouse (or any database) is to service end-users. The end-users are the people who read reports produced by data warehouse SQL queries. End-users utilize a data warehouse to search for patterns, and attempt to forecast trends from masses of historical information. From that per- spective, there is a sequence of steps in approaching data warehouse database design, beginning with the end-user perspective. The end-user looks at a company from a business process, or operational perspective: ❑ Business processes — Establish the subject areas of a business. How can a business be divided up? The result is the fact tables. Fact tables contain records of historical transactions. ❑ Granularity — Granularity is the level of detail required. In other words, should a data warehouse store every single transaction? Should it summarize transactions as a single record for each day, month, year, and so on? The more granularity the data warehouse contains, the bigger fact tables are because the more records they contain. The safest option is include all historical data down to the lowest level of granularity. This ensures that any possible future requirements for detailed analysis can always be met, without needed data perhaps missing in the future. Missing data might make your executive managers a little irate in the future. They will be irate with you, and that’s usually best avoided. There are specialized objects such as materialized views that can create summaries at a later stage. When you do not know the precise requirements for future use of your data warehouse, to be on the safe side, it is best to store all levels of detail (assuming hardware storage capacity allows it). If you miss a level of detail in any specific area and it is later requested, you won’t be able to comply. In other words, store every transaction, if you have the physical disk space and general hardware-processing capacity. ❑ Identify and build dimensions — Dimensions contain static information. Dimensions describe facts by storing static details about transactions in fact tables. Dimensions must be built before facts because facts contain foreign key references to dimension tables. ❑ Build facts — As previously mentioned, facts are transactional records, going back even many years. Fact tables are built after all dimensions are decided upon because, as you already know, facts are dependent on dimensions. How Long to Keep Data in a Data Warehouse? The amount of time you keep data in a data warehouse depends on end-user requirements. Typically, when designing a data warehouse, at the point of creating table ERD diagrams, it is impossible to tell how detail is required. The best option is retain every single transaction without summarizing anything; however, that can chew up a humungous amount of disk space. If you have the space, why not use it. If you run out of space later, you can always begin summarizing and destroying detail level records at a later stage. Be warned! Summarizing data warehouse records into aggregated records, and deleting detail records can be a seriously time- consuming effort if done when the data warehouse has grown to be extremely large. Data warehouses sometimes retain all data forever. When a data warehouse becomes too difficult to manage, there will have to be some deletion of older data, or summarizing (or both). It all depends on hardware storage capacity, the power of computers, and how much you can spend on continually expanding the capacity of existing hardware. Upgrading a large data warehouse to new hardware and software can also be very time-consuming. 183
  12. Chapter 7 Types of Dimension Tables The data warehouse database models for the REVIEW and SALE tables, shown previously in this chapter, are actually inadequate. Many data warehouse databases have standard requirements based on how end-users need to analyze data. Typical additions to data warehouse databases are dimensions such as dates, locations, and products. These extra dimensions can be built from most types of transactional data. Any types of transactions involving products or services are usually dated, such as orders, invoices, or payments. Therefore, every transaction can be placed (or dated) into specific periods. Locations can usually be assessed either generally or in detail from customers, suppliers, shippers, and other company’s that con- tact is had with. The REVIEW and SALE fact tables shown in Figure 7-2 and Figure 7-4 both contain date fields (REVIEW.REVIEW_DATE and SALE.SALE_DATE). A date or timestamp dimension might look some- thing similar to that shown in Figure 7-8. Date fields in the tables would be replaced with identifiers as shown in Figure 7-9. Note how the date values in the TIME table, as shown in Figure 7-8, are reconstituted in a specific time window period, such as MONTH, QUARTER, and YEAR. Time time_id month quarter year Figure 7-8: A time dimension entity. Figure 7-9 shows fact tables for both book reviews and sales, where the date fields have been replaced with foreign key links to the TIME table. The result of this is that facts can be analyzed by month, quarter, and year. In this case, removing the date fields from the fact tables has actually decreased the detail level. Now data can only be analyzed by month. Months are the lowest level of detail. To retain the ability to analyze on a daily basis, or with further detail such as hours and minutes, either the TIME table should be expanded to accommodate more detail periods, or the date fields should be retained in the fact tables. Another commonly used dimension involves locations, which can be states, cities, countries, continents, regions, and others. Location details for review and sale facts can be gleaned from customer and shipper address details. The resulting LOCATION table would look something like that shown in Figure 7-10. Some silly fields have been added just to give this conversation a little color. The equivalent star schema changes are shown in Figure 7-11. 184
  13. Understanding Data Warehouse Database Modeling Time time_id month quarter Publisher year Dates are replaced with publisher_id TIME_ID foreign key to publisher the TIME dimension Review review_id Time Publication time_id time_id (FK) publication_id customer_id (FK) month Customer publication_id (FK) title quarter customer_id author_id (FK) Author year publisher_id (FK) customer author_id address text author phone email credit_card_type Author Sale credit_card# author_id sale_id credit_card_expiry Book author Customer ISBN time_id (FK) ISBN (FK) customer_id publisher author_id (FK) customer title shipper_id (FK) address edition# customer_id (FK) phone print_date sale_price email pages sale_date credit_card_type list_price credit_card# format Shipper credit_card_expiry rank ingram_units shipper_id shipper Subject address subject_id phone category_id (FK) subject Figure 7-9: Adding the time dimension to the facts. Book location_id city state country continent region star_system galaxy universe Figure 7-10: Location dimensions are commonly used to analyze data by region. 185
  14. Chapter 7 Locations become foreign keys Time to a LOCATION dimension time_id month quarter Location Publisher year location_id publisher_id city publisher state country continent Location region Review start_system location_id Time review_id galaxy city time_id universe state location_id (FK) country month Customer time_id (FK) continent quarter customer_id (FK) region year customer_id publication_id (FK) start_system customer author_id (FK) galaxy address publisher_id (FK) universe phone text email Author credit_card_type Publication Sale credit_card# author_id publication_id sale_id credit_card_expiry author title location_id (FK) Author time_id (FK) Customer author_id ISBN (FK) customer_id author_id (FK) author shipper_id (FK) customer address customer_id (FK) phone sale_price email sale_date credit_card_type Shipper credit_card# Book shipper_id credit_card_expiry ISBN shipper publisher Subject address title subject_id phone edition# email print_date category_id (FK) pages subject list_price format rank ingram_units Figure 7-11: Adding a location dimension to facts. Other types of dimensions, common to many data warehouses, cover topic areas such as products or product categories. There are many other possible dimensions. Dimensions typically depend on the data content of a data warehouse, and also how the data warehouse is used for analysis. 186
  15. Understanding Data Warehouse Database Modeling Try It Out Creating a Data Warehouse Database Model Figure 7-12 shows the tables of a database containing details of bands, their released CDs, tracks on each of those CDs, followed by three tables containing royalty amounts earned by each track, from radio plays, live performances, and recordings by other artists. Create an equivalent data warehouse snowflake schema. From the snowflake schema, denormalize into a star schema. 1. Identify the fact tables. Facts are generally business processes of the day-to-day functions of a business. Transactional data is function. 2. Granularity can be ignored in this case. 3. What are the dimensions? Dimensions are the static tables in a data warehouse database model. 4. Create the dimension tables, followed by the fact tables. CD Band band_id cd_id Radio band_id (FK) radio_id name title address length track_id (FK) phone tracks station email city date royalty_amount Track Performance track_id performance_id cd_id (FK) track track_id (FK) performer length genre city chart date rank royalty_amount Recording recording_id track_id (FK) artist city date royalty_amount Figure 7-12: An OLTP relational database model. 187
  16. Chapter 7 How It Works Go through the process as follows: 1. In Figure 7-12, you should be able to see that transactional tables, those constantly changing, in relation to static tables, which are the royalty tables. The royalty tables are the RADIO, PERFORMANCE, and RECORDING tables. These three tables are the only tables with amount and date values in them. This makes them transactional tables and thus fact tables. 2. In Figure 7-12, dimensions are all the tables containing static data, not previously assessed as being transactional facts. Dimension tables in Figure 7-12 are the BAND, CD, and TRACK tables. 3. The next step is to begin by building dimensions. Build facts after that. Figure 7-13 shows a simplistic picture of all static dimensions and facts, in the form of a dimensionally normalized snowflake schema. Band Time CD Radio Track Royalty Performer Recording Artist Location One-To-Many Relationship Figure 7-13: A data warehouse snowflake schema. Figure 7-14 shows the ERD equivalent of the snowflake schema shown in Figure 7-13. Note how date and location values are replaced by dimensions TIME and LOCATION (shown both in Figure 7-13 and Figure 7-14). 188
  17. Understanding Data Warehouse Database Modeling Band Time band_id time_id name hour address day phone email month Radio CD quarter cd_id year station_id band_id (FK) station title length tracks Track Royalty track_id royalty_id Performer cd_id (FK) track_id (FK) performer_id track recording_artist_id (FK) performer length performer_id (FK) genre station_id (FK) chart location_id (FK) rank time_id (FK) amount Recording Artist recording_artist_id recording_artist Location location_id city county state country Figure 7-14: A data warehouse snowflake schema ERD. 4. Figure 7-15 and Figure 7-16 show a conversion of snowflake to star schema. The BAND, CD, and TRACK tables are all denormalized into the single table called TRACK. The result is a single-level hierarchy of dimensions. Time Radio Track Royalty Performer Recording Artist Location One-To-Many Relationship Figure 7-15: A data warehouse star schema. 189
  18. Chapter 7 Time time_id hour day month quarter year Radio Track station_id track_id station band_name Royalty band_address royalty_id band_phone band_email track_id (FK) Performer cd_title recording_artist_id (FK) performer_id cd_length performer_id (FK) station_id (FK) performer cd_tracks location_id (FK) track time_id (FK) track_length amount track_chart track_genre Recording Artist track_rank recording_artist_id Location recording_artist location_id city county state country Figure 7-16: A data warehouse star schema ERD. Understanding Fact Tables Fact table fields contain the content and field types of fact tables; therefore, facts are not actually the tables themselves but the fields within those tables. Facts are generally numeric in nature, allowing aggregation of those numeric values into analytical data warehouse reporting structures. Foreign keys in fact tables that point to dimensions are not facts. Fact table foreign keys (often surrogate keys) simply link to more descriptive dimension tables. Dimension tables contain detailed descriptions about fact values, such as the time and location details of when and where a particular transaction took place. There are other less commonly used dimensions, such as with whom and what (customers, authors, shippers, books). 190
  19. Understanding Data Warehouse Database Modeling Fact tables can contain detailed transactional histories, summaries (aggregations), or both. Details and aggregations can be separated into different fact tables. Many relational databases allow use of materialized views to contain fact table aggregations, containing less detailed aggregations of their underlying fact tables. One of the most significant factors with fact tables, specifically because of their potentially enormous size, is how fact tables are changed. How is data changed in fact tables? Fact tables can be added to or changed, and even deleted from. The least obstructive changes to fact tables are done by adding new data, preferably during low end-user activity time windows. Deletion from fact tables is necessary when upgrading to more summarized details, or perhaps deletion of old data that is no longer useful. Fact tables are thus regarded as being additive, semi-additive, or non-additive. The most efficient data warehouses are those that never change, containing only non-additive fact tables. Additive tables are more practical because the more current data warehouse facts are, the more useful they should become for analysis. Summar y In this chapter, you learned about: ❑ The importance of understanding the origin of data warehouses, especially why they were invented and how they are used ❑ The inadequacy of the transactional (OLTP) database model for producing forecasts and predictions from huge amounts of out of date data because an OLTP database model has too much granularity ❑ Specialized database modeling techniques required by data warehouses in the form of facts and dimensions ❑ Star schemas as single dimensional layers and how they provide best efficiency and clarity to end-users ❑ Snowflake schemas as normalized, less efficient forms of star schemas ❑ Variable data, the transactions or history of company transactional activities that compose facts ❑ How dimensions contain static, descriptive data; dimensional data describes fact data ❑ How dimensions describe the facts by normalizing only static (dimensional) data, preferably in a single dimensional layer ❑ How fact-dimensional or dimensional database modeling is either a star schema or a snowflake schema ❑ Data warehouses containing one or more fact tables, linked to the same dimensions, or a subset of the dimensions ❑ The various steps and processes used in the creation of a data warehouse database model ❑ The contents of a data warehouse database and model are both static and historical ❑ Static (dimensional) data changes infrequently ❑ Historical or archived data (factual) is continually added to, and can grow enormously over time 191
  20. Chapter 7 This chapter has described the specialized database modeling techniques required for the building of data warehouses. The next chapter examines one of the most important factors of database modeling — performance. If a database underperforms, a company can lose customers and, in extreme situations, lose all its customers. Guess what happens after that? Look, no job! Exercises Use the ERDs in Figure 7-14 and Figure 7-16 to help you answer these questions: 1. Create scripts to create tables for the snowflake schema in Figure 7-14. 2. Change the tables created for the previous question, converting the snowflake schema in Figure 7-14 to the star schema in Figure 7-16. Use CREATE TABLE and DROP TABLE commands. 192
Đồng bộ tài khoản