Chapter 3 Data modelling and databases OLTP & OLAP
1
Outline
• Overview
• OLTP vs OLAP
• Data warehouse modeling
• Data warehouse design
• Data warehouse Implementation
2
Heterogeneous data sources
3
Why data integration
• To facilitate information access and reuse through a
single information access point
• Data from different complementing information systems is to be combined to gain a more comprehensive basis to satisfy the need • Improve decision making • Improve customer experience • Increase competitiveness, Streamline operations • Increase productivity • Predict the future
4
Data integration challenges
• Physical systems
• Logical structures
• Various hardwares, standards • Distributed deployment • Various data format
• Different data models • Different data schemas • Business organization
5
• Data security and privacy • Business rules and requirements • Different administrative zones in the business organization
Data Warehouse
• A single, complete and consistent store of data obtained from a
variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]
• A data warehouse is a copy of transaction data specifically
structured for query and analysis [Ralph Kimball]
• Data from several operational sources (OLTP) are extracted,
transformed, and loaded (ETL) into a data warehouse
Data Warehouse usage
• Three kinds of data warehouse applications
• supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
• Information processing
• multidimensional analysis of data warehouse data • supports basic OLAP operations, slice-dice, drilling, pivoting
• Analytical processing
• knowledge discovery from hidden patterns • supports associations, constructing analytical models,
performing classification and prediction, and presenting the mining results using visualization tools
7
• Data mining
Data Warehouse usage
Which are our lowest/highest margin customers ?
Who are my customers and what products are they buying?
What is the most effective distribution channel?
What product prom- -otions have the biggest impact on revenue?
Which customers are most likely to go to the competition ?
What impact will new products/services have on revenue and margins?
Advantages
• High query performance
• Does not interfere with local processing at sources
• But not necessary most current information
9
• Complex queries at warehouse • OLTP at information sources
Characteristics of Data warehouse
• Subject-Oriented
• Integrated
• Time-variant
• Non-volatile
10
Subject-Oriented
• Offer information regarding a theme instead of
companies' ongoing operations
• Emphasis on modeling and analysis of data
for decision making
• Subjects can be sales, marketing, distributions, etc. • A data warehouse never focuses on the ongoing operations
• Provide a simple and concise view around the specific
11
subject by excluding data which not helpful
Integrated
• Data needs to be stored in the Datawarehouse in a common and
universally acceptable manner
• This integration helps in effective analysis of data
• Constructed by integrating multiple, heterogeneous data sources
• Ensure consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources • E.g., Hotel price: currency, tax, breakfast covered, etc.
• When data is moved to the warehouse, it is converted.
12
• Data cleaning and data integration techniques are applied.
Time-Variant
• The time horizon for the data warehouse is significantly
longer than that of operational systems • Operational database: current value data • Data warehouse data: provide information from a historical
• Every key structure in the data warehouse
perspective (e.g., past 5-10 years)
• Contains an element of time, explicitly or implicitly • But the key of operational data may or may not contain “time
• Data warehouse is loaded daily, hourly, or on some
other periodic basis, and does not change within that period
Time
Data
01/97
Data for January
02/97
Data for February
03/97
Data for March
13
Data Warehouse
element”
Non-volatile
• Historical data in a data warehouse should never be
altered
• Helps to analyze historical data and understand what & when
• Data is read-only
happened
• Does not require transaction process, recovery and
concurrency control mechanisms
• Only two types of data operations
• Delete, update, and insert are omitted
14
• Data loading • Data access (reading)
Data Warehousing
• A process for assembling and managing data from
various sources for the purpose of answering business questions
15
ETL
• Extract
• Transform
• Get the data from source system as efficiently as possible
• Load
• Perform calculations on data
16
• Load the data in the target storage
Staging area
• An intermediate storage area used for data processing during the extract, transform and load (ETL) process
• Mainly required for timing reasons (optional)
• If it is not feasible to extract all the data from all Operational
• Objectives
databases at exactly the same time
• Consolidation: as a large "bucket" in which data from
multiple source systems can be temporarily placed for further processing
• Alignment: standardization of reference data, validation of
relationships between records and data elements from different sources
17
• Minimizing contention: efficient data transfer from sources • Independent scheduling • Change detection • Cleansing data
Why is ETL (System) Important?
• Adds value to data
• Removes mistakes and corrects data • Documented measures of confidence in data • Captures the flow of transactional data • Adjusts data from multiple sources to be used together
(conforming)
18
• Structures data to be usable by BI tools • Enables subsequent business / analytical data procesing
ETL market
19
Metadata Repository
• Metadata is the data defining warehouse objects. It stores: • Description of the structure of the data warehouse
• schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
• Operational meta-data
• data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)
• The algorithms used for summarization • The mapping from operational environment to the data
warehouse
• Data related to system performance
• warehouse schema, view and derived data definitions
• Business data
• business terms and definitions, ownership of data, charging
policies
20
OLTP vs OLAP
21
OLTP (Online Transaction processing)
• OLTP is characterized by a large number of short on-
line transactions (INSERT, UPDATE, DELETE).
• The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second.
• In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF).
22
OLAP (Online analytical processing)
• Is characterized by relatively low volume of
transactions.
• Queries are often very complex and involve
aggregations.
• For OLAP systems a response time is an effectiveness
measure.
• OLAP applications are widely used by Data Mining
techniques.
• In OLAP database there is aggregated, historical data,
stored in multi-dimensional schemas (usually star schema).
23
OLTP vs OLAP
• targets one specific process • integrates data from different
processes • Many short transactions
(queries + updates) • often makes use of historical
data • Examples
• Long transactions, complex,
ad hoc queries
• Examples • Update account balance • Enroll in course • Add book to shopping cart
• Queries touch a small • Report total sales for each department in each month
• Identify top selling books amounts of data (one record or a few records)
• Queries touch large amounts • Updates are frequent
of data
• Updates are infrequent • Concurrency is biggest performance concern
• Individual queries can require
24
lots of resources
OLTP vs. OLAP
25
OLAP & OLTP: Different performance requirements
• Transaction processing (OLTP) • Fast response time important • Data must be up-to-date, consistent at all times
• Data analysis (OLAP)
• OLAP queries would degrade operational DB
• Queries can consume lots of resources • Operating on static “snapshot” of data
26
• Analysis query asks for sum of all sales • Acquires lock on sales table • New sales transaction is blocked
OLAP & OLTP: Different data modeling requirements • OLTP
• OLAP
• Normalized schema for consistency • Complex data models, many tables • Limited number of standardized queries and updates
• Fewer joins -> improved query performance • Fewer tables -> schema is easier to understand
27
• Simplicity of data model is important • De-normalized schemas are common
Data Warehouse Modeling Data Cube and OLAP models and operations
28
Multi-dimensional data model
• A data warehouse is based on a multidimensional data model, which views data in the form of a data cube.
• A data cube allows data to be modeled and viewed in
multiple dimensions.
• optimized for very quick data analysis
• Fact tables contain measures of interest (such as dollars sold) and keys to each of the related dimension tables.
29
• Dimension tables provide the context of the measures such as item (item name, brand), product, location or time(day, week, month, quarter, year).
Multi-dimensional data representation
• Relational table only represents multi-dimensional data in
two dimensions
Date
Sales
ID
Product
Country
1Qtr
100
1
TV
US
4Qtr
500
2
PC
Canada
2Qtr
30
3
CAR
US
3Qtr
200
4
PC
UK
1Qtr
20
5
CAR
UK
2Qtr
15
6
CAR
UK
4Qtr
80
7
TV
Canada
• Cube represents data as cells in an array
• Each side of cube is a dimension
30
From Tables to Data Cubes
Total annual sales of TV in U.S.A. 2Qtr 1Qtr sum
Date 3Qtr
4Qtr TV
U.S.A PC
Product
CAR
sum Canada
y r t n u o C
U.K.
31
sum
Multidimensional Data
• Sales volume as a function of product, month, and
region
• For each dimension, the set of values can be
organized in a hierachy
Dimensions: Product, Location, Time Hierarchical summarization paths
Region
Industry Region Year
Category Country Quarter
t c u d o r P
Product City Month Week
Month
32
Office Day
Browsing a Data Cube
34
Cube: A Lattice of Cuboids
all
0-D (apex) cuboid
product
product,date
product,country
date, country
country date 1-D cuboids
2-D cuboids
product, date, country
35
3-D (base) cuboid
An example of cuboids
36
Cube: A Lattice of Cuboids
all 0-D (apex) cuboid
time item location supplier
1-D cuboids
time,item time,location
item,location
location,supplier
2-D cuboids
time,supplier
item,supplier
time,location,supplier
time,item,location
3-D cuboids
item,location,supplier
time,item,supplier
4-D (base) cuboid
time, item, location, supplier
37
Data Warehouse Schemas (Conceptual) • Star Schema
• A fact table in the middle connected to a set of dimension
• Snowflake Schema
tables
• A refinement of star schema where some dimensional
• Fact Constellations
hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
38
• Multiple fact tables sharing dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
The Star Schema
time
Sales Fact Table
time_key day day_of_the_week month quarter year
item item_key item_name brand type supplier_type
time_key
item_key
branch_key
location
location_key
branch branch_key branch_name branch_type
units_sold
location_key street city province_or_street country
dollars_sold
avg_sales
39
Measures
The Star Schema: An Example
product
prodId name price
store
p1 p2
bolt nut
10 5
storeId c1 c2 c3
city nyc sfo la
sale oderId date o100 1/7/97 o102 2/7/97 3/8/97 105
custId prodId storeId p1 p2 p1
53 53 111
c1 c1 c3
qty 1 2 5
amt 12 11 50
customer
custId 53 81 111
name joe fred sally
address 10 main 12 main 80 willow
city sfo sfo la
40
The Snowflake Schema
time
supplier supplier_key supplier_type
Sales Fact Table
item item_key item_name brand type supplier_key
time_key day day_of_the_week month quarter year
time_key
item_key
branch_key
location
location_key
location_key street city_key
units_sold
branch branch_key branch_name branch_type
city
dollars_sold
avg_sales
city_key city province_or_street country
41
Measures
The Fact Constellation Schema
time
Shipping Fact Table
time_key
Sales Fact Table
item_key
time_key
time_key day day_of_the_week month quarter year
item_key item_name brand type supplier_type
shipper_key
item_key
from_location
branch_key
to_location
branch
location_key
location
dollars_cost
units_sold
units_shipped
branch_key branch_name branch_type
dollars_sold
avg_sales
location_key street city province_or_street country
shipper
Measures
shipper_key shipper_name location_key shipper_type
42
item
Concept Hierarchy
• Define a sequence of mappings from a set of very
sepecific, low-level concepts to more general, higher- level concepts
Location
all
Europe
...
North_America
region
Germany
...
Spain
Canada
...
country
Vancouver
...
city
Frankfurt
...
Toronto
L. Chan
...
M. Wind
office
43
• E.g. concept of Location
Concept Hierarchy
• Concept hierarchies are useful to perform OLAP • Data are organized in multiple dimensions where each
• It give flexibility to summarize data on various levels of
granularity
• And OLAP operations enable materialization of such views
dimension contains multiple levels of abstraction defined by concept hierarchies
Set-Grouping Hierarchy
[$0 - $1000]
inexpensive
moderate
expensive
[$0 - $150]
45
Typical OLAP Operations
• Roll up (drill-up): summarize data
• Drill down (roll down): reverse of roll-up
• by climbing up hierarchy or by dimension reduction
• from higher level summary to lower-level summary or detailed
data, or introducing new dimensions • Slice and dice: project and select • Pivot (rotate):
47
• reorient the cube, visualization, 3D to series of 2D planes
49
50
Roll-Up & Drill-Down
Higher Level of Aggregation
D r i l l - D o w n
p U
l l
o R
v Sales Channel v Region v Country v State v Location Address v Sales Representative
Low-level Details
51
52
53
Pivot (1)
• Pivot is also called a
rotation.
• Pivot rotates the data
axes in view to provide an alternative presentation of the data.
• It may contain
swapping the rows and columns or moving one of the row-dimensions into the column dimensions.
54
Pivot (2)
55
Pivot (3)
56
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time
Product
ANNUALY
QTRLY
DAILY
PRODUCT ITEM
PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location
Promotion
Organization
57
Each circle is called a footprint
Data Warehouse Design
58
Design of Data Warehouse
• A Business Analysis Framework
• Four views regarding the design of a data warehouse
• allows selection of the relevant information necessary for the
data warehouse
• Top-down view
• exposes the information being captured, stored, and managed
by operational systems
• Data source view
• consists of fact tables and dimension tables
• Data warehouse view
• sees the perspectives of data in the warehouse from the view of
end-user
59
• Business query view
Data Warehouse Design Process
• Top-down, bottom-up approaches or a combination of both • Top-down: Starts with overall design and planning (mature) • Bottom-up: Starts with experiments and prototypes (rapid)
• From software engineering point of view
• Waterfall: structured and systematic analysis at each step before
proceeding to the next
• Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
• Typical data warehouse design process
• Choose a business process to model, e.g., orders, invoices, etc. • Choose the grain (atomic level of data) of the business process • Choose the dimensions that will apply to each fact table record • Choose the measure that will populate each fact table record
60
3-tier data warehouse architecture
61
Enterprise warehouse & data marts
• Enterprise warehouse
• collects all of the information about subjects spanning the
• Data mart
entire organization
• Holds data only for a specific department or line of business,
62
such as sales, finance, or human resources. • A data warehouse can feed data to a data mart • A data mart can feed a data warehouse.
A Recommended Approach
Multi-Tier Data Warehouse
Distributed Data Marts
Data Mart
Data Mart
Enterprise Data Warehouse
Model refinement
Model refinement
Define a high-level corporate data model
63
Data warehouse implementation
64
Efficient Data Cube Computation
■ Data cube can be viewed as a lattice of cuboids ■ The bottom-most cuboid is the base cuboid ■ The top-most cuboid (apex) contains only one cell ■ How many cuboids in an n-dimensional cube with L
levels?
■ Materialization of data cube
■ Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)
■ Selection of which cuboids to materialize
■ Based on size, sharing, access frequency, etc.
65
The “Compute Cube” Operator
■ Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
■
compute cube sales
()
Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
(city)
(item)
(year)
(city, item)
(city, year)
(item, year)
FROM SALES
(city, item, year)
66
CUBE BY item, city, year ■ Need compute the following Group-Bys (date, product, customer), (date,product),(date, customer), (product, customer), (date), (product), (customer) ()
Indexing OLAP Data: Bitmap Index
■
■
■
■
■ not suitable for high cardinality domains ■ A recent bit compression technique, Word-Aligned Hybrid (WAH),
Index on a particular column Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column
makes it work for high cardinality domain as well [Wu, et al. TODS’06]
Base table
Index on Region
Index on Type
67
Indexing OLAP Data: Join Indexing
• Used for cross table searches
• Most commonly used to join fact table with a dimension
table in the start schema
• A join index on city maintains for each distinct city a list of R-IDs
of the tuples recording the Sales in the city • Join indices can span multiple dimensions
68
• E.g. fact table: Sales and two dimensions city and product
69
OLAP Server Architectures
• Relational OLAP (ROLAP)
• Use relational or extended-relational DBMS to store and
manage warehouse data and OLAP middle ware
• Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
• Multidimensional OLAP (MOLAP)
• Greater scalability
• Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer) • Flexibility, e.g., low level: relational, high-level: array
• Specialized SQL servers (e.g., Redbricks)
• Sparse array-based multidimensional storage engine • Fast indexing to pre-computed summarized data
• Specialized support for SQL queries over star/snowflake
71
schemas
Summary
■ Data warehousing: A multi-dimensional model of a data warehouse
■ A data cube consists of dimensions & measures ■ Star schema, snowflake schema, fact constellations ■ OLAP operations: drilling, rolling, slicing, dicing and pivoting
■ Data Warehouse Architecture, Design, and Usage
■ Multi-tiered architecture
■ Business analysis design framework
■
■
Information processing, analytical processing, data mining
■
Implementation: Efficient computation of data cubes ■ Partial vs. full vs. no materialization
■ OLAP query processing ■ OLAP servers: ROLAP, MOLAP, HOLAP
72
Indexing OALP data: Bitmap index and join index
References (I)
■
S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB’96 ■ D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data
warehouses. SIGMOD’97
■ R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases.
■
■
■
■
■
■
■
ICDE’97 S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997 E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July 1993. J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997. A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and Applications. MIT Press, 1999. J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107, 1998. V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. SIGMOD’96 J. Hellerstein, P. Haas, and H. Wang. Online aggregation. SIGMOD'97
73
References (II)
■ C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design:
Relational and Dimensional Techniques. John Wiley, 2003 ■ W. H. Inmon. Building the Data Warehouse. John Wiley, 1996 ■ R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to
■
Dimensional Modeling. 2ed. John Wiley, 2002 P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD Record, 24:8–11, Sept. 1995. P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97
■ ■ Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
■
http://www.microsoft.com/data/oledb/olap, 1998 S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94 A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00.
■ ■ D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with
■
■
■
aggregation using views. VLDB'96 P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987. J. Widom. Research problems in data warehousing. CIKM’95 K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans. on Database Systems (TODS), 31(1): 1-38, 2006
74
Thank you for your attention!!!
75