Data Warehouse

Two major classes of  DB workloads
* OLTP (OnLine Transaction Processing)
   -   Teller machines
   -   Reservation systems
   -   Tracking systems
   -   ...
* Decision Support queries
   -   use a data warehouse
* OLTP was primary challenge.
   -   Now a solved problem.

Common Terms
   -   A "data warehouse" is an organization-wide snapshot of data, typically
     used for decision-making.  
   -   A DBMS that runs these decision-making queries efficiently is sometimes
     called a "Decision Support System" DSS  
   -   A "data mart" is a mini-warehouse -- typically an extract from the
     warehouse for a particular application or org. unit. 

Warehouse/DSS Properties 
   -   Very large: 100gigabytes to many terabytes (or as big as you can go) 
   -   Tends to include historical data 
   -   Workload: mostly complex queries that access lots of data, and do many
     scans, joins, aggregations.  

More Terms
   -   One class of DSS queries is called on-line analytic processing (OLAP) 
   -   Distinguish from more typical batch processed query.
   -   OLAP can be implemented as Relational OLAP, directly on RDBMS;
   -   Or as Multi-dimensional OLAP, an extract from the RDBMS stored in a
     separate "multi-dimensional" database. 

Warehouse Loading
   -   Data loaded from multiple data sources.
	    Needs translation/integration.
   -   Frequency of load is an issue.
	    Once a night is typical.
   -   Snapshot source
	    But we could also have
	       -   Cooperative (active DB) source
	       -   Queryable source
	       -   Logged source

Data Preparation on Load 
Data Migration: simple transformation rules (replace "gender" with "sex") 
Data Scrubbing: use domain-specific knowledge (e.g. zip codes) to modify
     data. Try parsing and fuzzy matching from multiple sources.  
Data Auditing: discover rules and relationships (or signal violations
     thereof).  

Load Options
 Bring down DB and load new data.
  - Typical in commercial installations today.
 But, data load can take a long time! (Sorting, indexing, summarization,
     constraint checking, ...) 
 Incremental loading makes sense for big warehouses, but transaction
     model is more complex - have to break the load into lots of
     transactions, and commit them periodically to avoid locking everything.
     Need to be careful to keep metadata & indices consistent along the way.  

Typical Warehouse Schema
 One "Fact" table.
 One record per (real-life) transaction.
  - E.g. productid, time, location, units sold, total
 Multiple auxiliary tables.
  - Create a location hierarchy
  - Relate productid to supplier, warehouse ...
  - Relate supplier to country.
 Star Schema
 Snowflake Schema 

Multi-Dimensional Data Model
 To facilitate analysis and visualization, data is often modeled
     multidimensionally  
 Think n-dimensional spreadsheet rather than relational table 
 E.g. dimensions are  time,  location, product... 
 Dimensions can be organized hierarchically into more detail 
 For each "grid point" store numeric measure(s)
  - units sold, revenue, etc.
 Called  "Data Cube"
 Number of dimensions typically 3-6.

Roll Up/ Drill Down
 Use hierarchy to "roll up" information in any dimension.
  - e.g. time_of_sale may be "rolled up" into day-month-quarter-year 
  - store numeric measures per quarter, rather than per day.
 Provide appropriately summarized info. 
 "Drill-down"  is opposite of "roll up"
  - break up a measure into component pieces.

MOLAP 
 Multidimensional OLAP actually stores things in multi-d format 
 Special index structures used for this. 
 Much of the cube is empty! (no sales of Purple Chevy Trucks in June in Reno)
  - Identify dense regions. 
 Products: Essbase (Arbor), Express (Oracle), Lightship (Pilot) 
 Essentially everything is precomputed.

ROLAP 
 Relational OLAP uses standard relational tables & engine to do OLAP. 
 Uses standard relational query processing, with lots of indexes and
     precomputation  
 Products: MicroStrategy, Metacube (Informix), Information Advantage. 
 HOLAP (Hybrid OLAP) to combine the best of both 
  - Informix/Essbase MOLAP ADTs in ORDBMS 

SQL GROUPBY
SELECT  A, B, F, AVG(C), MAX(D)
FROM REL
GROUPBY A, B
* Divide relation into pieces and compute one aggregate tuple per piece.
* No result tuples for "pieces" that are null.

The Need for "ALL"
SELECT  A, 'ALL', AVG(C), MAX(D)
FROM REL
GROUPBY A
* Aggregate over all values of B, specifying only the A value.
* "ALL" is a context-sensitive token
* Represents the set of all values for the attribute.

The Data Cube
 Gray, et al. present a ROLAP language approach to cubing. 
 The full data cube is all the base data in the cube, plus all the
     subaggregates obtained by projection.  
 CUBE as relational operator implies need to model projection with a
     magic ALL value.  
 Computing the cube (Zhao/Deshpande/Naughton.) 

The CUBE Operator
SELECT  A, B, AVG(C), MAX(D)
FROM REL
GROUPBY CUBE A, B
 Computes the groupby without the cube
 And also aggregates by projecting out every combination of 1, 2, ...
     dimensions. 
  - A, 'ALL'
  - 'ALL', B
  - 'ALL', 'ALL'

The ROLLUP Operator
Similar to CUBE, but computes rollups only in the specified attribute order. 
SELECT  A, B, AVG(C), MAX(D)
FROM REL
GROUPBY ROLLUP A, B
Computes groups by
  - A,B
  - A, 'ALL'
  - 'ALL', 'ALL'

We can mix and match the above
SELECT  A, B, AVG(C), MAX(D)
FROM REL
GROUPBY A, ROLLUP B, C, CUBE D, E;
This query computes answers for groupings:
  - ABCDE, ABCD*, ABC*E, ABC**
  - AB*DE, AB*D*, AB**E, AB***
  - A**DE, A**D*, A***E, A****

Query Processing in Warehouses
 Lots of joins needed.
 Delay using fact table as far as possible.
 Compute everything else, and finally join with fact table.
 What if selection through fact table?

Join Index
 R1 join F (use attribute A2 of R1 in join)
 Given value for attribute A1 of R1, find tuples of F that join selected
     tuples of R1. 
 If snowflake, may go through multiple levels of joins!!

BitMap Indices
 Store a bit vector for each unique attribute value, one entry per tuple
     in relation. Value of entry is 1 if it has the specified value for the
     attribute, 0 otherwise. 
 Compute total storage in regular (value-list) index; compare with bit
     mapped index.  
 Compare access times.  Bitmap indices are very fast to manipulate.
 Bit-Slice indices when an attribute may take too many different values.
 BitMap Join Indices frequently used in data warehouses.

Other Indices
 Projection index basically does column partitioning.
 Groupset index allows efficient identification of groups of rows
     that evaluate to a common value for a grouping predicate.

Precompute Aggregates
 No room to store all.
 So choose which ones would be most beneficial.
    To compute any answer, you further aggregate the smallest aggregate that
    groups things into finer categories.  (I.e. has at least all the info you
    need). 
    Most beneficial is tradeoff of savings in processing time to cost (size).