Data Warehouse Two major classes of DB workloads * OLTP (OnLine Transaction Processing) - Teller machines - Reservation systems - Tracking systems - ... * Decision Support queries - use a data warehouse * OLTP was primary challenge. - Now a solved problem. Common Terms - A "data warehouse" is an organization-wide snapshot of data, typically used for decision-making. - A DBMS that runs these decision-making queries efficiently is sometimes called a "Decision Support System" DSS - A "data mart" is a mini-warehouse -- typically an extract from the warehouse for a particular application or org. unit. Warehouse/DSS Properties - Very large: 100gigabytes to many terabytes (or as big as you can go) - Tends to include historical data - Workload: mostly complex queries that access lots of data, and do many scans, joins, aggregations. More Terms - One class of DSS queries is called on-line analytic processing (OLAP) - Distinguish from more typical batch processed query. - OLAP can be implemented as Relational OLAP, directly on RDBMS; - Or as Multi-dimensional OLAP, an extract from the RDBMS stored in a separate "multi-dimensional" database. Warehouse Loading - Data loaded from multiple data sources. Needs translation/integration. - Frequency of load is an issue. Once a night is typical. - Snapshot source But we could also have - Cooperative (active DB) source - Queryable source - Logged source Data Preparation on Load Data Migration: simple transformation rules (replace "gender" with "sex") Data Scrubbing: use domain-specific knowledge (e.g. zip codes) to modify data. Try parsing and fuzzy matching from multiple sources. Data Auditing: discover rules and relationships (or signal violations thereof). Load Options Bring down DB and load new data. - Typical in commercial installations today. But, data load can take a long time! (Sorting, indexing, summarization, constraint checking, ...) Incremental loading makes sense for big warehouses, but transaction model is more complex - have to break the load into lots of transactions, and commit them periodically to avoid locking everything. Need to be careful to keep metadata & indices consistent along the way. Typical Warehouse Schema One "Fact" table. One record per (real-life) transaction. - E.g. productid, time, location, units sold, total Multiple auxiliary tables. - Create a location hierarchy - Relate productid to supplier, warehouse ... - Relate supplier to country. Star Schema Snowflake Schema Multi-Dimensional Data Model To facilitate analysis and visualization, data is often modeled multidimensionally Think n-dimensional spreadsheet rather than relational table E.g. dimensions are time, location, product... Dimensions can be organized hierarchically into more detail For each "grid point" store numeric measure(s) - units sold, revenue, etc. Called "Data Cube" Number of dimensions typically 3-6. Roll Up/ Drill Down Use hierarchy to "roll up" information in any dimension. - e.g. time_of_sale may be "rolled up" into day-month-quarter-year - store numeric measures per quarter, rather than per day. Provide appropriately summarized info. "Drill-down" is opposite of "roll up" - break up a measure into component pieces. MOLAP Multidimensional OLAP actually stores things in multi-d format Special index structures used for this. Much of the cube is empty! (no sales of Purple Chevy Trucks in June in Reno) - Identify dense regions. Products: Essbase (Arbor), Express (Oracle), Lightship (Pilot) Essentially everything is precomputed. ROLAP Relational OLAP uses standard relational tables & engine to do OLAP. Uses standard relational query processing, with lots of indexes and precomputation Products: MicroStrategy, Metacube (Informix), Information Advantage. HOLAP (Hybrid OLAP) to combine the best of both - Informix/Essbase MOLAP ADTs in ORDBMS SQL GROUPBY SELECT A, B, F, AVG(C), MAX(D) FROM REL GROUPBY A, B * Divide relation into pieces and compute one aggregate tuple per piece. * No result tuples for "pieces" that are null. The Need for "ALL" SELECT A, 'ALL', AVG(C), MAX(D) FROM REL GROUPBY A * Aggregate over all values of B, specifying only the A value. * "ALL" is a context-sensitive token * Represents the set of all values for the attribute. The Data Cube Gray, et al. present a ROLAP language approach to cubing. The full data cube is all the base data in the cube, plus all the subaggregates obtained by projection. CUBE as relational operator implies need to model projection with a magic ALL value. Computing the cube (Zhao/Deshpande/Naughton.) The CUBE Operator SELECT A, B, AVG(C), MAX(D) FROM REL GROUPBY CUBE A, B Computes the groupby without the cube And also aggregates by projecting out every combination of 1, 2, ... dimensions. - A, 'ALL' - 'ALL', B - 'ALL', 'ALL' The ROLLUP Operator Similar to CUBE, but computes rollups only in the specified attribute order. SELECT A, B, AVG(C), MAX(D) FROM REL GROUPBY ROLLUP A, B Computes groups by - A,B - A, 'ALL' - 'ALL', 'ALL' We can mix and match the above SELECT A, B, AVG(C), MAX(D) FROM REL GROUPBY A, ROLLUP B, C, CUBE D, E; This query computes answers for groupings: - ABCDE, ABCD*, ABC*E, ABC** - AB*DE, AB*D*, AB**E, AB*** - A**DE, A**D*, A***E, A**** Query Processing in Warehouses Lots of joins needed. Delay using fact table as far as possible. Compute everything else, and finally join with fact table. What if selection through fact table? Join Index R1 join F (use attribute A2 of R1 in join) Given value for attribute A1 of R1, find tuples of F that join selected tuples of R1. If snowflake, may go through multiple levels of joins!! BitMap Indices Store a bit vector for each unique attribute value, one entry per tuple in relation. Value of entry is 1 if it has the specified value for the attribute, 0 otherwise. Compute total storage in regular (value-list) index; compare with bit mapped index. Compare access times. Bitmap indices are very fast to manipulate. Bit-Slice indices when an attribute may take too many different values. BitMap Join Indices frequently used in data warehouses. Other Indices Projection index basically does column partitioning. Groupset index allows efficient identification of groups of rows that evaluate to a common value for a grouping predicate. Precompute Aggregates No room to store all. So choose which ones would be most beneficial. To compute any answer, you further aggregate the smallest aggregate that groups things into finer categories. (I.e. has at least all the info you need). Most beneficial is tradeoff of savings in processing time to cost (size).