Query Optimization Relational algebra equivalences used to manipulate query expressions e.g. commutativity, associativity, cascading rule, ... Manipulate query tree to obtain equivalent alternatives and choose cheapest. Some changes are always good - e.g. pushing selections in. Most changes may be good or bad depending on the data. For a single relation, we can access through scan or index. Index-only access, where possible. Left-deep plans only for joins. Access method for left(outer) relation chosen independent of join, but need to consider ordering of output. Access method for right(inner) relation chosen based on join. Once more alternative output orders must be considered. Bushy plans are (sometimes) considered in commercial systems. Costs harder to estimate -- resource (buffer) sharing is an issue. Lots of heuristics. Cost Estimation -- The Big Picture Cost-based query optimization chooses plan with EXPECTED least cost Expectation formed on the basis of meta-data Costs are driven by sizes of (intermediate result) data sets, which are unknown. Do not forget impact of data organization on cost. Estimate unknown costs somehow. Meta-Data Use meta-data (data statistics) for estimation. In DB design/tuning, tradeoff to decide how much statistical info to store. Less stats can cause poor estimates, and hence poor choice of query plan. More stats requires more storage overhead, and also introduces more maintenance overhead. What information do you keep in meta-data? Data Structure Data Statistics System Catalog is a special set of tables Schema Table has one row for each attribute in each relation, with domain info. Storage Table has relations mapped to physical storage, file ids, etc. Index Table has index name, type of structure, relation indexed, search key, etc. Also, info on integrity constraints, on triggers, on views, ... The Selinger Paper Introduced idea of cost-based query optimization, including simple cost estimation rules. Many nice simplifications. E.g. cost = page_fetches + w*tuples_processed Simple guesses at selectivities: 1/10 if you don't know any better (or 1/3 (for ranges) or 1/2 (for set inclusion)) 1/cardinality (assumes uniform distribution) Treat as independent probabilities, to compute more complex formulae 1/Max(cardinalities) for column equality (join condition) selectivity Key decisions include: Only binary joins Only left-deep plans Only single-block queries Data Statistics Sizes of (base) relations - pretty basic!! Number of distinct values for (each) attribute. Possibly obtained from index. Range of values for (each) attribute. Single-attribute histograms Equi-width Equi-depth End-biased (compressed) Optimal Multi-attribute histograms - needed to capture correlations Using Statistics Reconstructing Counts: Attr = val Attr > val Attr IN set Attr1 = attr2 Uniformity assumption within histogram bucket Sensible combination rules, using independence assumption. Intersection: a*b Negation: 1-a Union: a + b - a*b