Query Optimization
 
Relational algebra equivalences used to manipulate query expressions
	   e.g. commutativity, associativity, cascading rule, ... 
Manipulate query tree to obtain equivalent alternatives and choose cheapest.
	   Some changes are always good - e.g. pushing selections in.
	   Most changes may be good or bad depending on the data.
For a single relation, we can access through scan or index.
Index-only access, where possible.
Left-deep plans only for joins. 
Access method for left(outer) relation chosen independent of join, but need to 
consider ordering of output. 
Access method for right(inner) relation chosen based on join.  Once more 
alternative output orders must be considered. 
Bushy plans are (sometimes) considered in commercial systems.  Costs harder to 
estimate -- resource (buffer) sharing is an issue.  Lots of heuristics. 

Cost Estimation -- The Big Picture

Cost-based query optimization chooses plan with EXPECTED least cost
Expectation formed on the basis of meta-data

Costs are driven by sizes of (intermediate result) data sets, which are
unknown. 
Do not forget impact of data organization on cost.
Estimate unknown costs somehow.

  
Meta-Data

Use meta-data (data statistics) for estimation.
In DB design/tuning, tradeoff to decide how much statistical info to store.
Less stats can cause poor estimates, and hence poor choice of query plan.
More stats requires more storage overhead, and also introduces more
maintenance overhead.

What information do you keep in meta-data?
Data Structure
Data Statistics 
System Catalog is a special set of tables 
Schema Table has one row for each attribute in each relation, with domain
info.  
Storage Table has relations mapped to physical storage, file ids, etc. 
Index Table has index name, type of structure, relation indexed, search key, 
etc. 
Also, info on integrity constraints, on triggers, on views, ...


The Selinger Paper
Introduced idea of cost-based query optimization, including simple cost 
estimation rules. 
Many nice simplifications.  E.g. cost = page_fetches + w*tuples_processed 
Simple guesses at selectivities: 
  1/10 if you don't know any better (or 1/3 (for ranges) or 1/2 (for set 
  inclusion))
  1/cardinality  (assumes uniform distribution) 
  Treat as independent probabilities, to compute more complex formulae 
  1/Max(cardinalities) for column equality (join condition) selectivity
Key decisions include:
  Only binary joins
  Only left-deep plans
  Only single-block queries

  
Data Statistics
Sizes of (base) relations - pretty basic!!
Number of distinct values for (each) attribute.
Possibly obtained from index.
Range of values for (each) attribute.
Single-attribute histograms 
  Equi-width
  Equi-depth
  End-biased (compressed)
  Optimal
Multi-attribute histograms - needed to capture correlations

  
Using Statistics
Reconstructing Counts: 
  Attr = val 
  Attr > val 
  Attr IN set 
  Attr1 = attr2
Uniformity assumption within histogram bucket
Sensible combination rules, using independence assumption.
  Intersection:    a*b
  Negation:    1-a
  Union:     a + b - a*b