Indexing

What is an Index
* Partition data in buckets
* Label each bucket
* Use label to determine whether bucket is relevant to query
* May organize buckets into a tree.
At leaf-level, have to decide what goes in the bucket - data items or pointers 
   to them.  Also, if pointers, is it one per entry or a whole list.
Clustered vs. unclustered indices.
Sparse (clustered) indices as option.
Primary vs secondary indices.
Range queries vs. exact match queries.

Some Indices
  Hash Index
  Inverted file
  Trie
  B Tree
  * Basic properties:
    - high fanout
    - balanced
    - dynamically updated.
  * Side pointers at leaves
  * B+ tree versus B tree.

B+ Tree dynamic adjustment
   * Upon node overflow
     - Split node in two.
     - Increase size of parent by one extra child ref.
     - Overflow may propagate up.
   * Upon node underflow
     -  merge with sibling
     - if merged node too big, split evenly

Concurrency Control in Indices
  * Any update that affects leaf node could propagate to the top of the tree.
  * Upper portions of tree used by everyone
    - cannot afford to allow long locks
  * Come up with special concurrency control mechanisms that work for indices.
  * E.g. ordered-tree locking protocol.
  * But now have to redesign recovery system to do the right thing.

A Generalized Search Tree
New indices needed for new apps...
- find all supersets of S
- find all molecules that bind to M
- your favorite query here (multimedia?)
...and for new queries over old domains:
- find all points in region from 12 to 1 o?clock
- find all strings that match R. E.
 
Extensible, efficient search tree structure
* Balanced tree of (p, ptr) pairs
  - each pair is an entry E
* p is a key ?predicate?
* p holds for all objects below ptr
* keys on a page may overlap
* Key predicates: a user-defined class
  - This is the only extensibility required!

Key Methods
* Search
  - Consistent(E,q): evaluate E.p && q
* return a no or a maybe (not necessarily yes)
* Labeling
  - Union(P): new key that holds for all tuples in P.
* Partitioning
  - Penalty(E1,E2): penalty of inserting E2 in subtree E1
  - PickSplit(P): split P into two groups of entries

Tree Traversal
* Search
  - traverse tree where Consistent is TRUE
* Insert
  - descend tree along least increase in Penalty 
     if there's room at leaf, insert there 
     else split according to PickSplit and propagate changes using Union
* Delete
  - Find the entry via Search, and delete it.
    Propagate changes using Union