What is it?
 Emerging hot area 
 Linking DBMS with AI and statistics. 
 Figure out "interesting" things about data in the warehouse.

Applications
 Marketing 
 Fraud detection 
 Science (astronomy, remote sensing, protein sequencing) 
 NBA stats (IBM Advanced Scout) 

Techniques
 Clustering/Segmentation: group together similar items (separating
     dissimilar items). 
 Data Summarization: find compact descriptions of subsets of data
 Change and Deviation Analysis: find significant changes in sequences of
     data (e.g. stock data).  Sequence-oriented.  
 Predictive Modeling: predict value of some attributes based on values of
     others (and previous "training") e.g. classification: assign items to
     pre-specified categories 
 Dependency Modeling: model joint probability densities (relationships
     among columns)  
 * Important example: Association Rules

Classification
 Ask minimum number of questions to be able to classify each instance
     into one of n (usually 2) categories. 
 Next question depends on answer to current question.
 Build a decision tree.
 Many algorithms to build good trees.

Clustering
 What is a good cluster
 k-means (and other "spherical" clustering algorithms)
 Link-based clustering algorithms
 How many clusters should be formed?
 Top down vs. bottom up clustering

Association Rule Problem
 Given a "purchase" transaction as a "basket" of items purchased together, 
 Association Rule: X ==> Y                        
		 (X and Y are disjoint sets of items)  
 confidence c: c% of transactions that contain X also contain Y
	         (rule-specific)            
 support s: s% of all transactions contain both X and Y (relative to all data)
 
 Efficiently find all rules with support > minsup, confidence > minconf 
 * Anecdote -- diapers & beer

A Priori Intuition
 Two phases:
 * first find all large item-sets  (with support)
 * find rules with confidence
 Use monotonicity property
 * Every subset of a large itemset must be large.
 * Consider an itemset a candidate for largeness only if all its subsets
     (of size one smaller than itself) are large.  

A Priori Algorithm
L0 = {{}} // A unique item-set with zero items in it.
for (k = 1; Lk-1 != empty ; k++) { 
Ck = apriori-gen(Lk-1); // candidate k-itemsets
forall transactions t in the database {
Ct = subset(Ck, t); // Candidates from Ck contained in t
forall candidates c in Ct 
c.count++;
} 
Lk = {c in Ck | c.count >= minsup}
}  

apriori-gen(Lk-1) k>1
 join: 
 insert into Ck 
 select p.item1, ..., p.itemk-1, q.itemk-1
 from Lk-1 p, Lk-1 q 
 where p.item1 = q.item1 and ... and p.itemk-2 = q.itemk-2 and p.itemk-1 <
							      q.itemk-1  
 prune: 
 delete itemsets such that some (k-1)-subset is not in Lk-1 
 
Implementing Subset: The hash tree. 
 Leaves contain a list of itemsets 
 Internal nodes contain a hash table, with each bucket pointing to a child. 
 Root is at depth 1. 
 When inserting into interior node at depth d, we choose child by
	  applying hash function to dth item in itemset.  

Variation: AprioriTid 
 Generates a new "database" at each step k, with items 
 * <TID, all subsets of size k in that xact>
 Use this in place of the original database for the next step. 
 Benefit: Database gets fewer rows at each stage. 
 Cost: Databases generated may be big. 

Hybrid: AprioriHybrid
 Run Apriori for a while, then switch to AprioriTid when the generated DB
     would fit in mem.  
 AprioriHybrid is fastest, beating old algorithms and matching Apriori
     and AprioriTid when either one wins.  

The Second Phase
 Consider every subset A of every large itemset L.
 Check if A -> (L-A) has enough confidence.  
 Hopefully inexpensive -- we already have counts for all large itemsets.
 Optimize by not considering all choices of A -- we only want to report
     the smallest A for which the rule holds.  (It must hold for all supersets). 

Why Is this Popular?
 It is statistically "light" (no background required) 
 It has a relational query processing flavor.
  Probably for both of these reasons, there have been a ton of follow-on
      papers (arguably too many...)  

Some Later Work 
 If you have an IsA hierarchy, do association rules on it 
 * you may not be able to say that pampers -> beer, but if you know pampers
     IsA diaper and luvs IsA diaper, maybe you'll find that diapers->beer. 
 Quantitative association rules
 * "10% of married people between age 50 and 60 have at least 2 cars."
 Online association rules: you should be able to change support and
     confidence threshholds on the fly. 
 Incremental computation of association rule as base data changes.
 More efficient computation algorithms.
 Parallel versions of this stuff. 
 Definition variants.