What is it? Emerging hot area Linking DBMS with AI and statistics. Figure out "interesting" things about data in the warehouse. Applications Marketing Fraud detection Science (astronomy, remote sensing, protein sequencing) NBA stats (IBM Advanced Scout) Techniques Clustering/Segmentation: group together similar items (separating dissimilar items). Data Summarization: find compact descriptions of subsets of data Change and Deviation Analysis: find significant changes in sequences of data (e.g. stock data). Sequence-oriented. Predictive Modeling: predict value of some attributes based on values of others (and previous "training") e.g. classification: assign items to pre-specified categories Dependency Modeling: model joint probability densities (relationships among columns) * Important example: Association Rules Classification Ask minimum number of questions to be able to classify each instance into one of n (usually 2) categories. Next question depends on answer to current question. Build a decision tree. Many algorithms to build good trees. Clustering What is a good cluster k-means (and other "spherical" clustering algorithms) Link-based clustering algorithms How many clusters should be formed? Top down vs. bottom up clustering Association Rule Problem Given a "purchase" transaction as a "basket" of items purchased together, Association Rule: X ==> Y (X and Y are disjoint sets of items) confidence c: c% of transactions that contain X also contain Y (rule-specific) support s: s% of all transactions contain both X and Y (relative to all data) Efficiently find all rules with support > minsup, confidence > minconf * Anecdote -- diapers & beer A Priori Intuition Two phases: * first find all large item-sets (with support) * find rules with confidence Use monotonicity property * Every subset of a large itemset must be large. * Consider an itemset a candidate for largeness only if all its subsets (of size one smaller than itself) are large. A Priori Algorithm L0 = {{}} // A unique item-set with zero items in it. for (k = 1; Lk-1 != empty ; k++) { Ck = apriori-gen(Lk-1); // candidate k-itemsets forall transactions t in the database { Ct = subset(Ck, t); // Candidates from Ck contained in t forall candidates c in Ct c.count++; } Lk = {c in Ck | c.count >= minsup} } apriori-gen(Lk-1) k>1 join: insert into Ck select p.item1, ..., p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1 = q.item1 and ... and p.itemk-2 = q.itemk-2 and p.itemk-1 < q.itemk-1 prune: delete itemsets such that some (k-1)-subset is not in Lk-1 Implementing Subset: The hash tree. Leaves contain a list of itemsets Internal nodes contain a hash table, with each bucket pointing to a child. Root is at depth 1. When inserting into interior node at depth d, we choose child by applying hash function to dth item in itemset. Variation: AprioriTid Generates a new "database" at each step k, with items * Use this in place of the original database for the next step. Benefit: Database gets fewer rows at each stage. Cost: Databases generated may be big. Hybrid: AprioriHybrid Run Apriori for a while, then switch to AprioriTid when the generated DB would fit in mem. AprioriHybrid is fastest, beating old algorithms and matching Apriori and AprioriTid when either one wins. The Second Phase Consider every subset A of every large itemset L. Check if A -> (L-A) has enough confidence. Hopefully inexpensive -- we already have counts for all large itemsets. Optimize by not considering all choices of A -- we only want to report the smallest A for which the rule holds. (It must hold for all supersets). Why Is this Popular? It is statistically "light" (no background required) It has a relational query processing flavor. Probably for both of these reasons, there have been a ton of follow-on papers (arguably too many...) Some Later Work If you have an IsA hierarchy, do association rules on it * you may not be able to say that pampers -> beer, but if you know pampers IsA diaper and luvs IsA diaper, maybe you'll find that diapers->beer. Quantitative association rules * "10% of married people between age 50 and 60 have at least 2 cars." Online association rules: you should be able to change support and confidence threshholds on the fly. Incremental computation of association rule as base data changes. More efficient computation algorithms. Parallel versions of this stuff. Definition variants.