Why Benchmarks Figure out how well a (database) system performs. Results specific to hardware configuration and specific parameters tuned. User cares about how well "my application" will do -- which system should I buy? Cannot test with actual application * too complex, too costly Use standard benchmark numbers as guide. Required Characteristics Must be simple to specify and implement. Must be representative of some class of real applications. Must be carefully specified * cheating should be hard. Companies consider bragging rights very important, and will often tune their product to suit the benchmark, even at the cost of other applications at times. Three Benchmarks TP WISS XMark Transaction Processing Most famous benchmark in DB. Proposed by Jim Gray and friends, but published as by "anon et al" so as to be able to dodge some bullets. Since codified by the Transaction Processing Council as TPCA. TPCA and TPCB are no longer interesting benchmarks since you can get multiple xacts per dollar. TPC Benchmarks TPC-C deals with a relatively complex operational database. TPC-D deals with a data warehouse evolved to TPC-R (business reporting) and TPC-H (ad hoc queries). TPC-W deal with a web backend database. Visit http://www.tpc.org Test data generating code and benchmark spec. available for free download. All benchmarks have a "scale factor" as a central feature. TP1 benchmark Supposedly a stylized statement of a cash withdrawal (or check cashing) transaction. Complaints from banks that it is not quite how they do it in practice. Not a water-tight spec. TPC fixed much. But used very widely, and central to RDBMS development in the early days. How to Measure Things Time: Wall-clock time on an unloaded system. Cost: Compute 5-year cost of ownership, use straight-line depreciation, and zero interest rate to determine cost per second of execution time on the system Commercial benchmarks typically do not count cost per second, but rather total system cost. Cost of outages, of software development, ... The Benchmarks Sort one million 100 byte records stored on disk, using first ten bytes as key. Scan one million 100 byte records in 1000 equal transactions, each of which reads and writes back 1000 records, with locking. The DebitCredit Benchmark Transactions Per Second TPS = Number of debitcredit transactions that can be run on the system per second, with a response time less than 1 sec for 95% of the xacts. Do not worry about system start-up, crash recovery etc. (though all are required). Rewards simplicity -- don't pay a performance penalty for fancy functions. 100 TPS set up: 10,000 tellers (100 sec.) think time each. 1000 branches 10,000,000 accounts One 100 byte record for each of the above in 3 separate tables, randomly accessed. One 50 byte history record per transaction, 10 GB sequential file. The transaction read message from terminal read-modify-write account write history read-modify-write teller read-modify-write branch write message to terminal Pick branch, and teller in branch randomly. Pick random account in branch 85% of time Wisconsin Benchmark An Engineer's benchmark, as opposed to a User's benchmark. Does not model any application -- rather it is a stylized synthetic database with queries designed to test specific features of this database. Minimize randomness by deriving attributes in a stylized way. Actually predates the TP1 benchmark (1983 vs 1985)!! XML Benchmarks What is the application? What is the data set? Some suggestions, notably XMark. http://www.xml-benchmark.org/ Customizable benchmarks -- Toxgene. http://www.cs.toronto.edu/tox/toxgene used in XBench at Waterloo. MBench is an engineer's benchmark for XML. Designed after the Wisconsin benchmark. But, of necessity, more complex. http://www.eecs.umich.edu/db/mbench Parallelism Measure how much performance improves by using n processors. Ideally you would like for it to grow by a factor of n. In practice, usually grows by something less than that. Speedup = Sing Proc. time / Multi-proc. time Compare this against number of processors used. Scaleup = Elapsed Time measured as the problem size is scaled linearly with the number of processors. Sizeup = Elapsed Time measured as the problem size is scaled linearly (with no change to the hardware configuration).