RECOVERY Motivation Atomicity: Transactions may abort ("Rollback"). Durability: What if DBMS stops running? (Causes?) Types of Failure - Transaction failure - System failure - Media failure * managed by having redundant media + mirrored disks for log files + RAID Handling the Buffer Pool - Force write to disk at commit? * Poor response time. * But provides durability. - NO FORCE (Enforcing Durability is hard) * What if system crashes before a modified page is written to disk? * Write info, in a convenient place, at commit time, to support REDOing modifications. - Steal buffer-pool frames from uncommited Xacts? * If not, poor throughput. * If so, how can we ensure atomicity? - STEAL (Enforcing Atomicity is hard) To steal frame F: Current page in F (say P) is written to disk; some Xact T holds X lock on P. * What if T aborts? * Must remember the old value of P at steal time (to support UNDOing the write to page P). Basic Idea: Logging - Record REDO and UNDO information, for every update, in a log. - Sequential writes to log (put it on a separate disk). - Minimal info (diff) written to log, so multiple updates fit in a single log page. - Log: An ordered list of REDO/UNDO actions - Log record contains: and additional control info. Log Data - Log records can be logical (e.g. "inserted a tuple t into relation R), or physical (e.g. "byte 74 of page 255 used to be 'r' and now is 's'"). - Log data is "physio-logical" - i.e. some is physical (e.g. B-tree page splits), and some is logical (heap-tuple insertions.) Write-Ahead Logging (WAL) - The Write-Ahead Logging Protocol: * Must force the log record for an update before the corresponding data page gets to disk. * Must write all log records for a Xact before commit. - #1 guarantees Atomicity. - #2 guarantees Durability. WAL & the Log - Each log record has a unique Log Sequence Number (LSN). * LSNs always increasing. - Each data page contains a pageLSN. * The LSN of the most recent log record for an update to that page. - System keeps track of flushedLSN. * The max LSN flushed so far. - WAL: Before a page is written, ensure that * pageLSN <= flushedLSN Log Records Possible log record types: - Update - Commit - Abort - End (signifies end of commit or abort) - Compensation Log Records (CLRs) * for UNDO actions Other Log-Related State - Transaction Table: - One entry per active Xact. - Contains TID, status (prepared/unprepared), and lastLSN(last LSN written by Xact). - Dirty Page Table: - One entry per dirty page in buffer pool. - Contains recLSN -- the LSN of the log record which first caused the page to be dirty. Checkpointing - In order to speed up recovery, it's nice to have "checkpoint" records that limit the amount of log that needs to be processed during recovery. It can be tricky to do efficient checkpoints. - Write to log: * begin_checkpoint record * end_checkpoint record: Contains current Xact table and dirty page table. - Collecting all this info. may take time. This is a `fuzzy checkpoint': these tables are not recorded atomically. - Store LSN of chkpt record in a safe place (master record). - No attempt to force dirty pages to disk; effectiveness of checkpoint limited by oldest unwritten change to a dirty page. - So it's a good idea to flush dirty pages to disk periodically. Summary of Logging/Recovery - Recovery Manager guarantees Atomicity & Durability. - Use WAL to allow STEAL/NO-FORCE w/o sacrificing correctness. - LSNs identify log records; linked into backwards chains per transaction (via prevLSN). - pageLSN allows comparison of data page and log records. - Checkpointing: A quick way to limit the amount of log to scan on recovery. - Recovery works in 3 phases: * Analysis: Forward from checkpoint. * Redo: Forward from oldest recLSN. * Undo: Backward from end to first LSN of oldest Xact alive at crash. - Upon Undo, write CLRs. - Redo "repeats history": Simplifies the logic! Replication is Important - Used extensively for performance, particularly when reads dominate writes - Also used to enhance availability - Critical for disaster recovery: * two-safe copy is expensive, but replicates current state exactly * one-safe copy is cheaper, but could lose recent transactions. Replication Challenge - Updates have to be reflected at multiple sites. - What are the options? * Eager vs. lazy * Group vs. master Eager/Lazy - Eager replication is transactional * needed for serializability * expensive - Lazy replication allows copies to diverge * eventually, updates do get propagated * meanwhile xacts may have read a stale value of the copy, and these xacts must be "reconciled". Group/Master - Master (or primary copy) replication has one site responsible for any data item. * Master site always has the "true" value of the data item * All updates must be performed (eagerly) at the master. - Group replication permits updates "anywhere" to any copy. * More complex reconciliation task. The Replication Challenge - Replication looks fine on small demos, dies when you scale up. - Deadlock/reconciliation rates grow exponentially with replication factor. - This gets even worse with disconnected operation (mobile computers). Eager Master - All updates are serialized through the master copy transactionally. - No performance benefits due to distribution. - May seem bad, but is actually better than the other 3 options. - If different masters for different objects, then problems same as for Eager Group. Eager Group - Every update generates an update at all nodes - Deadlocks grow like nodes^3, assuming DB size does not change. - High replication factors will bring the system to its knees. - Distributed xacts may run longer, and deadlocks grow as actions^5. Lazy Master replication - Master updates are transactional. - Master must be read-locked for up-to-date read value, if serializability is desired. - Slave updates to replicas are not transactional. - Deadlock rate, considering only master updates, grows as nodes^2. Lazy Group replication - Xact that would wait under eager needs to be reconciled here - So reconciliation is much more common than deadlock - Reconciliations grow like nodes^3. * (Denominator is much smaller). - If disconnected operation, then reconciliations grow with disconnection time (but "only" nodes^2). Real products today - Lazy group replication is the only viable answer. - Postpone reconciliations and focus on achieving eventual consistency (convergence). - Lotus Notes uses time stamps for this purpose, and suffers from the potential of lost updates. Two-Tier Replication - The world consists of * base nodes: always connected, few in number * mobile nodes: often disconnected, large number - Each data item has a master (which could be on a mobile node). - Base transactions work only with master data items, and update masters and all base copies transactionally. - Tentative transactions executed at mobile nodes, when disconnected. - Mobile nodes contain 2 versions of objects: * a (maybe stale) master version * a tentative version - Upon reconnect: * tentative versions are removed * tentative xacts are rerun as base xacts before committing the base xacts, an acceptance criterion is used to make sure the results are close enough to the original tentative versions.