Information about this Distribution (2.2.7.rio.v1) -------------------------------------------------- The distribution for FreeBSD-Rio can be downloaded from http://www.eecs.umich.edu/Rio/software.html. The distribution is provided as a complete kernel source tree (/usr/src/sys/). There are not too many changes to the standard FreeBSD 2.2.7, so applying the diffs to other versions should be straightforward. This version of Rio is based on FreeBSD 2.2.7. It is an alpha release. We have used it on our desktops for about 1 year, and we have been using it on our file server for about 1 month. We are verified it on thousands of induced crashes, but we have relatively little experience with actual crashes, since FreeBSD doesn't crash very often. As with any alpha-release of file system code, exercise caution when using FreeBSD-Rio: try it first with data that is disposable, and do frequent tape backups. E-mail questions, bug reports, etc. to rio@eecs.umich.edu. Reliable Main Memory -------------------- The goal of the Rio (RAM I/O) project is to provide a new storage medium: reliable main memory. Reliable main memory is ordinary main memory that has been made safe as disk. Making memory as safe as disk means protecting from its two main sources of failure: power loss and software crashes. Power loss is easily protected against with an uninterruptible power supply. However, software errors are tricker to protect against. Typical operating systems assume all data in memory is lost during a crash. This forces them to write permanent data frequently through to disk, which causes a drastic loss in performance. The Rio file cache enables memory to survive OS crashes as reliably as disks do. We accomplish this with a variety of techniques: * virtual memory protection to prevent accidental corruption of file cache data via wild stores and other bugs * redundant information to catalogue the contents of memory (we call this the registry) * a write-back routine (invoked during a crash) that operates in isolation from the rest of the system (we call this safe sync) * modifications to the low-level interrupt handlers to provide a reliable way to invoke safe sync on a deadlocked system With these modifications, we have verified that the Rio file cache survives OS crashes as reliably as disks do (see the papers referenced below). This reliability enables us to turn off all reliability-induced writes to disk, such as those caused by 30-second delayed writes and synchronous metadata updates). As a result, Rio improves performance by a factor of 3x over standard UFS (for I/O intensive workloads). Note that Rio also improves reliability over UFS, because Rio provides write-through semantics (just as if you had mounted the file system with the sync option). Rio is about 10x faster than a write-through file system with equivalent reliability. We have implemented Rio on two operating systems. Our Digital Unix implementation is described in ASPLOS 1996 (http://www.eecs.umich.edu/Rio/papers/rioFileCache.ps). Our FreeBSD implementation is described in a paper we just wrote (http://www.eecs.umich.edu/Rio/papers/rioPC.ps). The FreeBSD implementation differs from the Digital Unix one in how we restore data to disk when the system crashes. In Digital Unix, we restore data while the system comes up (warm reboot). In FreeBSD, we restore data while the system goes down (safe sync). We changed to safe sync because PC hardware tends to reset memory during reboot. Also, DEC Alphas have a convenient reset button that inserts a software halt. The corresponding button on PCs resets the entire machine (including memory), so we had to provide another method. The two papers also describe the design methodology we used to verify that the resulting file system is as safe as a write-through file system. The safe sync in this release is the "enhanced safe sync" mentioned in the paper rather than the BIOS safe sync, because enhanced safe sync is more portable. Turning On/Off Disk Writes -------------------------- We provide a method to dynamically turn on and off reliability-induced disk writes on a per-file system basis. Reliability-induced writes are those caused by "fsync" calls and the update(4) daemon. Calling "sync" will still write all dirty blocks to disk. To turn off reliability-induced disk writes, mount the file system "async". If the file system is mounted without the "async" flag, reliability-induced disk writes will be on. Note that this has nothing to do with the "sync" flag (although it probably is inconsistent to mount with both "sync" and "async"). E.g. to turn disk writes off for the file system mounted on /dev/wd0a. mount -u -o update,async /dev/wd0a To turn them back on, mount -u -o update /dev/wd0a There seems to be a bug in FreeBSD that causes a panic if you do these mount commands too quickly in succession--just sleep(1) after a mount command to fix this. If you want to use Rio as the default, add the "async" flag in the options field for that file system in /etc/fstab. This way, these file systems will be mounted at boot time with reliability-induced disk writes turned off. Resetting the Machine --------------------- Safe sync will be invoked automatically if the machine panics. This requires using a kernel that does NOT have DDB--otherwise the machine will drop into the kernel debugger. If the machine hangs (e.g. gets into a deadlock), press Control-Alt-r (the 'r' stands for Rio/reboot/reset; take your pick). The keyboard interrupt handler detects this and calls panic (which does a safe sync). DON'T just turn off the power. You really should only run FreeBSD-Rio for machines with a reliable power supply (e.g. a UPS). Kernel Configuration File ------------------------- Rio requires no changes to the kernel configuration file. Note that you should not configure the kernel to use DDB. If you use DDB, DDB gets control on panic and safe sync won't work (unless you exit the debugger). Amount of Dirty File Data ------------------------- The main point of Rio is to allow file writes to stay in memory. As a matter of policy, it's sometimes good to allow a lot of dirty file data in memory. FreeBSD's default policy is to only allow dirty file data to reside in the buffer cache (unless you're using mmap'ed files). Clean data can live in the buffer cache or the VM cache. Actually all file data is in the VM cache; we say it's lives in the buffer cache if it has a buffer header. The FreeBSD buffer cache is usually 10% of memory, and this is wired in memory (so increasing it causes problems). To allow more dirty file data, we changed FreeBSD's policy so that dirty file data can live in the VM cache (no buffer header). It will be written out during a "sync" or when the VM cache gets too full. This policy allows potentially all of memory to fill up with dirty file data (so sync may take a long time). It would probably be a good idea to write dirty file data to disk when the machine is otherwise idle. This shouldn't be hard (e.g. have a daemon process call sync when the machine has been idle for 5 minutes). Known Limitations ----------------- The UFS file system uses ordered writes to maintain file system consistency. Rio does not change this basic mechanism. Although reliable memory makes this inexpensive to fix (e.g. with transactions), we have not yet done so. Symptoms of this mechanism crop up on normal FreeBSD (and on FreeBSD-Rio). For example, fsck during reboot will occasionally find that link counts are wrong for a few files. I think this is because the rename and truncate system call manipulate the link counts, but do not commit the change to the buffer cache with VOP_UPDATE. Also, fsck will complain about the file system not being clean, or about the free bit map being wrong. These errors are harmless--fsck fixes them just fine. Some day we may get around to fixing this. This release of Rio is over-aggressive at keeping dirty file data in memory. A more realistic configuration would write dirty data to disk during idle times (see above section on "Amount of Dirty File Data"). Rio is only implemented for UFS file systems. Detailed List of Changes ------------------------ If you're interested in the details of our changes, here is a summary of all changes we made to the kernel. They are tagged by a first guess as to how the changes could be integrated into the mainstream FreeBSD releases. (D) part of standard FreeBSD distribution (R) part of standard FreeBSD distribution, enabled by RIO flag (P) private to Rio project at Michigan change i386/isa/wd.c (R) add wd_setup() and wd_write() change scsi/sd.c (R) add sd_setup() and sd_write() change i386/isa/syscons.c (R) change scgetc to call panic if it sees ctrl-alt-escape (D) change scintr to declare and zero ticks_since_scintr change i386/isa/vector.s (D) add and register new interrupt handler for timer change kern/kern_shutdown.c (R) have panic() call rio_dump (only on the first panic), using own stack add kern/rio.c (D) (also add in i386/conf/files.i386) change kern/vfs_bio.c (R) define NO_B_MALLOC so that buffers are always a full page (R) change bufinit() to call rio_init, print miscellaneous info and do checks (D?) change call to vm_map_insert to not call with MAP_NOFAULT (P) change vfs_update() to not sync ASYNC file systems on 30-second sync. Another solution would be to stop 30-second sync by using sysctl to change kern.update, but this would turn off 30-second sync for non-UFS file systems, too (e.g. NFS) (D) fix bug in vm_hold_load_pages, vm_hold_free_pages (R) change getnewbuf to transfer dirty file pages to VM cache (instead of always writing them out to disk) (R) change getnewbuf to not write dirty pages when buffer header is recycled change kern/vfs_syscalls.c (P) change sync to not sync ASYNC file systems on 30-second sync (P) change fsync to not sync ASYNC file systems (D) add sys/rio.h change ufs/ffs/ffs_alloc.c ufs/ffs/ffs_balloc.c ufs/ffs/ffs_inode.c ufs/ffs/ffs_vfsops.c ufs/ufs/ufs_lookup.c (R) call rio routines (D) call bdwrite instead of bwrite for async file systems change ufs/ffs/ffs_balloc.c (R) change ffs_balloc() to only set IN_CHANGE|IN_UPDATE if the size or block number changes change ufs/ufs/ufs_vnops.c (D) have ufs_remove truncate file if link count is 0 (and not currently opened). change ufs/ufs/ufs_readwrite.c (D) call VOP_UPDATE for all file systems (but only wait for completion for SYNC file systems) (R) call rio routines (D) call bdwrite instead of bwrite for async file systems (D) test for sync write (and call bwrite) FIRST, then check for async write SECOND (P) if file system is async, ignore the sync ioflag file (R) mark B_DIRTY for I/O issued by pageout (vnode_pager_leaf_putpages) (make sure B_DIRTY isn't used by anybody else) (R) change WRITE to not update file modification time if writing VMIO data (R) change WRITE to use bread instead of ffs_balloc to get a buf pointer when writing VMIO data change vm/vm_fault.c (R) call rio_mmap change vm/vm_page.c (R) call rio_remove change vm/vnode_pager.c (D) make vnode_pager_addr able to be called by an outside function (rio_mmap) change i386/i386/trap.c (D) change line #259 (so that write traps return the write code to signal handlers). We use this in our other software (e.g. Vista, Discount Checking) (P) change syscall() to report EFAULT. This is helpful in debugging Discount Checking. change i386/i386/pmap.c (D) pmap_protect(): use eager update of protection instead of lazy (fixed in 3.0, but maybe not) (R) call rio_mmap when page is made writable via mprotect change vm/vm_kern.c (D) change kmem_init() to map kernel text read/execute (not writable) Not needed in FreeBSD 3.0, since 3.0 correctly avoids copy-on-write on pages not marked copy-on-write change vm/vm_map.c (D) In vm_map_protect(), always change the physical page PTE, even if there's no change in address map protection settings. May not be needed in 3.0. (R) change mmap() to update the modification time of a file when it's mmap'ed and can be written