x

Energy Efficiency in Enterprise Data Centers

MPowered Data Centers: Power- and Cooling-Aware Self-Managing Data Centers

Enterprise data centers consume an alarmingly-high fraction of the energy produced in the United States. The US Environmental Protection Agency estimates that data center energy consumption will reach over 100 billion kWh by 2011, 2.5% of US power generation, resulting in an estimated annual electricity cost of $7.4 billion. As much as 40% of this energy is wasted because of two key inefficiencies: (1) the substantial energy used by idle equipment that is powered on, but not performing useful work, and (2) inefficiency in data-center cooling infrastructure arising from poor a match between where heat is generated and where cool air is supplied.

Concern over idle-energy waste has prompted calls for a fundamental redesign of each system component to consume energy in proportion to utilization. We propose an alternative energy-conservation approach - PowerNap - where the entire system transitions rapidly between a high-performance active state and a near-zero-power idle state in response to instantaneous load. Rather than requiring energy-proportional operation from each component, PowerNap instead calls for minimizing idle power and transition time, simpler optimization goals.

To address cooling inefficiencies, our team is developing a data-center- wide management system that controls IT equipment, power, and cooling infrastructure in real time to save energy in two ways. First, it actively consolidates computing tasks onto fewer systems, allowing idle systems to be powered down. Second, it moves computing tasks to systems that can be cooled most efficiently. This control system, dubbed the Data Center Hypervisor (DCH), monitors the environmental conditions in the data center and shifts virtualized workloads among physical servers while adjusting data center environmental controls. DCH saves energy by increasing CRAC set points, reducing blower speeds, or taking portions of the data center out of service in response to changing load.

This project is supported by the National Science Foundation under grant CCF- 0811320, and equipment donations from Intel and Mentor Graphics (Flomerics).

Maelstrom: Real-Time Tracking of Data Center Thermal Topology

The key challenge facing the Data Center Hypervisor's cooling-aware scheduling algorithm is that the airflow dynamics in a data center are complicated by the many high-velocity air streams produced by equipment fans - hot exhaust from a server may travel several meters before it is ingested by another server. Hence, hot spots may not correlate to utilization density. To manage server loads and cooling parameters effectively, the DCH requires an accurate model of the airflow dynamics of the data center. To this end, we propose Maelstrom, a computation fluid dynamic (CFD) model that tracks data center thermal topology from epoch to epoch in real time with the assistance of sensor readings, and can provide the DCH with estimates of the thermal response to "what-if" scenarios in its epoch strategy.

This project is supported by the National Science Foundation under grant CCF- 0811320, and software/equipment donations from Intel and Mentor Graphics (Flomerics).

Beyond Solid State Disks: Saving Energy with Flash in Enterprise Systems

Energy efficiency is rapidly becoming a key constraint in the design of enterprise systems. By 2011, yearly data center energy consumption in the United States is projected to grow to over 100 billion kWh at cost of $7.4 billion. As much as 40% of this energy is consumed by DRAM and disks. Portable consumer devices, where battery life has long been a key concern, instead use faster and more energy-efficient Flash storage. To exploit Flash's energy, latency, and bandwidth advantages in the enterprise market, storage vendors have recently announced high-capacity Flash solid-state disks (SSDs). However, because they are accessed through archaic block-device interfaces designed for legacy rotating disks, SSDs fail to fully-exploit the low latency and high bandwidth Flash can provide. Furthermore, replacing conventional disks with SSDs does not address the growing power consumption of severs' DRAM. In this project, we propose further opportunities to save energy with Flash in enterprise systems. Instead of placing Flash behind traditional I/O interfaces, we integrate Flash with the server's memory system, making it directly accessible within processors' physical address space.

This project is supported by the National Science Foundation under grant CNS- 0834403, and gifts from HP Labs.

Multiprocessor Memory System Design

Polymorphic Multicore Cache Architecture

The semiconductor industry has hit a wall - chip-level power and cooling constraints have slowed the march of clock frequency, forcing industry to instead bet on multicore to provide energy-efficient performance scalability. Although the multicore trend poses daunting challenges for application developers, it also creates new opportunities unavailable in traditional multi- chip multiprocessors: the drastic change in the relative costs of on-chip communication and computation enable application designs with tightly-coupled threads and frequent sharing that would prove latency- and bandwidth-prohibitive in traditional multiprocessors. Unfortunately, current multicore memory systems are inflexible and poorly-suited to support coordinated execution, as they provide no direct means for core-to-core communication or to optimize data placement on chip. Moreover, intra-chip access patterns vary drastically across applications - there is no one-size-fits-all static cache architecture. We are designing the Polymorphic Multicore Cache Architecture (PMCA) - a modular on-chip cache design where software configures primitive hardware mechanisms to provide a cache architecture suited to a specific workload.

Spatio-Temporal Memory Streaming (STeMS)

http://www.ece.cmu.edu/~stems

While advances in semiconductor fabrication have enabled phenomenal increases in processor speeds, advances in DRAM fabrication have primarily increased density, providing only modest improvements in access latency. Conventional processors bridge the processor/memory performance gap with an on-chip cache hierarchy where each level provides progressively slower access to a larger subset of data. However, as the processor/memory performance gap grows and on-chip storage capacity increases, the simple heuristic policies of conventional caches (e.g., LRU replacement) are becoming less effective at preventing processor stalls due to off-chip accesses. The performance penalty of ineffective cache management is particularly acute in commercial server applications, where frequent traversals of linked-data structures result in long chains of dependent off-chip misses. Instead of accessing data individually upon processor request, we propose Spatio-Temporal Memory Streaming (STeMS), a memory system design where data are managed in the form of spatio-temporal memory streams-data groups whose accesses are correlated in space or time. STeMS dynamically constructs streams from memory access sequences that exhibit a repetitive layout in memory (spatial correlation) or that recur over the course of program execution (temporal correlation). By fetching stream elements in parallel using recorded history, STeMS increases memory level parallelism for both independent and dependent access sequences. By throttling stream transfer to stay ahead of processor requests, STeMS hides main memory access latency while improving utilization of pin bandwidth and on-chip storage.

Performance Evaluation Methodology

SimFlex

http://www.ece.cmu.edu/~simflex

Computer architects have long relied on software simulation to measure dynamic performance metrics (e.g. CPI) of a proposed design. Unfortunately, detailed software simulators have become four or more orders of magnitude slower than their hardware counterparts, rendering hardware measurement methodologies impracticable for simulation studies of large-scale commercial server systems. The SimFlex project is advancing the state-of-the-art in simulation tools and measurement methodology to enable fast, accurate, and flexible simulation of large-scale systems. SimFlex combines component-based software design, full- system simulation, statistical sampling, and simulation state checkpointing to enable rapid system evaluation with commercially-relevant benchmark applications, such as online transaction processing databases, while validating performance estimates with statistical measures of confidence. The SimFlex project has recently released two new computer architecture simulation tools to the academic/industrial community. TurboSMARTSim integrates rigorous statistical sampling methodology with live-points, a per-benchmark library of minimal reusable machine state, to accelerate microarchitecture simulation turnaround by 250x over previous simulation sampling approaches while maintaining high accuracy and confidence in estimates. Flexus is a family of component-based C++ computer architecture simulators that enable full-system timing-accurate simulation of uni- and multiprocessor systems running unmodified commercial applications and operating systems.