Energy Efficiency in Enterprise Data Centers
|
Multiprocessor Memory System DesignPolymorphic Multicore Cache ArchitectureThe semiconductor industry has hit a wall - chip-level power and cooling constraints have slowed the march of clock frequency, forcing industry to instead bet on multicore to provide energy-efficient performance scalability. Although the multicore trend poses daunting challenges for application developers, it also creates new opportunities unavailable in traditional multi- chip multiprocessors: the drastic change in the relative costs of on-chip communication and computation enable application designs with tightly-coupled threads and frequent sharing that would prove latency- and bandwidth-prohibitive in traditional multiprocessors. Unfortunately, current multicore memory systems are inflexible and poorly-suited to support coordinated execution, as they provide no direct means for core-to-core communication or to optimize data placement on chip. Moreover, intra-chip access patterns vary drastically across applications - there is no one-size-fits-all static cache architecture. We are designing the Polymorphic Multicore Cache Architecture (PMCA) - a modular on-chip cache design where software configures primitive hardware mechanisms to provide a cache architecture suited to a specific workload.Spatio-Temporal Memory Streaming (STeMS)http://www.ece.cmu.edu/~stemsWhile advances in semiconductor fabrication have enabled phenomenal increases in processor speeds, advances in DRAM fabrication have primarily increased density, providing only modest improvements in access latency. Conventional processors bridge the processor/memory performance gap with an on-chip cache hierarchy where each level provides progressively slower access to a larger subset of data. However, as the processor/memory performance gap grows and on-chip storage capacity increases, the simple heuristic policies of conventional caches (e.g., LRU replacement) are becoming less effective at preventing processor stalls due to off-chip accesses. The performance penalty of ineffective cache management is particularly acute in commercial server applications, where frequent traversals of linked-data structures result in long chains of dependent off-chip misses. Instead of accessing data individually upon processor request, we propose Spatio-Temporal Memory Streaming (STeMS), a memory system design where data are managed in the form of spatio-temporal memory streams-data groups whose accesses are correlated in space or time. STeMS dynamically constructs streams from memory access sequences that exhibit a repetitive layout in memory (spatial correlation) or that recur over the course of program execution (temporal correlation). By fetching stream elements in parallel using recorded history, STeMS increases memory level parallelism for both independent and dependent access sequences. By throttling stream transfer to stay ahead of processor requests, STeMS hides main memory access latency while improving utilization of pin bandwidth and on-chip storage.
|
Performance Evaluation MethodologySimFlexhttp://www.ece.cmu.edu/~simflexComputer architects have long relied on software simulation to measure dynamic performance metrics (e.g. CPI) of a proposed design. Unfortunately, detailed software simulators have become four or more orders of magnitude slower than their hardware counterparts, rendering hardware measurement methodologies impracticable for simulation studies of large-scale commercial server systems. The SimFlex project is advancing the state-of-the-art in simulation tools and measurement methodology to enable fast, accurate, and flexible simulation of large-scale systems. SimFlex combines component-based software design, full- system simulation, statistical sampling, and simulation state checkpointing to enable rapid system evaluation with commercially-relevant benchmark applications, such as online transaction processing databases, while validating performance estimates with statistical measures of confidence. The SimFlex project has recently released two new computer architecture simulation tools to the academic/industrial community. TurboSMARTSim integrates rigorous statistical sampling methodology with live-points, a per-benchmark library of minimal reusable machine state, to accelerate microarchitecture simulation turnaround by 250x over previous simulation sampling approaches while maintaining high accuracy and confidence in estimates. Flexus is a family of component-based C++ computer architecture simulators that enable full-system timing-accurate simulation of uni- and multiprocessor systems running unmodified commercial applications and operating systems.
|
