Our research is driven by big challenges and big opportunities in computer
system design. A potentially big opportunity to address the extreme power challenges of the future lies in a careful relaxation of the correctness constraints of hardware. Prior work on implementing ASICs with relaxed correctness requirements report energy benefits approaching an order of magnitude. The primary mandate of our research group is to approach such benefits for programmable processors.
Programmable Stochastic Computing: Embracing Errors in Design and Architecture of Processors and Applications
Programmable hardware has always been designed and optimized for correct operation under all conditions. However, the power cost of guaranteeing correctness is increasingly becoming prohibitive, especially in presence of hardware and environmental variations; present hardware is also necessarily over-optimized for the error tolerant applications of future. Our recent research explores the feasibility and power benefits of programmable computing machines (stochastic processors) where (a) hardware errors are allowed even during nominal operation, and (b) software and hardware are optimized to maximize power savings, while delivering acceptable outputs in spite of errors. Specifically, we have focused on two questions: a) can increased error tolerance in software and hardware result in significant power savings? If yes, how should hardware be designed and architected to maximize power savings in presence of increased error tolerance? b) given that errors escape to software (inadvertently or by design), how should software be re-designed to allow robust, acceptable operation in spite of errors. Below we discuss our research accomplishments.
Programmable Stochastic Processors
Hardware errors must be allowed only if there are significant power benefits to relaxing hardware correctness. We showed [HPCA2010] that conventional processor designs do not allow useful tradeoffs between voltage and timing error rate (here, error rate refers to the fraction of cycles in which at least one timing error occurs). This is because timing slack distribution in conventional designs looks like a wall (i.e., a large fraction of timing paths are critical or near critical). Slack walls can be attributed to a) the dominance of regular structures (e.g., caches, register files, etc.) whose timing paths tend to be critical and of similar lengths and b) the tendency of conventional physical design tools to reclaim timing slack to save power. Slack wall causes a large number of timing errors in presence of voltage overscaling (or manufacturing and dynamic variations) making it impossible for the errors to be tolerated meaningfully by an application or a hardware timing error resilience mechanism. The paper surprised many as it effectively shows that useful power/reliability tradeoffs may not be possible from many previously proposed hardware and software error tolerance techniques unless slack walls are dealt with.
In order to allow useful power reliability tradeoffs in processors, therefore, we developed a physical design methodology [HPCA2010, ASPDAC2010] that generates a gradual slack distribution. A gradual slack distribution will cause relatively small number of timing violations when voltage is overscaled by a certain magnitude (or in presence of variations) making it potentially easier for the errors to be tolerated meaningfully by an application or a hardware timing error resilience mechanism. Note, however, that a gradual slack distribution requires increasing the slack of near-critical paths in the design. However, increasing path slacks increases power, so we risk canceling out the power reductions obtained from extra voltage scaling. Herein lies the challenge of our design objective. We want to increase the slack of paths in the design, but we also want to perform this optimization without incurring a large power overhead at zero error rate. To do this, we rely on the following observation: frequently-exercised paths contribute the most to the error rate when voltage is scaled, however the number of such paths in a design is low w.r.t. the total number of paths. Our insight is that we can keep the power overhead at zero error rate low by optimizing frequently exercised paths and de-optimizing rarely exercised paths while simultaneously creating a gradual slack distribution that allows deeper voltage overscaling for a target error rate. We observed that a sacrifice of 2% in error rate saved 30% power for an OpenSPARC processor with less than 2% power overhead at zero error rate. More importantly, the resulting processor degrades gracefully in presence of variations allowing the costs of software or hardware error resilience to be manageable. For Razor-based error resilience, 21% energy savings was achieved over the conventional Razorized design allowing the methodology to be used as a low power technique at no cost to reliability. This is the first physical design methodology, to the best of our knowledge, to produce hardware designs that allow effective tradeoffs between power and timing error rate.
The above technique for slack redistribution assumes a fixed microarchitecture. Our recent work [CASES 2011, ICCD2011] shows that microarchitecture affects slack distribution and the frequency at which different timing paths are exercised. For example, the slack distribution for processors with large register files and caches looks closer to a slack wall. Similarly, utilization-enhancing microarchitectural techniques (e.g., superscalar processing, pipeling) affect error rate for different timing critical structures in presence of variations. Microarchitectural dependence of the error distribution suggests that deeper voltage scaling may be possible for a given timing error rate through microarchitectural optimizations (e.g., move heavily utilized structures off the critical path to permit more scaling). It also suggests that one would make different, sometimes counterintuitive, microarchitectural design choices when optimizing a processor specifically for a non-zero error rate (e.g., for an error resilience mechanism) than when optimizing for a zero error rate. We showed that significant energy benefits are indeed possible from such error resilience-aware microarchitectural optimizations. For Razor-based error resilience, 29% energy savings was achieved over conventional Razorized-designs allowing the optimizations to be used as a low power technique at no cost to reliability. This is the first work, to the best of our knowledge, to make a case for rethinking microarchitecture to allow effective tradeoffs between power and timing error rate.
Timing error rate in presence of variations can also be manipulated through compiler optimizations [DAC 2012a]. Since the program binary, in conjunction with the processor architecture, determines a processor's activity distribution, compiler optimizations can be used to manipulate the utilization of different microarchitectural units based on their timing slack distribution. For example, compiler optimizations can be used to change the set of frequently exercised paths in a processor to avoid activating the longest paths. Similarly, compiler optimizations can be used to reduce error rate by throttling activity in structures of the processor that cause the most errors. Other possibilities include optimizations to overlap errors in a cycle to reduce the effective error rate and optimizations to redistribute errors in the processors to reduce the effective error recovery overhead. For a conventional Razorized design, up to 39% energy savings was achieved. This is the first work on compiler techniques that require no additional hardware support to reduce energy of execution on timing speculative processors.
Making Software Robust to Errors
Ideally, the errors produced by stochastic hardware get tolerated by software. Unfortunately, while some applications are robust to certain kinds of errors (our IEEE Transactions on Multimedia 2012 paper exploits error resilience of certain GPU applications to improve performance on commodity GPUs), faults often cause undesirable outputs for most other applications. One research direction in our group focuses on making software robust to hardware errors.
We focus on algorithmic techniques to make applications robust to numerical errors. One key technique [DSN-DCCS 2010] involves recasting an application, including applications such as sorting and graph matching that have discrete output(s), as an optimization problem, which can then be solved reliably through well-known, error tolerant solvers. The key insight is that numerical optimization problems are tolerant to numerical errors (for example, gradient descent-based algorithms are guaranteed to converge even when there is an error in the calculation of the gradient). As such, converting arbitrary applications into numerical optimization problems may make them error tolerant. A large class of applications can be robustified using this approach. For example, since linear programming, which is P-complete, can be implemented this way, all applications in class P can employ the proposed approach. Non-linear (NP) problems such as job-scheduling, etc. can also employ the proposed approach, but we cannot provide theoretical guarantees of the generality of the approach for NP problems. For many applications, the runtime and, therefore, scalability of the proposed approach to large problems may be an issue, but the significantly higher parallelism in the new implementations can be exploited using accelerators [SASP 2011] to deliver performance comparable to the baseline. For other applications, energy benefits larger than 1 order of magnitude may be possible [DSN-DCCS 2010]. To the best of our knowledge, this is the first algorithmic methodology for transforming arbitrary applications into their error tolerant form.
We also consider domain-specific algorithmic techniques. We focus on sparse linear algebra -based applications as sparse-linear algebra-based kernels dominate many RMS applications. Our recent paper [DSN-DCCS 2012] shows that previously proposed algorithmic techniques that focused on detecting errors in dense linear operations have high overhead in the context of sparse problems due to lower timing complexity of sparse problems. We propose a set of algorithmic techniques that minimize the overhead of fault detection for sparse problems. The techniques are based on two insights. First, many sparse problems are well structured (e.g. diagonal), which allows for sampling to produce good approximations of the checks used for fault detection. These approximate checks may be acceptable for many sparse linear algebra applications. Second, many linear applications have enough reuse that preconditioning techniques can be used to make these applications more amenable to low-cost algorithmic checks. The proposed techniques are shown to yield up to 2x reductions in performance overhead over traditional checks for a spectrum of sparse problems. To the best of our knowledge, this is the first work to address application-level fault tolerance in the context of general sparse linear algebra. We are currently working on extending the techniques to allow algorithmic correction [DAC2012b, Under Submission].
New Directions in Architecture and Design of Energy-Efficient Microprocessors
Programmable stochastic computing is a fairly disruptive and potentially long-term approach to addressing the power challenge of the future. Thankfully, in the short term, some opportunities still exist to significantly reduce processor power by identifying a discontinuity between conventional design/microarchitectural decisions and the energy efficiency needs of future. Below we present two examples.
Power Balanced Pipelines
Since the onset of pipelined processors, balancing the delay of the microarchitectural pipeline stages such that each microarchitectural pipeline stage has an equal delay has been a primary design objective as it maximizes instruction throughput. Unfortunately, this causes significant energy inefficiency in processors as each microarchitectural pipeline stage gets the same amount of time to complete irrespective of its size or complexity. Our recent work (HPCA 2012- ) showed that the inefficiency manifests itself as a significant imbalance in power consumption of different microarchitectural pipestages. for many processors (1-2 orders of magnitude of difference was routinely observed in the power consumption of the least power microarchitectural pipeline stage and the highest power pipestage). We propose the concept of power balanced pipelines - i.e., processor pipelines in which different delays are assigned to different microarchitectural pipestages to reduce the power disparity between the stages while guaranteeing the same processor frequency/performance. A specific implementation of the concept uses cycle time stealing to deliberately redistribute cycle time from low-power microarchitectural pipeline stages to power-hungry stages, relaxing their timing constraints and allowing them to operate at reduced voltages or use smaller, less leaky cells. Since the change in power of each stage is proportional to the area and original power of the stage, the increase in power for the simple, low-power stage should be less than the decrease in power for the complex, high-power stage, thus generating a net power savings for the processor for the same guaranteed performance. We demonstrate that balancing pipeline power rather than delay at the microarchitectural level can result in up to 46% processor power reduction with no loss in processor frequency. To the best of our knowledge, this is the first such work on microarchitecture-level power reduction that guarantees the same performance.
Peak Power Management
Much research has been done on decreasing average power dissipation of processors, which most directly addresses cooling costs and reliability. However, much less has been done to decrease peak power, which most directly impacts the processor design, packaging, and power delivery. We performed the first academic study, to the best of our knowledge, on mechanisms and policies for peak power management for single-core [MICRO2009] and many-core [DATE2009] processors. For single core processors, we proposed a highly adaptive processor core whose components can be configured at different levels, but because they are centrally controlled, the architecture can guarantee that they are never all configured maximally at the same time. The mechanisms for transitioning between allowed configurations maximize performance within a peak power constraint. Such an architecture can cut peak power by 25% with less than 5% performance loss. For many-core processors, we proposed decentralized techniques that enable the placement of several more cores on a die than what the peak power budget would normally allow. Up to 47% throughput improvement was achieved for a fixed peak power budget.
Low Power Memory Systems
System power, especially in context of HPC systems and embedded systems,
is increasingly getting dominated not by processors, but by memories. While
there are several reasons for this, an important one is the fact that memories
are very susceptible to transient and permanent errors and, as such, require
a costly investment (e.g., chipkill, ECC) to support reliable execution.
We believe that current approaches to providing ECC/chipkill support are
extremely energy-inefficient and that opportunities exist to significantly
reduce the cost of building relible memories. We have started a slew of projects
in this area. Below we present one example.
Adaptive Reliability Chipkill Correct
Chipkill correct is an advanced type of error correction
in memory that is popular among servers. Large field studies of memories have
shown that chipkill correct reduces uncorrectable error rate by 4X to 36X
SECDED. Unfortunately, there is a strong trade-off between
effective capacity, power, and reliability for existing chipkill
correct memories. Commercial available chipkill correct memories with a storage overhead of 12.5% are optimized for
memory capacity; however, they are very power hungry. On
the other hand, recent proposals on chipkill correct such as
LOT-ECC optimize for low power consumption but require
26.5% storage overhead. LOT-ECC may also increase uncorrectable error rate by 4.25X compared to some commercially
available chipkill correct memories. We recently presented [HPCA 2013]
Adaptive Reliability Chipkill Correct (ARCC) - a framework
for symbol-based chipkill correct memories that signi?cantly
improves on the tradeoff between effective capacity, power,
and reliability. ARCC is based on the observation that, on
average, only a tiny fraction of memory experiences any type
of faults during the typical operational lifespan of a server.
As such, it proposes relaxing the strength of chipkill correct
in the beginning and then adaptively increasing the strength
as needed on a page by page basis in order to reap the bene?t of lower power consumption during the majority of the
lifetime of a memory system. When optimized for memory
capacity, ARCC reduces the power consumption of memory
by 36% compared to commercially available chipkill correct
memories, without increasing storage overhead or degrading
reliability. When optimized for low power consumption, it can
offer matching uncorrectable error rate as that of any commercially available chipkill correct memory for the same ECC
overhead and similar power characteristics as an oracular
LOT-ECC  implementation.
Select Other Work
Our ICCAD 2012 paper presents logic synthesis techniques for timing speculative processors. Our IEEE Transactions on Multimedia 2012 paper presents a GPU-based prototype system where errors are selectively introduced into applications for significantly higher performance. Our IEEE Transactions on VLSI 2012 paper proposes techniques to enhance the lifetime energy efficiency of DVFS-based designs. Our HPCA 2011 paper quantifies the inefficiency of message passing applications on shared memory multi-core architectures and proposes techniques to significantly enhance their efficiency. Our DATE 2011 paper re-evaluates the effectiveness of architecture-level techniques for NBTI management and demonstrates that the potential benefits from these techniques are, for the most part, smaller than what has been previously suggested, and that guardbanding may still be an efficient way to deal with aging. Our DAC 2010 paper presents techniques to minimize power at a target timing error rate. Our ISQED2010 paper makes a case for new binning metrics and strategies for multi-core architectures as the binning overhead may become prohibitive in future due to variations.
Please send us an email if you need more details.