PASSAT GROUP @ UIUC

 

 

 

Home
Projects
People
Publications
Software/Tools
Press
Wiki
Contact

 

 

 

 

Current projects in the Passat group fall into one of the two categories: computing with stochastic processors and design and architecture of multiscalar systems. Below we spotlight the former.

 

Computing with Stochastic Processors:

Background

The primary driver for innovations in computer systems has been the phenomenal scalability of the semiconductor manufacturing process governed by Moore's Law that has allowed us to literally print circuits and systems growing at exponential capacities for the last three decades. The resulting exponentially reducing cost per function has resulted in an unprecedented penetration of technology in homes and beyond leading to profound impacts on the society and quality of life.

Moore's Law has started being threatened, however, due the resulting exponentially deteriorating effects of material properties on chip reliability and power. As transistors become smaller - the oxide in 22 nm process is only five atomic layer thick, and gate length is only 42 atoms across, it is becoming increasingly expensive for the current design and manufacturing technology to keep the transistors functioning deterministically, even under normal operating conditions. There are three primary sources of non-determinism. First, decreasing transistor sizes lead to different transistors being doped differently during the manufacturing process, causing them to have non-deterministic electrical characteristics. Second, transistors have become smaller than the wavelength of the light used to pattern them (by 6X). This causes non-determinism in the dimensions and characteristics of the manufactured transistors. Finally, the unprecedented increase in the power density of the chips coupled with the time and context-dependent variation in temperature and utilization across the chip cause voltage and timing variations in the circuits. These variations are dynamic and largely non-deterministic. The most immediate impact of such non-determinism is on chip yields), a growing number of parts are thrown away since they do not meet the timing and power related specifications. A 5% yield loss on a 90nm process today directly translates into a cost to the manufacturer that exceeds 2x the design cost for a typical cell-phone manufacturer, arguably one of the highest volume parts. Clearly the status quo cannot continue, and left unaddressed the entire computing and information technology industry will soon face prospect of parts that neither scale in capability nor cost. We must find a solution to the non-determinism problem if semiconductor technology and industry has to remain a viable driver of science innovation and technology capabilities for the future.

Research Vision

Paradoxically, the problem is not non-determinism per se, but how the computer system designers treat it. The chip components no longer behave like the precisely chiseled machines of the past; yet, the basic approach to designing and operating computing machines has remained unchanged. While there have been many swings in computing platform paradigms, such as from general-purpose to specialized, and from single-core to multi-core, the contract between hardware and software has remained unchanged. The contract guarantees that hardware will return correct value for every computation under all conditions. In other words, we demand hardware to be overdesigned so as to meet the mindsets in computer systems and software design of the past. The guard-band for the hardware designer not only results in increased cost because getting the last bit of performance incurs too much area and power overhead, especially if performance is to be optimized for all possible computations, but also leaves enormous performance and2 energy potential untapped as the software assumes lower performance than what a majority of instances of that platform may be capable of most of the time.

Our goal is to fundamentally rethink the correctness contract between hardware and software. Instead of computing machines where the spatiotemporal variations in hardware specifications are hidden from the software behind conservative specifications or through overdesign, we present a vision of computing machines where the hardware is deliberately underdesigned with relaxed design and manufacturing constraints to allow errors and produce stochastically correct results even under nominal conditions. The software is aware of hardware errors and proactively self-adapts. We call such "under-designed" processors that may produce only stochastically correct results even under nominal conditions and rely on software adaptability and architectural resilience for tolerating errors, stochastic processors.We call the applications that have been implemented to be adaptively error-tolerant, stochastic applications. The goal of our research is to explore approaches to architect and design stochastic processors and stochastic applications.

Note that reliable / robust systems have been studied for decades. Our hardware work is largely distinguished by the fact that we are not focused on error detection / correction. Instead, we are largely motivated by the question: given that an error resilience mechanism is available, how would you design and architect the processors differently? Also, stochastic processor hardware may produce errors even during nominal operation. Similarly, our software work is largely focused on generic algorithmic approaches to building applications that are robust to transient hardware errors caused due to manufacturing and environmental variations.

Project Overview

Realizing the vision of computing with stochastic processors will require a fundamental re-examination of the architecture and design principles for processors as well as applications. Our research has three thrusts.

Rethinking Design for Stochastic Processors. Conventional design methodologies are required to produce designs that work correctly under nominal conditions for all workloads. Therefore, such methodologies are agnostic of dynamic or workload-specific information. Such methodologies are also unconcerned about the impact of an error on the application or the cost of hardware or software-based error resilience. A stochastic processor design methodology, on the other hand, may allow errors for certain inputs or operating conditions. Therefore, such methodologies will exploit dynamic / functional information to optimize power or performance for a given workload or operating conditions. We are looking at stochastic processor design methodologies for different objectives for different sources of variation and different error resilience mechanisms. Specific stochastic processor design methodologies being investigated include optimizing hardware for gradual degradation under variations (soft processor design), minimizing power for a target error rate (recovery-driven design), and design optimizations for multi-modal stochastic processors (multi-modal stochastic design).

Rethinking Architecture for Stochastic Processors. Traditionally, architectural decisions have been made in the context of correctness. As such, architectural optimizations aim to maximize processor efficiency during correct operation while disregarding scenarios where the correctness constraint is relaxed and an error resilience mechanism is available. We are investigating the differences between architectural design for correctness and architecting for error resilience. We are attempting to identify key properties of an architecture that influence efficiency at different error rates, exploring how architectural and structural transformations influence these properties as well as energy efficiency, and translating this knowledge into prescriptive advice for how to optimize the architecture of an error-resilient design. Specific directions being pursued include architectures optimized for specific error rates (resilience-optimized architectures) and architectures whose stochasticity is configured based on execution characteristics (dynamic error management architectures).

Compiling Programs for Stochastic Processors. The rate and cost of timing errors produced by a processor strongly depend on the activity distribution of the processor. The activity distribution of the processor describes how often paths are toggled, and thus determines the frequency of errors caused by a path when it has negative slack. Together, the slack and activity distributions dictate the error distribution of a processor, i.e., the locations and frequencies of errors produced in an overscaled processor. Since the program binary, in conjunction with the processor architecture, determines a processor's activity distribution, we are investigating program optimizations to increase the efficiency of a design that exploits timing error resilience.

Application Robustification. Errors produced by a stochastic processor may escape to software by design or accident (even when a hardware-based error resilience mechanism is used). Such scenarios critically require the application to be robust to errors. We are developing methodologies to robustify applications instead of relying on their intrinsic robustness. The goal is to identify methodologies that are generic enough that the same methodology can be used to develop error tolerant versions of applications that require precisely correct outputs and the applications that do not require precise outputs. A specific technique being investigated involves converting applications into error tolerant stochastic optimization problems and then solving them using provably robust (under some assumptions) gradient descent-based solvers. Another specific technique being investigated involves the use of checksums to robustify sparse linear algebra.

Note that the above work on hardware architecture and design methodologies does not require the errors to be exposed to software (e.g., optimizations can be performed for a given hardware error resilience mechanism). In fact, our hardware work may also be viewed as a set of novel techniques for minimizing power and maximizing yield. This makes the deliverables from our work attractive to industry and computing organizations even in the short term.

Please send us an email if you need more details.