|
|
| Home
| Projects
| People
| Publications
| Software/Tools
| Press
| Wiki
| Contact
|
|
|
|
Current projects in the Passat group fall into one of the two categories: computing with stochastic processors and design and architecture of multiscalar systems. Below we spotlight the former.
Computing with Stochastic Processors:
Background
The primary driver for innovations in computer systems has been the phenomenal scalability of the
semiconductor manufacturing process governed by Moore's Law that has allowed us to literally
print circuits and systems growing at exponential capacities for the last three decades. The resulting
exponentially reducing cost per function has resulted in an unprecedented penetration of technology in
homes and beyond leading to profound impacts on the society and quality of life.
Moore's Law has started being threatened, however, due the resulting exponentially deteriorating
effects of material properties on chip reliability and power. As transistors become smaller - the
oxide in 22 nm process is only five atomic layer thick, and gate length is only 42 atoms across, it
is becoming increasingly expensive for the current design and manufacturing technology to keep the
transistors functioning deterministically, even under normal operating conditions. There are three primary sources of non-determinism. First, decreasing transistor sizes lead to different transistors being
doped differently during the manufacturing process, causing them to have non-deterministic electrical
characteristics. Second, transistors have become smaller than the wavelength of the light used to pattern
them (by 6X). This causes non-determinism in the dimensions and characteristics of the manufactured
transistors. Finally, the unprecedented increase in the power density of the chips coupled with
the time and context-dependent variation in temperature and utilization across the chip cause voltage and
timing variations in the circuits. These variations are dynamic and largely non-deterministic.
The most immediate impact of such non-determinism is on chip yields), a growing number
of parts are thrown away since they do not meet the timing and power related specifications. A 5%
yield loss on a 90nm process today directly translates into a cost to the manufacturer that exceeds 2x the
design cost for a typical cell-phone manufacturer, arguably one of the highest volume parts. Clearly
the status quo cannot continue, and left unaddressed the entire computing and information technology
industry will soon face prospect of parts that neither scale in capability nor cost. We must find a solution
to the non-determinism problem if semiconductor technology and industry has to remain a viable driver
of science innovation and technology capabilities for the future.
Research Vision
Paradoxically, the problem is not non-determinism per se, but how the computer system designers treat
it. The chip components no longer behave like the precisely chiseled machines of the past; yet, the basic
approach to designing and operating computing machines has remained unchanged. While there have
been many swings in computing platform paradigms, such as from general-purpose to specialized, and
from single-core to multi-core, the contract between hardware and software has remained unchanged. The
contract guarantees that hardware will return correct value for every computation under all conditions.
In other words, we demand hardware to be overdesigned so as to meet the mindsets in computer systems
and software design of the past. The guard-band for the hardware designer not only results in increased
cost because getting the last bit of performance incurs too much area and power overhead, especially if
performance is to be optimized for all possible computations, but also leaves enormous performance and2
energy potential untapped as the software assumes lower performance than what a majority of instances
of that platform may be capable of most of the time.
Our goal is to fundamentally rethink the correctness contract between hardware and software.
Instead of computing machines where the spatiotemporal variations in hardware specifications are hidden from the software behind conservative specifications or through overdesign, we present a vision
of computing machines where the hardware is deliberately underdesigned with relaxed design and
manufacturing constraints to allow errors and produce stochastically correct results even
under nominal conditions. The software is aware of hardware errors and proactively self-adapts. We
call such "under-designed" processors that may produce only stochastically correct results even under
nominal conditions and rely on software adaptability and architectural resilience for tolerating errors,
stochastic processors.We call the applications that have been implemented to be adaptively error-tolerant,
stochastic applications. The goal of our research is to explore approaches to architect and
design stochastic processors and stochastic applications.
Note that reliable / robust systems have been studied for decades. Our hardware work is largely distinguished by the fact that we are not
focused on error detection / correction. Instead, we are largely motivated by the question: given that an
error resilience mechanism is available, how would you design and architect the processors differently?
Also, stochastic processor hardware may produce errors even during nominal operation. Similarly, our
software work is largely focused on generic algorithmic approaches to building applications that are
robust to transient hardware errors caused due to manufacturing and environmental variations.
Project Overview
Realizing the vision of computing with stochastic processors will require a fundamental re-examination
of the architecture and design principles for processors as well as applications. Our research
has three thrusts.
Rethinking Design for Stochastic Processors. Conventional design methodologies are required
to produce designs that work correctly under nominal conditions for all workloads. Therefore,
such methodologies are agnostic of dynamic or workload-specific information. Such methodologies
are also unconcerned about the impact of an error on the application or the cost of hardware
or software-based error resilience. A stochastic processor design methodology, on the other hand,
may allow errors for certain inputs or operating conditions. Therefore, such methodologies will
exploit dynamic / functional information to optimize power or performance for a given workload
or operating conditions. We are looking at stochastic processor design methodologies
for different objectives for different sources of variation and different error resilience mechanisms.
Specific stochastic processor design methodologies being investigated include optimizing hardware
for gradual degradation under variations (soft processor design), minimizing power for a target
error rate (recovery-driven design), and design optimizations for multi-modal stochastic processors
(multi-modal stochastic design).
Rethinking Architecture for Stochastic Processors. Traditionally, architectural decisions have been
made in the context of correctness. As such, architectural optimizations aim to maximize processor
efficiency during correct operation while disregarding scenarios where the correctness constraint is
relaxed and an error resilience mechanism is available. We are investigating
the
differences between architectural design for correctness and architecting for error resilience. We are attempting to identify key properties of an architecture that influence efficiency at different error rates, exploring how
architectural and structural transformations influence these properties as well as energy efficiency,
and translating this knowledge into prescriptive advice for how to optimize the architecture of an
error-resilient design. Specific directions being pursued include architectures optimized for specific
error rates (resilience-optimized architectures) and architectures whose stochasticity is configured
based on execution characteristics (dynamic error management architectures).
Compiling Programs for Stochastic Processors. The rate and cost
of timing errors produced by a processor strongly depend on the activity
distribution of the processor.
The activity distribution of the
processor describes how often paths are toggled, and thus
determines the frequency of errors caused by a path when
it has negative slack. Together, the slack and activity
distributions dictate the error distribution of a processor,
i.e., the locations and frequencies of errors produced in
an overscaled processor.
Since the program
binary, in conjunction with the processor architecture,
determines a processor's activity distribution, we are investigating program
optimizations
to increase the efficiency of a design that exploits timing error
resilience.
Application Robustification. Errors produced by a stochastic processor may escape to software by
design or accident (even when a hardware-based error resilience mechanism is used). Such scenarios
critically require the application to be robust to errors. We are developing
methodologies to robustify
applications instead of relying on their intrinsic robustness. The goal is
to identify methodologies
that are generic enough that the same methodology can be used to develop error tolerant versions
of applications that require precisely correct outputs and the applications that do not require precise
outputs. A specific technique being investigated involves converting
applications into error tolerant stochastic optimization problems and then solving them using
provably robust (under some assumptions) gradient descent-based solvers. Another
specific technique being investigated involves the use of checksums to
robustify sparse linear algebra.
Note that the above work on hardware architecture and design methodologies does not require the
errors to be exposed to software (e.g., optimizations can be performed for a given hardware error resilience
mechanism). In fact, our hardware work may also be viewed as a set of novel techniques
for minimizing power and maximizing yield. This makes the deliverables from our work attractive to
industry and computing organizations even in the short term.
Please send us an email if you need more details.
|
|