What Is a Compiler When The Architecture Is Not Hard(ware) ?

What Is a Compiler When The
Architecture Is Not Hard(ware) ?
Krishna V Palem
This work was supported in part by awards from Hewlett-Packard Corporation, IBM, Panasonic AVC, and
by DARPA under Contract No. DABT63-96-C-0049 and Grant No. 25-74100-F0944.
Portions of this presentation were given by the speaker as a keynote at the
ACM LCTES2001 and as an invited speaker at EMSOFT01
UCB November 8, 2001
Krishna V Palem
Georgia Tech
2
Embedded Computing
Why ?
What ?
How ?
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
The Nature of Embedded Systems
Visible Computing
View is of end application
(Hidden computing element)
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
3
4
Favorable Trends
 Supported by Moore’s (second) law
– Computing power doubles every eighteen months
Corollary: cost per unit of computing halves every eighteen months
 From hundreds of millions to billions of units
 Projected by market research firms (VDC) to be a
50 billion+ space over the next five years
 High volume, relatively low per unit $ margin
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
5
Embedded Systems Desiderata
Low Power
- High battery life
Small size or footprint
Real-time constraints
Performance
comparable to or
surpassing leading edge
COTS technology
UCB – November 8, 2001
Rapid time-to-market
Krishna V Palem
Georgia Tech
6
Timing Example
Video-On-Demand
Predictable Timing Behavior
UCB – November 8, 2001
Unpredictable Timing
Behavior
Krishna V Palem
Georgia Tech
7
Current Art
Vertical application domains
Telecom
Industrial
Automation
Select computational kernels



To ASIC
Meet desiderata while overcoming NRE cost hurdles through
volume
High migration inertia across applications
Long time to market
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
8
Subtle but Sure Hurdles
 For Moore’s corollary to be true
– Non-recurring engineering (NRE) cost must be amortized
over high-volume
– Else prohibitively high per unit costs
 Implies “uniform designs” over large workload
classes
– (Eg). Numerical, integer, signal processing
 Demands of embedded systems
–
–
–
–
“Non uniform” or application specific designs
Per application volume might not be high
High NRE costs  infeasible cost/unit
Time to market pressure
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
The Embedded Systems Challenge
Multiple application domains
3D
Industrial
Graphics Automation
Medtronics
E-Textiles Telecom
Rapidly changing application
kernels in moderate volume
Custom computing solution
meeting constraints

Sustain Moore’s corollary
– Keep NRE costs down
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
9
10
Responding Via Automation
Multiple application domains
3D
Industrial
Graphics Automation
Medtronics
E-Textiles Telecom
Rapidly changing application
kernels in low volume
Power
Size
Automatic
Tools
Timing
Application specific design
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
11
Three Active Approaches
 Custom microprocessors
 Architecture exploration and synthesis
 Architecture assembly for reconfigurable computing
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Custom Processor Implementation
Application
analysis
Application
Proprietary ISA,
Fabricate
Architecture
Processor
Specification
Proprietary
Tools
Language
with custom
extensions
Compiler
Binary
Custom
Processor
implementation
Proprietary
ISA
Time Intensive




High performance implementation
Customized in silicon for particular application domain
O(months) of design time
Once designed, programmable like standard processors
Tensilica, HP-ST Microelectronics approach
UCB – November 8, 2001
12
Krishna V Palem
Georgia Tech
Architecture Exploration and Synthesis
The PICO Vision
Program In
Automatic synthesis of
application specific
parallel / VLIW
ULSI microprocessors
And their compilers
for embedded computing
Chip Out
“Computer design for the masses”
“A custom system architecture in 1 week tape-out in 4 weeks”
B Ramakrishna Rau “The Era of Embedded Computing”, Invited talk,
CASES 2000.
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
13
14
Custom Microprocessors
Application(s)
define workload
Optimizing
Compiler
Microprocessor
(eg) Itanium +
UCB – November 8, 2001
Analyze
Define Compiler
Optimizations
Define ISA extension
(eg) IA 64+
Design
Implementation
Krishna V Palem
Georgia Tech
15
Application Specific Design
Applications
Single
Application
Analyze
Program Analysis
Extended EPIC
Compiler
Technology
Explore and
Synthesize
implementations
Library of possible
implementations
(Bypass ISA)
VLIW Core +
Non programmable extension
Application specific processor
runs single application
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
The Compiler Optimization Trajectory
Compiler
Hardware
Frontend and Optimizer
Superscalar
Determine Dependences
Dataflow
Determine Dependences
Determine Independences
Indep. Arch.
Determine Independences
Bind Operations to Function Units
VLIW
Bind Operations to Function Units
Bind Transports to Busses
Bind Transports to Busses
TTA
Execute
B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective.
The Journal of Supercomputing, 7(1-2):9-50, May 1993.
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
16
What Is the Compiler’s Target ``ISA’’?
Compiler
Frontend and Optimizer
Determine Dependences
Determine Independences
Hardware

Superscalar
Dataflow
Indep. Arch.
Bind Operations to Function Units
VLIW
Bind Transports to Busses
Determine Dependences

Determine Independences

Bind Operations to Function Units
Bind Transports to Busses


TTA
Execute
B. Ramakrishna Rau and Joseph A. Fisher.
Instruction-level parallel: History overview, and
perspective. The Journal of Supercomputing,
7(1-2):9-50, May 1993.
UCB – November 8, 2001



Target is a range of
architectures and their building
blocks
Compiler reaches into a
constrained space of silicon
Explores architectural
implementations
O(days – weeks) of design
time
Exploration sensitive to
application specific hardware
modules
Fixed function silicon is the
result
Verification NRE costs still
there
One approach to overcoming
time to market
Krishna V Palem
Georgia Tech
17
18
Choices of Silicon
High level
design/synthesis
(EDIF) Netlist
Fixed silicon implementation
Standard cell design etc.
UCB – November 8, 2001
Emulated i.e.
Reconfigurable target
Krishna V Palem
Georgia Tech
19
Reconfigurable Computing
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
FPGAs As an Alternative Choice for Customization
 Frequent (re)configuration and hence frequent
recustomization
 Fabrication process is steadily improving
 Gate densities are going up
 Performance levels are acceptable
 Amortize large NRE investments by using COTS
platform
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
20
21
Major Motivations
 Poor compilation times
– Lack of correspondence between standard IR and final
configurations
– Place and route inherently complex
 Would like to have compile times of the order of
seconds to minutes
 Provide customization of data-path by automatic
analysis and optimization in software
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
22
Adaptive EPIC
Adaptive Explicitly Parallel Instruction Computing
Krishna V. Palem and Surendranath Talla, Courant Institute of
Mathematical Sciences; Patrick W. Devaney, Panasonic AVC American
Laboratories Inc.
Proceedings of the 4th Australasian Computer Architecture Conference,
Auckland, NZ. January 1999
Adaptive Explicitly Parallel Instruction Computing
Surendranath Talla
Department of Computer Science, New York University
PhD Thesis, May 2001
Janet Fabri award for outstanding dissertation.
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Compiler-Processor Interface
ISA
Source
program
format
Compiler
ADD
semantics
LD
format
semantics
Registers
Exceptions..
Executable
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
23
Redefining Processor-Compiler Interface
Source
program
ISA
format
mysub semantics
myxor
format
semantics
Compiler
Executable
Let compiler determine the
instruction sets (and their realization on chip)
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
24
25
EPIC execution model
Record of execution
Functional units
ILP
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
26
Adaptive EPIC execution model
Record of execution
ILP-1
Reconfigure datapath
Configured
Functional
units
ILP-2
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
27
What can be efficiently compiled for today?
FPGA
Complex ASIC
DPGA
RAW
RaPiD
CVH
Parallelism
TRACE (Multiscalar)
GARP
MultiChip
Adaptive EPIC
SMT
SuperSpeculative
SuperScalar Dataflow
Early x86
Simple ASIC
0
4
16
EPIC/VLIW
TTA
VECTOR
Simple
Pipelined/
Embedded
32
64
128-512
1K-10K
100K-1M
>1M
Approximate instruction packet size
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
28
Adaptive EPIC
 AEPIC = EPIC processing + Datapath reconfiguration
 Reconfiguration through three basic operations:
– add a functional unit to the datapath
– remove a functional unit from the datapath
– execute an operation on resident functional unit
RLA
F1
UCB – November 8, 2001
F2
Krishna V Palem
Georgia Tech
29
AEPIC Machine Organization
I-Cache
F/D
Configuration cache
CRF
GPR/FPR/…
Level-1
configuration
cache
Multi-context
Reconfigurable Logic
Array
Array Register File
L1
L2
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
30
AEPIC Compilation
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
31
Key Tasks For AEPIC Compiler
 Partitioning
– Identifying code sections that convert to custom instructions
 Operation synthesis
– Efficient CFU implementations for identified partitions
 Instruction selection
– Multiple CFU choices for code partitions
 Resource allocation
– Allocation of C-cache and MRLA resources for CFUs
 Scheduling
– When and where to schedule adaptive component instructions
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
32
Compiler Modules
Source program
Front-end processing
Partitioning
High-level
optimizations
(EPIC core)
Configuration
Library
Mapping
(Adaptive)
Pre-pass scheduling
Resource allocation
Machine
description
Post-pass scheduling
Code generation
Simulation
Performance statistics
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
33
Resource allocation for configurations
Reconfigurable logic
can accommodate
only two
configurations
simultaneously
Record of execution
B
G
R
UCB – November 8, 2001
Overlapping
live-range
region
RL
Processor
Krishna V Palem
Georgia Tech
34
Managing Reconfigurable Resource (contd.)

Simpler version related to register allocation problem

New parameters
– No need to save to memory on spill
– configurations immutable
– Load costs different for different configurations
– C-cache, MRLA = multi-level register file

Adapted register allocation techniques to configuration allocation
– Non-uniform sizes of configurations: graph multi-coloring
– Adapted Chaitin’s coloring based techniques
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
35
The Problem With Long Reconfiguration Times
Configuration
load time
Configuration
execution time
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
36
Speculating Configuration Loads
f1
f2
Speculated
configuration loads
Configuration
execution time
• Mask reconfiguration times!
• Need to know when and where to speculate
• If f1>>f2 do not speculate to “red” empty load slot
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
37
Sample Compiler Topics
 Configuration cache management
 Power/clock rate vs. Performance tradeoff
 Bit-width optimizations
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
38
More Generally Architecture Assembly
Applications

An ISA view

Synthesis and other
hardware design off-line

Much closer to compiler “Compiler” selects assembles
and optimizes program
optimizations implies
Build off-line
(synthesis, place
and route)
Program
Prebuilt
Implementations
faster compile time
Data path
Storage
Interconnect
Dynamically variable ISA
Architecture implementation
UCB – November 8, 2001
Also applicable to yield fixed
implementations in silicon
Krishna V Palem
Georgia Tech