my presentation

Predictable Programming on a
Precision Timed Architecture
Ben Lickly, Isaac Liu, Hiren Patel, Edward Lee, University of California, Berkeley
Sungjun Kim, Stephen Edwards, Columbia University, New York
Presented By
Ashutosh Dhekne
PhD Student, University of Illinios at Urbana Champaign
Goal of the Paper
• Rethink processor architecture to provide predictable timing
CPU
• Why such a stance?
RAM
Caching
• Current computers optimized for average performance
• Too many time saving tricks that complicate WCET analysis
Pipelined Execution
• How to achieve it?
• Exposed memory hierarchies
• Thread interleaved pipelining
• Deadline instructions
fx
CPU
CPU
perf
RAM
Virtual Memory
Frequency Scaling
Words that Stick
[link]
Main Memory
Processor (CISC)
Instruction Pipeline
ALUs
MMU
Cache
Paging
Cache Try
Cache Miss
Low Latency
High Latency
IO - DMA
DMA
Transparent to Program
Internal Registers
Task Switch Regs
HDD
External Material – Drawn from memory
The Familiar Architecture (x86)
The PRET Architecture
Scratchpad Memory
(Part of Memory Address Space)
Thread Interleaved
Pipeline
Main Memory
Code
ALUs
Thread Controller
MMU
DMA
Data
M/M IO
5
0
3
1
Register File
Register File
Register File
Register File
4
2
Memory Wheel
Paper Innovations
Processor (RISC)
Main Memory
0x00000000
0x00000FFF
Boot code used by each thread on
startup. Initializes all the registers
0x3F800000
0x40000000
0x405FFFFF
Shared Data 8MB between multiple
threads
Thread local instructions and data
(1MB per thread) 512KB for
instruction, 512KB for data
Memory Mapped IO
4
0
3
1
0x80000000
0xFFFFFFFF
5
2
The Memory Wheel
I am feeling lucky!
• Access the Main Memory only
through the Memory Wheel
• 13 cycle slotted time to access the
Main Memory
• TDMA access creates false busy
resource impression
• In the worst case, 90 cycles are
required to access memory –
bounded worst case
• Can we keep the pipeline always running?
• What about Data Hazards, Control Hazards, Structural Hazards?
Instruction 0
Instruction 1
Instruction 2
Instruction 3
Instruction 4
Instruction 5
Instruction 6
Instruction 7
0
1
2
3
4
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
5
6
7
8
9
10
11
W
External Material – Drawn from memory
Instruction Pipelines
• What if we thread interleave pipelines, instead?
• Can we avoid all pipeline hazards?
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 0
Thread 1
Thread 2
0
1
2
3
4
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
5
6
7
8
9
10
11
W
Derived from: Precision Timed Machines, Isaac Liu
Thread-Interleaved Pipelines
• Can we ensure no hazards in thread interleaved pipelines?
• Always fill the pipelines with instructions from distinct threads
•
•
•
•
•
No explicit control dependencies between threads – No Control Hazard
Long latency instructions; prevent two from same thread – No Data Hazard
Very few concurrent threads; push in NOPs – No Data Hazard
Access to multi-cycle shared resources (eg. Memory) – Structural Hazard
TDMA access to the shared resources – removes timing dependencies
• Nonetheless, removing interdependence between pipeline units
eases timing analysis
Derived from: Precision Timed Machines, Isaac Liu
Hazardless Pipeline – Not Quite
Deadline hit
Deadline miss
Deadline of Task
1A) Finish the task and detect at
the end, if the deadline was missed
1B) Immediately handle a
missed deadline
2A) Continue with next task
2B) Stall before next task
Task
Next Task
Deadline Miss Handler
Preemption
Stall
Derived from: Precision Timed Machines, Isaac Liu
Deadline Handling
Deadline hit
Deadline miss
Deadline of Task
1A) Finish the task and detect at
the end, if the deadline was missed
Future Work
1B) Immediately handle a
missed deadline
2A) Continue with next task
2B) Stall before next task
Task
Next Task
Deadline Miss Handler
Preemption
Stall
Derived from: Precision Timed Machines, Isaac Liu
Deadline Handling
The Deadline Instruction
• A per-thread Deadline
Register ti
• DEAD(x) blocks until ti
reaches zero
• It then loads the value
x in the register and
executes next
instruction
• The paper does not
handle missing
deadlines
Producer
int main() {
DEAD(28);
volatile unsigned int *buf =
(unsigned int*)
(0x3F800200);
unsigned int i = 0;
for (i=0; ; i++) {
DEAD(26);
*buf = i;
}
return 0;
}
Register ti is loaded with value 28
Program waits here until ti
becomes zero, then loads 26.
If program returns here due to
the loop, it might wait again.
• The deadline register is checked in the register access stage and replayed
until it becomes zero
Example Game
Command
Queues
Pixel
Data
Commands
Commands
Game Logic
Thread
Swap
1
2
New graphics available
(Sync Request)
Sync Complete
3
(Queue Swapped)
Even Buffer
Odd Buffer
Graphics
Controller
Thread
Pixel Data
Swap
2
Refresh Screen
1 (VSync Request)
VSync
3
(Frame Buffer Swapped)
Video
Driver
Thread
VGA Real-time Constraints
VGA Vsync Time
VGA Hsync Time
Sixteen Pixels at a time
Experiences from the Two Samples
• It is possible to provide timing guarantees using the PRET architecture
• But, timing calculations by hand are error-prone
• Automated tools will be provided in the future
• The underlying architecture lacks synchronization primitives
• Simple synchronization can be achieved using the deadline
instructions
Comparison with the LEON3
• Average case time degradation
is studied
• PRET shows significant
degradation due to lack of
parallel threads
• None of the special PRET
features are used
• Degradation factor < 6; no
pipeline hazard advantage?
Conclusions
• The paper builds a remarkable architecture using SystemC model
• It introduces new instruction for one type of deadlines
• PRET keeps memory hierarchy and time differences exposed to user
• The model runs actual C programs and a small game
• Somewhat unfair comparison between LEON3 and PRET at the end
It is possible to modify a RISC processor to have predictable timing
Some Observations
• With a project of this scale, it is difficult to fit all details in a paper
• I had to refer to one of the author’s thesis work to gain insights
• The memory wheel assumes all threads will use memory equally
• I would suggest reduce the LEON3 comparison; include more
fundamental insights instead
• Overall the work is commendable
• Provides some thoughts not discussed in any previous paper
• A true systems level work
Can off the shelf architectures provide a strict WCET mode?
Thanks!