PhD Thesis – ICMC/USP - Merlintec Computadores

Adaptive compilation for an object-oriented and
reconfigurable architecture
Jecel Mattos de Assumpção Júnior
SERVIÇO DE PÓS-GRADUAÇÃO DO ICMC-USP
Data de Depósito:
Assinatura:________________________
______
Adaptive compilation for an object-oriented and
reconfigurable architecture
Jecel Mattos de Assumpção Júnior
Advisor: Prof. Dr. Eduardo Marques
Doctoral dissertation submitted to the Instituto de
Ciências Matemáticas e de Computação - ICMC-USP,
in partial fulfillment of the requirements for the degree
of the Doctorate Program in Computer Science and
Computational Mathematics. EXAMINATION BOARD
PRESENTATION COPY.
USP – São Carlos
May 2015
SERVIÇO DE PÓS-GRADUAÇÃO DO ICMC-USP
Data de Depósito:
Assinatura:________________________
______
Compilação adaptativa para uma arquitetura orientada a
objetos e reconfigurável
Jecel Mattos de Assumpção Júnior
Orientador: Prof. Dr. Eduardo Marques
Tese apresentada ao Instituto de Ciências Matemáticas
e de Computação - ICMC-USP, como parte dos
requisitos para obtenção do título de Doutor em
Ciências - Ciências de Computação e Matemática
Computacional. EXEMPLAR DE DEFESA
USP – São Carlos
Maio de 2015
Ficha catalográfica elaborada pela Biblioteca Prof. Achille Bassi
e Seção Técnica de Informática, ICMC/USP,
com os dados fornecidos pelo(a) autor(a)
A851a
Assumpção Júnior, Jecel Mattos de
Adaptive compilation for an object-oriented and
reconfigurable architecture / Jecel Mattos de
Assumpção Júnior; orientador Eduardo Marques. -- São
Carlos, 2015.
83 p.
Tese (Doutorado - Programa de Pós-Graduação em
Ciências de Computação e Matemática Computacional) -Instituto de Ciências Matemáticas e de Computação,
Universidade de São Paulo, 2015.
1. Arquitetura de Computadores. I. Marques,
Eduardo, orient. II. Título.
Resumo
A crescente complexidade dos sistemas embarcados tem aumentado o interesse no uso de linguagens orientadas a objetos dinâmicas, como Python ou Smalltalk, em sua implementação.
Para tornar isso prático, a redução do consumo de energia e dos custos dos recursos computacionais necessários para tais linguagens se tornou um tópico de pesquisa bem interessante.
Este projeto aborda estes temas via o projeto de um processador específico para a linguagem
Smalltalk-80, pela otimização deste processador para a compilação adaptativa, pelo uso do paralelismo tanto de alta como baixa granularidade para realizar mais com relógios de frquências
mais baixas e ao aproveitar a reconfigurabilidade das FPGAs (Field Programmable Gate Arrays,
que são componentes que estão cada vez mais presentes nos sistemas embarcados) para adaptar
o hardware durante o tempo de execução adequando-o a cargas computacionais variáveis.
Abstract
As the complexity of embedded systems grows, so does the attraction of using object-oriented
dynamic languages, like Python or Smalltalk, to implement them. To make this practical, a
reduction in the energy and cost of the required computing resources for such languages has become a hot research topic. This project addresses these issues by designing a processor specifically for Smalltalk-80, by optimizing this processor for adaptive compilation, by using both fine
grained and course grained parallelism to do more work at lower clock speeds and by taking
advantage for the reconfigurability of Field Programmable Gate Arrays (which are increasingly
present in embedded systems) to adapt the hardware at runtime to variable computing loads.
Contents
1
2
Introduction
1
1.1
Project Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3
Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Theory and Related Works
7
2.1
Language-Specific Processors . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.1.1
Algol computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.2
SYMBOL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.3
Smalltalk computers . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.4
Lisp Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.1.5
Forth processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.1.6
Java Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Adaptive Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.2.1
Evolution of Adaptive Compilation . . . . . . . . . . . . . . . . . . .
16
2.2.2
Uses of Adaptive Compilation . . . . . . . . . . . . . . . . . . . . . .
23
Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.3.1
Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.3.2
CSP and Occam . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.3.3
Asynchronous Messages . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.3.4
Synchronism by Necessity . . . . . . . . . . . . . . . . . . . . . . . .
26
2.3.5
Linda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.3.6
Concurrent Aggregates . . . . . . . . . . . . . . . . . . . . . . . . . .
27
Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.4.1
Dynamic and Partial Reconfiguration . . . . . . . . . . . . . . . . . .
29
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.2
2.3
2.4
2.5
i
3
4
5
Implementation
31
3.1
Language-Specific Processor: SiliconSqueak . . . . . . . . . . . . . . . . . .
31
3.1.1
Level 1 caches and virtual level 2 caches . . . . . . . . . . . . . . . .
33
3.1.2
Microcode cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.1.3
Bytecode and data caches . . . . . . . . . . . . . . . . . . . . . . . .
34
3.1.4
Stack cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.1.5
Virtual registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.1.6
Fetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.1.7
PIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.2
Adaptive Compilation: Cog and Sista . . . . . . . . . . . . . . . . . . . . . .
35
3.3
Parallelism: ALU Matrix coprocessor . . . . . . . . . . . . . . . . . . . . . .
37
3.4
Reconfiguration: runtime reload . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
Experimental Results
43
4.1
Language-Specific Processors . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.2
Adaptive Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.3
Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.4
Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
Conclusion
49
5.1
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
5.1.1
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
5.1.2
Smalltalk Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
5.1.3
Multi-level Tracing and Partial Evaluation . . . . . . . . . . . . . . . .
53
5.1.4
Wafer Scale for Massive Parallelism . . . . . . . . . . . . . . . . . . .
54
5.1.5
Native Compilation to Hardware . . . . . . . . . . . . . . . . . . . . .
54
5.1.6
Non Von Neumann Architectures . . . . . . . . . . . . . . . . . . . .
55
Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
5.2
Bibliography
61
A SiliconSqueak Assembly Language
63
A.0.1 000: Registers t0 to t31 . . . . . . . . . . . . . . . . . . . . . . . . . .
64
A.0.2 001: Registers i0 to i31 . . . . . . . . . . . . . . . . . . . . . . . . . .
64
A.0.3 010: Registers s0 to s31 . . . . . . . . . . . . . . . . . . . . . . . . .
64
A.0.4 011: Registers x0 to x31 . . . . . . . . . . . . . . . . . . . . . . . . .
65
A.0.5 100: Pseudo Registers #0 to #31 . . . . . . . . . . . . . . . . . . . . .
65
A.0.6 101: Pseudo Registers #-1 to #-32 . . . . . . . . . . . . . . . . . . . .
65
ii
A.1
A.2
A.3
A.4
A.5
A.6
A.7
A.8
A.9
A.0.7 110: Pseudo Registers #o0 to #o31 .
A.0.8 111: Registers L0 to L31 . . . . . .
Directives . . . . . . . . . . . . . . . . . .
A.1.1 org expression . . . . . . . . . . .
A.1.2 def name expression . . . . . . . .
Syntax . . . . . . . . . . . . . . . . . . . .
Operations . . . . . . . . . . . . . . . . . .
A.3.1 Arithmetic (00) . . . . . . . . . . .
A.3.2 Comparison (01) . . . . . . . . . .
A.3.3 Logic (10) . . . . . . . . . . . . . .
A.3.4 Shifts (11) . . . . . . . . . . . . .
Fetch . . . . . . . . . . . . . . . . . . . .
Streams . . . . . . . . . . . . . . . . . . .
Context . . . . . . . . . . . . . . . . . . .
Thread . . . . . . . . . . . . . . . . . . . .
Image . . . . . . . . . . . . . . . . . . . .
System . . . . . . . . . . . . . . . . . . . .
B ALU Matrix Assembly Language
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
65
66
66
66
67
67
68
68
69
70
71
72
75
76
77
79
80
81
iii
List of Figures
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
Dorado block diagram and microcode instruction format
J-Machine with 1024 processors . . . . . . . . . . . . .
JOP block diagram . . . . . . . . . . . . . . . . . . . .
Programming language implementation techniques . . .
Dynamic Compilation . . . . . . . . . . . . . . . . . . .
How Polymorphic Inline Caches work . . . . . . . . . .
Adaptive Compilation . . . . . . . . . . . . . . . . . . .
Parallelism models . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
13
16
17
20
21
22
27
3.1
3.2
3.3
3.4
3.5
Squeak’s implementation . . . . . . . . . . . . . . . .
SiliconSqueak pipeline stages . . . . . . . . . . . . . .
Organization of the ALU Matrix coprocessor . . . . .
Switching between different FPGA configurations . . .
Time to execute code on different FPGA configurations
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
33
38
39
40
4.1
4.2
4.3
Cog generated code for a PIC with 6 different types . . . . . . . . . . . . . . .
PIC for SiliconSqueak with any number of different types . . . . . . . . . . . .
PIC causes these cache entries for six types . . . . . . . . . . . . . . . . . . .
46
47
47
5.1
5.2
Slang code for pushTemporaryVariableBytecode . . . . . . . . . . . . . . . .
Microcode for pushTemporaryVariable 3 bytecode . . . . . . . . . . . . . . .
52
52
v
.
.
.
.
.
List of Tables
2.1
Parallelism Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
A.1 Registers for Stream Units . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
B.1 ALU Matrix operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
vii
C HAPTER
1
Introduction
Of the various ways of classifying high level computer programming languages, a significant
one is the division into static (including Fortran, Pascal, C, Modula-2, Ada, Occam, and C++)
and dynamic (such as Lisp, Smalltalk, APL, Python, Ruby, Lua, the various Unix shell languages, JavaScript and TCL) languages. Given that Fortran was introduced in 1957 and Lisp
in 1958 (published, but the first implementation was in 1960) this division is nearly as old as
computing itself. Some languages, such as Cecil and Dart, allow the programmer to select how
dynamic a given application will be while other languages, like Java, mix characteristics from
both sides.
Static languages allow for simpler compilers and have small runtime software infrastructures
that make good use of available computing resources, and so are popular where the cost of
such resources is a major factor. That is the case at both the high end (multi-million dollar
supercomputers) and the low end (embedded systems costing just a few dollars or less).
In contrast, dynamic languages save programming time at the cost of wasting some computing resources. A traditional way of getting the best of both worlds is to develop a prototype
of an application using a dynamic language, such as Lisp, and when it is fully tested rewriting
the application from scratch in a static language, like Fortran. The obvious alternative is to re1
2
CHAPTER 1. INTRODUCTION
duce the gap in resource requirements between the two alternatives and this has been an active
research topic since the 1970s.
Dynamic translation (Deutsch e Schiffman, 1984), known as “JIT compilation” in the Java
world, and adaptive compilation (Hölzle, 1994) greatly reduced the runtime performance gap
at a cost of an increase in other computing resources, such as memory. While adaptive compilation technology has been very successful technologically, it is very demanding in terms of
engineering resources so even some popular dynamic languages are still implemented as simple interpreters. This was the case for JavaScript up to 2008, when Google, Microsoft and the
Mozilla Foundation entered a performance race for that language backed by significant financial
resources.
An older research direction was the development of computer architectures optimized for the
features present in high level languages. The brief commercial success of the Lisp Machines in
the late 1980s and early 1990s were the most visible result of this research, but their replacement
by Unix graphical workstations and then generic PCs (partly due to the success of technologies
similar to adaptive compilation) has mostly eliminated interest in language-specific architectures. When energy use and not just top performance is used as a metric, however, this research
direction can still be useful (specially in combination with, instead of as an alternative to, other
technologies).
Reducing energy use of computing systems is a top priority in embedded and portable systems (where battery life must be maximized), in supercomputers and data centers (where the
electric bill is by far the main operating cost) and even in regular PCs (since performance is
limited by the need to not melt down the processor – this is known as the “power wall”). A
well known way to increase the amount of computation per energy unit while maintaining performance is the use of parallelism. The limited number of components available in the past and
the lack of established programming models other than the sequential one had limited extensive
parallelism to specialized applications. The new energy limitations and high component budgets are now making parallelism the norm whether programmers are ready or not. Fortunately,
languages based on message passing (such as Occam, Smalltalk-80 and Erlang) can be more
easily adapted for parallelism than those based on direct manipulation of shared memory (“im-
1.1. PROJECT GOAL
3
perative languages”, such as C and most others). Additionally, a computer architecture designed
for such languages can better integrate support for parallelism.
Field programmable electronic circuits had been available since the 1970s, but the introduction of Field Programmable Gate Arrays (FPGA) in the mid 1980s and the continuous growth
in their capacity, made possible by “Moore’s Law” (Moore, 1965), allowed the creation of
reconfigurable computers. These machines can have their hardware changed, possibly at runtime, to be optimized for different applications. FPGAs were once mostly used for prototypes
and high-end communication systems due to their high cost, but have recently become more
common in mainstream products including cost sensitive embedded systems. Runtime reconfiguration could be one way to get more computing done with less energy but there are a few
obstacles. Adaptive compilation can address one of these obstacles since it uses some of the
same kind of runtime information that is needed to trigger the reconfigurations.
Another way of classifying computer language environments which is particularly relevant
for embedded applications is the contrast between native development and cross development.
In the native case the same computer that is used to create a program is also the one to run
it. The language can be either dynamic or static. For cross development the two machines are
separate and this is only interesting for static languages since it allows the computer that only
runs the program to be more limited (and so cheaper) because the runtime environment can
be very simple. As memory has become cheaper and low-end 32 bit microcontrollers cost the
same as traditional 8 bit ones, very complex runtime environments (such as complete Linux
operating systems) are becoming normal. This eliminates the most significant advantage of
cross development, and so eliminates a barrier to using dynamic languages.
1.1
Project Goal
The thesis is that it is possible to extend the use of object-oriented dynamic languages to areas
in computing that have been traditionally limited to static languages through a combination of a
language-specific processor architecture, support and use of adaptive compilation, parallelism
and reconfiguration. The focus of this project is on embedded systems, but the results should
also apply to supercomputing and other areas.
4
1.2
CHAPTER 1. INTRODUCTION
Contributions
The project incorporates a number of inventions which extend the state of the art in computer
architecture and language implementation and also uses novel combinations of existing techniques. Some of the most notable inventions are:
architecture for bytecode-based language implementations: SiliconSqueak is an overall processor architecture with features to speed up the interpretation of bytecodes and to efficiently support adaptive compilation by offering a lower level “microcode” as a compiler
target
PIC instruction and special instruction cache: Polymorphic Inline Caches (PIC) are a central feature of adaptive compilation as they both speed up message sends and accumulate
type information for later compilation phases. SiliconSqueak’s instruction cache can handle parallel search of all possible receiver types in contrast with the sequential search on
conventional processors
stack cache: register windows (used in the original RISC and RISC II, SOAR, Sparc, AMD29000
and Altera’s NIOS 1) are very effective in reducing the costs of calls and returns but must
be flushed on every thread switch, which is not the case for SiliconSqueak’s stack cache
virtual level 2 caches: the hardware implements the cache refill mechanism by transferring
data between the internal level 1 caches and the region of memory designated as the
virtual level 2 cache. Software handles misses in the virtual L2 caches, and so gets to
define the policy used adding an interesting level of reconfigurability to the system
runtime reconfiguration for adapting processing grain: a given area in an FPGA can implement a number of simple SiliconSqueak cores for course-grain parallelism or a single
SiliconSqueak core with an attached ALU Matrix coprocessor for fine-grain parallelism.
Which is most efficient varies at runtime and reconfiguring the FPGA accordingly will
result in the most computing per energy unit
1.3. ORGANIZATION
1.3
5
Organization
Each of the main chapters in this text talks about the four topics of this project in this order:
language-specific processors, adaptive compilation, parallelism and hardware reconfiguration.
• Chapter 2 serves as an introduction and historical overview for the topics in the form of a
comparison with related works.
• A description of the project and details about the implementation can be found in Chapter 3
• Chapter 4 shows the experiments designed to validate the implementation and the results
of those experiments.
• Finally, the conclusion is presented in Chapter 5, which also mentions plans for the
project’s future.
C HAPTER
2
Theory and Related Works
For each of the topics related to this project, a historical overview and a description of the state
of the art are presented in the form of references to related projects. The history presented is
not meant as an exaustive survey of each area, but as a selective overview to provide a context
for the following chapters.
2.1
Language-Specific Processors
One characteristic of a practical programming language is that it should be “Turing complete”.
That allows it to, in theory, do anything that can be done in any other language. In practice, it
might be so inefficient or so awkward to emulate one language’s feature in another that nobody
actually does it. This is known as the “Turing tarpit”. While in theory there should be no
reason to optimize a processor architecture for a given programming language, in practice much
research has been dedicated to doing just that.
Biases are very hard to perceive from within the contexts in which they occur. Many architectures considered to be “universal” or “language neutral” (which is presented as a very
desirable design goal) are not really such but happen to be optimized for the languages its users
7
8
CHAPTER 2. THEORY AND RELATED WORKS
are interested in. The original member of the extremely popular x86 architecture, the Intel
8086, could run assembly and Pascal code very well but was limited when running C, which
happened to be much less popular at the time. The programmer could write proper C code if
it could fit in 64KB of memory, wasting the rest of the 1MB address space. Or non standard
notations called “far pointers” could be used to access all memory, but at the cost of becoming
incompatible with all other C computers. So it simply was not possible to port C code between
a PC and a DEC Vax computer, for example, because what PCs could run wasn’t really C. The
next member of that architecture was the 80286 and it was optimized to run Modula-2 (Pascal’s
successor) and similar languages which were the most popular when the project was started but
had lost out to C by the time the product reached the market. It was the full and proper support
for C in the 80386 by extending the architecture from 16 to 32 bits that saved Intel from losing
the microprocessor wars.
The 8086 was actually a quick and dirty project meant to fill in for the delayed iAPX432
architecture, Intel’s first 32 bit design. That processor was optimized to run programs written
in the Ada programming language created to standardize military applications. Many of its
features, specially in the area of security, as unmatched by modern alternatives. Unfortunately,
performance was very disappointing due to the limitations of integrated circuit technology at
the time and a lack of optimizations. With the current negative perception of language-specific
architectures, Intel has rewritten history in all its official documents so that the 80386 is now its
first 32 bit design and the iAPX432 is never mentioned at all.
2.1.1
Algol computers
The definition of the Algol-60 language generated interest in the development of computer
architectures that could efficiently implement this language’s features. Nearly all computer
architectures of the time was one address machines with a single accumulator (a design inherited
from the days of mechanical calculators). Researchers trying to actually develop compilers for
Algol proposed that a stack machine would better match the nested blocks of that language.
The most successful effort was the Burroughs B5000 created by the group led by Bob Barton in 1961. The B5500 version (1964) implemented the 8 bit instructions which operated
2.1. LANGUAGE-SPECIFIC PROCESSORS
9
with tagged data, which has been the model used in the design of virtual machines for the
Smalltalk-76 (and later) and Java programming languages. Stacks not only simplified the compiler and made object code more compact, but they also became an important part of virtual
memory and multi-tasking.
The English Electric KDF9 was another stack machine released in 1964 with the goal of
efficiently running Algol-60.
2.1.2
SYMBOL
This architecture, published in 1971, defined a high level language called SYMBOL and designed hardware that could run it directly. The compiler and even text editor and debugger were
integrated into the hardware. Though it did not have much of an impact, this project was the
most extreme attempt to close the “semantic gap” between high level programming languages
and hardware.
2.1.3
Smalltalk computers
This project includes the design of an architecture for running Smalltalk-80 programs, so the
various Smalltalk computer projects of the past are the most closely related.
Xerox
The Alto computer (Thacker & al., 1981) developed in the Xerox Palo Alto Research Center
(PARC) is an example of the use of microcode to reconfigure computers. Alan Kay’s group at
PARC called the machine the “Interim Dynabook” for its role as a research vehicle for future
commercial (and portable, as in laptops and tablets) machines, allowing software development
to be started as early as possible. It was the test platform for the Ethernet network, as a file
server, a print server and a graphical workstation for thousands of researchers within Xerox and
a few collaborating institutes.
Of the improved versions that followed, the one with the highest performance was the Dorado. Its block diagram (as seen by a programmer working at the microcode level) is shown
in figure 2.1. Beyond the architectural advances relative to the Alto, it was the use of the high
10
CHAPTER 2. THEORY AND RELATED WORKS
Figure 2.1: Dorado block diagram and microcode instruction format
2.1. LANGUAGE-SPECIFIC PROCESSORS
11
speed bipolar Emitter Coupled Logic (ECL) technology which gave the Dorado its speed. Unfortunately it also caused a very high energy consumption (and, as a consequence, it was very
noisy) and made it very bulky besides costing nearly ten times as much as the Alto. But for
many years it was considered the best Smalltalk computer in the world.
Swamp, Sword32 and AI32
Given the very detailed specification of the Smalltalk-80 virtual machine, it was unavoidable
that some groups would try to implement it as a physical microprocessor. Swamp (Lewis & al.,
1986) was an academic project while Sword32 and its successor AI32 (Nojiri & al., 1986) were
developed by Hitachi as a commercial product. As with the iAPX432 object-oriented processor
(Intel’s first 32 bit architecture launched in 1979) these projects suffered due to the limitations
of integrated circuit technology of the time, specially with regards to internal memory. Even so,
the results that were obtained were very interesting when compared with the Dorado.
SOAR - Smalltalk On A RISC
After the success of a student project at Berkeley to design a processor to run programs written
in C (the success was such that the project name, Reduced Instruction Set Computer or RISC,
became the generic term for this whole kind of architecture) a follow up project studied the
possibility of using the same ideas to execute object-oriented languages. Smalltalk On A RISC
(SOAR), described in Samples & al. (1986) and Ungar & al. (1984), abandoned the bytecodes
used until then as a target for translating Smalltalk source code for an instruction set with 32
bits. Many of the results of this project were the inspiration for the creation of Sun’s Sparc
and the experience gained had a great impact in the software implementation of object-oriented
languages, as described in section 2.2.
Mushroom
The design of a Smalltalk computer at the University of Manchester (Williams, 1989) was the
first to make use of FPGAs for reconfigurability (very limited due to the small sizes of the
available components). The original plan was to build a bytecode based machine like Swamp,
12
CHAPTER 2. THEORY AND RELATED WORKS
but the positive results of the implementation of the Self programming language on conventional
hardware encouraged the adoption of a RISC style instruction set.
The most interesting feature of this project was its memory system. The cache used virtual
addresses instead of physical ones such that on a cache hit there was no need to consult the
object table. This combined the advantages of having a table with the performance of direct
pointers. In addition, the garbage collector was integrated with the virtual memory and with the
cache. A new object would always be created in the cache and could eventually be collected
without ever having been written to main memory (not to mention being swapped to disk).
J-Machine
The goal of Bill Dally’s “Jellybean Machine” (or J-Machine) project at MIT was to research
the consequences of having processors as cheap as jellybeans. It is a good example (like the
Alto and Dorado) of using Moore’s Law to obtain today a computer that will be typical in the
future. Figure 2.2 shows the 1024 processor version, each one with local memory and high
speed communication channels to each of its six neighbors (the overall architecture is a 3D
mesh).
A version of Smalltalk called ConcurrentSmalltalk (there was another project in Japan with
exactly the same name) was developed but it used a Lisp syntax. Active messages were implemented, which are sent as they are being built and when received they directly invoke the code
that is to process it.
One of the most interesting experiments in the software side of the project was the invention
of concurrent aggregates. These allow a very simple transformation of sequential code into
parallel by replacing the use of traditional data structure (aggregates) such as Array by the
functionally equivalent ParallelArray. So the bulk of the complexity is dealt by the creator of
such classes and not by their users. The success was so great that a new language, Concurrent
Aggregates (CA), was created to explore this concept further.
2.1. LANGUAGE-SPECIFIC PROCESSORS
Figure 2.2: J-Machine with 1024 processors
13
14
CHAPTER 2. THEORY AND RELATED WORKS
2.1.4
Lisp Machines
The use of some Altos at the artificial intelligence laboratory at MIT inspired researchers to
create a similar computer but optimized for the execution of the Lisp programming language.
Before that the Digital Equipment Corporation (DEC) PDP-10 computer had been considered
ideal of this type of application. But the high cost of its operation meant that it had to be shared
among many users, which combined with its address space limit of 256K words of 36 bits
(around 1MB) to make it hard to run large applications. By using instead a processor dedicated
to each user and able to address 16M words, the Lisp Machine was meant to allow advanced in
artificial intelligence research. With such a focused project it was possible to optimized even
more the architecture for Lisp (by using tagged data like in the B5500, for example).
The technology was licensed to two companies founded by researchers from MIT itself:
Symbolics and Lisp Machine Inc (LMI). Later on they were joined by Texas Instruments. This
was a great incentive to move artificial intelligence applications, specially expert systems, from
the laboratory into the field. The increase in the capabilities of Unix workstations coincided with
a strong reduction in the market for expert systems (the so called “AI winter”) and this caused
the Lisp Machine market to collapse. Less then a decade later these same Unix workstations
were eliminated by the increasing performance of regular PCs.
2.1.5
Forth processors
In Philip J. Koopman (1989) the following 16 bit architectures optimized for the Forth programming language are compared: WISC CPU/16, MISC M17, Novix NC4016 and Harris RTX
2000. Later chapters compare these 32 bit processors: FRISC 3, RTX 32P, SF1. The Novix
4016 was particularly interesting because it was the first hardware effort by Chuck Moore, the
creator of the Forth programming language. Though Forth’s two stack virtual machine is almost
trivial to implement in normal processors, Moore felt an optimized design could be simpler,
cheaper and more energy efficient. A more commercially succesful version of this architecture
is the Harris RTX 2000, several of which were recently used to land a probe on a comet.
While the Novix design took advantage of the microcode style instruction format to com-
2.1. LANGUAGE-SPECIFIC PROCESSORS
15
bine multiple high level Forth instructions into a single machine word, the following efforts
replaced this with independent and narrow instructions. The five bit instructions are enough for
all Forth primitives, forming a Minimum Instruction Set Computer (MISC). Other researchers
have explored different designs for Forth processors, though variations on MISC are very popular. The recent J1 went back to the Novix style but taking advantage of the features of modern
reconfigurable chips (FPGAs).
2.1.6
Java Computers
The rising popularity of the Java programming language led to research into architecture optimized for it. The most important virtual machines nearly always become real machines as has
been the case for Lisp, Smalltalk, Forth, Prolog (the famous Fifth Generation Project in Japan
back in the 1980s), Modula-2 and even UCSD Pascal had commercial hardware versions.
The first, and most famous attempt came from Sun itself (Java’s creator) in the form of
PicoJava. It used some very sophisticated instruction grouping techniques to work around the
traditional performance limitations of stack based architectures. This project was also the company’s first release of sources to the public, initially in a rather restricted form but later as free
hardware.
Studies such as Narayanan (1998) evaluated the difficulties of implementing Java in hardware. Many of the results obtained are valid for any object-oriented language that uses bytecodes. The following attempt by Sun was called MAJC and replaced bytecodes with an instruction set typical of media processors in a way very similar to what had been done in the
SOAR project. The delay in this product’s release reduced the performance advantages relative
to Sun’s own Sparc processor and this caused it to be eliminated.
The Java Oriented Processor (JOP) (Schoeberl, 2005) was an academic project that became
relatively successful commercially. The data flow part of its design is illustrated in figure 2.3
and was the third attempt by its creator. The first one was inspired by SOAR’s results and
was a simple RISC processor that ran the Java virtual machine entirely in software. The second
version added to the RISC just a few instructions to speed up stack manipulation, which resulted
in a significant increase in performance. The third (and final) version replaced the RISC with
16
CHAPTER 2. THEORY AND RELATED WORKS
Figure 2.3: JOP block diagram
microcode, though this microcode is very different from the traditional one as it uses very
narrow instructions that directly correspond the the simplest and most popular Java bytecodes.
The evolution of JOP in terms of performance does not mean that speed is the project’s
priority. The focus is on real time applications and consistent run times is more important than
small run times. The small FPGA area required by a single JOP processor has encouraged
research into multicore solutions.
2.2
Adaptive Compilation
The Squeak Smalltalk language is also the operating system for this project. It inherited its syntax and semantics from Smalltalk-80, and more recently also advanced implementation technologies such as adaptive compilation from Self.
2.2.1
Evolution of Adaptive Compilation
Smalltalk-72 (Goldberg e Kay, 1976) was the first implementation of a purely object-oriented
programming language (Simula was a hybrid language). It was extremely dynamic and, as
a result, very slow. Even the syntax itself was defined at runtime, which made compilation
impossible and made it hard to understand programs written by other people. A radical changed
was made for the 1976 version of the language as shown in figure 2.4: a simple and fixed
2.2. ADAPTIVE COMPILATION
Objects
17
Source
Objects
Source
bytecodes
Interpreter
Interpreter
Hardware
Hardware
Interpreter:
Smalltalk-72, traditional
Javascript and many others
Compiler
Virtual Machine:
Smalltalk-76 and -80,
embedded Java, UCSD Pascal
Figure 2.4: Programming language implementation techniques
syntax was defined and the implementation was split into two phases – the source text was
compiler to the machine language for a “virtual machine” and then an interpreter simulated this
virtual machine on the physical computer at runtime. The virtual instructions define a simple
stack machine (plus some more complicated instructions for message sending) and are called
“bytecodes” due to their length. Most current Smalltalk implementations use an interpreter for
the virtual machine and for this reason pure object-oriented languages are considered inefficient
by many people.
The implementation of Squeak Smalltalk (Ingalls & al., 1997) in 1996, made use of the
fact that Goldberg e Robson (1983) included a complete implementation of the virtual machine
written in Smalltalk-80 itself. But the style of that code did not make use of many Smalltalk
features so that it could be used as a reference for an implementor who wished to write the
equivalent in Pascal or C. For the Squeak project, a tool was created that could automatically
translate this subset of Smalltalk (called Slang to distinguish it from normal Smalltalk and with
no relation to other languages named Slang) to C.
Dynamic Compilation
Some projects tried, in the early 1980s, to implement the virtual machine directly in hardware.
They did not achieve the same increase in performance as obtained by Smalltalk On A RISC
(SOAR) Ungar & al. (1984) at the University of California at Berkeley. This group included
some rather minimal hardware support (some of which is included in Sun’s Sparc processors)
and compiled the source text directly to native machine code. An interesting technique used in
18
CHAPTER 2. THEORY AND RELATED WORKS
this system is the “inline cache” invented by L. P. Deutsch (Deutsch e Schiffman, 1984) (the
top two part of figure 2.6 show how it works). This exchanges the costly searches in every
message send by simple subroutine calls whenever possible. This works well in code where the
flexibility of Smalltalk’s polymorphism isn’t really used.
The inline cache works by compiling a message send initially into a subroutine call to code
that does the search for methods. At runtime when the selected method is found, the original
call to search is replaced with a direct call to the method that was found. The call actually goes
to a short header which verifies that the receiver’s class is the expected one (and falls back to
the search if not). So the next time this code runs the system expects the search to produce the
same result as the previous time.
A serious problem with directly compiling Smalltalk text to RISC machine code was the
size of the resulting executables. A Smalltalk system has a lot of code and an expansion of
four times the size can actually reduce performance due to an increase in the virtual memory
activity. A compromise is the use of dynamic translation (Deutsch e Schiffman, 1984), called
“dynamic compilation” (or Just In Time or JIT compilation in the Java world). The source text
is translated to bytecode and the first time that a method is called it is not interpreted but rather
translated to native machine code which are saved in a special software cache and then the code
is executed. The next time the same method is invoked it can be directly executed from the
cache at a considerable gain in average performance. If the cache becomes too full then some
methods can simply be discarded. They will have to be recompiled if they are called again later.
Given the current speed difference between processors and disk, this solution i faster than trying
to save the methods that are eliminated from cache.
The Self Language and Compiler Technology
The Self programming language (Ungar e Smith, 1987) is a Smalltalk variation based on a
smaller, but more general set of concepts. It can be described as a prototype based language with
message passing as the most basic operation. Self objects are structured as a set of associations
between names and values. When an object receives a message it looks up the association
with the message name and either simply returns the associated value if it is a normal object,
2.2. ADAPTIVE COMPILATION
19
executes the code if it is a method object or changes a corresponding association (or “slot”) if
it is an assignment. The slot with the name “x:” changes the value of the slot with the name
“x”, for example. In this model each object completely describes itself without the need of
the concept of classes. The lack of classes implies that objects can only be created by copying
(cloning) pre-existing objects (called prototypes) and then changing them. The assignment slots
are sufficient for changing the state of an object, but the language includes a few “primitives”
(like “AddSlot:” and “DeleteSlot:”) which allow more fundamental changes to an object.
To avoid code duplication, an object can “inherit” from one or more parent objects. When
a message is received and the object does not have a corresponding slot, the search continues
(recursively) in the parents but when a method is executed the object that received the message
originally is used as the context no matter where the method was found. Parents are indicated
using slots that have their names ending in “*”. As these parent slots also work as normal
slots it is possible to use assignment to replace parents at runtime. This is known as “dynamic
inheritance” and shows how flexible this model is, but is not a feature that has proved very
useful.
While the Smalltalk virtual machine looks like a typical computer with a few extensions,
the first implementations of Self were very radical in their exclusive use of message passing.
The Smalltalk code “x := x + 1” would be translated to:
push i n s t a n c e v a r i a b l e ( x )
push c o n s t a n t ( 1 )
send message ( + )
pop and s t o r e i n i n s t a n c e v a r i a b l e ( x )
The equivalent in Self would be written as “x: x + 1” which means “self x: self x + 1”
(hence the language’s name), which would be translated to these bytecodes:
self
push
send
self
send message ( x )
constant (1)
message ( + )
send message ( x : )
As message sending is the bane of implementations of purely object-oriented programming
language, it would seem that Self would be even slower than Smalltalk. And the advantages
20
CHAPTER 2. THEORY AND RELATED WORKS
Source
Objects
Compiler
Translator 1
Native Method
bytecodes
Hardware
Dynamic Compilation:
VisualWorks Smalltalk, Self 1
and 2, Java JIT, Squeak Cog
time
Interpreter
Self 1.0
Self 2.0
Interpreted Code
Self 1.0 Compiler
Native Code
Self 2.0 Compiler
Optimized Native Code
Figure 2.5: Dynamic Compilation
of this style (much simpler virtual machine with only eight bytecodes and the incentive for a
programming style that has a greater code reuse) would not be sufficient to make up for this
loss in performance.
With dynamic compilation, however, all message sends that would access a data slot or an
assignment slot could have their effects directly included in the generated native code for a speed
increase of 4 to 160 times (Chambers, 1992). This is only possible because the receiver for these
messages is “self” which has its type known at runtime (when the compiler is called). When
inheritance is taken into account, this is no longer true because “self” is of the type of object that
originally received the message and not the object where the method was found. One solution is
to generate different versions of the native code for each case where the same original method is
inherited, which is known as customized compilation (Chambers e Ungar, 1989). This scheme
is only practical because the compiler is only invoked for the cases that actually show up during
execution which are a tiny fraction of all possible combinations. Without this customization
Self would be three times slower since few optimizations would work equally in the different
2.2. ADAPTIVE COMPILATION
21
search: global
lookup routine
...
receiver = a point
call search
...
Code as initially compiled
search: global
lookup routine
...
cache em
receiver = a point
check if point, else
call search
call show2
show2: routine to
draw points
...
After the first call
search: global
lookup routine
...
receiver = a rectangle
show2: routine to
draw points
call PIC
...
check if point, else
call search
PIC
switch (receiver) {
circle: call show3;
rectangle: call show1;
point: call show2;
default: call search;
}
check if rectangle, else
call search
show1: routine to
draw rectangles
When new types of objects show up
Figure 2.6: How Polymorphic Inline Caches work
contexts in which a method can be called.
Polymorphic Inline Cache and Adaptive Compilation
As the sophistication of Self’s compiler grew (Chambers e Ungar, 1990) (Chambers & al.,
1989), performance came closer and closer to highly optimized languages such as C but the
interactive use became worse as the compiler induced pauses became longer as shown in Figure 2.5. Since Self programs have no explicit information about types, a compiler has to work
very hard to extract all implicit information in order to generate high quality code. An interesting optimization use by the Self compiler is an extension of the inline cache previously used in
Smalltalk: the Polymorphic Inline Cache (PIC) (Hölzle & al., 1991) is shown in figure 2.6. This
replaced the call to the method header (originally a call to the search routine) with a sequence
of type tests, similar to the “switch” statement in C. If the message receiver has the same type
as one of the previous sends of this message, then the correct native code can be called directly.
If not, the normal search occurs and then the PIC is extended to include the new type. This
extends the advantages of the inline cache to polymorphic message sends as well, and these
22
CHAPTER 2. THEORY AND RELATED WORKS
Source
Translator 2
Objects
Compiler
Translator 1
PIC
Optimized Method
PIC
Native Method
bytecodes
Hardware
Adaptive Compilation:
Self 3 and 4, Java HotSpot,
StrongTalk
time
Interpreter
Self 1.0
Self 1.0 and 2.0 combined
Self 3.0: type feedback
Interpreted Code
Self 1.0 Compiler or
Non Inlining Compiler
Self 2.0 Compiler
Optimized Native Code
Simple Inlining Compiler
Native Code
Figure 2.7: Adaptive Compilation
are very common in applications written in an object-oriented style. If many different types
appear in a given call site, the PIC can either start to eliminate the older types or the PIC itself
can be abandoned for a direct call to the search routine (which ends up being faster for these
“megamorphic” call sites).
A side effect of the PICs is that they accumulate information about the types of objects
that actually appear in different places in the compiler code. This information is a subset of
that obtained by the compiler through sophisticated analysis as mentioned before. This makes
“adaptive compilation” a particularly effective strategy: the methods are initially compiled by
a quick and dirty compiler and executed for a while, and then recompiled using PICs as a
source of type information. Both the first and second compilers can be simple and fast as seen
in Figure 2.7. Only methods that are extensively used need to go through the recompilation
process, which dedicates processor time to the parts of the program with the greatest effect.
2.2. ADAPTIVE COMPILATION
23
With adaptive compilation the code ends up calibrated for the actual runtime conditions,
including the characteristics of input data. The native code generated for an application might
be different, for example, depending on whether the user is editing black and white images or
colored ones. This technology can make use of information that would be too costly to obtain
statically.
From the programmer’s viewpoint there is only one type: the object. But the compiler must
be more detailed if it is to generated acceptable code. One observation of actual use patterns in
Self show that most objects are exact clones of some other object except for the values in their
data slots. A practical implementation can separate objects into two parts: one with just the
values of the data slots and another with all the rest (called the object’s “map”). When an object
is cloned, only the first part needs to be copied. Sharing maps between many objects not only
saves a lot of memory space but also makes up for the lack of classes by allowing the compiler
to treat these “clone families” as its notion of type. The full flexibility of Self remains since any
object can be changed at runtime using the programming primitives. In this case a new map is
created and the changed object ends up being the start of a new clone family, which leaves all
other instances of its previous type exactly as they were.
2.2.2
Uses of Adaptive Compilation
The most direct use of adaptive compilation technology is the performance gain for object-oriented
languages. After the development of Self, part of that research group left Sun to create a company called Animorphics in order to make commercial use of that technology. They developed
a high performance Smalltalk (called StrongTalk) and created a demonstration of a high performance Java implementation. As the second compiler only deals with the most critical code, the
technology was named “Hot Spot”. Sun bought the company and the technology was incorporated into Java 1.2.
One problem in parallel systems is to match the degree of parallelism of the application with
that of the hardware. If many more processes are created than there are physical processors
and performance is wasted in switching between processes. If too few processes are used,
then part of the machine will remain idle. As adaptive compilation takes into account actual
24
CHAPTER 2. THEORY AND RELATED WORKS
runtime conditions, it can be used to adapt the level of parallelism (de Assumpção Júnior, 1994).
The first compiler would generate code that is as parallel as possible and during execution the
PICs would accumulate not only type information but also about useless task switching. The
second compiler could eliminate the excess parallelism while doing its optimizations. A similar
alternative is presented in Diniz e Rinard (1997), which is to alternate between the evaluation
and production phases to select the best strategy for compiling critical code fragments in parallel
applications.
2.3
Parallelism
With the creation of integrated circuits and their increasing density it became obvious that it
would be possible to eventually put a whole processor into a single chip. With the low cost of
these microprocessor it became possible to increase a computer’s performance by simply adding
more processors. This caused a significant increase in research into parallel programming techniques in the 1980s. The problems encountered combined with the exponential increase in
performance of single microprocessors (due to a combination of ever higher clock rates and
the use of larger transistor budgets for elaborate architectural tricks) killed off the interest in
such research in the 1990s. Around 2004 the excessive heat dissipation made increasing clock
rates impractical while at the same time the architectural tricks yielded smaller and smaller results (Asanovic & al., 2006). The solution was to use the extra transistors to place additional
processors on a single die. This made the results of the 1980s research relevant once more.
Many parallelism models have been developed and in Marr (2013) a survey of the models
and their use in current programming languages was undertaken to identify a minimum set of
primitves upon which all these models (and, hopefully, any future ones) can be implemented.
Table 2.1 separates the models into those that have been used for a long time (indicated as
“prior art”), those that are normally implemented as normal library functions for programming
languages, those that can be used to increase performance and, finally, those that require semantic support for their implementation.
2.3. PARALLELISM
25
Prior Art
Asynchronous
Operations
Atomic Primitives
Library
Performance
Semantics
Agents
APGAS
Active Objects
Atoms
Barriers
Co-routines
Concurrent Objects
Condition Variables
Critical Sections
Fences
Global Address Spaces
Global Interpreter Lock
Green Threads
Immutability
Event-Loops
Events
Far-References
Futures
Guards
MVars
Message Queue
Parallel Bulk
Operations
Join
Locks
Reducers
Memory Model
Method Invocation
Single Blocks
State Reconciliation
Race-And-Repair
Thread Pools
Thread-local Variables
Threads
Volatiles
Wrapper Objects
Actors
Asynchronous
Clocks
Invocation
Data MOvement
Axum-Domains
Data-Flow Graphs
By-Value
Data-Flow Variables Channels
Fork/Join
Data Streams
Implicit Parallelism Isolation
Locality
Map/Reduce
Mirrors
Message sends
One-sided
Non-Intercession
Communication
Persistent Data
Ownership
Structures
PGAS
Replication
Vector Operations
Side-Effect Free
Speculative
Execution
Transactions
Tuple Spaces
Vats
Table 2.1: Parallelism Models
2.3.1
Shared Memory
The parallelism model which is closest to the sequential model familiar to most programmers
and is implemented by simply having two or more processors use the same memory. To avoid
errors due to the interference among the processors, some explicit synchronization structures
(such as semaphores) must be used. These simpler mechanisms are hard to use correctly, which
has led to the development of more sophisticated systems such as monitors and transactions
(which are in turn implemented in terms of semaphores or equivalent). When computers have
cache memories it becomes more complicated to maintain coherency which limits the scalability of this mode. With multicore chips now including eight or more processors these limits
are becoming a problem. The semaphore model is used in Squeak Smalltalk, but it was not
considered sufficient for this project.
26
CHAPTER 2. THEORY AND RELATED WORKS
2.3.2
CSP and Occam
An alternative that requires more radical changes relative to traditional programming is the
implementation of the system as a set of sequential programs connected through fixed communication channels. This Concurrent Sequential Processes (CSP) model was proposed by Hoare
(1978). The Occam programming language was designed for the Transputer microprocessor
and both were built around this model. The popular Message Passing Interface (MPI) library
can be used to structure parallel applications like this. Erlang and Ada are two languages which
use variations of synchronous message passing and this model can also be found in the design
of microkernel based operating systems and Remote Procedure Call (RPC) libraries like in the
Common Object Request Broker (CORBA) standard.
2.3.3
Asynchronous Messages
The family of Actor languages (Hewitt & al., 1973) use a model of unidirectional messages.
While the previous model suspends execution at the sender until a reply has been received, with
asynchronous messages the sender continues executing as soon as the message has been sent.
To receive a reply the other process must send a second message back to the first one, which
is only possible if the identity of the sender has been included as one of the arguments in the
original message.
While asynchronous messages transfer information, they can’t be used as a synchronization
mechanism. Information about events must be implemented separately, as an additional layer
on top of messages. This model is extremely flexible, but this greatly increases the possibility
of programming errors which is why the previous model tends to be more popular.
2.3.4
Synchronism by Necessity
The parallelism model described in de Assumpção Júnior (1993) is the “synchronism by necessity” originally created for the Eiffel programming language Caromel (1993). This model
is compared with the previous two in figure 2.8. When a message is sent, a “future object” is
immediately returned as a temporary result and execution continues at the sender just like in
2.3. PARALLELISM
27
send
send
future
object
send
receive
receive
reply
send
try to
use
future
object
receive
reply
receive
Synchronous
Messages
Asynchronous
Messages
Wait by
Necessity
Figure 2.8: Parallelism models
asynchronous messages, which tends to allow more parallelism. When the other process finally
returns a reply, the future object is replaced by the new result. If any process tries to use the
future object before this transformation, it must stop executing until it happens. This means that
the semantics are exactly the same as synchronous messages, which combines the advantages
of the two previous models.
2.3.5
Linda
Linda is a coordination language rather than a programming one, which means its primitives
can be added to any sequential language to transform it into a parallel version, like Linda Basic,
Linda Pascal, Linda C or Linda Smalltalk. The idea is that a set of processes, possibly in different machines, use these primitive to access in a controlled way a shared associative memory
built on data tuples as described in Gelernter e Bernstein (1982). All communication is indirect
and is based on the contents of the tuples. As with MPI, the Linda primitives can be used both
for synchronous and asynchronous messages as needed.
2.3.6
Concurrent Aggregates
In the J-Machine parallel Smalltalk computer (Noakes & al., 1993), one of the parallelism models used is based on concurrent aggregates. Besides the traditional Array class in Smalltalk-80,
for example, ConcurrentSmalltalk also includes a ConcurrentArray. The memory of such struc-
28
CHAPTER 2. THEORY AND RELATED WORKS
tures is spread out through the local memories of the processors and all operations are done in
parallel, but in a coordinated way. In Ungar e Adams (2010) some very similar data structures
are used, but the focus is in exploring the possibility of increasing parallelism in exchange for a
reduction in the precision of the replies.
2.4
Reconfiguration
The first electronic computer, ENIAC from the University of Pennsylvania (1946), was “programmed” by reconfiguring its hardware. Cables like in the old telephone exchanges would
connect different parts to solve specific problems. For a different problem the cables had to be
connected in a different way. The following generation of computers used the so called “Von
Neumann architecture” (still used by current computers) where the programming was done simply changing numbers in the machine’s memory. The configuration of the hardware remained
fixed independently of the problem to be solved.
Unlike traditional computers, a reconfigurable machine has its hardware adjusted to improve
performance for a specific program, as proposed by Gerald Estrin em 1960 (Estrin, 2002). At
another moment the hardware might be adjusted in a different way to solve a second problem.
Where there is no limit of cost or energy consumption it is possible to use large scale reconfigurable computers to solve problems typical of supercomputers. For embedded systems,
where these limits are quite strict, reconfigurable computers are an interesting alternative to
traditional processors and signal processors (DSPs).
The invention of reconfigurable integrated circuits, specially Field Programmable Gate Arrays (FPGAs) allowed the creation of a wide variety of reconfigurable computers, as shown in
Compton e Hauck (2002). The main advantage of this kind of machine is the use of components that are available commercially at a competitive cost. This did not eliminate research into
slightly different architectures with building blocks that are larger than the tiny lookup tables
(LUTs) used by FPGAs. In Hartenstein (2001) the alternatives described have complete arithmetic and logical units as the building block. The ADRES architectures Bouwens & al. (2007)
is a more structured variation of this idea.
2.4. RECONFIGURATION
2.4.1
29
Dynamic and Partial Reconfiguration
One of the most interesting feature of the 6000 family from Xilinx was dynamic and partial
reconfiguration. This was originally developed as an academic project and later commercialized
by Algotronix before being bought by Xilinx. The configuration bits could be addressed as
normal memory by some external device, such as a microprocessor. These bits could be changed
even during the normal operation of the FPGA and only the part being addressed would be
affected by the change.
When the XC6000 was retired, many of the most interesting projects of evolvable hardware
and reconfigurable computing became more complicated. In systems with multiple FPGAs,
such as Teixeira (2002), it is possible to change the configuration of one of the FPGAs without
disturbing the others. The introduction of the Virtex family by Xilinx brought back a more
restricted form of partial reconfiguration which has been available in all of that company’s
components since then. The configuration file was no longer a big blob of bits but rather a
sequence of independent frames. Each frame starts with an address to indicate which part of
the FPGA it is meant to configure, so the frames can be sent out of order as described in Xilinx
(2007).
For certain component, like the Virtex II and newer, Xilinx guarantees that a partial configuration which follows certain compatibility rules with the previous configuration can be loaded
into the FPGA without disturbing the operation in the rest of the chip. This allows dynamic
reconfiguration where part of the FPGA continues working normally while a different part is
being changed. In components which the ICAP hardware block is present external hardware is
no longer needed since the FPGA itself can generate bits for a frame.
In the Virtex, Virtex II and early Spartan generations a frame would configure a whole
column. Since the input and output blocks surround the chip, each column has two such blocks
(one at the top and another at the bottom) in addition to the logic blocks. Other columns include
RAM blocks. Starting with the Virtex 4 each column only contains a single type of block and,
additionally, a frame configures a fixed fraction of a column. The frame address indicates a
column, the top or bottom half, and then a unique region of 16 logic blocks within that half.
In Sedcole (2006) the FPGA’s capability of actually calculating the bits to configure part
30
CHAPTER 2. THEORY AND RELATED WORKS
of a column is used to allow the reconfiguration of even finer grained regions (shorter than 16
logic blocks). The exclusive or function shows the difference between the old configuration and
the desired one, and it avoids changing bits for areas that are not meant to be affected.
For partial reconfiguration to work, it is necessary that the communication between the
block that is being changed and the rest of the system be standardized. A very limited set of
“bus macros” must be used for this and the software tools must be aware of how things connect.
The designer needs to have more detailed control over the operation of these tools compared to
normal projects.
2.5
Summary
Both language-specific processors and parallelism were hot topics in the 1980s but were mostly
ignored in the following decade. They have become interesting once more in the context of
modern computing, which is why most references in this area are either very old or very new.
Adaptive compilation and reconfigurable computing became significant in the 1990s and have
become increasingly relevant since then.
Few projects combined two or more of these topics, which made it possible to present each
history completely separately.
C HAPTER
3
Implementation
In this chapter, the design choices for this project are presented and justified in terms of the
theory. Each topic will be discussed separately even though there is a lot of synergy between
them such as the choices for reconfiguration, for example, might be very dependent on the
choices for parallelism. Often the choices in a give area are only available at all because of a
choice made in a different area.
The general philosophy for the project was to reuse as much existing technology as possible,
not only to save time but also to align the project with the goals of existing communities. The
need to push beyond the state of the art to achieve this project’s goals, however, demanded the
invention of several new techniques.
3.1
Language-Specific Processor: SiliconSqueak
SiliconSqueak is a processor architecture created specifically to run programs written in the
Squeak Smalltalk language. Squeak is implemented as a bytecode-based virtual machine.
While this VM is still evolving, as described in Section 3.2, the basic design has remained
unchanged since 1976 which makes it an attractive target for a hardware implementation.
31
32
CHAPTER 3. IMPLEMENTATION
simulated!image
“Back!to!the!
Future”:
prim
int
image
obj
interpreter!+!object!
memory!+!primitives
slang
gcc
!!!!.c
interpreter!+!
object!
memory!+!
primitives
Hardware!+!OS
SiliconSqueak: Figure 3.1: Squeak’s implementation
Figure 3.1 shows that the Squeak VM can be divided into roughly three parts: the bytecode
interpreter, the primitives and the memory manager. The source for these components is written
in Slang, a subset of Smalltalk-80 that can easily be translated into C or similar
objlanguage. This
prim
user!image
user!image
user!image
system!image
allows the code to be developed and debugged within
Squeak itself so that
only when it is
!
!
Interpreter!=!SiSq
stable does the translation have to happen. So the code in the VM exists in four different forms:
hand written Slang, bytecodes (used for debugging and development) translated from Slang,
C translated from Slang and machine language for some processor (translated from C). This
means that if an architecture has some trick for running bytecodes at a very high speed then the
machine language code is not be needed as the bytecodes would yield the same result.
The primitives and even the memory manager could be run as bytecodes, but not the bytecode interpreter itself since that would lead to infinite recursion. In theory the bytecodes don’t
have to be Turing complete since they can rely of the primitives to implement things like integer arithmetic. That is the case of Self and Little Smalltalk, for example. Fortunately, Squeak
bytecode has a set of instructions that are redundant with the send bytecodes but are meant to
save space. So instead of compiling a normal send bytecode with access to a literal #+ as the
message name, there is an arithmetic plus bytecode that doesn’t waste space in the method with
such a common selector. So the hardware can know that an addition is being requested. The full
semantics is that if both arguments are SmallIntegers then the hardware can directly add them
(taking into account the presence of tag bits) and if not then it is exactly like a send bytecode.
When this instruction is generated from Slang then it is always the case of two SmallIntegers
and the bytecodes can be considered to be Turing complete.
3.1. LANGUAGE-SPECIFIC PROCESSOR: SILICONSQUEAK
33
data
cache
optional
ALU
Matrix
coprocessor
microcode
cache
bytecode
cache
method
instruction pointer
fetch bytecode
microcode
ram
stack
cache
uPC
fetch microcode
calculate
addresses
stack pointer
frame pointer
receiver
fetch operands
execute
write
back
Figure 3.2: SiliconSqueak pipeline stages
The bytecodes for the memory management and the primitives can run on the interpreter
supported by the hardware shown in Figure 3.2. The bytecodes can be divided into five categories: push, pop, send/return, jump, and arithmetic (as mentioned above). SiliconSqueak has
a lower level machine code that is called microcode (see Appendix A). The push, pop and jump
bytecodes always correspond to just one or two microcode instructions. So do the arithmetic
bytecodes when compiled from Slang, though in the general case they have to handle everything
the send bytecodes do. The functionality of the send and return bytecodes is very complex, but
when compiled from Slang they can be simple call/return instructions or even eliminated by
using inlining. This is what breaks the infinite recursion: as the bytecode interpreter is known
to be Slang, the send/return bytecodes used in the implementation of themselves can be inlined
away.
The hardware that supports the bytecode interpreter is shown in Figure 3.2. It is a pipeline
with five and a half stages that operates with four cache memories. In a sequence of bytecode where each one corresponds to a single microcode instruction, a new bytecode can start
executing on every clock cycle.
3.1.1
Level 1 caches and virtual level 2 caches
The four level 1 caches allow the system to distinguish between different kinds of memory
access and to do different optimizations of each one. The level 2 caches don’t actually exist,
but are just special memory regions managed by the software. There are three level 2 virtual
caches: objects, microcode and stack. The first one is shared between the bytecode level 1 cache
34
CHAPTER 3. IMPLEMENTATION
and the data level 1 cache. When there is a miss in any of the level 1 caches, the hardware will
attempt to load the missing data from a block of memory in the virtual level 2 cache. That might
also miss, in which case a special software handler is called to update the virtual level 2 cache
and the load to level 1 cache is retried. The performance of the software is not critical since
it is only invoked on level 2 misses and it effectively does the job of a reconfigurable memory
management unit. It defines whether object tables or direct pointers are used as well as details
about the garbage collector.
3.1.2
Microcode cache
At least 4KB of the microcode cache is protected from being flushed automatically and it holds
the code fragments associated with each bytecode. The starting address for the microcode
interpretation of a given bytecode is simply the value of that bytecode multiplied by 16. This
potentially wastes cache space, but it avoids the cost of looking up the start address. The rest of
the cache is organized as three sets. A level 2 microcode cache miss means that a PIC entry or
the native code for some method is missing. The compiler is invoked to deal with this situation.
3.1.3
Bytecode and data caches
Since Squeak methods are just normal objects, the bytecode and data caches share a single level
2 cache. It is the software’s job to maintain coherency between them by flushing the bytecode
cache whenever a method is loaded into the data cache.
3.1.4
Stack cache
The stack cache is organized as a doubly linked list of small, fix sized blocks. The lists from
different threads can be interleaved both at level 1 and level 2 so there is no need to flush either
cache when switching between threads like in processor architectures with register windows.
3.2. ADAPTIVE COMPILATION: COG AND SISTA
3.1.5
35
Virtual registers
The microcode instruction set is very similar to a RISC and seems to apply an operation to the
values of two registers and store the result in a third register. The registers are organized as four
groups of 32 registers each in the case of one operand and the destination and eight groups for
the second operand. But most of these groups are not implemented as physical registers at all
but instead are specially mapped regions of the caches. This is reflected by the stage to calculate
addresses in the pipeline between the fetch microcode and fetch operands stages. So SiliconSqueak is effectively a memory to memory architecture rather than a load/store one, but having
this closely coupled with the caches combines the advantages of both kinds of architectures.
3.1.6
Fetch
Each microcode instruction not only selects three “registers” and an operation, but also explicitly defines how the next instruction is to be fetched. The next option is the most common and
allows sequences of microcode. The last instruction of a sequence can use the fetch8 option
which warns the bytecode fetch stage to calculate the starting address for the next microcode
instruction. This effectively forms a zero overhead inner loop for the bytecode interpreter.
3.1.7
PIC
Two of the fetch options use the result of the instruction to indicate the type of a message
receiver. They combine this with the address of the current instruction (or a specified one) to
probe a unique cache entry. This implements an infinitely growable PIC with a constant access
time, which is something that can’t be matched by software implementations.
3.2
Adaptive Compilation: Cog and Sista
The advanced adaptive compilation technology developed for Self had an important limitation:
the effort to port it to a new processor was considerable. The project started using 68000-based
Sun workstations but moved to Sparc-based ones as soon as these became popular. It became
36
CHAPTER 3. IMPLEMENTATION
difficult to keep supporting the 68000 processor and soon Self was Sparc only. A port to the
PowerPC was done when Macintosh laptops became practical while a port to the x86 languished
for nearly a decade before Apple’s switch to that family gave the Mac laptop users the incentive
of finish it.
Squeak was created to be portable through its technology of translating the Slang sources
into C and using a simple interpreter. With nothing in the code being processor specific, it
was quickly ported to a number of different processors. Several projects have tried to bring
the advantages of Self technology to Squeak without losing the portability. Ian Piumarta, who
had originally ported Squeak from the Macintosh to various Unix workstations and Linux on
the PC, created a simple dynamic compiler that used Forth-like threaded code as its target. His
second effort was more Self-like and targeted the PowerPC and x86 architectures, but it never
became a standard part of Squeak.
Another project was Exupery by Bryce Kampjes. It used Smalltalk code to translate bytecodes into x86 machine language, and had some modifications in the VM to allow this code
to be invoked. The goal was to explore interesting compilation techniques such as continuation passing representations but the project could be expanded into an adaptive compiler if type
feedback were added to it.
Sista (Speculative Inlining Small-talk Architecture) is a project to rewrite code at the bytecode level to implement a series of optimizations. The idea is that if the underlying system
compiles the bytecodes instead of interpreting them, the combination of the two technologies
can have results competitive with optimizing at the native code level at a fraction of the complexity.
Cog, by Eliot Miranda, is the current dynamic compilation project for Squeak that has been
adopted by Pharo, Squeak and NewSpeak. Its development followed a series of steps so usable
results could be available at the end of each step:
Closure compiler: Smalltalk-80 inherited from Smalltalk-76 an odd implementation of blocks
that made certain kinds of programming awkward. Since other Smalltalks had dumped
this trick for proper closures, this step actually made Squeak more compatible rather than
less
3.3. PARALLELISM: ALU MATRIX COPROCESSOR
37
Stack Interpreter: Full Context objects for every message send and return are not only costly
in themselves but also stress the garbage collector. This version uses the processor’s
native stack whenever possible
Stack Compiler: The interpreter is replaced with a simple compiler that translates each bytecode individually to x86 code. As a result, the stack semantics is fully preserved
Register Compiler: By dealing with groups of bytecodes at once it is possible to generate code
that makes better use of the registers
Spur: This is a redesign of the memory management part of the Squeak VM to take advantage
of 64 bit processors
There are many more steps planned and related projects such as the port of Cog to the
ARM processor. One of the steps is the integration of Cog and Sista to move from a dynamic
compiler to an adaptive one. Cog already includes all the necessary hooks in the form of PICs
and performance counters.
The initial adaptive compiler for this project is a port of Cog to SiliconSqueak similar to the
ARM port. The change from one stack to two is very simple, but the use of the PIC fetch option
affects a lot of code.
3.3
Parallelism: ALU Matrix coprocessor
One level of parallelism for SiliconSqueak is the use of multiple cores in a single chip. An
expansion of this kind of parallelism can be obtained by connecting two or more such chips
using high speed communication channels such as those shown in Figure 3.4. The message
passing hardware allows remote memory and cores to be easily accessed.
For intensive numerical algorithms there is a different level of parallelism in the form of a
coprocessor tightly coupled to a single SiliconSqueak core.
For the initial implementation in this project, an ALU Matrix coprocessor with 64 ALUs
was selected. This is shown in Figure 3.3. Each ALU can execute instructions that do one
data transfer and one operation with two registers in a single clock cycles. There are 64 types
38
CHAPTER 3. IMPLEMENTATION
neighbors: Up,
Down, Left and
Right
U
D
L
R
mult high
mult low
instructions
32
reg
rA from data path
rB from data path
8 bit
ALU
mult
U D LR
Figure 3.3: Organization of the ALU Matrix coprocessor
of operations, though most are variations on addition and subtractions (signed or unsigned,
normal or saturating, absolute differences). The 8 bit wide ALU’s have carry bits to allow for
concatenation into wider words up to 512 bits wide either horizontally or vertically.
The selection of 8 bits as the granularity of the ALUs was based on the balance between
flexibility and the amount of RAM needed for the instructions. Such RAM was also the motivation of the choice of 64 ALUs since this configuration needs roughly the same number of
Block RAMs as two simpler SiliconSqueak cores (which have 20KB of cache each).
3.4
Reconfiguration: runtime reload
With an initial configuration as in the left of Figure 3.4, the newly generated hardware would
replace some of the existing cores both in terms of FPGA area and as an element in the ring
networks. If that particular processor was exclusively executing code that will now be done
by hardware, there will be a gain in performance. If, on the other hand, it was also executing
3.4. RECONFIGURATION: RUNTIME RELOAD
hdmi
other nodes
usb
hdmi
router / switch
sdram
contr
course grain processing
usb
39
other nodes
hdmi
router / switch
fifo
sdram
contr
other nodes
router / switch
fifo
sdram
contr
reconfiguration
reconfiguration
usb
fifo
fine grain processing
Figure 3.4: Switching between different FPGA configurations
unrelated code that must now be moved to the other cores then there will be a performance
loss no matter how much faster the hardware is than the optimized code. The scheduler should
group related code under heavy loads to make it simpler to detect the situation where a software
block has one or more cores dedicated to it and so is a candidate for a hardware replacement.
Given that an FPGA that is being reconfigured does not execute anything, the scheduler
should deal with time frames N times longer than this inactive period. Besides the reconfiguration time itself, there is the time needed to save all current state to external memory and then the
time to restore it (adapting to the new configuration). Since a single core with an ALU Matrix
takes up the same FPGA resources as three simple cores, any code which doesn’t make use of
the coprocessor will run roughly three times slower. Any code that does take advantage of the
ALU Matrix (code generated by the new compiler), on the other hand, will run X times faster.
This is illustrated in Figure 3.5 where the high level code always takes the same amount of time
(either serially on a single core or in parallel with multiple cores) while the numeric code can
use the coprocessor to take far less time.
N > 1.4 × (1 + (1 − α) × 3N + α ×
α>
N
( 1.4
− 3N − 1)
N
( X − 3N )
N
)
X
(3.1)
(3.2)
Where α is the percent of time that code that could use the coprocessor takes on the con-
40
CHAPTER 3. IMPLEMENTATION
three SiliconSqueak cores
one SiliconSqueak and ALU Matrix
low level code
high level code
Figure 3.5: Time to execute code on different FPGA configurations
figuration with three simple cores. To avoid needlessly switching back and forth between configurations, a factor of 1.4 adds some hystersis to the system. Equation 3.1 shows under what
conditions it is profitable to replace three simple cores with a single one having a coprocessor.
Equation 3.2 solves for α given X (notice that
N
X
− 3N < 0 given that X > 1). So if X = 6
(code becomes six times faster with the ALU Matrix) and N = 10 (the scheduling time frame
is ten times the reconfiguration time) then α > 84%.
N > 1.4 × (1 + (1 − β) ×
β<
N
+ β × N X)
3
N
( 1.4
− N3 − 1)
(N X − N3 )
(3.3)
(3.4)
In equation 3.3 we have the condition where it is a good idea to replace a SiliconSqueak
core including an ALU Matrix with three simple cores. Here β is the percent of the time in
which code that uses the ALU Matrix executes in the original configuration (this is different,
but related to, α). Given the same X = 6 and N = 10, then β < 5%.
3.5
Summary
This chapter described the implementation details of SiliconSqueak, a unique processor architecture created for this project to optimize the execution of programs written in the Squeak
Smalltalk language. It explained how the Cog dynamic compiler can be combined with the
3.5. SUMMARY
41
Sista bytecode to bytecode optimizer to implement an adaptive compilation system and how
they can make use of SiliconSqueak’s features. The option to use multiple SiliconSqueak cores
or fewer cores but with the ALU Matrix coprocessor allows the parallelism in the hardware to
match the parallelism of the application, and by taking advantage of reconfigurable hardware
this match can be maintained even as the application varies.
C HAPTER
4
Experimental Results
The materials and methods used for the experiments in this project are described here.
4.1
Language-Specific Processors
The basic cache size of 4KB was selected for SiliconSqueak as a good tradeoff between typical
sizes of Block RAMs in a wide range of FPGAs compared to the logic needed to implement the
core’s functionality. The microcode cache needs an extra 4KB to hold the bytecode interpreter.
All of the caches can be increased for better performance when more memory is available
relative to logic.
4.2
Adaptive Compilation
The Cog compiler handles send bytecodes by generating inline caches in x86 code. The listing
in Figure 4.1 shows what the PIC looks like after sending messages to objects of six different
classes. Each new type encountered causes the code to be rewritten.
In contrast, SiliconSqueak implements PICs as shown in Figure 4.2. Actually, only the
two first lines are needed to trigger the PIC mechanism in the microcode cache. The rest is a
43
44
CHAPTER 4. EXPERIMENTAL RESULTS
subroutine to extract the class for a given object pointer. Similar code is also needed for the Cog
case but is not shown in Figure 4.1 to save space.
Some of the code in the Cog version is also needed for SiliconSqueak, but it is spread out
through different entries in the virtual level 2 microcode cache. To match the example in the
Cog version, Figure 4.3 shows the cache lines that would be used when the same six types
are encountered during execution. Some memory is wasted because the cache lines are only
partially full.
4.3
Parallelism
To test the effect of fine grained parallelism, the same benchmark was written in pure SiliconSqueak assembly language and a mix of SiliconSqueak and ALU Matrix code. The benchmark
selected is the pixel region comparison using sum of absolute differences (SAD) since that was
implemented in hardware in de Assumpção Júnior (2009). The ALU Matrix can process pixels
at 8 times the rate of the basic core with the access to memory being the limiting factor.
4.4
Reconfiguration
The two development boards used for the experiments in this project were selected because they
allow reconfiguration to be triggered from within the FPGA itself. The Xilinx ML401 uses a
SystemACE chip to read configuration bitstreams from an attached Compact Flash memory card
formated as a FAT32 file system with directories organized into up to eight different projects.
Simply writing a number from 0 to 7 into a special register in the SystemACE causes the whole
Virtex 4 FPGA to be reconfigured from a bitstream found in the selected directory.
The other board is a Parallella, created to demonstrate the use of the Epiphany 16 core
floating point chip. Attached to that is a Xilinx Zynq Z7020 chip with two ARM cores and
FPGA resources based on the Artix 7 family. When running Linux on the ARM processors, it is
possible to reconfigure the FPGA part by simply copying the bitstream file to /dev/xdevcfg. This
can be a full configuration, but the same mechanism can also handle partial reconfigurations.
The tricky part is keeping the functionality expected by Linux intact while changing other parts.
4.5. SUMMARY
45
The time that SystemACE takes to reconfigure the whole Virtex 4 25 is just under four
seconds, which is not surprising due to several slow connections between the memory card and
the FPGA. Copying a bitstream from the SD memory card on the Parallella board using the
Linux “cat” command can be done in just 85ms, which is an improvement of 47 times not even
taking into account the difference in file sizes between the two FPGAs.
4.5
Summary
Initial testing validated the design choices made for this project.
46
CHAPTER 4. EXPERIMENTAL RESULTS
5F98
nArgs : 0
type : 4
b l k s i z : A0
s e l c t r : 3F199C=#basicNew
cPICNumCases : 6
00005 f b 0 : x o r l %ecx , %ecx
00005 f b 2 : c a l l . +0 x f f f f a 6 9 9 ( 0 x00000650=cePICAbort )
00005 f b 7 : nop
entry :
00005 f b 8 :
00005 f b a :
00005 f b d :
00005 f b f :
00005 f c 1 :
00005 f c 4 :
00005 f c 7 :
00005 f c 9 :
00005 f c c :
00005 f c f :
00005 f d 1 :
00005 f d 3 :
00005 f d 8 :
ClosedPICCase0 :
00005 f d d :
00005 f e 2 :
00005 f e 7 :
ClosedPICCase1 :
00005 f e d :
00005 f f 2 :
00005 f f 7 :
ClosedPICCase2 :
00005 f f d :
00006002:
00006007:
ClosedPICCase3 :
0000600d :
00006012:
00006017:
ClosedPICCase4 :
0000601d :
00006022:
00006027:
ClosedPICCase5 :
0000602d :
00006032:
movl %edx , %eax
a n d l $0x00000001 , %eax
j n z . +0x00000010 ( 0 x 0 0 0 0 5 f c f =basicNew@37 )
movl %ds :(% edx ) , %eax
s h r l $0x0a , %eax
a n d l $0x0000007c , %eax
j n z . +0x00000006 ( 0 x 0 0 0 0 5 f c f =basicNew@37 )
movl %ds : 0 x f f f f f f f c (%edx ) , %eax
a n d l $ 0 x f f f f f f f c , %eax
cmpl %ecx , %eax
j n z . +0x0000000a ( 0 x00005fdd=basicNew@45 )
movl $0x00000000 , %ebx
jmp . +0 x f f f f f 9 b 6 ( 0 x00005993=basicNew@3B )
cmpl $0x00139164=SharedQueue c l a s s , %eax
movl $0x00000000 , %ebx
j z . +0 x f f f f f 9 a 6 ( 0 x00005993=basicNew@3B )
cmpl $0x0013f76c=Delay c l a s s , %eax
movl $0x00000000 , %ebx
j z . +0 x f f f f f 9 9 6 ( 0 x00005993=basicNew@3B )
cmpl $0x0013da5c= O r d e r e d C o l l e c t i o n c l a s s , %eax
movl $0x00000000 , %ebx
j z . +0 x f f f f f 9 8 6 ( 0 x00005993=basicNew@3B )
cmpl $0x0013dd94=Set c l a s s , %eax
movl $0x00000000 , %ebx
j z . +0 x f f f f f 9 7 6 ( 0 x00005993=basicNew@3B )
cmpl $0x001404b8= U n i x F i l e D i r e c t o r y c l a s s , %eax
movl $0x00000000 , %ebx
j z . +0 x f f f f f 9 6 6 ( 0 x00005993=basicNew@3B )
movl $0x00005f98=basicNew@0 , %ecx
jmp . +0 x f f f f a 6 8 1 ( 0 x000006b8=ceCPICMissTrampoline )
Figure 4.1: Cog generated code for a PIC with 6 different types
4.5. SUMMARY
47
0000440 c : x27 : = t 5 −> c a l l 00005 f b 0 ; save t h e r e c e i v e r i n x27 and
; get class
00004414 : x28 : = x28 −> PIC ; e n t e r PIC mode f o r c l a s s i n r e g i s t e r x28
00005 f b 0 : x31 : = x27 & #1 −> skipOnZero ; check f o r i n t e g e r t a g
00005 f b 4 : x28 : = SmallIntegerOop −> r e t u r n ; x28 w i l l h o l d
; the r e c e i v e r class
00005 f b 8 : s4 : = x27 ; s e t up stream u n i t 0 t o read from t h e r e c e i v e r
00005 f b c : s0 : = #0 ; t h e b a s i c header word
00005 f c 0 : x28 : = s7 >> #10 ; t h e compact c l a s s i n d e x
00005 f c 4 : s0 : = #−4 ; t h e extended header word
00005 f c 4 : x28 : = x28 & #h7C −> skipOnOne ; mask compact c l a s s i n d e x
00005 f c 8 : x28 : = s7 ; non compact c l a s s
00005 f c c : x28 : = x28 & #−4 −> r e t u r n ; mask c l a s s oop
Figure 4.2: PIC for SiliconSqueak with any number of different types
0013 de38:00004414: x28 : = x28 −> c a l l 00005 f d d
; code f o r Dictionary>>basicNew
00139164 :00004414: x28 : = x28 −> c a l l 00005993
; code f o r SharedQueue>>basicNew
0013 f 7 6 c : 0 0 0 0 4 4 1 4 : x28 : = x28 −> c a l l 00005993
; code f o r Delay>>basicNew
0013 da5c:00004414: x28 : = x28 −> c a l l 00005993
; code f o r OrderedCollection>>basicNew
0013 dd94:00004414: x28 : = x28 −> c a l l 00005993
; code f o r Set>>basicNew
001404 b8:00004414: x28 : = x28 −> c a l l 00005993
; code f o r U n i x F i l e D i r e c t o r y > > b a s i c N e w
Figure 4.3: PIC causes these cache entries for six types
C HAPTER
5
Conclusion
Some embedded applications are still simple enough that the traditional development method
of cross compilation of static computer languages and debugging by inserting statements to
print to a serial work will get the job done. That is specially true when the program is just a
variation on something that has been implemented many times before. The decreasing costs of
computational resources, however, makes the alternative of using complete operating systems
and dynamic object-oriented programming languages very attractive. At the prototype phase
this can be achieved by simply including a normal PC in a robot (or have it close by talking
wirelessly to its sensors and actuators) or putting a whole mini datacenter in the trunk of an
autonomous vehicle, for example. A more specialized solution with lower costs and energy use
is needed for the final product.
The project described in this text addresses this issue with a new object-oriented architecture that makes good use of adaptive compilation, parallelism and reconfigurable hardware. The
processor architecture is called SiliconSqueak since it is optimized for the Squeak implementation of the Smalltalk-80 programming language, but it is flexible enough to support well any
language implemented with the popular bytecode virtual machine technology. It has features
to both support simple interpretation (to reduce energy costs for infrequent code) and adaptive
49
50
CHAPTER 5. CONCLUSION
compilation (for high performance of frequent code, or “hot spots”).
Multiple SiliconSqueak cores can fit into a single chip, even a reconfigurable one (an FPGA),
which allows course grained parallelism for a better performance / energy use ratio. This is scalable to very large systems as multiple chips, each with its own local memory, can be connected
with very high speed communication channels that match the message passing paradigm on
which Smalltalk is based.
Each SiliconSqueak core can support a coprocessor, called the ALU Matrix, which uses fine
grained parallelism to speed up the execution of numeric code kernels. Ideally, the code for the
coprocessor can be generated as needed by the adaptive compiler. But given Squeak’s use of
hand coded primitives and “plug-ins” the coprocessor can also be exploited manually if needed.
A heterogeneous mix of simple SiliconSqueak cores and ones with coprocessors can efficiently execute code which is a mix of high level Smalltalk code and primitives. Given that the
same FPGA area that can implement a single SiliconSqueak core and ALU Matrix could be
used instead for several (three, for example) simple cores, the ideal mix will vary depending on
the nature of the code being executed and whether it is mostly interpreted or mostly compiled.
Since this variation happens at runtime, dynamic reconfiguration of the FPGA to change this
mix can optimize the energy efficiency of the system.
5.1
Future Work
The work described in this text is only the latest step in a project that was started by the author
in 1982 (known as “the Merlin Project” from 1984 to 2008). That, in turn, is closely related to
Alan Kay’s ongoing “Dynabook Project” which was started in 1968. To achieve the goals of the
original project there are many more steps planned for the next few years. In addition, there are
projects by other groups that could be greatly enhanced by the results obtained here and which
will be available to all. The most important planned steps are described here.
5.1. FUTURE WORK
5.1.1
51
Experiments
Smalltalk was originally an internal project at Xerox PARC and due to company policy it was
secretive. For the 1980 version a selected group of external companies was involved and a set of
benchmarks were created to compare their results, which were published in what became known
as the “green book” (Krasner, 1983). When combined with the results published for the Self 2
(Chambers, 1992) dynamic compiler and Self 4 (Hölzle, 1994) adaptive compiler, the Smalltalk
family of languages is one of the best studied in terms of performance of implementations. The
benchmarks in the green book were classified as “micro benchmarks” which tested one simple
operation and “macro benchmarks” which were realistic applications. In this project so far only
micro benchmarks have been used, but since it is known that these don’t necessarily reflect in
the results of macro benchmarks and it is these that more closely match user perceptions of
performance, the system must be subjected to more experiments in the near future as soon as
support for complete applications is finished.
SiliconSqueak is not a trivial architecture and includes detailed support for some Smalltalk
features. The inclusion of such support was based on experience with other projects, but experiments to objectively evaluate the costs and advantages of each feature are planned. Such
experiments involve small redesigns of the hardware and corresponding changes in all compilation systems. A typical example is the support for stack operations. With the stack support
hardware, the following microcode instruction is enough to implement the functionality of the
pushTemporaryVariable bytecode for the case of an argument 3.
dPush : = t 3 −> f e t c h 8
; microcode f o r pushTemporaryVariable 3
Such support is not reflected in the Slang sources. As an example, the high level code that
implements the pushTemporaryVariable bytecode is shown in Figure 5.1. This will generate
16 different pieces of microcode due to the exapandCases, and each case will be a single piece
of code due to inlining. With special hardware support, there is a pattern matching pass in
the compilation process that will detect common sequences and replace them with the simpler
equivalent.
If SiliconSqueak had no special hardware support for stack operation (but with hardware
52
CHAPTER 5. CONCLUSION
S t a c k I n t e r p r e t e r methods f o r i n t e r n a l i n t e r p r e t e r access
temporary: o f f s e t i n : theFP
" See S t a c k I n t e r p r e t e r c l a s s >> i n i t i a l i z e F r a m e I n d i c e s "
< i n l i n e : true >
^ stackPages l o n g A t : theFP + o f f s e t ∗ BytesPerWord
i n t e r n a l P u s h : object
" I n t h e S i S q S t a c k I n t e r p r e t e r s t a c k s grow up . "
stackPages l o n g A t P o i n t e r : ( l o c a l S P : = l o c a l S P + BytesPerWord )
put: object
S t a c k I n t e r p r e t e r methods f o r s t a c k bytecodes
pushTemporaryVariable: temporaryIndex
s e l f i n t e r n a l P u s h : ( s e l f t e m p o r a r y : temporaryIndex i n : l o c a l F P )
pushTemporaryVariableBytecode
<expandCases>
s e l f fetchNextBytecode .
" t h i s bytecode w i l l be expanded so t h a t r e f s t o
c u r r e n t B y t e c o d e below w i l l be c o n s t a n t "
s e l f pushTemporaryVariable: ( c u r r e n t B y t e c o d e b i t A n d : 16 rF )
Figure 5.1: Slang code for pushTemporaryVariableBytecode
support of memory read and write similar to most RISC processors), a microcode version of the
above for the specific case where “currentBytecode bitAnd: 16rF” is 3 would be something like
Figure 5.2 using x27 to x29 as scratch registers.
; microcode f o r pushTemporaryVariable 3
def localFP
x27
def localSP
x28
d e f stackPages x29 ; these are s e t elsewhere
d e f BytesPerWord 4 ; d e f i n e d g l o b a l l y
x30 : = l o c a l F P + 12 ; o f f s e t ∗ BytesPerWord = 3 ∗ 4
x30 : = stackPages memRead x30
l o c a l S P : = l o c a l S P + BytesPerWord
memWrite x30 stackPages l o c a l S P −> f e t c h 8 ; i n t e r n a l P u s h
Figure 5.2: Microcode for pushTemporaryVariable 3 bytecode
At first glance, it would seem to save three instructions and two memory accesses. The latter
are an illusion because the caches are accessed behind the scenes by the SiliconSqueak pipeline.
The saving in clock cycles for such a frequent operation must be balanced against any loss in
clock frequency due to the extra complexity of the support hardware. Though very complex to
set up, an experiment such as this must be done for each SiliconSqueak special feature.
A third set of experiments involve comparisons with similar systems. Since Squeak can run
5.1. FUTURE WORK
53
just fine on Sparc processors, for example, an interesting experiment would implement one or
more Leon 3 cores (an open source Sparc design) on the same FPGA boards that SiliconSqueak
is being tested on. This allows the metric of performance per FPGA area to more realistically
demonstrate the value of the innovations introduced in this project.
5.1.2
Smalltalk Zero
SiliconSqueak is flexible enough to not only support the Squeak Virtual Machine (VM) but also
other languages that use their own, but similar VMs. It is even possible to switch between VMs
at runtime. Two popular languages used with children, Scratch 1.4 from the M.I.T. Media Lab
and Etoys, are implemented as a layer on top of Squeak. In addition to this, the same Squeak
VM is used both by languages derived from Squeak (such as Pharo or Cuis) and by entirely
new languages, such as NewSpeak. Other existing languages, like Self, could be implemented
to use the Squeak VM.
Given that all these different systems can run on SiliconSqueak at the same time and even
interact with each other, there is a plan to develop a simple and powerful language on top of
the Squeak VM, called Smalltalk Zero for now, to take greater advantage of SiliconSqueak’s
parallelism than existing languages. It would complement them instead of necessarily being a
replacement.
5.1.3
Multi-level Tracing and Partial Evaluation
The use of Cog (dynamic compiler) and Sista (bytecode to bytecode optimizer), an existing
framework for adaptive compilation in the Squeak VM, leverages many years of engineering
effort. But it also limits how much experimentation can be done. In Marr & al. (2014) two
modern alternatives to hand crafted compiling VMs are evaluated. RPython, developed as part
of the PyPy project, allows an interpreted VM to be written (normally for bytecodes) and then
uses simple annotations to drive a two level tracer which can automatically generate compiled
code. Interpreters are far simpler to write and debug than compilers, so this lowers the development costs for new VMs freeing time for the exploration of alternatives. Truffle is a system
that uses partial evaluation (Futamura, 1999) of an interpreter based on Abstract Syntax Trees
54
CHAPTER 5. CONCLUSION
(ASTs) to generate code.
A future research project will investigate the possibility of extending the two level tracing
from PyPy into a multi-level tracer that could be combined with partial evaluation, as in Truffle,
to create an alternative to Cog and Sista for implementing VMs on SiliconSqueak.
5.1.4
Wafer Scale for Massive Parallelism
The relatively small size of a SiliconSqueak core allows even low end FPGAs to be used for
multi-core systems. If implemented as a custom chip (Application Specific Integrated Circuit ASIC) using a modern fabrication process it would be possible to fit a large number of cores on
a die with a commercially viable size. Machines with more cores can be built by connecting a
number of such chips to each other using high speed serial communication channels. For such
applications, however, the overhead of splitting a wafer into separate dies, testing the dies and
discarding the defective ones, encapsulating the dies and then soldering them on a board so
that they are once again connected to other is very wasteful. An alternative is to build systems
from whole wafers. After significant research progress in this area in the 1980s, this direction
was abandoned due to commercial reasons which became less relevant later on. Perfecting this
technology would allow the construction of research machines with features similar to what
commercial ones will have a few years later so that software development can lead, rather than
follow, hardware development.
5.1.5
Native Compilation to Hardware
Besides the very course grained hardware reconfiguration of switching between a SiliconSqueak
core with an ALU Matrix coprocessor and a number of simple SiliconSqueak cores, there is a
level of reconfiguration where the compiler can generate different code to be loaded in the
ALU Matrix. This extends adaptive compilation one step beyond what is normally used. An
additional and even more extreme step would be the compilation of software objects into FPGA
configuration bits. This would require information that the FPGA vendors are not interested
in publishing, though there are ways around that and in an ASIC implementation (such as the
Wafer Scale mentioned above) it would be possible to include FPGA-like areas with a known
5.2. DISCUSSION AND LIMITATIONS
55
design. As long as the new hardware objects have the same message passing interface to the
rest of the system as the software objects they replace, the only effect would be an increase in
performance.
Closely related to the issue of secret configuration bitstreams is the limitation of cross development. Since the bit files are a black box generated only by the FPGA vendor’s own tools,
the fact that these tools run on normal PCs and not the system with the FPGA does not really
matter. But if the information needed to generate new configuration at runtime is available, then
it is desirable to do this generation natively instead. Given that adaptive compilation was the
solution to the growing pauses in sophisticated dynamic compilers, native bitstream generation
would be a real problem in terms of pauses if the same algorithms as in the vendor’s tools are
used. Parallelism can be used to partially hide these pauses, but simpler incremental algorithms
can also help.
5.1.6
Non Von Neumann Architectures
The Squeak VM is a traditional stack based Von Neumann architecture. This limits the amount
of parallelism which can be extracted from the code. As early as 1984, the Merlin project
investigated alternative execution models such as Dataflow architectures. Though the current
focus is on SiliconSqueak, there are plans to continue the previous work (mostly in 1990 and
1997) in this direction.
5.2
Discussion and Limitations
This project optimizes the typical execution of dynamic object-oriented languages so that they
can be used to implement embedded systems. The cost is an increase in the variation of execution time and this is a problem in hard real-time systems. With parallelism, however, enough
of the cost of adaptive compilation can be hidden that soft real-time systems become practical.
In the same way, the cost in increased variation due to dynamic reconfiguration can be hidden
with partial reconfiguration. So even though the focus of this thesis was embedded systems that
are not real-time, at least the soft real-time option can be achieved with some extra effort.
Bibliography
A SANOVIC , K.; C ATANZARO , B. C.; PATTERSON , D. A.; Y ELICK , K. A. The Landscape of
Parallel Computing Research : A View from Berkeley. EECS Department University of
California Berkeley Tech Rep UCBEECS2006183, volume 18, no. UCB/EECS-2006-183,
pages 19, december, 2006.
B OUWENS , F. J.; B EREKOVIC , M.; K ANSTEIN , A.; G AYDADJIEV, G. N.
Exploration of the {ADRES} Coarse-Grained Reconfigurable Array.
Architectural
In: Proceedings of
International Workshop on Applied Reconfigurable Computing, 2007, page 1–13.
C AROMEL , D.
Toward a method of object-oriented concurrent programming.
Commun.
ACM, New York, NY, USA: ACM, volume 36, no. 9, page 90–102, september, 1993.
C HAMBERS , C.
The Design and Implementation of the Self Compiler, an Optimizing
Compiler for Object-Oriented Programming Languages.
1992.
phd thesis, Stanford
University, 1992.
C HAMBERS , C.; U NGAR , D. Customization: Optimizing compiler technology for {Self}, a
dynamically-typed object-oriented programming language.
In: Proceedings of the SIG-
PLAN’89 Conference on Programming Language Design and Implementation. 1989.
page 146–160.
C HAMBERS , C.; U NGAR , D. Iterative type analysis and extended message splitting: Optimizing dynamically-typed object-oriented programs. In: Proceedings of the SIGPLAN’90
57
58
BIBLIOGRAPHY
Conference on Programming Language Design and Implementation.
1990.
page
150–164.
C HAMBERS , C.; U NGAR , D.; L EE , E.
An efficient implementation of Self, a
dynamically-typed object-oriented language based on prototypes. In: Proceedings of the
4th annual ACM conference on object-oriented programming, systems, languages and
applications. 1989. page 49–70.
C OMPTON , K.; H AUCK , S. Reconfigurable Computing: A Survey of Systems and Software.
ACM Computing Surveys, volume 34, no. 2, page 171–210, june, 2002.
DE
A SSUMPÇÃO J ÚNIOR , J. M. O Sistema Orientado a Objetos Merlin em Máquinas Parale-
las. In: V SBAC-PAD: Simpósio Brasileiro de Arquitetura de Computadores, Processamento de Alto Desempenho. 1993. page 304–312.
DE
A SSUMPÇÃO J ÚNIOR , J. M.
Machines.
Adaptive Compilation in the Merlin System for Parallel
In: WHPC’94 Proceedings - IEEE/USP International Workshop on High
Performance Computing. 1994. page 155–166.
DE
Projeto de um sistema de desvio de obstáculos para
A SSUMPÇÃO J ÚNIOR , J. M.
robôs móveis baseado em computação reconfigurável.
2009.
master thesis, ICMC –
Universidade de São Paulo, 2009.
D EUTSCH , L. P.; S CHIFFMAN , A. M. Efficient implementation of the Smalltalk-80 system.
In: Conference Record of the Eleventh Annual ACM Symposium on Principles of Programming Languages. 1984. page 279–302.
D INIZ , P. C.; R INARD , M. C.
Dynamic Feedback: An Effective Technique for Adaptive
Computing. In: {SIGPLAN} Conference on Programming Language Design and Implementation, 1997, page 71–84.
E STRIN , G. Reconfigurable Computer Origins : The UCLA Fixed-Plus-Variable ( F + V ).
Ieee Annals Of The History Of Computing, volume 24, no. 4, page 3–9, october, 2002.
BIBLIOGRAPHY
F UTAMURA , Y.
59
Partial Evaluation of Computation Process - An Approach to a
Compiler-Compiler.
Higher-Order and Symbolic Computation, volume 2, no. 5,
page 381–391, december, 1999.
G ELERNTER , D.; B ERNSTEIN , A. J.
Distributed communication via global buffer.
Pro-
ceedings of the first ACM SIGACT-SIGOPS symposium on Principles of distributed
computing, page 10—-18, august, 1982.
G OLDBERG , A.; K AY, A. Smalltalk-72 instruction manual. Technical report, Xerox PARC,
1976.
G OLDBERG , A.; ROBSON , D.
Smalltalk-80: the language and its implementation.
Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1983.
H ARTENSTEIN , R.
A decade of reconfigurable computing: a visionary retrospective.
In:
DATE ’01: Proceedings of the conference on Design, automation and test in Europe,
Piscataway, NJ, USA: IEEE Press, 2001, page 642–649.
H EWITT, C.; B ISHOP, P.; S TEIGER , R. A universal modular ACTOR formalism for artificial intelligence. In: Proceedings of the 3rd international joint conference on Artificial
intelligence, Stanford, USA: Morgan Kaufmann Publishers Inc., 1973, page 235—-245.
H OARE , C. A. R. Communicating sequential processes. Commun. ACM, volume 21, no. 8,
page 666—-677, august, 1978.
H ÖLZLE , U.
Adaptive optimization for Self: reconciling high performance with ex-
ploratory programming. 1994. phd thesis, Stanford University, 1994.
H ÖLZLE , U.; C HAMBERS , C.; U NGAR , D.
Optimizing dynamically-typed object-oriented
programming languages with polymorphic inline caches. In: ECOOP’91 Conference Proceedings. 1991. page 21–38.
I NGALLS , D.; K AEHLER , T.; M ALONEY, J.; WALLACE , S.; K AY, A.
Back to the future:
The story of Squeak, A practical Smalltalk written in itself. In: Proceedings OOPSLA ’97,
ACM SIGPLAN Notices, ACM Press, 1997, page 318–326.
60
K RASNER , G.
BIBLIOGRAPHY
Smalltalk-80: bits of history, words of advice.
Boston, MA, USA:
Addison-Wesley Longman Publishing Co., Inc., 1983.
L EWIS , D. M.; G ALLOWAY, D. R.; F RANCIS , R. J.; T HOMSON , B. W.
Swamp: A Fast
Processor for Smalltalk-80. In: Proceedings of the 1986 conference on Object-oriented
programming systems, languages, and applications, 1986, page 131–139.
M ARR , S. Supporting concurrency abstractions in high-level language virtual machines.
2013. phd thesis, Software Languages Lab, Vrije Universiteit Brussel, Pleinlaan 2, B-1050
Brussels, Belgium, 2013.
M ARR , S.; PAPE , T.; D E M EUTER , W. Are we there yet? simple language implementation
techniques for the 21st century. IEEE Software, volume 31, no. 5, page 60–67, September,
2014.
M OORE , G. E.
Cramming More Components onto Integrated Circuits.
Electronics, vol-
ume 38, no. 8, page 114–117, april, 1965.
NARAYANAN , V. Issues in the Design of a Java Processor Architecture. 1998. phd thesis,
Department of Computer Science and Engineering, University of Sourth Florida, 1998.
N OAKES , M.; WALLACH , D. A.; DALLY, W. J. The J-Machine Multicomputer: An Architectural Evaluation.
In: Proceedings of the 20th Annual International Symposium on
Computer Architecture, 1993, page 224–235.
N OJIRI , T.; K AWASAKI , S.; S AKODA , K. Microprogrammable processor for object-oriented
architecture. SIGARCH Comput. Archit. News, New York, NY, USA: ACM, volume 14,
no. 2, page 74–81, may, 1986.
P HILIP J. KOOPMAN , J. Stack computers: the new wave. Ellis Horwood Series in Computers and Their Applications. Ellis Horwood Ltd, 1989.
S AMPLES , A. D.; U NGAR , D.; H ILFINGER , P. SOAR: Smalltalk Without Bytecodes. In:
Proceedings of the 1986 conference on Object-oriented programming systems, languages, and applications, 1986, page 107–118.
BIBLIOGRAPHY
61
S CHOEBERL , M. JOP: A Java Optimized Processor for Embedded Real-Time Systems.
2005. phd thesis, Vienna University of Technology, 2005.
S EDCOLE , N. P. Reconfigurable Platform-Based Design in FPGAs for Video Image Processing. 2006. phd thesis, Department of Electrical and Electronic Engineering, Imperial
College of Science, Technology and Medicine, University of London, 2006.
T EIXEIRA , M. A.
Técnicas de reconfigurabilidade dos FPGAs da família APEX 20K -
Altera. 2002. master thesis, ICMC – Universidade de São Paulo, 2002.
T HACKER , C.; M C C REIGHT, E.; L AMPSON , B.; S PROULL , R.; B OGGS , D. Alto: A personal computer. In: S IEWIOREK , D. P.; B ELL , C. G.; N EWELL , A., editors Computer
Structures: Principles and Examples. Second edition. New York: McGraw-Hill, 1981.
page 549–572.
U NGAR , D.; A DAMS , S. S.
Harnessing emergence for manycore programming: early ex-
perience integrating ensembles, adverbs, and object-based inheritance.
In: SPLASH ’10
Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion, New York, NY, USA: ACM,
2010.
U NGAR , D.; B LAU , R.; F OLEY, P.; S AMPLES , A. D.; PATTERSON , D.
SOAR: Smalltalk On A RISC.
Architecture of
In: Proceedings of the Eleventh Annual International
Symposium on Computer Architecture. 1984. page 188–197.
U NGAR , D.; S MITH , R. B.
Self: The Power of Simplicity.
In: Proceedings of the 2nd
annual ACM conference on object-oriented programming, systems, languages and applications. 1987. page 227–241.
W ILLIAMS , I. W. The Mushroom Machine - An Architecture for Symbolic Processing. In:
IEE Colloquium on VLSI and Architectures for Symbolic Processing, 1989.
X ILINX Virtex-4 Configuration Guide. Technical report UG071, 2007.
A PPENDIX
A
SiliconSqueak Assembly Language
In theory, the assembly language for SiliconSqueak would be the bytecodes described in the
Smalltalk-80 “blue book” Goldberg e Robson (1983). This is the instruction set for a stack
based architecture, which is also known as a zero address architecture. And there is some
hardware dedicated to dealing with such instructions, but since these bytecodes are documented
elsewhere they will not be described further here. The Squeak system has tools for examining
these bytecodes but the programmer has no need to write code at that level.
At an even lower level, there is an instruction set which is implemented directly by SiliconSqueak hardware and which is used both to execute the bytecodes and as a target for adaptive
compilation from the bytecodes. This is what is called “assembly language” in this text, though
it is very tempting to call it “microcode”. Like microcode in many other machines, there is only
one instruction format and it combines operations and control in a four address architecure. The
32 bit binary format for these instructions is:
FddDDDDD FaaAAAAA MFFXXXXX bbbBBBBB
33222222 22221111 11111100 00000000
10987654 32109876 54321098 76543210
The four bit F field (which is spread about in bits 31, 23, 14 and 13) is described in Sec63
64
APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE
tion A.4 and is the control part of the instruction. The X field selects the operation, which can
either be a simple operation on two 32 bit data and with a 32 bit results or, depending on the
value of M, can select a very complex operation in the optional ALU Matrix coprocessor as
described in Appendix B.
The destination is selected by the D field while the inputs are selected by A and B. Each of
these three fields is indicated by some initial bits in lower case, which select the kind of register
which will be used, followed by five bits in upper case which indicate the actual register. While
D and A can be any of four kinds of registers (for a total of 127), field B can select four
additional kinds (for a total of 256). The following indicates the encoding of bbb, which is the
same as 0dd and 0aa.
A.0.1
000: Registers t0 to t31
Like most of the registers described here, these are not physical registers but rather aliases for
positions in the stack cache. They are called the “temporary registers” and correspond to the
arguments and temporary variables of a Smalltalk-80 method. It is extremely rare for methods
to require more than 32 temporaries and the code to access them is complicated requiring the
use of the stream units.
A.0.2
001: Registers i0 to i31
Registers i1 to i31 correspond to the first 31 instance variables of the receiver object of the
currently executing method. Register i0 corresponds to the last (and often only) word of the
header of the receiver object. These are not physical registers but aliases for words in the data
cache. If the object has more than 31 instance variables or more than one header word then
these can only be accesses through the stream units.
A.0.3
010: Registers s0 to s31
Section A.5 describes the four stream units (two for reading and two for writing) which are
essentially convenient Direct Memory Access (DMA) hardware blocks. Each unit is controled
by eight registers. Streams are widely used in Smalltalk-80 programs and these units support
65
both that and simple array access. They also allow access to parts of memory which aren’t
mapped into any of the other registers.
A.0.4
011: Registers x0 to x31
Sections A.6,A.7,A.8 and A.8 give the details about these registers. They are used to control
SiliconSqueak and for several of them changing their value has the side effect of remapping
other registers.
A.0.5
100: Pseudo Registers #0 to #31
These 32 values are not registers at all, but constants that can be conveniently referenced directly
in an instruction. It wouldn’t make sense to have a constant as a destination nor, in most cases,
a second constant in the same instruction (since then the result could have been calculated
at compile time). This is why only operand B can encode them. The actual small positive
integers 0 through 31 can have two different 32 bit encodings, one as the raw value and the
other as a tagged value. That depends on the selection of tags by the instruction, as described
in Section A.8.
A.0.6
101: Pseudo Registers #-1 to #-32
This is exactly the same as the previous case except that the constants are the first 32 small
negative integer values.
A.0.7
110: Pseudo Registers #o0 to #o31
32 well known constant objects can be directly referenced in any instruction. The actual object
references are fetched from the SpecialObjectsArray which is pointed to by register x20. This
indirection can be costly, even if the relevant part of the SpecialObjectsArray happens to be
present in the data cache. So three critical entries are cached in some of the special registers:
• x23 holds the oop for the class SmallInteger (entry 6, #o5 in assembly)
• x21 holds the oop for false (entry 2, #o1 in assembly)
66
APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE
• x22 holds the oop for true (entry 3, #o2 in assembly)
These registers are not loaded automatically and the hardware will not know to use them in
place of an expensive cache access if the constant version is used in an instruction.
The last eight constants don’t reference their respective entries in the SpecialObjectsArray
but instead indicate special hardware objects when loaded into a stream unit:
def
def
def
def
def
def
def
def
A.0.8
bytecodeCache
microcodeCache
dataCache
stackCache
microcodeL2
dataL2
stackL2
rawMemory
#o24
#o25
#o26
#o27
#o28
#o29
#o30
#o31 ; access t o t h e p h y s i c a l RAM
111: Registers L0 to L31
A Smalltalk-80 method object contains not only the bytecodes to be executed but also a set
of constants (called “literals”) which can be referenced by those bytecodes. Registers L0 to
L31 are equivalent to i0 to i31 but map to the bytecode cache and the method object instead of
the receiver object. If a method has more than 31 literals then the stream units must be used
to access them. For methods with fewer than 31 literals, the last few L registers will map to
bytecodes instead but that is not the proper way to access them.
A.1
Directives
Traditional assemblers have a number of directives that control the generation of code and the
formatting of listings. Since the assembler for SiliconSqueak microcode is a Squeak application
which runs in an environment in which text formating can easily be done with related tools, only
two directives were defined.
A.1.1
org expression
This directive sets the current PC of the code being assembled. Code is generated with an
offset, which defaults to zero if not sent as an argument to the assembler. The current PC and
A.2. SYNTAX
67
the last PC are initially set to the offset value and the generated code is initially empty. If an
org directive would set the current PC to a value greater than the last PC then nils are added to
the code until last PC reaches the desired value. If org sets current PC to a value below last PC,
the the assembler will overwrite previously generated code. If that code was just nils then it is
simply replaced but if it was anything else then a warning is generated.
A.1.2
def name expression
This directive adds the name to the symbol table and associates it with the value of the expression. One important use of this directive is to make register names more readable. Any
expression from 0 to 127 (or 0 to 255 for the second source or 0 to 31 for the ALU Matrix
instructions) can be used to indicate a register. The names t0 to t31, i0 to i31, s0 to s31, x0 to
x31, L0 to L31, #0 to #31, #-1 to #32, #o0 to #o31 and m0 to m31 are defined to the right values
before assembly begins, so they can be used instead of raw numbers to indicate registers. But
it is good practice to define more meaningful names. Suggested names are used for the special
registers in this text.
Labels (part of the core assembly syntax) are equivalent to defining the name with the current
PC as its value. While it is good practice to define a name before it is used in an instruction, that
is not possible for forward jumps. So there is a general scheme for handling undefined names
and this makes multiple definitions of the same name an error.
A.2
label:
Syntax
rD : = rA op rB −> f e t c h ; comment
The label (text without spaces before the first colon) is optional and if present defines the
name as having the value of the current program counter. The comment (any text after the first
semicolon in the line) is also optional and is ignored by the assembler.
The fetch part of the instruction (after the right arrow) is always present in the generated
code, but the default value "-> next" can be omitted. This encodes the F field in the instruction
as described in Section A.4. Some types of fetch fields (all those with the highest bit set) involve
68
APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE
a 32 bit constant, which is present in the word following the instruction.
The basic instruction has the syntax of an assignment to a register of a binary operation between two other registers, which is a style that is more popular for microcode than the traditional
operation operand, operand, operand. A few operations are actually unary and,
in those cases, rA is omitted.
A.3
Operations
The 32 operations possible between two registers can be divided into four groups as indicated
by the top two bits of the X field. Within each group, the eight instructions will be indicated by
the bottom three bits of the same X field.
In addition, the M bit can replace the operation by the ALU Matrix coprocessor (if present).
At the assembly language this is done by replacing the characters for the operation by a number
between square brackets: [0] to [31]. This is detailed in Appendix B.
A.3.1
Arithmetic (00)
These instructions use a simple 32 bit adder/subtractor. Bit 2 of the X field is used to mask
operand A, which would be equivalent to setting it to #0 if it were not the case that only B can
encode that. Bit 1 of the X field inverts the bits of B while Bit 0 is used as the carry in.
000 rD : = rB ; move
This is a simple copy of one register to another, including the value of constants.
001 rD : = rB + 1 ; i n c r e m e n t
The destination receives a value one greater than the source.
010 rD : = ~ rB ; n o t
This is a logical inversion, also known as one’s complement.
011 rD : = − rB ; negate
This is a mathematical inversion, also known as two’s complement.
100 rD : = rA + rB ; add
The values of the two sources are added and the result is stored in the destination.
A.3. OPERATIONS
69
101 rD : = rA + rB + 1 ; add w i t h c a r r y s e t
One more than the addition of the two sources is stored in the destination. This option allows
for extending results beyond 32 bits.
110 rD : = rA − rB − 1 ; sub borrow
One less than the subtraction of the second operand from the first is stored in the destination.
This option allows for extending results beyond 32 bits.
111 rD : = rA − rB ; s u b t r a c t
The second source is subtracted fom the first and stored in the destination.
A.3.2
Comparison (01)
Every operation described so far actually produces two results. One is a 32 bit result which
is stored in the indicated destination. The second is a single bit which indicates if the other
result was zero or not. This is used for conditional branching and is sufficient to test equality.
Some math code, however, needs more details about the effects of adding two operands. So six
additional instructions add (or subtract) the two operands but redefine the meaning of the single
bit result.
010 rD : = rA +? rB ; c a r r y
Indicates that the addition resulted in a carry from the highest bits. Can be used to extend
additions beyond 32 bits.
011 rD : = rA +?? rB ; o v e r f l o w
Indicates that the addition caused an overflow condition.
100 rD : = rA < rB ; l e s s than
Indicates that the subtraction of the two signed values had a negative result.
101 rD : = rA <= rB ; l e s s o r equal
Indicates that the subtraction of the two signed valus had a negative or zero result.
110 rD : = rA $< rB ; unsigned l e s s
Indicates that the subtraction of the two unsigned values had a negative result.
70
APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE
111 rD : = rA $<= rB ; unsigned l e s s eq
Indicates that the subtraction of the two unsigned values has a negative or zero result.
A.3.3
Logic (10)
These instruction implement simple bitwise logical operations. Given two bits, there are a total
of 16 possible logical operations but several of these are already implemented with the move
and not arithmetic instructions. Bit 0 of the X field is used to invert the bits of the result while
bits 2 and 1 select between four logic blocks.
000 rD : = rA | rB ; o r
The destination receives the bitwise inclusive or of the two operands.
001 rD : = rA ~ | rB ; nor
This is the same as the previous operation but with the result bits inverted.
010 rD : = rA ^ rB ; e x c l u s i v e o r
The destination receives the bitewise exclusive or of the two operands.
011 rD : = rA ~^ rB ; e q u i v a l e n c e
This is the same as the previous operation but with the result bits inverted. The opposite of
the exclusive or is the equivalence operation.
100 rD : = rA & rB ; and
The destination receives the bitwise and of the two operands.
101 rD : = rA ~& rB ; nand
This is the same as the previous operation but with the result bits inverted.
110 rD : = rA &~ rB ; and i n v e r t
The destination receives the bitwise and of the first operand with the inverse of the second
operand.
111 rD : = rA ~&~ rB ; nand i n v e r t
This is the same as the previous operation but with the result bits inverted.
A.3. OPERATIONS
A.3.4
71
Shifts (11)
This set of operations make use of the DSP units present in modern FPGAs. Bit 2 of the X
field selects between shifting or multiplying by changing the encoding of the second operand
before it reaches the multiplier (a shift by 9 is the same as a multiplication by 512). Bit 1 is
used to select between signed and unsigned versions of the instructions (except signed shift left
becomes rotate since it would otherwise be the same as unsgined shift left) while bit 0 selects
the direction.
000 rD : = rA <> rB ; r o t a t e
The bits of the first operand are shifted left as indicated by the second operand and the left
over bits are brought back into the result.
001 rD : = rA >> rB ; a r i t h s h i f t r i g h t
The bits of the first operand are shifted right as indicated by the second operand with the top
bit filling the result.
010 rD : = rA << rB ; s h i f t
left
The bits of the first operand are shifted left as indicated by the second operand while the
new bits are filled with zero.
011 rD : = rA $>> rB ; s h i f t r i g h t
The bits of the first operand are shifted right as indicated by the second operand while the
new bits are filled with zero.
100 rD : = rA ∗ rB ; m u l t i p l y
The two 32 bit operands are treated as signed values which are multiplied, and the bottom
32 bits of the result are stored in the destination.
101 rD : = rA / rB ; d i v i d e
The first operand is divided by the second operand, with both being treated as signed values.
110 rD : = rA $∗ rB ; unsigned m u l t
The two 32 bit operands are treated as unsigned values which are multiplied, and the bottom
32 bits of the result are stored in the destination.
72
APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE
111 rD : = rA $ / rB ; unsigned d i v
The first operands is divided by the second operand, with both being treated as unsigned
values.
A.4
Fetch
The four bits of the F field are used to define how the next microinstruction is to be fetched.
The top bit selects between single word instructions and those that require a second 32 bit word
(as a destination address, for example).
0000 −> n e x t
This is the default case if no fetch part is included in an assembly level instruction. The
uPC is incremented by one so that the instruction that follows in the L2 microcode cache gets
executed.
0001 −> r e t u r n
A 32 bit value is popped from the return stack and microcode execution continues from the
indicated address.
0010 −> skipOnZero
If the single bit result of the instruction is zero, then all effects of executing the following
instruction (whether one or two words long) are canceled. By the time the result is determined,
one or more of the following instructions will probably have already been fetched and in the
pipeline but if the instruction that might be skipped could cause a branch then all fetching is
halted until this is resolved.
0011 −> skipOnOne
This is exactly like the previous case, but the skip happens if the single bit result is one.
0100 −> PICuPC
This option isn’t used in the bytecode interpreter but only in code generated by the adaptive
compiler. The associated instruction generates a value that represents the class of an object
which is to receive a message. The current value of the uPC represents the “send site” and
A.4. FETCH
73
is combined via a hash function with the receiver’s class to point to a place in the virtual L2
microcode cache. That line might be allocated or not and if allocated it might belong to a
different class/send site pair. In those cases the lookup routine is invoked. Otherwise execution
can continue with the fetched code.
Note that though PIC means “Polymorphic Inline Cache”, this is not actually inline. Traditional adaptive compilers will create an inline cache for a single entry and then change that to
a call to a non inline switch statement which is the actual PIC. In SiliconSqueak the sequential
tests of the switch statement are replaced with a hashed search.
0101 −> m a t r i x row c o l s t a r t
The next few words are not microcode at all but instructions to be loaded into the ALU
Matrix coprocessor (if present). The first word includes a count of ALU Matrix instructions
and once these are all fetched the word following that is again interpreted as SiliconSqueak
microcode. If the associated instruction happens to be an ALU Matrix operation, then that operation number is set to start at this code fragment, as described in Appendix B. SiliconSqueak
itself treats all these words the same as skipped instructions.
The area selection allows a compact representation and shorter reconfiguration times when
code is more SIMD (single instruction, multiple data) in nature. The row and column can be *
to indicate all (treated as 0.15), a single number (treated as n.n) or a pair of numbers separated
by a period. This allows a rectangular subset of the Matrix to be selected. The start tells the
system where to load the code in the selected ALU’s program memory and can be from 0 to
1023.
0110 −> f e t c h 8
The last microcode instruction of a sequence that implements a give bytecode should have
this option so that the following bytecode will be fetched and the address of the first microcode
corresponding to that new bytecode can be calculated. Many simple bytecodes can have a
microcode sequence of just a single instruction.
0111 −> tagAndFetch8
The tag units will verify that the two operands are properly encoded small integers and
convert them into raw 32 bit values. The result will be converted back into an encoded small
74
APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE
integer if possible. When all this works out, then this option is just like the previous one and
a new bytecode is fetched and its corresponding microcode sequence is executed. If any of
the detagging or retagging operations fail, however, execution continues with the following
microcode instruction. This is normally some cleanup code followed by a more general version
of the operation that was attempted.
1000 −> f e t c h 4 d e s t i n a t i o n
The byte following the currently executing bytecode is fetched and its top four bits replace
the corresponding bits in the 32 bit word following the current microcode instruction and the
result is used as the new uPC. This results in a 16 way branch to locations spread 16 words apart
around the address indicated by destination.
1001 −> f e t c h 2 d e s t i n a t i o n
The byte following the currently executing bytecode is fetched and its top two bits replace
the corresponding bits in the 32 bit word following the current microcode instruction and the
result is used as the new uPC. This results in a 4 way branch to locations spread 64 words apart
around the address indicated by destinataion.
1010 −> jumpOnZero d e s t i n a t i o n
When the single bit result is zero, the uPC is set to destination. Otherwise the following
instruction is executed.
1011 −> jumpOnOne d e s t i n a t i o n
When the single bit result is one, the uPC is set to destination. Otherwise the following
instruction is executed.
1100 −> jump d e s t i n a t i o n
The uPC is set to the destination. This option is used not only for loops and (in combination with the jumpOnXX instructions) for conditional execution but also to deal with memory
fragmentation. Due to the fetchX limitations there will often be partially used code fragments
while other fragments overflow.
1101 −> PICx associateduPC
A.5. STREAMS
75
One problem with the PICuPC option is that different call sites can’t share PIC entries even
if the adaptive compiler figures out that they could. This option is exactly the same except that
the following word is used for the hash instead of the current value of the uPC. This is just a
way to save memory and time, but does not add any functionality.
1110 −> c a l l d e s t i n a t i o n
This is just like jump except that the previous values of uPC is pushed on the return stack.
If a return is encountered later on then execution will continue with the instruction following
the call. Since the return stack is shared with the send and return bytecodes, it is important the
microcode level calls and returns be perfectly balanced within a single bytecode.
1111 −> t a g O r C a l l d e s t i n a t i o n
The tag units will verify that the two operands are properly encoded small integers and
convert them into raw 32 bit values. The result will be converted back into an encoded small
integer if possible. When all this works out, execution continues with the following instruction.
If there are any problems, then the subroutine indicated by destination is called and its job is to
deal with the failure.
A.5
Streams
SiliconSqueak has four DMA (direct memory access) engines that are presented as a set of
registers s0 to s31. This is the only way to access instance variables beyond 31 in the receiver,
to access the extended header words or to access an object other than the receiver. The DMA
units will access the data cache if the needed information is there but will bypass the cache if it
is not.
The index registers select which field in the object will be accessed. The step registers
hold the value to be added to the index after the operation. The limit registers indicate the first
value which is not acceptable for the index and the base registers indicate the first value that
is acceptable (this is not checked when the registers are initially set but is only used for wrap
around).
When an object pointer is stored into the reset registers or the resetByte registers then the
76
APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE
register name
index
step
base
limit
reset
resetByte
atEnd
next
nextPut
read0
s0
s1
s2
s3
s4
s5
s6
s7
read1
s8
s9
s10
s11
s12
s13
s14
s15
write0
s16
s17
s18
s19
s20
s21
s22
write1
s24
s25
s26
s27
s28
s29
s30
s23
s31
Table A.1: Registers for Stream Units
base and index registers are set to the first valid word or byte in that object and the limit to the
word or byte after the last one while the step will be set to 1. This makes setting up the most
common configuration very efficient but it doesn’t complicate other configurations.
The atEnd registers will read as false until the first time the index has to wrap around due to
exceeding the limit. It will read as true from then on until the DMA unit is reset.
The next registers allow data to be read from the streams while the nextPut registers allow
data to be written to the streams. The data can be either 32 bit words or 8 bit bytes, depending
on which resgister was used to reset the stream unit. The unit has a small internal buffer to make
memory access more efficient.
A.6
Context
The runtime environment presented to the Smalltalk-80 programmer is a set of threads, each
of which has a stack built from a linked list of Context objects (known in other languages as
activations or as stack frames). So the current context provides information which is important
for the proper execution of instructions. It is replaced by different context either when a send or
return bytecode are executed or when the scheduler switches to a different thread.
Context objects allow high level implementation of tools such as the debugger but they can
greatly reduce performance and can complicated adaptive compilation. An alternative is to
use a conventional stack most of the time and only allocate actual Context objects when some
code references them. SiliconSqueak has two separate hardware stacks (which share a single
cache): the data stack and the return stack. Five registers are associated with every context
A.7. THREAD
77
(independently of whether the actual object has been allocated or not) and they are pushed to
and popped from the return stack on message sends and returns.
def
def
def
def
def
s e l f x0
method x1
IP x2
c o n t e x t x3
f r a m e P o i n t e r x4
; a f f e c t s i0−i31
; a f f e c t s L0−L31
; ( or n i l )
; a f f e c t s t0−t31
Special register x0 has been defined with the nicer name “self” in this example. Changing
its value will remap registers i0 to i31 to a different part of main memory. In theory this register
should have the same value as register t0, but this doesn’t happen when blocks are involved.
Register IP is normally incremented automatically by the bytecode fetch hardware, but can be
set to a new value to cause a jump at the bytecode level. When the method register is changed,
the IP register must be updated as well. The context register is ignored by the hardware and is
set to nil unless an actual Context object is allocated, in which case it is set to point to that.
The framePointer indicates the memory position to which register t0 is mapped. This is a
location in the data stack.
A.7
Thread
Though threads share object address space, each has its own stack (or pair of stacks in the case
of SiliconSqueak). The hardware stacks are allocated from a region in memory that is divided
into blocks of 32 words each. The blocks are joined using doubly linked lists to allow stacks to
grow to arbitrary sizes within the region.
d e f dBlockHigh x5
d e f dBlockLow x6
d e f r B l o c k x7
These three registers select which blocks the hardware can access directly. dBlockHigh and
dBlockLow together define a 64 word region for the data stack. rBlock defines a 32 word region
for the return stack. This simpler scheme is possible because there is no equivalent to registers
t0 to t31 for the return stack.
d e f s t a c k P o i n t e r x8
; a f f e c t s dBlockHigh and dBlockLow
78
APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE
This register points to a word that normally corresponds to the region defined by dBlock-
Low. If it is incremented to where it moves from the dBlockLow region to the dBlockHigh one
then dBlockHigh is copied to dBlockLow and the linked list is followed to find a new value for
dBlockHigh (possibly invoking software to allocate a new block). If the stackPointer is decremented when it already points to the first word in dBlockLow then dBlockLow is copied to
dBlockHigh and the linked list is followed to find a new value for dBlockLow (always possible
unless there is a stack underflow error).
d e f dTop x9
d e f rTop x10
d e f c u r r e n t B y t e x11
The dTop register will return the value in the word pointed to by stackPointer. rTop is
exactly the same, but for the return stack. The currentByte register returns the value pointed to
by the combination of the method and IP registers.
d e f r e t u r n P o i n t e r x12
; a f f e c t s rBlock
This is like stackPointer, but for the return stack instead of the data stack. When it tries to
move beyond or before the region indicated by rBlock then that register is replaced by following
the linked list (possibly invoking software to allocate a new block).
d e f dPop x13
d e f rPop x14
d e f n e x t B y t e x15
; affects stackPointer
; affects returnPointer
; a f f e c t s IP
These registers are exactly like dTop, rTop and currentByte respectively except they cause
the associated pointers to be incremented or decremented. In the case of dPop and rPop they
cause their respective stack pointers to be decremented after use when they are operands and to
be incremented before use when they are destination. In the latter case it might be nice to define
alternative names for these registers:
d e f dPush x13
d e f rPush x14
; affects stackPointer
; affects returnPointer
A.8. IMAGE
A.8
79
Image
For a given Squeak system, there are some global settings that are valid for all threads. A single
core might run more than one image (even other languages such as Java or Python) so that these
registers would have to be updated on each switch (as well as all previously mentioned registers
and the caches would have to be flushed).
def
def
def
def
t a g C o n f i g x16
bcTable x17
L2ConfigHigh x18
L2ConfigLow x19
Tag configuration defines the operation of the two detagging units (associated with operands
A and B) and the retagging unit (associated with the destination). The lowest 16 bits of the
register indicate valid SmallInteger combinations of d31, d30, d1 and d0. The next higher 4
bits are ANDed to d31, d30, d1 and d0 when converting from tagged to untagged SmallInteger
while 4 other bits are ORed to d31, d30, d1 and d0 when converting from untagged to tagged
SmallIntegers. 2 bits indicate how much to shift right when converting from tagged to untagged
SmallIntegers and the same bits indicate how much to shift left for the reverse operation. The
top 6 bits are undefined.
For Squeak the bottom bit is set to 1 for SmallIntegers, so this register must be set to hex
value 011EAAAA. The AAAA defines all odd values as valid SmallIntegers. The E will clear
d0 when converting to raw bits and the bottom 1 will set it when retagging. The top 1 will divide
the tagged value by 2 and multiply it back when retagging. For Self the bottom two bits are 0
for SmallIntegers, so this register must be set to hex value 020F1111. An option that works well
in hardware but is complicated to deal with in software is when the top two bits must match in
SmallIntegers. This can be handled by setting this register to hex value 000FF00F.
There is a reserved region of 4KB in the microcode cache which is loaded manually and
is never flushed. When a new bytecode starts to be executed, the uIP is set to the value of the
bytecode shifted left by 2 in order to directly start executing the sequence of 16 bytes (4 words)
of microinstructions corresponding to that bytecode. If the region needs to be reloaded when
switching between images then register bcTable has the address from which it should be loaded.
80
APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE
The rest of the L2 virtual microcode cache follows that in memory.
The two L2Config registers are not yet defined, so for now the L2 virtual caches are simply
the 32 bit address space.
The following registers have already been described and simply cache values that can be
accessed using the constant objects option for operand B. They are just regular registers as far
as the hardware is concerned and must be manually reloaded when there is an image switch.
def
def
def
def
A.9
specialOop x20
falseOop x21
trueOop x22
SmallIntegerOop x23
System
Only three very low level hardware registers are used by the system as a whole instead of
individual images. The nocData and nocLast/nocStatus registers offer very low level access to
the Network On Chip. Normally this is not needed as all resources can be accessed as memory.
The now register allows simple access to a precise timer, but with only 32 bits it overflows
rather quickly.
The remaining registers are labeled “scratch” and since they are widely used their values
can’t be counted on except in very short code fragments. They are used mostly in the microcode
for the send and return bytecodes. x31 is, by covention, the destination whenever the result of
an operation is not needed.
d e f nocData x24
d e f nocLast x25
; o r nocStatus on read
d e f now x26
; i n c r e m e n t s on every c l o c k c y c l e
; s c r a t c h r e g i s t e r s : x27 t o x31
A PPENDIX
B
ALU Matrix Assembly Language
The optional coprocessor ALU Matrix is meant to run numerically intensive code faster than
the main SiliconSqueak core. It is tightly controlled by the core but offers a level of indirection
that allows some deviation from a single instruction, multiple data (SIMD) execution. When
the core selects ALU Matrix operation 12, for example, this is translated into a 10 bit address
for the ALU program memories. The content of these memories can be different for each
ALU (though it often isn’t) so that one might be adding two registers while its neighbor is
doing an inclusive or of two different ones. Just like instruction flow control is a part of each
SiliconSqueak microcode instruction, communication is a part of each ALU Matrix instruction.
The coprocessor operates in place of the ALU of the core and has access to the same
operands A and B. 32 bits of its results are always saved to the indicated destination in the
core (or to x31 if this result is unwanted).
The size of the Matrix can be up to 16 by 16 and doesn’t have to be square. A smaller 8 by
8 Matrix is used in this text. Each ALU is 8 bits wide, but neighbors can combine their results
using the carry options shown in Table B.1
Though the syntax for ALU Matrix assembly language is slightly different from the SiliconSqueak microcode, the same tool is used for both and source code mixes them. The last
81
82
APPENDIX B. ALU MATRIX ASSEMBLY LANGUAGE
SiliconSqueak instruction before a sequence of ALU Matrix instructions must use the “-> matrix[row,col,start]” fetch option. In the binary code, the first word has this format:
xxxxXXXX yyyyYYYY CCCCCCSS SSSSSSSS
33222222 22221111 11111100 00000000
10987654 32109876 54321098 76543210
The x field indicates the first colunm and X the last column into which the following code
will be loaded. The y and Y fields define the rows, so the four fields together define a rectangular
subset of a 16 by 16 Matrix. The C field indicates how many words of ALU Matrix instructions
follow while the S field is the address into which the code is to be loaded in the program memories of the selected ALUs. If the SiliconSqueak instruction with the “-> matrix[row,col,start]”
option invoked a matrix operation then that operation number is reset to start at S.
The next C words are the actual ALU Matrix instructions and have the following format:
TTTDDDDD AAAAAXXX XXXBBBBB CCCCCIII
33222222 22221111 11111100 00000000
10987654 32109876 54321098 76543210
At the source level, the assembly syntax is:
{ c o n d i t i o n } mD : = mA op mB , mC : = INP ; comment
The comment (any text after the first semicolon in the line) is optional and is ignored by
the assembler. The only other part that can be omitted is the condition (the part between curly
brackets) which is equivalent to true if not present.
The basic instruction is in the format of two register assignments separated by a comma.
The leftmost assignment takes values from two registers and combines them using the named
operation while the second assignment stores the value from the named external source into
a register. The registers are named m0 to m31. When the indicated condition is false, the
leftmost assignment does not happen. By convention, m31 is used as a scratch register so it is
the destination of assignments which do not matter.
The eight possible sources for the second assignment are (indicated by field I): up, down,
left, right, rA, rB, multLow and multHigh. The first four are the outputs of a neighbor, rA and
83
rB are a byte from the respective register of the core and the last two options select a byte from
half of the result of a 16x16 bit multiplier. Each multiplier is shared between four 8 bit ALUs.
The eight possible conditions (indicated by field T) are {m0}, {!m0}, {m1}, {!m1}, {m2},
{!m2}, {true} and {false}. The {m0} condition is true if register m0 has a value that is different
from zero while {!m0} indicates that register m0 does have zero.
There can be up to 64 different operations (selected by field X), not including multiplications
and shifts which are handled by circuits between the ALUs. The "$" indicates unsigned, "."
indicates saturating, "r" indicates carry from right and "u" carry from up.
000
Additions (000xxx)
+
+r
Additions (001xxx)
Additions (010xxx)
+u
|
Logic (011xxx)
Comparisons (100xxx) =
Comparisons (101xxx) =r
Comparisons (110xxx) =u
? (011xxx):
001 010
+.
$+.
+.r $+.r
+.u $+.u
˜|
ˆ
0?
+?
0?r +?r
0?u +?u
011
-r
-u
˜ˆ
+??
+??r
+??u
100 101
-.
$-.
-.r $-.r
-.u $-.u
& ˜&
<
<=
<r <=r
<u <=u
110
|-|
|-|r
|-|u
&˜
$<
$<r
$<u
111
|$-|
|$-|r
|$-|u
˜ &˜
$<=
$<=r
$<=u
Table B.1: ALU Matrix operations
The versions of subtractions indicated by | − | are absolute differences. The comparisons
with +? test for carry out while those with +?? test for overflow. For non saturating arithmetic
the results for signed and unsigned operations are the same. Absolute differences also don’t
make sense with saturating arithmetic.