Adaptive compilation for an object-oriented and reconfigurable architecture Jecel Mattos de Assumpção Júnior SERVIÇO DE PÓS-GRADUAÇÃO DO ICMC-USP Data de Depósito: Assinatura:________________________ ______ Adaptive compilation for an object-oriented and reconfigurable architecture Jecel Mattos de Assumpção Júnior Advisor: Prof. Dr. Eduardo Marques Doctoral dissertation submitted to the Instituto de Ciências Matemáticas e de Computação - ICMC-USP, in partial fulfillment of the requirements for the degree of the Doctorate Program in Computer Science and Computational Mathematics. EXAMINATION BOARD PRESENTATION COPY. USP – São Carlos May 2015 SERVIÇO DE PÓS-GRADUAÇÃO DO ICMC-USP Data de Depósito: Assinatura:________________________ ______ Compilação adaptativa para uma arquitetura orientada a objetos e reconfigurável Jecel Mattos de Assumpção Júnior Orientador: Prof. Dr. Eduardo Marques Tese apresentada ao Instituto de Ciências Matemáticas e de Computação - ICMC-USP, como parte dos requisitos para obtenção do título de Doutor em Ciências - Ciências de Computação e Matemática Computacional. EXEMPLAR DE DEFESA USP – São Carlos Maio de 2015 Ficha catalográfica elaborada pela Biblioteca Prof. Achille Bassi e Seção Técnica de Informática, ICMC/USP, com os dados fornecidos pelo(a) autor(a) A851a Assumpção Júnior, Jecel Mattos de Adaptive compilation for an object-oriented and reconfigurable architecture / Jecel Mattos de Assumpção Júnior; orientador Eduardo Marques. -- São Carlos, 2015. 83 p. Tese (Doutorado - Programa de Pós-Graduação em Ciências de Computação e Matemática Computacional) -Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, 2015. 1. Arquitetura de Computadores. I. Marques, Eduardo, orient. II. Título. Resumo A crescente complexidade dos sistemas embarcados tem aumentado o interesse no uso de linguagens orientadas a objetos dinâmicas, como Python ou Smalltalk, em sua implementação. Para tornar isso prático, a redução do consumo de energia e dos custos dos recursos computacionais necessários para tais linguagens se tornou um tópico de pesquisa bem interessante. Este projeto aborda estes temas via o projeto de um processador específico para a linguagem Smalltalk-80, pela otimização deste processador para a compilação adaptativa, pelo uso do paralelismo tanto de alta como baixa granularidade para realizar mais com relógios de frquências mais baixas e ao aproveitar a reconfigurabilidade das FPGAs (Field Programmable Gate Arrays, que são componentes que estão cada vez mais presentes nos sistemas embarcados) para adaptar o hardware durante o tempo de execução adequando-o a cargas computacionais variáveis. Abstract As the complexity of embedded systems grows, so does the attraction of using object-oriented dynamic languages, like Python or Smalltalk, to implement them. To make this practical, a reduction in the energy and cost of the required computing resources for such languages has become a hot research topic. This project addresses these issues by designing a processor specifically for Smalltalk-80, by optimizing this processor for adaptive compilation, by using both fine grained and course grained parallelism to do more work at lower clock speeds and by taking advantage for the reconfigurability of Field Programmable Gate Arrays (which are increasingly present in embedded systems) to adapt the hardware at runtime to variable computing loads. Contents 1 2 Introduction 1 1.1 Project Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Theory and Related Works 7 2.1 Language-Specific Processors . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Algol computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 SYMBOL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.3 Smalltalk computers . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.4 Lisp Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.5 Forth processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.6 Java Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Adaptive Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Evolution of Adaptive Compilation . . . . . . . . . . . . . . . . . . . 16 2.2.2 Uses of Adaptive Compilation . . . . . . . . . . . . . . . . . . . . . . 23 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 CSP and Occam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.3 Asynchronous Messages . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.4 Synchronism by Necessity . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.5 Linda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.6 Concurrent Aggregates . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 Dynamic and Partial Reconfiguration . . . . . . . . . . . . . . . . . . 29 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 2.3 2.4 2.5 i 3 4 5 Implementation 31 3.1 Language-Specific Processor: SiliconSqueak . . . . . . . . . . . . . . . . . . 31 3.1.1 Level 1 caches and virtual level 2 caches . . . . . . . . . . . . . . . . 33 3.1.2 Microcode cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.3 Bytecode and data caches . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.4 Stack cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.5 Virtual registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.6 Fetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.7 PIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Adaptive Compilation: Cog and Sista . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Parallelism: ALU Matrix coprocessor . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Reconfiguration: runtime reload . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Experimental Results 43 4.1 Language-Specific Processors . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Adaptive Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Conclusion 49 5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.1.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1.2 Smalltalk Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1.3 Multi-level Tracing and Partial Evaluation . . . . . . . . . . . . . . . . 53 5.1.4 Wafer Scale for Massive Parallelism . . . . . . . . . . . . . . . . . . . 54 5.1.5 Native Compilation to Hardware . . . . . . . . . . . . . . . . . . . . . 54 5.1.6 Non Von Neumann Architectures . . . . . . . . . . . . . . . . . . . . 55 Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 Bibliography 61 A SiliconSqueak Assembly Language 63 A.0.1 000: Registers t0 to t31 . . . . . . . . . . . . . . . . . . . . . . . . . . 64 A.0.2 001: Registers i0 to i31 . . . . . . . . . . . . . . . . . . . . . . . . . . 64 A.0.3 010: Registers s0 to s31 . . . . . . . . . . . . . . . . . . . . . . . . . 64 A.0.4 011: Registers x0 to x31 . . . . . . . . . . . . . . . . . . . . . . . . . 65 A.0.5 100: Pseudo Registers #0 to #31 . . . . . . . . . . . . . . . . . . . . . 65 A.0.6 101: Pseudo Registers #-1 to #-32 . . . . . . . . . . . . . . . . . . . . 65 ii A.1 A.2 A.3 A.4 A.5 A.6 A.7 A.8 A.9 A.0.7 110: Pseudo Registers #o0 to #o31 . A.0.8 111: Registers L0 to L31 . . . . . . Directives . . . . . . . . . . . . . . . . . . A.1.1 org expression . . . . . . . . . . . A.1.2 def name expression . . . . . . . . Syntax . . . . . . . . . . . . . . . . . . . . Operations . . . . . . . . . . . . . . . . . . A.3.1 Arithmetic (00) . . . . . . . . . . . A.3.2 Comparison (01) . . . . . . . . . . A.3.3 Logic (10) . . . . . . . . . . . . . . A.3.4 Shifts (11) . . . . . . . . . . . . . Fetch . . . . . . . . . . . . . . . . . . . . Streams . . . . . . . . . . . . . . . . . . . Context . . . . . . . . . . . . . . . . . . . Thread . . . . . . . . . . . . . . . . . . . . Image . . . . . . . . . . . . . . . . . . . . System . . . . . . . . . . . . . . . . . . . . B ALU Matrix Assembly Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 66 66 66 67 67 68 68 69 70 71 72 75 76 77 79 80 81 iii List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Dorado block diagram and microcode instruction format J-Machine with 1024 processors . . . . . . . . . . . . . JOP block diagram . . . . . . . . . . . . . . . . . . . . Programming language implementation techniques . . . Dynamic Compilation . . . . . . . . . . . . . . . . . . . How Polymorphic Inline Caches work . . . . . . . . . . Adaptive Compilation . . . . . . . . . . . . . . . . . . . Parallelism models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 13 16 17 20 21 22 27 3.1 3.2 3.3 3.4 3.5 Squeak’s implementation . . . . . . . . . . . . . . . . SiliconSqueak pipeline stages . . . . . . . . . . . . . . Organization of the ALU Matrix coprocessor . . . . . Switching between different FPGA configurations . . . Time to execute code on different FPGA configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 33 38 39 40 4.1 4.2 4.3 Cog generated code for a PIC with 6 different types . . . . . . . . . . . . . . . PIC for SiliconSqueak with any number of different types . . . . . . . . . . . . PIC causes these cache entries for six types . . . . . . . . . . . . . . . . . . . 46 47 47 5.1 5.2 Slang code for pushTemporaryVariableBytecode . . . . . . . . . . . . . . . . Microcode for pushTemporaryVariable 3 bytecode . . . . . . . . . . . . . . . 52 52 v . . . . . List of Tables 2.1 Parallelism Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 A.1 Registers for Stream Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 B.1 ALU Matrix operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 vii C HAPTER 1 Introduction Of the various ways of classifying high level computer programming languages, a significant one is the division into static (including Fortran, Pascal, C, Modula-2, Ada, Occam, and C++) and dynamic (such as Lisp, Smalltalk, APL, Python, Ruby, Lua, the various Unix shell languages, JavaScript and TCL) languages. Given that Fortran was introduced in 1957 and Lisp in 1958 (published, but the first implementation was in 1960) this division is nearly as old as computing itself. Some languages, such as Cecil and Dart, allow the programmer to select how dynamic a given application will be while other languages, like Java, mix characteristics from both sides. Static languages allow for simpler compilers and have small runtime software infrastructures that make good use of available computing resources, and so are popular where the cost of such resources is a major factor. That is the case at both the high end (multi-million dollar supercomputers) and the low end (embedded systems costing just a few dollars or less). In contrast, dynamic languages save programming time at the cost of wasting some computing resources. A traditional way of getting the best of both worlds is to develop a prototype of an application using a dynamic language, such as Lisp, and when it is fully tested rewriting the application from scratch in a static language, like Fortran. The obvious alternative is to re1 2 CHAPTER 1. INTRODUCTION duce the gap in resource requirements between the two alternatives and this has been an active research topic since the 1970s. Dynamic translation (Deutsch e Schiffman, 1984), known as “JIT compilation” in the Java world, and adaptive compilation (Hölzle, 1994) greatly reduced the runtime performance gap at a cost of an increase in other computing resources, such as memory. While adaptive compilation technology has been very successful technologically, it is very demanding in terms of engineering resources so even some popular dynamic languages are still implemented as simple interpreters. This was the case for JavaScript up to 2008, when Google, Microsoft and the Mozilla Foundation entered a performance race for that language backed by significant financial resources. An older research direction was the development of computer architectures optimized for the features present in high level languages. The brief commercial success of the Lisp Machines in the late 1980s and early 1990s were the most visible result of this research, but their replacement by Unix graphical workstations and then generic PCs (partly due to the success of technologies similar to adaptive compilation) has mostly eliminated interest in language-specific architectures. When energy use and not just top performance is used as a metric, however, this research direction can still be useful (specially in combination with, instead of as an alternative to, other technologies). Reducing energy use of computing systems is a top priority in embedded and portable systems (where battery life must be maximized), in supercomputers and data centers (where the electric bill is by far the main operating cost) and even in regular PCs (since performance is limited by the need to not melt down the processor – this is known as the “power wall”). A well known way to increase the amount of computation per energy unit while maintaining performance is the use of parallelism. The limited number of components available in the past and the lack of established programming models other than the sequential one had limited extensive parallelism to specialized applications. The new energy limitations and high component budgets are now making parallelism the norm whether programmers are ready or not. Fortunately, languages based on message passing (such as Occam, Smalltalk-80 and Erlang) can be more easily adapted for parallelism than those based on direct manipulation of shared memory (“im- 1.1. PROJECT GOAL 3 perative languages”, such as C and most others). Additionally, a computer architecture designed for such languages can better integrate support for parallelism. Field programmable electronic circuits had been available since the 1970s, but the introduction of Field Programmable Gate Arrays (FPGA) in the mid 1980s and the continuous growth in their capacity, made possible by “Moore’s Law” (Moore, 1965), allowed the creation of reconfigurable computers. These machines can have their hardware changed, possibly at runtime, to be optimized for different applications. FPGAs were once mostly used for prototypes and high-end communication systems due to their high cost, but have recently become more common in mainstream products including cost sensitive embedded systems. Runtime reconfiguration could be one way to get more computing done with less energy but there are a few obstacles. Adaptive compilation can address one of these obstacles since it uses some of the same kind of runtime information that is needed to trigger the reconfigurations. Another way of classifying computer language environments which is particularly relevant for embedded applications is the contrast between native development and cross development. In the native case the same computer that is used to create a program is also the one to run it. The language can be either dynamic or static. For cross development the two machines are separate and this is only interesting for static languages since it allows the computer that only runs the program to be more limited (and so cheaper) because the runtime environment can be very simple. As memory has become cheaper and low-end 32 bit microcontrollers cost the same as traditional 8 bit ones, very complex runtime environments (such as complete Linux operating systems) are becoming normal. This eliminates the most significant advantage of cross development, and so eliminates a barrier to using dynamic languages. 1.1 Project Goal The thesis is that it is possible to extend the use of object-oriented dynamic languages to areas in computing that have been traditionally limited to static languages through a combination of a language-specific processor architecture, support and use of adaptive compilation, parallelism and reconfiguration. The focus of this project is on embedded systems, but the results should also apply to supercomputing and other areas. 4 1.2 CHAPTER 1. INTRODUCTION Contributions The project incorporates a number of inventions which extend the state of the art in computer architecture and language implementation and also uses novel combinations of existing techniques. Some of the most notable inventions are: architecture for bytecode-based language implementations: SiliconSqueak is an overall processor architecture with features to speed up the interpretation of bytecodes and to efficiently support adaptive compilation by offering a lower level “microcode” as a compiler target PIC instruction and special instruction cache: Polymorphic Inline Caches (PIC) are a central feature of adaptive compilation as they both speed up message sends and accumulate type information for later compilation phases. SiliconSqueak’s instruction cache can handle parallel search of all possible receiver types in contrast with the sequential search on conventional processors stack cache: register windows (used in the original RISC and RISC II, SOAR, Sparc, AMD29000 and Altera’s NIOS 1) are very effective in reducing the costs of calls and returns but must be flushed on every thread switch, which is not the case for SiliconSqueak’s stack cache virtual level 2 caches: the hardware implements the cache refill mechanism by transferring data between the internal level 1 caches and the region of memory designated as the virtual level 2 cache. Software handles misses in the virtual L2 caches, and so gets to define the policy used adding an interesting level of reconfigurability to the system runtime reconfiguration for adapting processing grain: a given area in an FPGA can implement a number of simple SiliconSqueak cores for course-grain parallelism or a single SiliconSqueak core with an attached ALU Matrix coprocessor for fine-grain parallelism. Which is most efficient varies at runtime and reconfiguring the FPGA accordingly will result in the most computing per energy unit 1.3. ORGANIZATION 1.3 5 Organization Each of the main chapters in this text talks about the four topics of this project in this order: language-specific processors, adaptive compilation, parallelism and hardware reconfiguration. • Chapter 2 serves as an introduction and historical overview for the topics in the form of a comparison with related works. • A description of the project and details about the implementation can be found in Chapter 3 • Chapter 4 shows the experiments designed to validate the implementation and the results of those experiments. • Finally, the conclusion is presented in Chapter 5, which also mentions plans for the project’s future. C HAPTER 2 Theory and Related Works For each of the topics related to this project, a historical overview and a description of the state of the art are presented in the form of references to related projects. The history presented is not meant as an exaustive survey of each area, but as a selective overview to provide a context for the following chapters. 2.1 Language-Specific Processors One characteristic of a practical programming language is that it should be “Turing complete”. That allows it to, in theory, do anything that can be done in any other language. In practice, it might be so inefficient or so awkward to emulate one language’s feature in another that nobody actually does it. This is known as the “Turing tarpit”. While in theory there should be no reason to optimize a processor architecture for a given programming language, in practice much research has been dedicated to doing just that. Biases are very hard to perceive from within the contexts in which they occur. Many architectures considered to be “universal” or “language neutral” (which is presented as a very desirable design goal) are not really such but happen to be optimized for the languages its users 7 8 CHAPTER 2. THEORY AND RELATED WORKS are interested in. The original member of the extremely popular x86 architecture, the Intel 8086, could run assembly and Pascal code very well but was limited when running C, which happened to be much less popular at the time. The programmer could write proper C code if it could fit in 64KB of memory, wasting the rest of the 1MB address space. Or non standard notations called “far pointers” could be used to access all memory, but at the cost of becoming incompatible with all other C computers. So it simply was not possible to port C code between a PC and a DEC Vax computer, for example, because what PCs could run wasn’t really C. The next member of that architecture was the 80286 and it was optimized to run Modula-2 (Pascal’s successor) and similar languages which were the most popular when the project was started but had lost out to C by the time the product reached the market. It was the full and proper support for C in the 80386 by extending the architecture from 16 to 32 bits that saved Intel from losing the microprocessor wars. The 8086 was actually a quick and dirty project meant to fill in for the delayed iAPX432 architecture, Intel’s first 32 bit design. That processor was optimized to run programs written in the Ada programming language created to standardize military applications. Many of its features, specially in the area of security, as unmatched by modern alternatives. Unfortunately, performance was very disappointing due to the limitations of integrated circuit technology at the time and a lack of optimizations. With the current negative perception of language-specific architectures, Intel has rewritten history in all its official documents so that the 80386 is now its first 32 bit design and the iAPX432 is never mentioned at all. 2.1.1 Algol computers The definition of the Algol-60 language generated interest in the development of computer architectures that could efficiently implement this language’s features. Nearly all computer architectures of the time was one address machines with a single accumulator (a design inherited from the days of mechanical calculators). Researchers trying to actually develop compilers for Algol proposed that a stack machine would better match the nested blocks of that language. The most successful effort was the Burroughs B5000 created by the group led by Bob Barton in 1961. The B5500 version (1964) implemented the 8 bit instructions which operated 2.1. LANGUAGE-SPECIFIC PROCESSORS 9 with tagged data, which has been the model used in the design of virtual machines for the Smalltalk-76 (and later) and Java programming languages. Stacks not only simplified the compiler and made object code more compact, but they also became an important part of virtual memory and multi-tasking. The English Electric KDF9 was another stack machine released in 1964 with the goal of efficiently running Algol-60. 2.1.2 SYMBOL This architecture, published in 1971, defined a high level language called SYMBOL and designed hardware that could run it directly. The compiler and even text editor and debugger were integrated into the hardware. Though it did not have much of an impact, this project was the most extreme attempt to close the “semantic gap” between high level programming languages and hardware. 2.1.3 Smalltalk computers This project includes the design of an architecture for running Smalltalk-80 programs, so the various Smalltalk computer projects of the past are the most closely related. Xerox The Alto computer (Thacker & al., 1981) developed in the Xerox Palo Alto Research Center (PARC) is an example of the use of microcode to reconfigure computers. Alan Kay’s group at PARC called the machine the “Interim Dynabook” for its role as a research vehicle for future commercial (and portable, as in laptops and tablets) machines, allowing software development to be started as early as possible. It was the test platform for the Ethernet network, as a file server, a print server and a graphical workstation for thousands of researchers within Xerox and a few collaborating institutes. Of the improved versions that followed, the one with the highest performance was the Dorado. Its block diagram (as seen by a programmer working at the microcode level) is shown in figure 2.1. Beyond the architectural advances relative to the Alto, it was the use of the high 10 CHAPTER 2. THEORY AND RELATED WORKS Figure 2.1: Dorado block diagram and microcode instruction format 2.1. LANGUAGE-SPECIFIC PROCESSORS 11 speed bipolar Emitter Coupled Logic (ECL) technology which gave the Dorado its speed. Unfortunately it also caused a very high energy consumption (and, as a consequence, it was very noisy) and made it very bulky besides costing nearly ten times as much as the Alto. But for many years it was considered the best Smalltalk computer in the world. Swamp, Sword32 and AI32 Given the very detailed specification of the Smalltalk-80 virtual machine, it was unavoidable that some groups would try to implement it as a physical microprocessor. Swamp (Lewis & al., 1986) was an academic project while Sword32 and its successor AI32 (Nojiri & al., 1986) were developed by Hitachi as a commercial product. As with the iAPX432 object-oriented processor (Intel’s first 32 bit architecture launched in 1979) these projects suffered due to the limitations of integrated circuit technology of the time, specially with regards to internal memory. Even so, the results that were obtained were very interesting when compared with the Dorado. SOAR - Smalltalk On A RISC After the success of a student project at Berkeley to design a processor to run programs written in C (the success was such that the project name, Reduced Instruction Set Computer or RISC, became the generic term for this whole kind of architecture) a follow up project studied the possibility of using the same ideas to execute object-oriented languages. Smalltalk On A RISC (SOAR), described in Samples & al. (1986) and Ungar & al. (1984), abandoned the bytecodes used until then as a target for translating Smalltalk source code for an instruction set with 32 bits. Many of the results of this project were the inspiration for the creation of Sun’s Sparc and the experience gained had a great impact in the software implementation of object-oriented languages, as described in section 2.2. Mushroom The design of a Smalltalk computer at the University of Manchester (Williams, 1989) was the first to make use of FPGAs for reconfigurability (very limited due to the small sizes of the available components). The original plan was to build a bytecode based machine like Swamp, 12 CHAPTER 2. THEORY AND RELATED WORKS but the positive results of the implementation of the Self programming language on conventional hardware encouraged the adoption of a RISC style instruction set. The most interesting feature of this project was its memory system. The cache used virtual addresses instead of physical ones such that on a cache hit there was no need to consult the object table. This combined the advantages of having a table with the performance of direct pointers. In addition, the garbage collector was integrated with the virtual memory and with the cache. A new object would always be created in the cache and could eventually be collected without ever having been written to main memory (not to mention being swapped to disk). J-Machine The goal of Bill Dally’s “Jellybean Machine” (or J-Machine) project at MIT was to research the consequences of having processors as cheap as jellybeans. It is a good example (like the Alto and Dorado) of using Moore’s Law to obtain today a computer that will be typical in the future. Figure 2.2 shows the 1024 processor version, each one with local memory and high speed communication channels to each of its six neighbors (the overall architecture is a 3D mesh). A version of Smalltalk called ConcurrentSmalltalk (there was another project in Japan with exactly the same name) was developed but it used a Lisp syntax. Active messages were implemented, which are sent as they are being built and when received they directly invoke the code that is to process it. One of the most interesting experiments in the software side of the project was the invention of concurrent aggregates. These allow a very simple transformation of sequential code into parallel by replacing the use of traditional data structure (aggregates) such as Array by the functionally equivalent ParallelArray. So the bulk of the complexity is dealt by the creator of such classes and not by their users. The success was so great that a new language, Concurrent Aggregates (CA), was created to explore this concept further. 2.1. LANGUAGE-SPECIFIC PROCESSORS Figure 2.2: J-Machine with 1024 processors 13 14 CHAPTER 2. THEORY AND RELATED WORKS 2.1.4 Lisp Machines The use of some Altos at the artificial intelligence laboratory at MIT inspired researchers to create a similar computer but optimized for the execution of the Lisp programming language. Before that the Digital Equipment Corporation (DEC) PDP-10 computer had been considered ideal of this type of application. But the high cost of its operation meant that it had to be shared among many users, which combined with its address space limit of 256K words of 36 bits (around 1MB) to make it hard to run large applications. By using instead a processor dedicated to each user and able to address 16M words, the Lisp Machine was meant to allow advanced in artificial intelligence research. With such a focused project it was possible to optimized even more the architecture for Lisp (by using tagged data like in the B5500, for example). The technology was licensed to two companies founded by researchers from MIT itself: Symbolics and Lisp Machine Inc (LMI). Later on they were joined by Texas Instruments. This was a great incentive to move artificial intelligence applications, specially expert systems, from the laboratory into the field. The increase in the capabilities of Unix workstations coincided with a strong reduction in the market for expert systems (the so called “AI winter”) and this caused the Lisp Machine market to collapse. Less then a decade later these same Unix workstations were eliminated by the increasing performance of regular PCs. 2.1.5 Forth processors In Philip J. Koopman (1989) the following 16 bit architectures optimized for the Forth programming language are compared: WISC CPU/16, MISC M17, Novix NC4016 and Harris RTX 2000. Later chapters compare these 32 bit processors: FRISC 3, RTX 32P, SF1. The Novix 4016 was particularly interesting because it was the first hardware effort by Chuck Moore, the creator of the Forth programming language. Though Forth’s two stack virtual machine is almost trivial to implement in normal processors, Moore felt an optimized design could be simpler, cheaper and more energy efficient. A more commercially succesful version of this architecture is the Harris RTX 2000, several of which were recently used to land a probe on a comet. While the Novix design took advantage of the microcode style instruction format to com- 2.1. LANGUAGE-SPECIFIC PROCESSORS 15 bine multiple high level Forth instructions into a single machine word, the following efforts replaced this with independent and narrow instructions. The five bit instructions are enough for all Forth primitives, forming a Minimum Instruction Set Computer (MISC). Other researchers have explored different designs for Forth processors, though variations on MISC are very popular. The recent J1 went back to the Novix style but taking advantage of the features of modern reconfigurable chips (FPGAs). 2.1.6 Java Computers The rising popularity of the Java programming language led to research into architecture optimized for it. The most important virtual machines nearly always become real machines as has been the case for Lisp, Smalltalk, Forth, Prolog (the famous Fifth Generation Project in Japan back in the 1980s), Modula-2 and even UCSD Pascal had commercial hardware versions. The first, and most famous attempt came from Sun itself (Java’s creator) in the form of PicoJava. It used some very sophisticated instruction grouping techniques to work around the traditional performance limitations of stack based architectures. This project was also the company’s first release of sources to the public, initially in a rather restricted form but later as free hardware. Studies such as Narayanan (1998) evaluated the difficulties of implementing Java in hardware. Many of the results obtained are valid for any object-oriented language that uses bytecodes. The following attempt by Sun was called MAJC and replaced bytecodes with an instruction set typical of media processors in a way very similar to what had been done in the SOAR project. The delay in this product’s release reduced the performance advantages relative to Sun’s own Sparc processor and this caused it to be eliminated. The Java Oriented Processor (JOP) (Schoeberl, 2005) was an academic project that became relatively successful commercially. The data flow part of its design is illustrated in figure 2.3 and was the third attempt by its creator. The first one was inspired by SOAR’s results and was a simple RISC processor that ran the Java virtual machine entirely in software. The second version added to the RISC just a few instructions to speed up stack manipulation, which resulted in a significant increase in performance. The third (and final) version replaced the RISC with 16 CHAPTER 2. THEORY AND RELATED WORKS Figure 2.3: JOP block diagram microcode, though this microcode is very different from the traditional one as it uses very narrow instructions that directly correspond the the simplest and most popular Java bytecodes. The evolution of JOP in terms of performance does not mean that speed is the project’s priority. The focus is on real time applications and consistent run times is more important than small run times. The small FPGA area required by a single JOP processor has encouraged research into multicore solutions. 2.2 Adaptive Compilation The Squeak Smalltalk language is also the operating system for this project. It inherited its syntax and semantics from Smalltalk-80, and more recently also advanced implementation technologies such as adaptive compilation from Self. 2.2.1 Evolution of Adaptive Compilation Smalltalk-72 (Goldberg e Kay, 1976) was the first implementation of a purely object-oriented programming language (Simula was a hybrid language). It was extremely dynamic and, as a result, very slow. Even the syntax itself was defined at runtime, which made compilation impossible and made it hard to understand programs written by other people. A radical changed was made for the 1976 version of the language as shown in figure 2.4: a simple and fixed 2.2. ADAPTIVE COMPILATION Objects 17 Source Objects Source bytecodes Interpreter Interpreter Hardware Hardware Interpreter: Smalltalk-72, traditional Javascript and many others Compiler Virtual Machine: Smalltalk-76 and -80, embedded Java, UCSD Pascal Figure 2.4: Programming language implementation techniques syntax was defined and the implementation was split into two phases – the source text was compiler to the machine language for a “virtual machine” and then an interpreter simulated this virtual machine on the physical computer at runtime. The virtual instructions define a simple stack machine (plus some more complicated instructions for message sending) and are called “bytecodes” due to their length. Most current Smalltalk implementations use an interpreter for the virtual machine and for this reason pure object-oriented languages are considered inefficient by many people. The implementation of Squeak Smalltalk (Ingalls & al., 1997) in 1996, made use of the fact that Goldberg e Robson (1983) included a complete implementation of the virtual machine written in Smalltalk-80 itself. But the style of that code did not make use of many Smalltalk features so that it could be used as a reference for an implementor who wished to write the equivalent in Pascal or C. For the Squeak project, a tool was created that could automatically translate this subset of Smalltalk (called Slang to distinguish it from normal Smalltalk and with no relation to other languages named Slang) to C. Dynamic Compilation Some projects tried, in the early 1980s, to implement the virtual machine directly in hardware. They did not achieve the same increase in performance as obtained by Smalltalk On A RISC (SOAR) Ungar & al. (1984) at the University of California at Berkeley. This group included some rather minimal hardware support (some of which is included in Sun’s Sparc processors) and compiled the source text directly to native machine code. An interesting technique used in 18 CHAPTER 2. THEORY AND RELATED WORKS this system is the “inline cache” invented by L. P. Deutsch (Deutsch e Schiffman, 1984) (the top two part of figure 2.6 show how it works). This exchanges the costly searches in every message send by simple subroutine calls whenever possible. This works well in code where the flexibility of Smalltalk’s polymorphism isn’t really used. The inline cache works by compiling a message send initially into a subroutine call to code that does the search for methods. At runtime when the selected method is found, the original call to search is replaced with a direct call to the method that was found. The call actually goes to a short header which verifies that the receiver’s class is the expected one (and falls back to the search if not). So the next time this code runs the system expects the search to produce the same result as the previous time. A serious problem with directly compiling Smalltalk text to RISC machine code was the size of the resulting executables. A Smalltalk system has a lot of code and an expansion of four times the size can actually reduce performance due to an increase in the virtual memory activity. A compromise is the use of dynamic translation (Deutsch e Schiffman, 1984), called “dynamic compilation” (or Just In Time or JIT compilation in the Java world). The source text is translated to bytecode and the first time that a method is called it is not interpreted but rather translated to native machine code which are saved in a special software cache and then the code is executed. The next time the same method is invoked it can be directly executed from the cache at a considerable gain in average performance. If the cache becomes too full then some methods can simply be discarded. They will have to be recompiled if they are called again later. Given the current speed difference between processors and disk, this solution i faster than trying to save the methods that are eliminated from cache. The Self Language and Compiler Technology The Self programming language (Ungar e Smith, 1987) is a Smalltalk variation based on a smaller, but more general set of concepts. It can be described as a prototype based language with message passing as the most basic operation. Self objects are structured as a set of associations between names and values. When an object receives a message it looks up the association with the message name and either simply returns the associated value if it is a normal object, 2.2. ADAPTIVE COMPILATION 19 executes the code if it is a method object or changes a corresponding association (or “slot”) if it is an assignment. The slot with the name “x:” changes the value of the slot with the name “x”, for example. In this model each object completely describes itself without the need of the concept of classes. The lack of classes implies that objects can only be created by copying (cloning) pre-existing objects (called prototypes) and then changing them. The assignment slots are sufficient for changing the state of an object, but the language includes a few “primitives” (like “AddSlot:” and “DeleteSlot:”) which allow more fundamental changes to an object. To avoid code duplication, an object can “inherit” from one or more parent objects. When a message is received and the object does not have a corresponding slot, the search continues (recursively) in the parents but when a method is executed the object that received the message originally is used as the context no matter where the method was found. Parents are indicated using slots that have their names ending in “*”. As these parent slots also work as normal slots it is possible to use assignment to replace parents at runtime. This is known as “dynamic inheritance” and shows how flexible this model is, but is not a feature that has proved very useful. While the Smalltalk virtual machine looks like a typical computer with a few extensions, the first implementations of Self were very radical in their exclusive use of message passing. The Smalltalk code “x := x + 1” would be translated to: push i n s t a n c e v a r i a b l e ( x ) push c o n s t a n t ( 1 ) send message ( + ) pop and s t o r e i n i n s t a n c e v a r i a b l e ( x ) The equivalent in Self would be written as “x: x + 1” which means “self x: self x + 1” (hence the language’s name), which would be translated to these bytecodes: self push send self send message ( x ) constant (1) message ( + ) send message ( x : ) As message sending is the bane of implementations of purely object-oriented programming language, it would seem that Self would be even slower than Smalltalk. And the advantages 20 CHAPTER 2. THEORY AND RELATED WORKS Source Objects Compiler Translator 1 Native Method bytecodes Hardware Dynamic Compilation: VisualWorks Smalltalk, Self 1 and 2, Java JIT, Squeak Cog time Interpreter Self 1.0 Self 2.0 Interpreted Code Self 1.0 Compiler Native Code Self 2.0 Compiler Optimized Native Code Figure 2.5: Dynamic Compilation of this style (much simpler virtual machine with only eight bytecodes and the incentive for a programming style that has a greater code reuse) would not be sufficient to make up for this loss in performance. With dynamic compilation, however, all message sends that would access a data slot or an assignment slot could have their effects directly included in the generated native code for a speed increase of 4 to 160 times (Chambers, 1992). This is only possible because the receiver for these messages is “self” which has its type known at runtime (when the compiler is called). When inheritance is taken into account, this is no longer true because “self” is of the type of object that originally received the message and not the object where the method was found. One solution is to generate different versions of the native code for each case where the same original method is inherited, which is known as customized compilation (Chambers e Ungar, 1989). This scheme is only practical because the compiler is only invoked for the cases that actually show up during execution which are a tiny fraction of all possible combinations. Without this customization Self would be three times slower since few optimizations would work equally in the different 2.2. ADAPTIVE COMPILATION 21 search: global lookup routine ... receiver = a point call search ... Code as initially compiled search: global lookup routine ... cache em receiver = a point check if point, else call search call show2 show2: routine to draw points ... After the first call search: global lookup routine ... receiver = a rectangle show2: routine to draw points call PIC ... check if point, else call search PIC switch (receiver) { circle: call show3; rectangle: call show1; point: call show2; default: call search; } check if rectangle, else call search show1: routine to draw rectangles When new types of objects show up Figure 2.6: How Polymorphic Inline Caches work contexts in which a method can be called. Polymorphic Inline Cache and Adaptive Compilation As the sophistication of Self’s compiler grew (Chambers e Ungar, 1990) (Chambers & al., 1989), performance came closer and closer to highly optimized languages such as C but the interactive use became worse as the compiler induced pauses became longer as shown in Figure 2.5. Since Self programs have no explicit information about types, a compiler has to work very hard to extract all implicit information in order to generate high quality code. An interesting optimization use by the Self compiler is an extension of the inline cache previously used in Smalltalk: the Polymorphic Inline Cache (PIC) (Hölzle & al., 1991) is shown in figure 2.6. This replaced the call to the method header (originally a call to the search routine) with a sequence of type tests, similar to the “switch” statement in C. If the message receiver has the same type as one of the previous sends of this message, then the correct native code can be called directly. If not, the normal search occurs and then the PIC is extended to include the new type. This extends the advantages of the inline cache to polymorphic message sends as well, and these 22 CHAPTER 2. THEORY AND RELATED WORKS Source Translator 2 Objects Compiler Translator 1 PIC Optimized Method PIC Native Method bytecodes Hardware Adaptive Compilation: Self 3 and 4, Java HotSpot, StrongTalk time Interpreter Self 1.0 Self 1.0 and 2.0 combined Self 3.0: type feedback Interpreted Code Self 1.0 Compiler or Non Inlining Compiler Self 2.0 Compiler Optimized Native Code Simple Inlining Compiler Native Code Figure 2.7: Adaptive Compilation are very common in applications written in an object-oriented style. If many different types appear in a given call site, the PIC can either start to eliminate the older types or the PIC itself can be abandoned for a direct call to the search routine (which ends up being faster for these “megamorphic” call sites). A side effect of the PICs is that they accumulate information about the types of objects that actually appear in different places in the compiler code. This information is a subset of that obtained by the compiler through sophisticated analysis as mentioned before. This makes “adaptive compilation” a particularly effective strategy: the methods are initially compiled by a quick and dirty compiler and executed for a while, and then recompiled using PICs as a source of type information. Both the first and second compilers can be simple and fast as seen in Figure 2.7. Only methods that are extensively used need to go through the recompilation process, which dedicates processor time to the parts of the program with the greatest effect. 2.2. ADAPTIVE COMPILATION 23 With adaptive compilation the code ends up calibrated for the actual runtime conditions, including the characteristics of input data. The native code generated for an application might be different, for example, depending on whether the user is editing black and white images or colored ones. This technology can make use of information that would be too costly to obtain statically. From the programmer’s viewpoint there is only one type: the object. But the compiler must be more detailed if it is to generated acceptable code. One observation of actual use patterns in Self show that most objects are exact clones of some other object except for the values in their data slots. A practical implementation can separate objects into two parts: one with just the values of the data slots and another with all the rest (called the object’s “map”). When an object is cloned, only the first part needs to be copied. Sharing maps between many objects not only saves a lot of memory space but also makes up for the lack of classes by allowing the compiler to treat these “clone families” as its notion of type. The full flexibility of Self remains since any object can be changed at runtime using the programming primitives. In this case a new map is created and the changed object ends up being the start of a new clone family, which leaves all other instances of its previous type exactly as they were. 2.2.2 Uses of Adaptive Compilation The most direct use of adaptive compilation technology is the performance gain for object-oriented languages. After the development of Self, part of that research group left Sun to create a company called Animorphics in order to make commercial use of that technology. They developed a high performance Smalltalk (called StrongTalk) and created a demonstration of a high performance Java implementation. As the second compiler only deals with the most critical code, the technology was named “Hot Spot”. Sun bought the company and the technology was incorporated into Java 1.2. One problem in parallel systems is to match the degree of parallelism of the application with that of the hardware. If many more processes are created than there are physical processors and performance is wasted in switching between processes. If too few processes are used, then part of the machine will remain idle. As adaptive compilation takes into account actual 24 CHAPTER 2. THEORY AND RELATED WORKS runtime conditions, it can be used to adapt the level of parallelism (de Assumpção Júnior, 1994). The first compiler would generate code that is as parallel as possible and during execution the PICs would accumulate not only type information but also about useless task switching. The second compiler could eliminate the excess parallelism while doing its optimizations. A similar alternative is presented in Diniz e Rinard (1997), which is to alternate between the evaluation and production phases to select the best strategy for compiling critical code fragments in parallel applications. 2.3 Parallelism With the creation of integrated circuits and their increasing density it became obvious that it would be possible to eventually put a whole processor into a single chip. With the low cost of these microprocessor it became possible to increase a computer’s performance by simply adding more processors. This caused a significant increase in research into parallel programming techniques in the 1980s. The problems encountered combined with the exponential increase in performance of single microprocessors (due to a combination of ever higher clock rates and the use of larger transistor budgets for elaborate architectural tricks) killed off the interest in such research in the 1990s. Around 2004 the excessive heat dissipation made increasing clock rates impractical while at the same time the architectural tricks yielded smaller and smaller results (Asanovic & al., 2006). The solution was to use the extra transistors to place additional processors on a single die. This made the results of the 1980s research relevant once more. Many parallelism models have been developed and in Marr (2013) a survey of the models and their use in current programming languages was undertaken to identify a minimum set of primitves upon which all these models (and, hopefully, any future ones) can be implemented. Table 2.1 separates the models into those that have been used for a long time (indicated as “prior art”), those that are normally implemented as normal library functions for programming languages, those that can be used to increase performance and, finally, those that require semantic support for their implementation. 2.3. PARALLELISM 25 Prior Art Asynchronous Operations Atomic Primitives Library Performance Semantics Agents APGAS Active Objects Atoms Barriers Co-routines Concurrent Objects Condition Variables Critical Sections Fences Global Address Spaces Global Interpreter Lock Green Threads Immutability Event-Loops Events Far-References Futures Guards MVars Message Queue Parallel Bulk Operations Join Locks Reducers Memory Model Method Invocation Single Blocks State Reconciliation Race-And-Repair Thread Pools Thread-local Variables Threads Volatiles Wrapper Objects Actors Asynchronous Clocks Invocation Data MOvement Axum-Domains Data-Flow Graphs By-Value Data-Flow Variables Channels Fork/Join Data Streams Implicit Parallelism Isolation Locality Map/Reduce Mirrors Message sends One-sided Non-Intercession Communication Persistent Data Ownership Structures PGAS Replication Vector Operations Side-Effect Free Speculative Execution Transactions Tuple Spaces Vats Table 2.1: Parallelism Models 2.3.1 Shared Memory The parallelism model which is closest to the sequential model familiar to most programmers and is implemented by simply having two or more processors use the same memory. To avoid errors due to the interference among the processors, some explicit synchronization structures (such as semaphores) must be used. These simpler mechanisms are hard to use correctly, which has led to the development of more sophisticated systems such as monitors and transactions (which are in turn implemented in terms of semaphores or equivalent). When computers have cache memories it becomes more complicated to maintain coherency which limits the scalability of this mode. With multicore chips now including eight or more processors these limits are becoming a problem. The semaphore model is used in Squeak Smalltalk, but it was not considered sufficient for this project. 26 CHAPTER 2. THEORY AND RELATED WORKS 2.3.2 CSP and Occam An alternative that requires more radical changes relative to traditional programming is the implementation of the system as a set of sequential programs connected through fixed communication channels. This Concurrent Sequential Processes (CSP) model was proposed by Hoare (1978). The Occam programming language was designed for the Transputer microprocessor and both were built around this model. The popular Message Passing Interface (MPI) library can be used to structure parallel applications like this. Erlang and Ada are two languages which use variations of synchronous message passing and this model can also be found in the design of microkernel based operating systems and Remote Procedure Call (RPC) libraries like in the Common Object Request Broker (CORBA) standard. 2.3.3 Asynchronous Messages The family of Actor languages (Hewitt & al., 1973) use a model of unidirectional messages. While the previous model suspends execution at the sender until a reply has been received, with asynchronous messages the sender continues executing as soon as the message has been sent. To receive a reply the other process must send a second message back to the first one, which is only possible if the identity of the sender has been included as one of the arguments in the original message. While asynchronous messages transfer information, they can’t be used as a synchronization mechanism. Information about events must be implemented separately, as an additional layer on top of messages. This model is extremely flexible, but this greatly increases the possibility of programming errors which is why the previous model tends to be more popular. 2.3.4 Synchronism by Necessity The parallelism model described in de Assumpção Júnior (1993) is the “synchronism by necessity” originally created for the Eiffel programming language Caromel (1993). This model is compared with the previous two in figure 2.8. When a message is sent, a “future object” is immediately returned as a temporary result and execution continues at the sender just like in 2.3. PARALLELISM 27 send send future object send receive receive reply send try to use future object receive reply receive Synchronous Messages Asynchronous Messages Wait by Necessity Figure 2.8: Parallelism models asynchronous messages, which tends to allow more parallelism. When the other process finally returns a reply, the future object is replaced by the new result. If any process tries to use the future object before this transformation, it must stop executing until it happens. This means that the semantics are exactly the same as synchronous messages, which combines the advantages of the two previous models. 2.3.5 Linda Linda is a coordination language rather than a programming one, which means its primitives can be added to any sequential language to transform it into a parallel version, like Linda Basic, Linda Pascal, Linda C or Linda Smalltalk. The idea is that a set of processes, possibly in different machines, use these primitive to access in a controlled way a shared associative memory built on data tuples as described in Gelernter e Bernstein (1982). All communication is indirect and is based on the contents of the tuples. As with MPI, the Linda primitives can be used both for synchronous and asynchronous messages as needed. 2.3.6 Concurrent Aggregates In the J-Machine parallel Smalltalk computer (Noakes & al., 1993), one of the parallelism models used is based on concurrent aggregates. Besides the traditional Array class in Smalltalk-80, for example, ConcurrentSmalltalk also includes a ConcurrentArray. The memory of such struc- 28 CHAPTER 2. THEORY AND RELATED WORKS tures is spread out through the local memories of the processors and all operations are done in parallel, but in a coordinated way. In Ungar e Adams (2010) some very similar data structures are used, but the focus is in exploring the possibility of increasing parallelism in exchange for a reduction in the precision of the replies. 2.4 Reconfiguration The first electronic computer, ENIAC from the University of Pennsylvania (1946), was “programmed” by reconfiguring its hardware. Cables like in the old telephone exchanges would connect different parts to solve specific problems. For a different problem the cables had to be connected in a different way. The following generation of computers used the so called “Von Neumann architecture” (still used by current computers) where the programming was done simply changing numbers in the machine’s memory. The configuration of the hardware remained fixed independently of the problem to be solved. Unlike traditional computers, a reconfigurable machine has its hardware adjusted to improve performance for a specific program, as proposed by Gerald Estrin em 1960 (Estrin, 2002). At another moment the hardware might be adjusted in a different way to solve a second problem. Where there is no limit of cost or energy consumption it is possible to use large scale reconfigurable computers to solve problems typical of supercomputers. For embedded systems, where these limits are quite strict, reconfigurable computers are an interesting alternative to traditional processors and signal processors (DSPs). The invention of reconfigurable integrated circuits, specially Field Programmable Gate Arrays (FPGAs) allowed the creation of a wide variety of reconfigurable computers, as shown in Compton e Hauck (2002). The main advantage of this kind of machine is the use of components that are available commercially at a competitive cost. This did not eliminate research into slightly different architectures with building blocks that are larger than the tiny lookup tables (LUTs) used by FPGAs. In Hartenstein (2001) the alternatives described have complete arithmetic and logical units as the building block. The ADRES architectures Bouwens & al. (2007) is a more structured variation of this idea. 2.4. RECONFIGURATION 2.4.1 29 Dynamic and Partial Reconfiguration One of the most interesting feature of the 6000 family from Xilinx was dynamic and partial reconfiguration. This was originally developed as an academic project and later commercialized by Algotronix before being bought by Xilinx. The configuration bits could be addressed as normal memory by some external device, such as a microprocessor. These bits could be changed even during the normal operation of the FPGA and only the part being addressed would be affected by the change. When the XC6000 was retired, many of the most interesting projects of evolvable hardware and reconfigurable computing became more complicated. In systems with multiple FPGAs, such as Teixeira (2002), it is possible to change the configuration of one of the FPGAs without disturbing the others. The introduction of the Virtex family by Xilinx brought back a more restricted form of partial reconfiguration which has been available in all of that company’s components since then. The configuration file was no longer a big blob of bits but rather a sequence of independent frames. Each frame starts with an address to indicate which part of the FPGA it is meant to configure, so the frames can be sent out of order as described in Xilinx (2007). For certain component, like the Virtex II and newer, Xilinx guarantees that a partial configuration which follows certain compatibility rules with the previous configuration can be loaded into the FPGA without disturbing the operation in the rest of the chip. This allows dynamic reconfiguration where part of the FPGA continues working normally while a different part is being changed. In components which the ICAP hardware block is present external hardware is no longer needed since the FPGA itself can generate bits for a frame. In the Virtex, Virtex II and early Spartan generations a frame would configure a whole column. Since the input and output blocks surround the chip, each column has two such blocks (one at the top and another at the bottom) in addition to the logic blocks. Other columns include RAM blocks. Starting with the Virtex 4 each column only contains a single type of block and, additionally, a frame configures a fixed fraction of a column. The frame address indicates a column, the top or bottom half, and then a unique region of 16 logic blocks within that half. In Sedcole (2006) the FPGA’s capability of actually calculating the bits to configure part 30 CHAPTER 2. THEORY AND RELATED WORKS of a column is used to allow the reconfiguration of even finer grained regions (shorter than 16 logic blocks). The exclusive or function shows the difference between the old configuration and the desired one, and it avoids changing bits for areas that are not meant to be affected. For partial reconfiguration to work, it is necessary that the communication between the block that is being changed and the rest of the system be standardized. A very limited set of “bus macros” must be used for this and the software tools must be aware of how things connect. The designer needs to have more detailed control over the operation of these tools compared to normal projects. 2.5 Summary Both language-specific processors and parallelism were hot topics in the 1980s but were mostly ignored in the following decade. They have become interesting once more in the context of modern computing, which is why most references in this area are either very old or very new. Adaptive compilation and reconfigurable computing became significant in the 1990s and have become increasingly relevant since then. Few projects combined two or more of these topics, which made it possible to present each history completely separately. C HAPTER 3 Implementation In this chapter, the design choices for this project are presented and justified in terms of the theory. Each topic will be discussed separately even though there is a lot of synergy between them such as the choices for reconfiguration, for example, might be very dependent on the choices for parallelism. Often the choices in a give area are only available at all because of a choice made in a different area. The general philosophy for the project was to reuse as much existing technology as possible, not only to save time but also to align the project with the goals of existing communities. The need to push beyond the state of the art to achieve this project’s goals, however, demanded the invention of several new techniques. 3.1 Language-Specific Processor: SiliconSqueak SiliconSqueak is a processor architecture created specifically to run programs written in the Squeak Smalltalk language. Squeak is implemented as a bytecode-based virtual machine. While this VM is still evolving, as described in Section 3.2, the basic design has remained unchanged since 1976 which makes it an attractive target for a hardware implementation. 31 32 CHAPTER 3. IMPLEMENTATION simulated!image “Back!to!the! Future”: prim int image obj interpreter!+!object! memory!+!primitives slang gcc !!!!.c interpreter!+! object! memory!+! primitives Hardware!+!OS SiliconSqueak: Figure 3.1: Squeak’s implementation Figure 3.1 shows that the Squeak VM can be divided into roughly three parts: the bytecode interpreter, the primitives and the memory manager. The source for these components is written in Slang, a subset of Smalltalk-80 that can easily be translated into C or similar objlanguage. This prim user!image user!image user!image system!image allows the code to be developed and debugged within Squeak itself so that only when it is ! ! Interpreter!=!SiSq stable does the translation have to happen. So the code in the VM exists in four different forms: hand written Slang, bytecodes (used for debugging and development) translated from Slang, C translated from Slang and machine language for some processor (translated from C). This means that if an architecture has some trick for running bytecodes at a very high speed then the machine language code is not be needed as the bytecodes would yield the same result. The primitives and even the memory manager could be run as bytecodes, but not the bytecode interpreter itself since that would lead to infinite recursion. In theory the bytecodes don’t have to be Turing complete since they can rely of the primitives to implement things like integer arithmetic. That is the case of Self and Little Smalltalk, for example. Fortunately, Squeak bytecode has a set of instructions that are redundant with the send bytecodes but are meant to save space. So instead of compiling a normal send bytecode with access to a literal #+ as the message name, there is an arithmetic plus bytecode that doesn’t waste space in the method with such a common selector. So the hardware can know that an addition is being requested. The full semantics is that if both arguments are SmallIntegers then the hardware can directly add them (taking into account the presence of tag bits) and if not then it is exactly like a send bytecode. When this instruction is generated from Slang then it is always the case of two SmallIntegers and the bytecodes can be considered to be Turing complete. 3.1. LANGUAGE-SPECIFIC PROCESSOR: SILICONSQUEAK 33 data cache optional ALU Matrix coprocessor microcode cache bytecode cache method instruction pointer fetch bytecode microcode ram stack cache uPC fetch microcode calculate addresses stack pointer frame pointer receiver fetch operands execute write back Figure 3.2: SiliconSqueak pipeline stages The bytecodes for the memory management and the primitives can run on the interpreter supported by the hardware shown in Figure 3.2. The bytecodes can be divided into five categories: push, pop, send/return, jump, and arithmetic (as mentioned above). SiliconSqueak has a lower level machine code that is called microcode (see Appendix A). The push, pop and jump bytecodes always correspond to just one or two microcode instructions. So do the arithmetic bytecodes when compiled from Slang, though in the general case they have to handle everything the send bytecodes do. The functionality of the send and return bytecodes is very complex, but when compiled from Slang they can be simple call/return instructions or even eliminated by using inlining. This is what breaks the infinite recursion: as the bytecode interpreter is known to be Slang, the send/return bytecodes used in the implementation of themselves can be inlined away. The hardware that supports the bytecode interpreter is shown in Figure 3.2. It is a pipeline with five and a half stages that operates with four cache memories. In a sequence of bytecode where each one corresponds to a single microcode instruction, a new bytecode can start executing on every clock cycle. 3.1.1 Level 1 caches and virtual level 2 caches The four level 1 caches allow the system to distinguish between different kinds of memory access and to do different optimizations of each one. The level 2 caches don’t actually exist, but are just special memory regions managed by the software. There are three level 2 virtual caches: objects, microcode and stack. The first one is shared between the bytecode level 1 cache 34 CHAPTER 3. IMPLEMENTATION and the data level 1 cache. When there is a miss in any of the level 1 caches, the hardware will attempt to load the missing data from a block of memory in the virtual level 2 cache. That might also miss, in which case a special software handler is called to update the virtual level 2 cache and the load to level 1 cache is retried. The performance of the software is not critical since it is only invoked on level 2 misses and it effectively does the job of a reconfigurable memory management unit. It defines whether object tables or direct pointers are used as well as details about the garbage collector. 3.1.2 Microcode cache At least 4KB of the microcode cache is protected from being flushed automatically and it holds the code fragments associated with each bytecode. The starting address for the microcode interpretation of a given bytecode is simply the value of that bytecode multiplied by 16. This potentially wastes cache space, but it avoids the cost of looking up the start address. The rest of the cache is organized as three sets. A level 2 microcode cache miss means that a PIC entry or the native code for some method is missing. The compiler is invoked to deal with this situation. 3.1.3 Bytecode and data caches Since Squeak methods are just normal objects, the bytecode and data caches share a single level 2 cache. It is the software’s job to maintain coherency between them by flushing the bytecode cache whenever a method is loaded into the data cache. 3.1.4 Stack cache The stack cache is organized as a doubly linked list of small, fix sized blocks. The lists from different threads can be interleaved both at level 1 and level 2 so there is no need to flush either cache when switching between threads like in processor architectures with register windows. 3.2. ADAPTIVE COMPILATION: COG AND SISTA 3.1.5 35 Virtual registers The microcode instruction set is very similar to a RISC and seems to apply an operation to the values of two registers and store the result in a third register. The registers are organized as four groups of 32 registers each in the case of one operand and the destination and eight groups for the second operand. But most of these groups are not implemented as physical registers at all but instead are specially mapped regions of the caches. This is reflected by the stage to calculate addresses in the pipeline between the fetch microcode and fetch operands stages. So SiliconSqueak is effectively a memory to memory architecture rather than a load/store one, but having this closely coupled with the caches combines the advantages of both kinds of architectures. 3.1.6 Fetch Each microcode instruction not only selects three “registers” and an operation, but also explicitly defines how the next instruction is to be fetched. The next option is the most common and allows sequences of microcode. The last instruction of a sequence can use the fetch8 option which warns the bytecode fetch stage to calculate the starting address for the next microcode instruction. This effectively forms a zero overhead inner loop for the bytecode interpreter. 3.1.7 PIC Two of the fetch options use the result of the instruction to indicate the type of a message receiver. They combine this with the address of the current instruction (or a specified one) to probe a unique cache entry. This implements an infinitely growable PIC with a constant access time, which is something that can’t be matched by software implementations. 3.2 Adaptive Compilation: Cog and Sista The advanced adaptive compilation technology developed for Self had an important limitation: the effort to port it to a new processor was considerable. The project started using 68000-based Sun workstations but moved to Sparc-based ones as soon as these became popular. It became 36 CHAPTER 3. IMPLEMENTATION difficult to keep supporting the 68000 processor and soon Self was Sparc only. A port to the PowerPC was done when Macintosh laptops became practical while a port to the x86 languished for nearly a decade before Apple’s switch to that family gave the Mac laptop users the incentive of finish it. Squeak was created to be portable through its technology of translating the Slang sources into C and using a simple interpreter. With nothing in the code being processor specific, it was quickly ported to a number of different processors. Several projects have tried to bring the advantages of Self technology to Squeak without losing the portability. Ian Piumarta, who had originally ported Squeak from the Macintosh to various Unix workstations and Linux on the PC, created a simple dynamic compiler that used Forth-like threaded code as its target. His second effort was more Self-like and targeted the PowerPC and x86 architectures, but it never became a standard part of Squeak. Another project was Exupery by Bryce Kampjes. It used Smalltalk code to translate bytecodes into x86 machine language, and had some modifications in the VM to allow this code to be invoked. The goal was to explore interesting compilation techniques such as continuation passing representations but the project could be expanded into an adaptive compiler if type feedback were added to it. Sista (Speculative Inlining Small-talk Architecture) is a project to rewrite code at the bytecode level to implement a series of optimizations. The idea is that if the underlying system compiles the bytecodes instead of interpreting them, the combination of the two technologies can have results competitive with optimizing at the native code level at a fraction of the complexity. Cog, by Eliot Miranda, is the current dynamic compilation project for Squeak that has been adopted by Pharo, Squeak and NewSpeak. Its development followed a series of steps so usable results could be available at the end of each step: Closure compiler: Smalltalk-80 inherited from Smalltalk-76 an odd implementation of blocks that made certain kinds of programming awkward. Since other Smalltalks had dumped this trick for proper closures, this step actually made Squeak more compatible rather than less 3.3. PARALLELISM: ALU MATRIX COPROCESSOR 37 Stack Interpreter: Full Context objects for every message send and return are not only costly in themselves but also stress the garbage collector. This version uses the processor’s native stack whenever possible Stack Compiler: The interpreter is replaced with a simple compiler that translates each bytecode individually to x86 code. As a result, the stack semantics is fully preserved Register Compiler: By dealing with groups of bytecodes at once it is possible to generate code that makes better use of the registers Spur: This is a redesign of the memory management part of the Squeak VM to take advantage of 64 bit processors There are many more steps planned and related projects such as the port of Cog to the ARM processor. One of the steps is the integration of Cog and Sista to move from a dynamic compiler to an adaptive one. Cog already includes all the necessary hooks in the form of PICs and performance counters. The initial adaptive compiler for this project is a port of Cog to SiliconSqueak similar to the ARM port. The change from one stack to two is very simple, but the use of the PIC fetch option affects a lot of code. 3.3 Parallelism: ALU Matrix coprocessor One level of parallelism for SiliconSqueak is the use of multiple cores in a single chip. An expansion of this kind of parallelism can be obtained by connecting two or more such chips using high speed communication channels such as those shown in Figure 3.4. The message passing hardware allows remote memory and cores to be easily accessed. For intensive numerical algorithms there is a different level of parallelism in the form of a coprocessor tightly coupled to a single SiliconSqueak core. For the initial implementation in this project, an ALU Matrix coprocessor with 64 ALUs was selected. This is shown in Figure 3.3. Each ALU can execute instructions that do one data transfer and one operation with two registers in a single clock cycles. There are 64 types 38 CHAPTER 3. IMPLEMENTATION neighbors: Up, Down, Left and Right U D L R mult high mult low instructions 32 reg rA from data path rB from data path 8 bit ALU mult U D LR Figure 3.3: Organization of the ALU Matrix coprocessor of operations, though most are variations on addition and subtractions (signed or unsigned, normal or saturating, absolute differences). The 8 bit wide ALU’s have carry bits to allow for concatenation into wider words up to 512 bits wide either horizontally or vertically. The selection of 8 bits as the granularity of the ALUs was based on the balance between flexibility and the amount of RAM needed for the instructions. Such RAM was also the motivation of the choice of 64 ALUs since this configuration needs roughly the same number of Block RAMs as two simpler SiliconSqueak cores (which have 20KB of cache each). 3.4 Reconfiguration: runtime reload With an initial configuration as in the left of Figure 3.4, the newly generated hardware would replace some of the existing cores both in terms of FPGA area and as an element in the ring networks. If that particular processor was exclusively executing code that will now be done by hardware, there will be a gain in performance. If, on the other hand, it was also executing 3.4. RECONFIGURATION: RUNTIME RELOAD hdmi other nodes usb hdmi router / switch sdram contr course grain processing usb 39 other nodes hdmi router / switch fifo sdram contr other nodes router / switch fifo sdram contr reconfiguration reconfiguration usb fifo fine grain processing Figure 3.4: Switching between different FPGA configurations unrelated code that must now be moved to the other cores then there will be a performance loss no matter how much faster the hardware is than the optimized code. The scheduler should group related code under heavy loads to make it simpler to detect the situation where a software block has one or more cores dedicated to it and so is a candidate for a hardware replacement. Given that an FPGA that is being reconfigured does not execute anything, the scheduler should deal with time frames N times longer than this inactive period. Besides the reconfiguration time itself, there is the time needed to save all current state to external memory and then the time to restore it (adapting to the new configuration). Since a single core with an ALU Matrix takes up the same FPGA resources as three simple cores, any code which doesn’t make use of the coprocessor will run roughly three times slower. Any code that does take advantage of the ALU Matrix (code generated by the new compiler), on the other hand, will run X times faster. This is illustrated in Figure 3.5 where the high level code always takes the same amount of time (either serially on a single core or in parallel with multiple cores) while the numeric code can use the coprocessor to take far less time. N > 1.4 × (1 + (1 − α) × 3N + α × α> N ( 1.4 − 3N − 1) N ( X − 3N ) N ) X (3.1) (3.2) Where α is the percent of time that code that could use the coprocessor takes on the con- 40 CHAPTER 3. IMPLEMENTATION three SiliconSqueak cores one SiliconSqueak and ALU Matrix low level code high level code Figure 3.5: Time to execute code on different FPGA configurations figuration with three simple cores. To avoid needlessly switching back and forth between configurations, a factor of 1.4 adds some hystersis to the system. Equation 3.1 shows under what conditions it is profitable to replace three simple cores with a single one having a coprocessor. Equation 3.2 solves for α given X (notice that N X − 3N < 0 given that X > 1). So if X = 6 (code becomes six times faster with the ALU Matrix) and N = 10 (the scheduling time frame is ten times the reconfiguration time) then α > 84%. N > 1.4 × (1 + (1 − β) × β< N + β × N X) 3 N ( 1.4 − N3 − 1) (N X − N3 ) (3.3) (3.4) In equation 3.3 we have the condition where it is a good idea to replace a SiliconSqueak core including an ALU Matrix with three simple cores. Here β is the percent of the time in which code that uses the ALU Matrix executes in the original configuration (this is different, but related to, α). Given the same X = 6 and N = 10, then β < 5%. 3.5 Summary This chapter described the implementation details of SiliconSqueak, a unique processor architecture created for this project to optimize the execution of programs written in the Squeak Smalltalk language. It explained how the Cog dynamic compiler can be combined with the 3.5. SUMMARY 41 Sista bytecode to bytecode optimizer to implement an adaptive compilation system and how they can make use of SiliconSqueak’s features. The option to use multiple SiliconSqueak cores or fewer cores but with the ALU Matrix coprocessor allows the parallelism in the hardware to match the parallelism of the application, and by taking advantage of reconfigurable hardware this match can be maintained even as the application varies. C HAPTER 4 Experimental Results The materials and methods used for the experiments in this project are described here. 4.1 Language-Specific Processors The basic cache size of 4KB was selected for SiliconSqueak as a good tradeoff between typical sizes of Block RAMs in a wide range of FPGAs compared to the logic needed to implement the core’s functionality. The microcode cache needs an extra 4KB to hold the bytecode interpreter. All of the caches can be increased for better performance when more memory is available relative to logic. 4.2 Adaptive Compilation The Cog compiler handles send bytecodes by generating inline caches in x86 code. The listing in Figure 4.1 shows what the PIC looks like after sending messages to objects of six different classes. Each new type encountered causes the code to be rewritten. In contrast, SiliconSqueak implements PICs as shown in Figure 4.2. Actually, only the two first lines are needed to trigger the PIC mechanism in the microcode cache. The rest is a 43 44 CHAPTER 4. EXPERIMENTAL RESULTS subroutine to extract the class for a given object pointer. Similar code is also needed for the Cog case but is not shown in Figure 4.1 to save space. Some of the code in the Cog version is also needed for SiliconSqueak, but it is spread out through different entries in the virtual level 2 microcode cache. To match the example in the Cog version, Figure 4.3 shows the cache lines that would be used when the same six types are encountered during execution. Some memory is wasted because the cache lines are only partially full. 4.3 Parallelism To test the effect of fine grained parallelism, the same benchmark was written in pure SiliconSqueak assembly language and a mix of SiliconSqueak and ALU Matrix code. The benchmark selected is the pixel region comparison using sum of absolute differences (SAD) since that was implemented in hardware in de Assumpção Júnior (2009). The ALU Matrix can process pixels at 8 times the rate of the basic core with the access to memory being the limiting factor. 4.4 Reconfiguration The two development boards used for the experiments in this project were selected because they allow reconfiguration to be triggered from within the FPGA itself. The Xilinx ML401 uses a SystemACE chip to read configuration bitstreams from an attached Compact Flash memory card formated as a FAT32 file system with directories organized into up to eight different projects. Simply writing a number from 0 to 7 into a special register in the SystemACE causes the whole Virtex 4 FPGA to be reconfigured from a bitstream found in the selected directory. The other board is a Parallella, created to demonstrate the use of the Epiphany 16 core floating point chip. Attached to that is a Xilinx Zynq Z7020 chip with two ARM cores and FPGA resources based on the Artix 7 family. When running Linux on the ARM processors, it is possible to reconfigure the FPGA part by simply copying the bitstream file to /dev/xdevcfg. This can be a full configuration, but the same mechanism can also handle partial reconfigurations. The tricky part is keeping the functionality expected by Linux intact while changing other parts. 4.5. SUMMARY 45 The time that SystemACE takes to reconfigure the whole Virtex 4 25 is just under four seconds, which is not surprising due to several slow connections between the memory card and the FPGA. Copying a bitstream from the SD memory card on the Parallella board using the Linux “cat” command can be done in just 85ms, which is an improvement of 47 times not even taking into account the difference in file sizes between the two FPGAs. 4.5 Summary Initial testing validated the design choices made for this project. 46 CHAPTER 4. EXPERIMENTAL RESULTS 5F98 nArgs : 0 type : 4 b l k s i z : A0 s e l c t r : 3F199C=#basicNew cPICNumCases : 6 00005 f b 0 : x o r l %ecx , %ecx 00005 f b 2 : c a l l . +0 x f f f f a 6 9 9 ( 0 x00000650=cePICAbort ) 00005 f b 7 : nop entry : 00005 f b 8 : 00005 f b a : 00005 f b d : 00005 f b f : 00005 f c 1 : 00005 f c 4 : 00005 f c 7 : 00005 f c 9 : 00005 f c c : 00005 f c f : 00005 f d 1 : 00005 f d 3 : 00005 f d 8 : ClosedPICCase0 : 00005 f d d : 00005 f e 2 : 00005 f e 7 : ClosedPICCase1 : 00005 f e d : 00005 f f 2 : 00005 f f 7 : ClosedPICCase2 : 00005 f f d : 00006002: 00006007: ClosedPICCase3 : 0000600d : 00006012: 00006017: ClosedPICCase4 : 0000601d : 00006022: 00006027: ClosedPICCase5 : 0000602d : 00006032: movl %edx , %eax a n d l $0x00000001 , %eax j n z . +0x00000010 ( 0 x 0 0 0 0 5 f c f =basicNew@37 ) movl %ds :(% edx ) , %eax s h r l $0x0a , %eax a n d l $0x0000007c , %eax j n z . +0x00000006 ( 0 x 0 0 0 0 5 f c f =basicNew@37 ) movl %ds : 0 x f f f f f f f c (%edx ) , %eax a n d l $ 0 x f f f f f f f c , %eax cmpl %ecx , %eax j n z . +0x0000000a ( 0 x00005fdd=basicNew@45 ) movl $0x00000000 , %ebx jmp . +0 x f f f f f 9 b 6 ( 0 x00005993=basicNew@3B ) cmpl $0x00139164=SharedQueue c l a s s , %eax movl $0x00000000 , %ebx j z . +0 x f f f f f 9 a 6 ( 0 x00005993=basicNew@3B ) cmpl $0x0013f76c=Delay c l a s s , %eax movl $0x00000000 , %ebx j z . +0 x f f f f f 9 9 6 ( 0 x00005993=basicNew@3B ) cmpl $0x0013da5c= O r d e r e d C o l l e c t i o n c l a s s , %eax movl $0x00000000 , %ebx j z . +0 x f f f f f 9 8 6 ( 0 x00005993=basicNew@3B ) cmpl $0x0013dd94=Set c l a s s , %eax movl $0x00000000 , %ebx j z . +0 x f f f f f 9 7 6 ( 0 x00005993=basicNew@3B ) cmpl $0x001404b8= U n i x F i l e D i r e c t o r y c l a s s , %eax movl $0x00000000 , %ebx j z . +0 x f f f f f 9 6 6 ( 0 x00005993=basicNew@3B ) movl $0x00005f98=basicNew@0 , %ecx jmp . +0 x f f f f a 6 8 1 ( 0 x000006b8=ceCPICMissTrampoline ) Figure 4.1: Cog generated code for a PIC with 6 different types 4.5. SUMMARY 47 0000440 c : x27 : = t 5 −> c a l l 00005 f b 0 ; save t h e r e c e i v e r i n x27 and ; get class 00004414 : x28 : = x28 −> PIC ; e n t e r PIC mode f o r c l a s s i n r e g i s t e r x28 00005 f b 0 : x31 : = x27 & #1 −> skipOnZero ; check f o r i n t e g e r t a g 00005 f b 4 : x28 : = SmallIntegerOop −> r e t u r n ; x28 w i l l h o l d ; the r e c e i v e r class 00005 f b 8 : s4 : = x27 ; s e t up stream u n i t 0 t o read from t h e r e c e i v e r 00005 f b c : s0 : = #0 ; t h e b a s i c header word 00005 f c 0 : x28 : = s7 >> #10 ; t h e compact c l a s s i n d e x 00005 f c 4 : s0 : = #−4 ; t h e extended header word 00005 f c 4 : x28 : = x28 & #h7C −> skipOnOne ; mask compact c l a s s i n d e x 00005 f c 8 : x28 : = s7 ; non compact c l a s s 00005 f c c : x28 : = x28 & #−4 −> r e t u r n ; mask c l a s s oop Figure 4.2: PIC for SiliconSqueak with any number of different types 0013 de38:00004414: x28 : = x28 −> c a l l 00005 f d d ; code f o r Dictionary>>basicNew 00139164 :00004414: x28 : = x28 −> c a l l 00005993 ; code f o r SharedQueue>>basicNew 0013 f 7 6 c : 0 0 0 0 4 4 1 4 : x28 : = x28 −> c a l l 00005993 ; code f o r Delay>>basicNew 0013 da5c:00004414: x28 : = x28 −> c a l l 00005993 ; code f o r OrderedCollection>>basicNew 0013 dd94:00004414: x28 : = x28 −> c a l l 00005993 ; code f o r Set>>basicNew 001404 b8:00004414: x28 : = x28 −> c a l l 00005993 ; code f o r U n i x F i l e D i r e c t o r y > > b a s i c N e w Figure 4.3: PIC causes these cache entries for six types C HAPTER 5 Conclusion Some embedded applications are still simple enough that the traditional development method of cross compilation of static computer languages and debugging by inserting statements to print to a serial work will get the job done. That is specially true when the program is just a variation on something that has been implemented many times before. The decreasing costs of computational resources, however, makes the alternative of using complete operating systems and dynamic object-oriented programming languages very attractive. At the prototype phase this can be achieved by simply including a normal PC in a robot (or have it close by talking wirelessly to its sensors and actuators) or putting a whole mini datacenter in the trunk of an autonomous vehicle, for example. A more specialized solution with lower costs and energy use is needed for the final product. The project described in this text addresses this issue with a new object-oriented architecture that makes good use of adaptive compilation, parallelism and reconfigurable hardware. The processor architecture is called SiliconSqueak since it is optimized for the Squeak implementation of the Smalltalk-80 programming language, but it is flexible enough to support well any language implemented with the popular bytecode virtual machine technology. It has features to both support simple interpretation (to reduce energy costs for infrequent code) and adaptive 49 50 CHAPTER 5. CONCLUSION compilation (for high performance of frequent code, or “hot spots”). Multiple SiliconSqueak cores can fit into a single chip, even a reconfigurable one (an FPGA), which allows course grained parallelism for a better performance / energy use ratio. This is scalable to very large systems as multiple chips, each with its own local memory, can be connected with very high speed communication channels that match the message passing paradigm on which Smalltalk is based. Each SiliconSqueak core can support a coprocessor, called the ALU Matrix, which uses fine grained parallelism to speed up the execution of numeric code kernels. Ideally, the code for the coprocessor can be generated as needed by the adaptive compiler. But given Squeak’s use of hand coded primitives and “plug-ins” the coprocessor can also be exploited manually if needed. A heterogeneous mix of simple SiliconSqueak cores and ones with coprocessors can efficiently execute code which is a mix of high level Smalltalk code and primitives. Given that the same FPGA area that can implement a single SiliconSqueak core and ALU Matrix could be used instead for several (three, for example) simple cores, the ideal mix will vary depending on the nature of the code being executed and whether it is mostly interpreted or mostly compiled. Since this variation happens at runtime, dynamic reconfiguration of the FPGA to change this mix can optimize the energy efficiency of the system. 5.1 Future Work The work described in this text is only the latest step in a project that was started by the author in 1982 (known as “the Merlin Project” from 1984 to 2008). That, in turn, is closely related to Alan Kay’s ongoing “Dynabook Project” which was started in 1968. To achieve the goals of the original project there are many more steps planned for the next few years. In addition, there are projects by other groups that could be greatly enhanced by the results obtained here and which will be available to all. The most important planned steps are described here. 5.1. FUTURE WORK 5.1.1 51 Experiments Smalltalk was originally an internal project at Xerox PARC and due to company policy it was secretive. For the 1980 version a selected group of external companies was involved and a set of benchmarks were created to compare their results, which were published in what became known as the “green book” (Krasner, 1983). When combined with the results published for the Self 2 (Chambers, 1992) dynamic compiler and Self 4 (Hölzle, 1994) adaptive compiler, the Smalltalk family of languages is one of the best studied in terms of performance of implementations. The benchmarks in the green book were classified as “micro benchmarks” which tested one simple operation and “macro benchmarks” which were realistic applications. In this project so far only micro benchmarks have been used, but since it is known that these don’t necessarily reflect in the results of macro benchmarks and it is these that more closely match user perceptions of performance, the system must be subjected to more experiments in the near future as soon as support for complete applications is finished. SiliconSqueak is not a trivial architecture and includes detailed support for some Smalltalk features. The inclusion of such support was based on experience with other projects, but experiments to objectively evaluate the costs and advantages of each feature are planned. Such experiments involve small redesigns of the hardware and corresponding changes in all compilation systems. A typical example is the support for stack operations. With the stack support hardware, the following microcode instruction is enough to implement the functionality of the pushTemporaryVariable bytecode for the case of an argument 3. dPush : = t 3 −> f e t c h 8 ; microcode f o r pushTemporaryVariable 3 Such support is not reflected in the Slang sources. As an example, the high level code that implements the pushTemporaryVariable bytecode is shown in Figure 5.1. This will generate 16 different pieces of microcode due to the exapandCases, and each case will be a single piece of code due to inlining. With special hardware support, there is a pattern matching pass in the compilation process that will detect common sequences and replace them with the simpler equivalent. If SiliconSqueak had no special hardware support for stack operation (but with hardware 52 CHAPTER 5. CONCLUSION S t a c k I n t e r p r e t e r methods f o r i n t e r n a l i n t e r p r e t e r access temporary: o f f s e t i n : theFP " See S t a c k I n t e r p r e t e r c l a s s >> i n i t i a l i z e F r a m e I n d i c e s " < i n l i n e : true > ^ stackPages l o n g A t : theFP + o f f s e t ∗ BytesPerWord i n t e r n a l P u s h : object " I n t h e S i S q S t a c k I n t e r p r e t e r s t a c k s grow up . " stackPages l o n g A t P o i n t e r : ( l o c a l S P : = l o c a l S P + BytesPerWord ) put: object S t a c k I n t e r p r e t e r methods f o r s t a c k bytecodes pushTemporaryVariable: temporaryIndex s e l f i n t e r n a l P u s h : ( s e l f t e m p o r a r y : temporaryIndex i n : l o c a l F P ) pushTemporaryVariableBytecode <expandCases> s e l f fetchNextBytecode . " t h i s bytecode w i l l be expanded so t h a t r e f s t o c u r r e n t B y t e c o d e below w i l l be c o n s t a n t " s e l f pushTemporaryVariable: ( c u r r e n t B y t e c o d e b i t A n d : 16 rF ) Figure 5.1: Slang code for pushTemporaryVariableBytecode support of memory read and write similar to most RISC processors), a microcode version of the above for the specific case where “currentBytecode bitAnd: 16rF” is 3 would be something like Figure 5.2 using x27 to x29 as scratch registers. ; microcode f o r pushTemporaryVariable 3 def localFP x27 def localSP x28 d e f stackPages x29 ; these are s e t elsewhere d e f BytesPerWord 4 ; d e f i n e d g l o b a l l y x30 : = l o c a l F P + 12 ; o f f s e t ∗ BytesPerWord = 3 ∗ 4 x30 : = stackPages memRead x30 l o c a l S P : = l o c a l S P + BytesPerWord memWrite x30 stackPages l o c a l S P −> f e t c h 8 ; i n t e r n a l P u s h Figure 5.2: Microcode for pushTemporaryVariable 3 bytecode At first glance, it would seem to save three instructions and two memory accesses. The latter are an illusion because the caches are accessed behind the scenes by the SiliconSqueak pipeline. The saving in clock cycles for such a frequent operation must be balanced against any loss in clock frequency due to the extra complexity of the support hardware. Though very complex to set up, an experiment such as this must be done for each SiliconSqueak special feature. A third set of experiments involve comparisons with similar systems. Since Squeak can run 5.1. FUTURE WORK 53 just fine on Sparc processors, for example, an interesting experiment would implement one or more Leon 3 cores (an open source Sparc design) on the same FPGA boards that SiliconSqueak is being tested on. This allows the metric of performance per FPGA area to more realistically demonstrate the value of the innovations introduced in this project. 5.1.2 Smalltalk Zero SiliconSqueak is flexible enough to not only support the Squeak Virtual Machine (VM) but also other languages that use their own, but similar VMs. It is even possible to switch between VMs at runtime. Two popular languages used with children, Scratch 1.4 from the M.I.T. Media Lab and Etoys, are implemented as a layer on top of Squeak. In addition to this, the same Squeak VM is used both by languages derived from Squeak (such as Pharo or Cuis) and by entirely new languages, such as NewSpeak. Other existing languages, like Self, could be implemented to use the Squeak VM. Given that all these different systems can run on SiliconSqueak at the same time and even interact with each other, there is a plan to develop a simple and powerful language on top of the Squeak VM, called Smalltalk Zero for now, to take greater advantage of SiliconSqueak’s parallelism than existing languages. It would complement them instead of necessarily being a replacement. 5.1.3 Multi-level Tracing and Partial Evaluation The use of Cog (dynamic compiler) and Sista (bytecode to bytecode optimizer), an existing framework for adaptive compilation in the Squeak VM, leverages many years of engineering effort. But it also limits how much experimentation can be done. In Marr & al. (2014) two modern alternatives to hand crafted compiling VMs are evaluated. RPython, developed as part of the PyPy project, allows an interpreted VM to be written (normally for bytecodes) and then uses simple annotations to drive a two level tracer which can automatically generate compiled code. Interpreters are far simpler to write and debug than compilers, so this lowers the development costs for new VMs freeing time for the exploration of alternatives. Truffle is a system that uses partial evaluation (Futamura, 1999) of an interpreter based on Abstract Syntax Trees 54 CHAPTER 5. CONCLUSION (ASTs) to generate code. A future research project will investigate the possibility of extending the two level tracing from PyPy into a multi-level tracer that could be combined with partial evaluation, as in Truffle, to create an alternative to Cog and Sista for implementing VMs on SiliconSqueak. 5.1.4 Wafer Scale for Massive Parallelism The relatively small size of a SiliconSqueak core allows even low end FPGAs to be used for multi-core systems. If implemented as a custom chip (Application Specific Integrated Circuit ASIC) using a modern fabrication process it would be possible to fit a large number of cores on a die with a commercially viable size. Machines with more cores can be built by connecting a number of such chips to each other using high speed serial communication channels. For such applications, however, the overhead of splitting a wafer into separate dies, testing the dies and discarding the defective ones, encapsulating the dies and then soldering them on a board so that they are once again connected to other is very wasteful. An alternative is to build systems from whole wafers. After significant research progress in this area in the 1980s, this direction was abandoned due to commercial reasons which became less relevant later on. Perfecting this technology would allow the construction of research machines with features similar to what commercial ones will have a few years later so that software development can lead, rather than follow, hardware development. 5.1.5 Native Compilation to Hardware Besides the very course grained hardware reconfiguration of switching between a SiliconSqueak core with an ALU Matrix coprocessor and a number of simple SiliconSqueak cores, there is a level of reconfiguration where the compiler can generate different code to be loaded in the ALU Matrix. This extends adaptive compilation one step beyond what is normally used. An additional and even more extreme step would be the compilation of software objects into FPGA configuration bits. This would require information that the FPGA vendors are not interested in publishing, though there are ways around that and in an ASIC implementation (such as the Wafer Scale mentioned above) it would be possible to include FPGA-like areas with a known 5.2. DISCUSSION AND LIMITATIONS 55 design. As long as the new hardware objects have the same message passing interface to the rest of the system as the software objects they replace, the only effect would be an increase in performance. Closely related to the issue of secret configuration bitstreams is the limitation of cross development. Since the bit files are a black box generated only by the FPGA vendor’s own tools, the fact that these tools run on normal PCs and not the system with the FPGA does not really matter. But if the information needed to generate new configuration at runtime is available, then it is desirable to do this generation natively instead. Given that adaptive compilation was the solution to the growing pauses in sophisticated dynamic compilers, native bitstream generation would be a real problem in terms of pauses if the same algorithms as in the vendor’s tools are used. Parallelism can be used to partially hide these pauses, but simpler incremental algorithms can also help. 5.1.6 Non Von Neumann Architectures The Squeak VM is a traditional stack based Von Neumann architecture. This limits the amount of parallelism which can be extracted from the code. As early as 1984, the Merlin project investigated alternative execution models such as Dataflow architectures. Though the current focus is on SiliconSqueak, there are plans to continue the previous work (mostly in 1990 and 1997) in this direction. 5.2 Discussion and Limitations This project optimizes the typical execution of dynamic object-oriented languages so that they can be used to implement embedded systems. The cost is an increase in the variation of execution time and this is a problem in hard real-time systems. With parallelism, however, enough of the cost of adaptive compilation can be hidden that soft real-time systems become practical. In the same way, the cost in increased variation due to dynamic reconfiguration can be hidden with partial reconfiguration. So even though the focus of this thesis was embedded systems that are not real-time, at least the soft real-time option can be achieved with some extra effort. Bibliography A SANOVIC , K.; C ATANZARO , B. C.; PATTERSON , D. A.; Y ELICK , K. A. The Landscape of Parallel Computing Research : A View from Berkeley. EECS Department University of California Berkeley Tech Rep UCBEECS2006183, volume 18, no. UCB/EECS-2006-183, pages 19, december, 2006. B OUWENS , F. J.; B EREKOVIC , M.; K ANSTEIN , A.; G AYDADJIEV, G. N. Exploration of the {ADRES} Coarse-Grained Reconfigurable Array. Architectural In: Proceedings of International Workshop on Applied Reconfigurable Computing, 2007, page 1–13. C AROMEL , D. Toward a method of object-oriented concurrent programming. Commun. ACM, New York, NY, USA: ACM, volume 36, no. 9, page 90–102, september, 1993. C HAMBERS , C. The Design and Implementation of the Self Compiler, an Optimizing Compiler for Object-Oriented Programming Languages. 1992. phd thesis, Stanford University, 1992. C HAMBERS , C.; U NGAR , D. Customization: Optimizing compiler technology for {Self}, a dynamically-typed object-oriented programming language. In: Proceedings of the SIG- PLAN’89 Conference on Programming Language Design and Implementation. 1989. page 146–160. C HAMBERS , C.; U NGAR , D. Iterative type analysis and extended message splitting: Optimizing dynamically-typed object-oriented programs. In: Proceedings of the SIGPLAN’90 57 58 BIBLIOGRAPHY Conference on Programming Language Design and Implementation. 1990. page 150–164. C HAMBERS , C.; U NGAR , D.; L EE , E. An efficient implementation of Self, a dynamically-typed object-oriented language based on prototypes. In: Proceedings of the 4th annual ACM conference on object-oriented programming, systems, languages and applications. 1989. page 49–70. C OMPTON , K.; H AUCK , S. Reconfigurable Computing: A Survey of Systems and Software. ACM Computing Surveys, volume 34, no. 2, page 171–210, june, 2002. DE A SSUMPÇÃO J ÚNIOR , J. M. O Sistema Orientado a Objetos Merlin em Máquinas Parale- las. In: V SBAC-PAD: Simpósio Brasileiro de Arquitetura de Computadores, Processamento de Alto Desempenho. 1993. page 304–312. DE A SSUMPÇÃO J ÚNIOR , J. M. Machines. Adaptive Compilation in the Merlin System for Parallel In: WHPC’94 Proceedings - IEEE/USP International Workshop on High Performance Computing. 1994. page 155–166. DE Projeto de um sistema de desvio de obstáculos para A SSUMPÇÃO J ÚNIOR , J. M. robôs móveis baseado em computação reconfigurável. 2009. master thesis, ICMC – Universidade de São Paulo, 2009. D EUTSCH , L. P.; S CHIFFMAN , A. M. Efficient implementation of the Smalltalk-80 system. In: Conference Record of the Eleventh Annual ACM Symposium on Principles of Programming Languages. 1984. page 279–302. D INIZ , P. C.; R INARD , M. C. Dynamic Feedback: An Effective Technique for Adaptive Computing. In: {SIGPLAN} Conference on Programming Language Design and Implementation, 1997, page 71–84. E STRIN , G. Reconfigurable Computer Origins : The UCLA Fixed-Plus-Variable ( F + V ). Ieee Annals Of The History Of Computing, volume 24, no. 4, page 3–9, october, 2002. BIBLIOGRAPHY F UTAMURA , Y. 59 Partial Evaluation of Computation Process - An Approach to a Compiler-Compiler. Higher-Order and Symbolic Computation, volume 2, no. 5, page 381–391, december, 1999. G ELERNTER , D.; B ERNSTEIN , A. J. Distributed communication via global buffer. Pro- ceedings of the first ACM SIGACT-SIGOPS symposium on Principles of distributed computing, page 10—-18, august, 1982. G OLDBERG , A.; K AY, A. Smalltalk-72 instruction manual. Technical report, Xerox PARC, 1976. G OLDBERG , A.; ROBSON , D. Smalltalk-80: the language and its implementation. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1983. H ARTENSTEIN , R. A decade of reconfigurable computing: a visionary retrospective. In: DATE ’01: Proceedings of the conference on Design, automation and test in Europe, Piscataway, NJ, USA: IEEE Press, 2001, page 642–649. H EWITT, C.; B ISHOP, P.; S TEIGER , R. A universal modular ACTOR formalism for artificial intelligence. In: Proceedings of the 3rd international joint conference on Artificial intelligence, Stanford, USA: Morgan Kaufmann Publishers Inc., 1973, page 235—-245. H OARE , C. A. R. Communicating sequential processes. Commun. ACM, volume 21, no. 8, page 666—-677, august, 1978. H ÖLZLE , U. Adaptive optimization for Self: reconciling high performance with ex- ploratory programming. 1994. phd thesis, Stanford University, 1994. H ÖLZLE , U.; C HAMBERS , C.; U NGAR , D. Optimizing dynamically-typed object-oriented programming languages with polymorphic inline caches. In: ECOOP’91 Conference Proceedings. 1991. page 21–38. I NGALLS , D.; K AEHLER , T.; M ALONEY, J.; WALLACE , S.; K AY, A. Back to the future: The story of Squeak, A practical Smalltalk written in itself. In: Proceedings OOPSLA ’97, ACM SIGPLAN Notices, ACM Press, 1997, page 318–326. 60 K RASNER , G. BIBLIOGRAPHY Smalltalk-80: bits of history, words of advice. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1983. L EWIS , D. M.; G ALLOWAY, D. R.; F RANCIS , R. J.; T HOMSON , B. W. Swamp: A Fast Processor for Smalltalk-80. In: Proceedings of the 1986 conference on Object-oriented programming systems, languages, and applications, 1986, page 131–139. M ARR , S. Supporting concurrency abstractions in high-level language virtual machines. 2013. phd thesis, Software Languages Lab, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussels, Belgium, 2013. M ARR , S.; PAPE , T.; D E M EUTER , W. Are we there yet? simple language implementation techniques for the 21st century. IEEE Software, volume 31, no. 5, page 60–67, September, 2014. M OORE , G. E. Cramming More Components onto Integrated Circuits. Electronics, vol- ume 38, no. 8, page 114–117, april, 1965. NARAYANAN , V. Issues in the Design of a Java Processor Architecture. 1998. phd thesis, Department of Computer Science and Engineering, University of Sourth Florida, 1998. N OAKES , M.; WALLACH , D. A.; DALLY, W. J. The J-Machine Multicomputer: An Architectural Evaluation. In: Proceedings of the 20th Annual International Symposium on Computer Architecture, 1993, page 224–235. N OJIRI , T.; K AWASAKI , S.; S AKODA , K. Microprogrammable processor for object-oriented architecture. SIGARCH Comput. Archit. News, New York, NY, USA: ACM, volume 14, no. 2, page 74–81, may, 1986. P HILIP J. KOOPMAN , J. Stack computers: the new wave. Ellis Horwood Series in Computers and Their Applications. Ellis Horwood Ltd, 1989. S AMPLES , A. D.; U NGAR , D.; H ILFINGER , P. SOAR: Smalltalk Without Bytecodes. In: Proceedings of the 1986 conference on Object-oriented programming systems, languages, and applications, 1986, page 107–118. BIBLIOGRAPHY 61 S CHOEBERL , M. JOP: A Java Optimized Processor for Embedded Real-Time Systems. 2005. phd thesis, Vienna University of Technology, 2005. S EDCOLE , N. P. Reconfigurable Platform-Based Design in FPGAs for Video Image Processing. 2006. phd thesis, Department of Electrical and Electronic Engineering, Imperial College of Science, Technology and Medicine, University of London, 2006. T EIXEIRA , M. A. Técnicas de reconfigurabilidade dos FPGAs da família APEX 20K - Altera. 2002. master thesis, ICMC – Universidade de São Paulo, 2002. T HACKER , C.; M C C REIGHT, E.; L AMPSON , B.; S PROULL , R.; B OGGS , D. Alto: A personal computer. In: S IEWIOREK , D. P.; B ELL , C. G.; N EWELL , A., editors Computer Structures: Principles and Examples. Second edition. New York: McGraw-Hill, 1981. page 549–572. U NGAR , D.; A DAMS , S. S. Harnessing emergence for manycore programming: early ex- perience integrating ensembles, adverbs, and object-based inheritance. In: SPLASH ’10 Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion, New York, NY, USA: ACM, 2010. U NGAR , D.; B LAU , R.; F OLEY, P.; S AMPLES , A. D.; PATTERSON , D. SOAR: Smalltalk On A RISC. Architecture of In: Proceedings of the Eleventh Annual International Symposium on Computer Architecture. 1984. page 188–197. U NGAR , D.; S MITH , R. B. Self: The Power of Simplicity. In: Proceedings of the 2nd annual ACM conference on object-oriented programming, systems, languages and applications. 1987. page 227–241. W ILLIAMS , I. W. The Mushroom Machine - An Architecture for Symbolic Processing. In: IEE Colloquium on VLSI and Architectures for Symbolic Processing, 1989. X ILINX Virtex-4 Configuration Guide. Technical report UG071, 2007. A PPENDIX A SiliconSqueak Assembly Language In theory, the assembly language for SiliconSqueak would be the bytecodes described in the Smalltalk-80 “blue book” Goldberg e Robson (1983). This is the instruction set for a stack based architecture, which is also known as a zero address architecture. And there is some hardware dedicated to dealing with such instructions, but since these bytecodes are documented elsewhere they will not be described further here. The Squeak system has tools for examining these bytecodes but the programmer has no need to write code at that level. At an even lower level, there is an instruction set which is implemented directly by SiliconSqueak hardware and which is used both to execute the bytecodes and as a target for adaptive compilation from the bytecodes. This is what is called “assembly language” in this text, though it is very tempting to call it “microcode”. Like microcode in many other machines, there is only one instruction format and it combines operations and control in a four address architecure. The 32 bit binary format for these instructions is: FddDDDDD FaaAAAAA MFFXXXXX bbbBBBBB 33222222 22221111 11111100 00000000 10987654 32109876 54321098 76543210 The four bit F field (which is spread about in bits 31, 23, 14 and 13) is described in Sec63 64 APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE tion A.4 and is the control part of the instruction. The X field selects the operation, which can either be a simple operation on two 32 bit data and with a 32 bit results or, depending on the value of M, can select a very complex operation in the optional ALU Matrix coprocessor as described in Appendix B. The destination is selected by the D field while the inputs are selected by A and B. Each of these three fields is indicated by some initial bits in lower case, which select the kind of register which will be used, followed by five bits in upper case which indicate the actual register. While D and A can be any of four kinds of registers (for a total of 127), field B can select four additional kinds (for a total of 256). The following indicates the encoding of bbb, which is the same as 0dd and 0aa. A.0.1 000: Registers t0 to t31 Like most of the registers described here, these are not physical registers but rather aliases for positions in the stack cache. They are called the “temporary registers” and correspond to the arguments and temporary variables of a Smalltalk-80 method. It is extremely rare for methods to require more than 32 temporaries and the code to access them is complicated requiring the use of the stream units. A.0.2 001: Registers i0 to i31 Registers i1 to i31 correspond to the first 31 instance variables of the receiver object of the currently executing method. Register i0 corresponds to the last (and often only) word of the header of the receiver object. These are not physical registers but aliases for words in the data cache. If the object has more than 31 instance variables or more than one header word then these can only be accesses through the stream units. A.0.3 010: Registers s0 to s31 Section A.5 describes the four stream units (two for reading and two for writing) which are essentially convenient Direct Memory Access (DMA) hardware blocks. Each unit is controled by eight registers. Streams are widely used in Smalltalk-80 programs and these units support 65 both that and simple array access. They also allow access to parts of memory which aren’t mapped into any of the other registers. A.0.4 011: Registers x0 to x31 Sections A.6,A.7,A.8 and A.8 give the details about these registers. They are used to control SiliconSqueak and for several of them changing their value has the side effect of remapping other registers. A.0.5 100: Pseudo Registers #0 to #31 These 32 values are not registers at all, but constants that can be conveniently referenced directly in an instruction. It wouldn’t make sense to have a constant as a destination nor, in most cases, a second constant in the same instruction (since then the result could have been calculated at compile time). This is why only operand B can encode them. The actual small positive integers 0 through 31 can have two different 32 bit encodings, one as the raw value and the other as a tagged value. That depends on the selection of tags by the instruction, as described in Section A.8. A.0.6 101: Pseudo Registers #-1 to #-32 This is exactly the same as the previous case except that the constants are the first 32 small negative integer values. A.0.7 110: Pseudo Registers #o0 to #o31 32 well known constant objects can be directly referenced in any instruction. The actual object references are fetched from the SpecialObjectsArray which is pointed to by register x20. This indirection can be costly, even if the relevant part of the SpecialObjectsArray happens to be present in the data cache. So three critical entries are cached in some of the special registers: • x23 holds the oop for the class SmallInteger (entry 6, #o5 in assembly) • x21 holds the oop for false (entry 2, #o1 in assembly) 66 APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE • x22 holds the oop for true (entry 3, #o2 in assembly) These registers are not loaded automatically and the hardware will not know to use them in place of an expensive cache access if the constant version is used in an instruction. The last eight constants don’t reference their respective entries in the SpecialObjectsArray but instead indicate special hardware objects when loaded into a stream unit: def def def def def def def def A.0.8 bytecodeCache microcodeCache dataCache stackCache microcodeL2 dataL2 stackL2 rawMemory #o24 #o25 #o26 #o27 #o28 #o29 #o30 #o31 ; access t o t h e p h y s i c a l RAM 111: Registers L0 to L31 A Smalltalk-80 method object contains not only the bytecodes to be executed but also a set of constants (called “literals”) which can be referenced by those bytecodes. Registers L0 to L31 are equivalent to i0 to i31 but map to the bytecode cache and the method object instead of the receiver object. If a method has more than 31 literals then the stream units must be used to access them. For methods with fewer than 31 literals, the last few L registers will map to bytecodes instead but that is not the proper way to access them. A.1 Directives Traditional assemblers have a number of directives that control the generation of code and the formatting of listings. Since the assembler for SiliconSqueak microcode is a Squeak application which runs in an environment in which text formating can easily be done with related tools, only two directives were defined. A.1.1 org expression This directive sets the current PC of the code being assembled. Code is generated with an offset, which defaults to zero if not sent as an argument to the assembler. The current PC and A.2. SYNTAX 67 the last PC are initially set to the offset value and the generated code is initially empty. If an org directive would set the current PC to a value greater than the last PC then nils are added to the code until last PC reaches the desired value. If org sets current PC to a value below last PC, the the assembler will overwrite previously generated code. If that code was just nils then it is simply replaced but if it was anything else then a warning is generated. A.1.2 def name expression This directive adds the name to the symbol table and associates it with the value of the expression. One important use of this directive is to make register names more readable. Any expression from 0 to 127 (or 0 to 255 for the second source or 0 to 31 for the ALU Matrix instructions) can be used to indicate a register. The names t0 to t31, i0 to i31, s0 to s31, x0 to x31, L0 to L31, #0 to #31, #-1 to #32, #o0 to #o31 and m0 to m31 are defined to the right values before assembly begins, so they can be used instead of raw numbers to indicate registers. But it is good practice to define more meaningful names. Suggested names are used for the special registers in this text. Labels (part of the core assembly syntax) are equivalent to defining the name with the current PC as its value. While it is good practice to define a name before it is used in an instruction, that is not possible for forward jumps. So there is a general scheme for handling undefined names and this makes multiple definitions of the same name an error. A.2 label: Syntax rD : = rA op rB −> f e t c h ; comment The label (text without spaces before the first colon) is optional and if present defines the name as having the value of the current program counter. The comment (any text after the first semicolon in the line) is also optional and is ignored by the assembler. The fetch part of the instruction (after the right arrow) is always present in the generated code, but the default value "-> next" can be omitted. This encodes the F field in the instruction as described in Section A.4. Some types of fetch fields (all those with the highest bit set) involve 68 APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE a 32 bit constant, which is present in the word following the instruction. The basic instruction has the syntax of an assignment to a register of a binary operation between two other registers, which is a style that is more popular for microcode than the traditional operation operand, operand, operand. A few operations are actually unary and, in those cases, rA is omitted. A.3 Operations The 32 operations possible between two registers can be divided into four groups as indicated by the top two bits of the X field. Within each group, the eight instructions will be indicated by the bottom three bits of the same X field. In addition, the M bit can replace the operation by the ALU Matrix coprocessor (if present). At the assembly language this is done by replacing the characters for the operation by a number between square brackets: [0] to [31]. This is detailed in Appendix B. A.3.1 Arithmetic (00) These instructions use a simple 32 bit adder/subtractor. Bit 2 of the X field is used to mask operand A, which would be equivalent to setting it to #0 if it were not the case that only B can encode that. Bit 1 of the X field inverts the bits of B while Bit 0 is used as the carry in. 000 rD : = rB ; move This is a simple copy of one register to another, including the value of constants. 001 rD : = rB + 1 ; i n c r e m e n t The destination receives a value one greater than the source. 010 rD : = ~ rB ; n o t This is a logical inversion, also known as one’s complement. 011 rD : = − rB ; negate This is a mathematical inversion, also known as two’s complement. 100 rD : = rA + rB ; add The values of the two sources are added and the result is stored in the destination. A.3. OPERATIONS 69 101 rD : = rA + rB + 1 ; add w i t h c a r r y s e t One more than the addition of the two sources is stored in the destination. This option allows for extending results beyond 32 bits. 110 rD : = rA − rB − 1 ; sub borrow One less than the subtraction of the second operand from the first is stored in the destination. This option allows for extending results beyond 32 bits. 111 rD : = rA − rB ; s u b t r a c t The second source is subtracted fom the first and stored in the destination. A.3.2 Comparison (01) Every operation described so far actually produces two results. One is a 32 bit result which is stored in the indicated destination. The second is a single bit which indicates if the other result was zero or not. This is used for conditional branching and is sufficient to test equality. Some math code, however, needs more details about the effects of adding two operands. So six additional instructions add (or subtract) the two operands but redefine the meaning of the single bit result. 010 rD : = rA +? rB ; c a r r y Indicates that the addition resulted in a carry from the highest bits. Can be used to extend additions beyond 32 bits. 011 rD : = rA +?? rB ; o v e r f l o w Indicates that the addition caused an overflow condition. 100 rD : = rA < rB ; l e s s than Indicates that the subtraction of the two signed values had a negative result. 101 rD : = rA <= rB ; l e s s o r equal Indicates that the subtraction of the two signed valus had a negative or zero result. 110 rD : = rA $< rB ; unsigned l e s s Indicates that the subtraction of the two unsigned values had a negative result. 70 APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE 111 rD : = rA $<= rB ; unsigned l e s s eq Indicates that the subtraction of the two unsigned values has a negative or zero result. A.3.3 Logic (10) These instruction implement simple bitwise logical operations. Given two bits, there are a total of 16 possible logical operations but several of these are already implemented with the move and not arithmetic instructions. Bit 0 of the X field is used to invert the bits of the result while bits 2 and 1 select between four logic blocks. 000 rD : = rA | rB ; o r The destination receives the bitwise inclusive or of the two operands. 001 rD : = rA ~ | rB ; nor This is the same as the previous operation but with the result bits inverted. 010 rD : = rA ^ rB ; e x c l u s i v e o r The destination receives the bitewise exclusive or of the two operands. 011 rD : = rA ~^ rB ; e q u i v a l e n c e This is the same as the previous operation but with the result bits inverted. The opposite of the exclusive or is the equivalence operation. 100 rD : = rA & rB ; and The destination receives the bitwise and of the two operands. 101 rD : = rA ~& rB ; nand This is the same as the previous operation but with the result bits inverted. 110 rD : = rA &~ rB ; and i n v e r t The destination receives the bitwise and of the first operand with the inverse of the second operand. 111 rD : = rA ~&~ rB ; nand i n v e r t This is the same as the previous operation but with the result bits inverted. A.3. OPERATIONS A.3.4 71 Shifts (11) This set of operations make use of the DSP units present in modern FPGAs. Bit 2 of the X field selects between shifting or multiplying by changing the encoding of the second operand before it reaches the multiplier (a shift by 9 is the same as a multiplication by 512). Bit 1 is used to select between signed and unsigned versions of the instructions (except signed shift left becomes rotate since it would otherwise be the same as unsgined shift left) while bit 0 selects the direction. 000 rD : = rA <> rB ; r o t a t e The bits of the first operand are shifted left as indicated by the second operand and the left over bits are brought back into the result. 001 rD : = rA >> rB ; a r i t h s h i f t r i g h t The bits of the first operand are shifted right as indicated by the second operand with the top bit filling the result. 010 rD : = rA << rB ; s h i f t left The bits of the first operand are shifted left as indicated by the second operand while the new bits are filled with zero. 011 rD : = rA $>> rB ; s h i f t r i g h t The bits of the first operand are shifted right as indicated by the second operand while the new bits are filled with zero. 100 rD : = rA ∗ rB ; m u l t i p l y The two 32 bit operands are treated as signed values which are multiplied, and the bottom 32 bits of the result are stored in the destination. 101 rD : = rA / rB ; d i v i d e The first operand is divided by the second operand, with both being treated as signed values. 110 rD : = rA $∗ rB ; unsigned m u l t The two 32 bit operands are treated as unsigned values which are multiplied, and the bottom 32 bits of the result are stored in the destination. 72 APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE 111 rD : = rA $ / rB ; unsigned d i v The first operands is divided by the second operand, with both being treated as unsigned values. A.4 Fetch The four bits of the F field are used to define how the next microinstruction is to be fetched. The top bit selects between single word instructions and those that require a second 32 bit word (as a destination address, for example). 0000 −> n e x t This is the default case if no fetch part is included in an assembly level instruction. The uPC is incremented by one so that the instruction that follows in the L2 microcode cache gets executed. 0001 −> r e t u r n A 32 bit value is popped from the return stack and microcode execution continues from the indicated address. 0010 −> skipOnZero If the single bit result of the instruction is zero, then all effects of executing the following instruction (whether one or two words long) are canceled. By the time the result is determined, one or more of the following instructions will probably have already been fetched and in the pipeline but if the instruction that might be skipped could cause a branch then all fetching is halted until this is resolved. 0011 −> skipOnOne This is exactly like the previous case, but the skip happens if the single bit result is one. 0100 −> PICuPC This option isn’t used in the bytecode interpreter but only in code generated by the adaptive compiler. The associated instruction generates a value that represents the class of an object which is to receive a message. The current value of the uPC represents the “send site” and A.4. FETCH 73 is combined via a hash function with the receiver’s class to point to a place in the virtual L2 microcode cache. That line might be allocated or not and if allocated it might belong to a different class/send site pair. In those cases the lookup routine is invoked. Otherwise execution can continue with the fetched code. Note that though PIC means “Polymorphic Inline Cache”, this is not actually inline. Traditional adaptive compilers will create an inline cache for a single entry and then change that to a call to a non inline switch statement which is the actual PIC. In SiliconSqueak the sequential tests of the switch statement are replaced with a hashed search. 0101 −> m a t r i x row c o l s t a r t The next few words are not microcode at all but instructions to be loaded into the ALU Matrix coprocessor (if present). The first word includes a count of ALU Matrix instructions and once these are all fetched the word following that is again interpreted as SiliconSqueak microcode. If the associated instruction happens to be an ALU Matrix operation, then that operation number is set to start at this code fragment, as described in Appendix B. SiliconSqueak itself treats all these words the same as skipped instructions. The area selection allows a compact representation and shorter reconfiguration times when code is more SIMD (single instruction, multiple data) in nature. The row and column can be * to indicate all (treated as 0.15), a single number (treated as n.n) or a pair of numbers separated by a period. This allows a rectangular subset of the Matrix to be selected. The start tells the system where to load the code in the selected ALU’s program memory and can be from 0 to 1023. 0110 −> f e t c h 8 The last microcode instruction of a sequence that implements a give bytecode should have this option so that the following bytecode will be fetched and the address of the first microcode corresponding to that new bytecode can be calculated. Many simple bytecodes can have a microcode sequence of just a single instruction. 0111 −> tagAndFetch8 The tag units will verify that the two operands are properly encoded small integers and convert them into raw 32 bit values. The result will be converted back into an encoded small 74 APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE integer if possible. When all this works out, then this option is just like the previous one and a new bytecode is fetched and its corresponding microcode sequence is executed. If any of the detagging or retagging operations fail, however, execution continues with the following microcode instruction. This is normally some cleanup code followed by a more general version of the operation that was attempted. 1000 −> f e t c h 4 d e s t i n a t i o n The byte following the currently executing bytecode is fetched and its top four bits replace the corresponding bits in the 32 bit word following the current microcode instruction and the result is used as the new uPC. This results in a 16 way branch to locations spread 16 words apart around the address indicated by destination. 1001 −> f e t c h 2 d e s t i n a t i o n The byte following the currently executing bytecode is fetched and its top two bits replace the corresponding bits in the 32 bit word following the current microcode instruction and the result is used as the new uPC. This results in a 4 way branch to locations spread 64 words apart around the address indicated by destinataion. 1010 −> jumpOnZero d e s t i n a t i o n When the single bit result is zero, the uPC is set to destination. Otherwise the following instruction is executed. 1011 −> jumpOnOne d e s t i n a t i o n When the single bit result is one, the uPC is set to destination. Otherwise the following instruction is executed. 1100 −> jump d e s t i n a t i o n The uPC is set to the destination. This option is used not only for loops and (in combination with the jumpOnXX instructions) for conditional execution but also to deal with memory fragmentation. Due to the fetchX limitations there will often be partially used code fragments while other fragments overflow. 1101 −> PICx associateduPC A.5. STREAMS 75 One problem with the PICuPC option is that different call sites can’t share PIC entries even if the adaptive compiler figures out that they could. This option is exactly the same except that the following word is used for the hash instead of the current value of the uPC. This is just a way to save memory and time, but does not add any functionality. 1110 −> c a l l d e s t i n a t i o n This is just like jump except that the previous values of uPC is pushed on the return stack. If a return is encountered later on then execution will continue with the instruction following the call. Since the return stack is shared with the send and return bytecodes, it is important the microcode level calls and returns be perfectly balanced within a single bytecode. 1111 −> t a g O r C a l l d e s t i n a t i o n The tag units will verify that the two operands are properly encoded small integers and convert them into raw 32 bit values. The result will be converted back into an encoded small integer if possible. When all this works out, execution continues with the following instruction. If there are any problems, then the subroutine indicated by destination is called and its job is to deal with the failure. A.5 Streams SiliconSqueak has four DMA (direct memory access) engines that are presented as a set of registers s0 to s31. This is the only way to access instance variables beyond 31 in the receiver, to access the extended header words or to access an object other than the receiver. The DMA units will access the data cache if the needed information is there but will bypass the cache if it is not. The index registers select which field in the object will be accessed. The step registers hold the value to be added to the index after the operation. The limit registers indicate the first value which is not acceptable for the index and the base registers indicate the first value that is acceptable (this is not checked when the registers are initially set but is only used for wrap around). When an object pointer is stored into the reset registers or the resetByte registers then the 76 APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE register name index step base limit reset resetByte atEnd next nextPut read0 s0 s1 s2 s3 s4 s5 s6 s7 read1 s8 s9 s10 s11 s12 s13 s14 s15 write0 s16 s17 s18 s19 s20 s21 s22 write1 s24 s25 s26 s27 s28 s29 s30 s23 s31 Table A.1: Registers for Stream Units base and index registers are set to the first valid word or byte in that object and the limit to the word or byte after the last one while the step will be set to 1. This makes setting up the most common configuration very efficient but it doesn’t complicate other configurations. The atEnd registers will read as false until the first time the index has to wrap around due to exceeding the limit. It will read as true from then on until the DMA unit is reset. The next registers allow data to be read from the streams while the nextPut registers allow data to be written to the streams. The data can be either 32 bit words or 8 bit bytes, depending on which resgister was used to reset the stream unit. The unit has a small internal buffer to make memory access more efficient. A.6 Context The runtime environment presented to the Smalltalk-80 programmer is a set of threads, each of which has a stack built from a linked list of Context objects (known in other languages as activations or as stack frames). So the current context provides information which is important for the proper execution of instructions. It is replaced by different context either when a send or return bytecode are executed or when the scheduler switches to a different thread. Context objects allow high level implementation of tools such as the debugger but they can greatly reduce performance and can complicated adaptive compilation. An alternative is to use a conventional stack most of the time and only allocate actual Context objects when some code references them. SiliconSqueak has two separate hardware stacks (which share a single cache): the data stack and the return stack. Five registers are associated with every context A.7. THREAD 77 (independently of whether the actual object has been allocated or not) and they are pushed to and popped from the return stack on message sends and returns. def def def def def s e l f x0 method x1 IP x2 c o n t e x t x3 f r a m e P o i n t e r x4 ; a f f e c t s i0−i31 ; a f f e c t s L0−L31 ; ( or n i l ) ; a f f e c t s t0−t31 Special register x0 has been defined with the nicer name “self” in this example. Changing its value will remap registers i0 to i31 to a different part of main memory. In theory this register should have the same value as register t0, but this doesn’t happen when blocks are involved. Register IP is normally incremented automatically by the bytecode fetch hardware, but can be set to a new value to cause a jump at the bytecode level. When the method register is changed, the IP register must be updated as well. The context register is ignored by the hardware and is set to nil unless an actual Context object is allocated, in which case it is set to point to that. The framePointer indicates the memory position to which register t0 is mapped. This is a location in the data stack. A.7 Thread Though threads share object address space, each has its own stack (or pair of stacks in the case of SiliconSqueak). The hardware stacks are allocated from a region in memory that is divided into blocks of 32 words each. The blocks are joined using doubly linked lists to allow stacks to grow to arbitrary sizes within the region. d e f dBlockHigh x5 d e f dBlockLow x6 d e f r B l o c k x7 These three registers select which blocks the hardware can access directly. dBlockHigh and dBlockLow together define a 64 word region for the data stack. rBlock defines a 32 word region for the return stack. This simpler scheme is possible because there is no equivalent to registers t0 to t31 for the return stack. d e f s t a c k P o i n t e r x8 ; a f f e c t s dBlockHigh and dBlockLow 78 APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE This register points to a word that normally corresponds to the region defined by dBlock- Low. If it is incremented to where it moves from the dBlockLow region to the dBlockHigh one then dBlockHigh is copied to dBlockLow and the linked list is followed to find a new value for dBlockHigh (possibly invoking software to allocate a new block). If the stackPointer is decremented when it already points to the first word in dBlockLow then dBlockLow is copied to dBlockHigh and the linked list is followed to find a new value for dBlockLow (always possible unless there is a stack underflow error). d e f dTop x9 d e f rTop x10 d e f c u r r e n t B y t e x11 The dTop register will return the value in the word pointed to by stackPointer. rTop is exactly the same, but for the return stack. The currentByte register returns the value pointed to by the combination of the method and IP registers. d e f r e t u r n P o i n t e r x12 ; a f f e c t s rBlock This is like stackPointer, but for the return stack instead of the data stack. When it tries to move beyond or before the region indicated by rBlock then that register is replaced by following the linked list (possibly invoking software to allocate a new block). d e f dPop x13 d e f rPop x14 d e f n e x t B y t e x15 ; affects stackPointer ; affects returnPointer ; a f f e c t s IP These registers are exactly like dTop, rTop and currentByte respectively except they cause the associated pointers to be incremented or decremented. In the case of dPop and rPop they cause their respective stack pointers to be decremented after use when they are operands and to be incremented before use when they are destination. In the latter case it might be nice to define alternative names for these registers: d e f dPush x13 d e f rPush x14 ; affects stackPointer ; affects returnPointer A.8. IMAGE A.8 79 Image For a given Squeak system, there are some global settings that are valid for all threads. A single core might run more than one image (even other languages such as Java or Python) so that these registers would have to be updated on each switch (as well as all previously mentioned registers and the caches would have to be flushed). def def def def t a g C o n f i g x16 bcTable x17 L2ConfigHigh x18 L2ConfigLow x19 Tag configuration defines the operation of the two detagging units (associated with operands A and B) and the retagging unit (associated with the destination). The lowest 16 bits of the register indicate valid SmallInteger combinations of d31, d30, d1 and d0. The next higher 4 bits are ANDed to d31, d30, d1 and d0 when converting from tagged to untagged SmallInteger while 4 other bits are ORed to d31, d30, d1 and d0 when converting from untagged to tagged SmallIntegers. 2 bits indicate how much to shift right when converting from tagged to untagged SmallIntegers and the same bits indicate how much to shift left for the reverse operation. The top 6 bits are undefined. For Squeak the bottom bit is set to 1 for SmallIntegers, so this register must be set to hex value 011EAAAA. The AAAA defines all odd values as valid SmallIntegers. The E will clear d0 when converting to raw bits and the bottom 1 will set it when retagging. The top 1 will divide the tagged value by 2 and multiply it back when retagging. For Self the bottom two bits are 0 for SmallIntegers, so this register must be set to hex value 020F1111. An option that works well in hardware but is complicated to deal with in software is when the top two bits must match in SmallIntegers. This can be handled by setting this register to hex value 000FF00F. There is a reserved region of 4KB in the microcode cache which is loaded manually and is never flushed. When a new bytecode starts to be executed, the uIP is set to the value of the bytecode shifted left by 2 in order to directly start executing the sequence of 16 bytes (4 words) of microinstructions corresponding to that bytecode. If the region needs to be reloaded when switching between images then register bcTable has the address from which it should be loaded. 80 APPENDIX A. SILICONSQUEAK ASSEMBLY LANGUAGE The rest of the L2 virtual microcode cache follows that in memory. The two L2Config registers are not yet defined, so for now the L2 virtual caches are simply the 32 bit address space. The following registers have already been described and simply cache values that can be accessed using the constant objects option for operand B. They are just regular registers as far as the hardware is concerned and must be manually reloaded when there is an image switch. def def def def A.9 specialOop x20 falseOop x21 trueOop x22 SmallIntegerOop x23 System Only three very low level hardware registers are used by the system as a whole instead of individual images. The nocData and nocLast/nocStatus registers offer very low level access to the Network On Chip. Normally this is not needed as all resources can be accessed as memory. The now register allows simple access to a precise timer, but with only 32 bits it overflows rather quickly. The remaining registers are labeled “scratch” and since they are widely used their values can’t be counted on except in very short code fragments. They are used mostly in the microcode for the send and return bytecodes. x31 is, by covention, the destination whenever the result of an operation is not needed. d e f nocData x24 d e f nocLast x25 ; o r nocStatus on read d e f now x26 ; i n c r e m e n t s on every c l o c k c y c l e ; s c r a t c h r e g i s t e r s : x27 t o x31 A PPENDIX B ALU Matrix Assembly Language The optional coprocessor ALU Matrix is meant to run numerically intensive code faster than the main SiliconSqueak core. It is tightly controlled by the core but offers a level of indirection that allows some deviation from a single instruction, multiple data (SIMD) execution. When the core selects ALU Matrix operation 12, for example, this is translated into a 10 bit address for the ALU program memories. The content of these memories can be different for each ALU (though it often isn’t) so that one might be adding two registers while its neighbor is doing an inclusive or of two different ones. Just like instruction flow control is a part of each SiliconSqueak microcode instruction, communication is a part of each ALU Matrix instruction. The coprocessor operates in place of the ALU of the core and has access to the same operands A and B. 32 bits of its results are always saved to the indicated destination in the core (or to x31 if this result is unwanted). The size of the Matrix can be up to 16 by 16 and doesn’t have to be square. A smaller 8 by 8 Matrix is used in this text. Each ALU is 8 bits wide, but neighbors can combine their results using the carry options shown in Table B.1 Though the syntax for ALU Matrix assembly language is slightly different from the SiliconSqueak microcode, the same tool is used for both and source code mixes them. The last 81 82 APPENDIX B. ALU MATRIX ASSEMBLY LANGUAGE SiliconSqueak instruction before a sequence of ALU Matrix instructions must use the “-> matrix[row,col,start]” fetch option. In the binary code, the first word has this format: xxxxXXXX yyyyYYYY CCCCCCSS SSSSSSSS 33222222 22221111 11111100 00000000 10987654 32109876 54321098 76543210 The x field indicates the first colunm and X the last column into which the following code will be loaded. The y and Y fields define the rows, so the four fields together define a rectangular subset of a 16 by 16 Matrix. The C field indicates how many words of ALU Matrix instructions follow while the S field is the address into which the code is to be loaded in the program memories of the selected ALUs. If the SiliconSqueak instruction with the “-> matrix[row,col,start]” option invoked a matrix operation then that operation number is reset to start at S. The next C words are the actual ALU Matrix instructions and have the following format: TTTDDDDD AAAAAXXX XXXBBBBB CCCCCIII 33222222 22221111 11111100 00000000 10987654 32109876 54321098 76543210 At the source level, the assembly syntax is: { c o n d i t i o n } mD : = mA op mB , mC : = INP ; comment The comment (any text after the first semicolon in the line) is optional and is ignored by the assembler. The only other part that can be omitted is the condition (the part between curly brackets) which is equivalent to true if not present. The basic instruction is in the format of two register assignments separated by a comma. The leftmost assignment takes values from two registers and combines them using the named operation while the second assignment stores the value from the named external source into a register. The registers are named m0 to m31. When the indicated condition is false, the leftmost assignment does not happen. By convention, m31 is used as a scratch register so it is the destination of assignments which do not matter. The eight possible sources for the second assignment are (indicated by field I): up, down, left, right, rA, rB, multLow and multHigh. The first four are the outputs of a neighbor, rA and 83 rB are a byte from the respective register of the core and the last two options select a byte from half of the result of a 16x16 bit multiplier. Each multiplier is shared between four 8 bit ALUs. The eight possible conditions (indicated by field T) are {m0}, {!m0}, {m1}, {!m1}, {m2}, {!m2}, {true} and {false}. The {m0} condition is true if register m0 has a value that is different from zero while {!m0} indicates that register m0 does have zero. There can be up to 64 different operations (selected by field X), not including multiplications and shifts which are handled by circuits between the ALUs. The "$" indicates unsigned, "." indicates saturating, "r" indicates carry from right and "u" carry from up. 000 Additions (000xxx) + +r Additions (001xxx) Additions (010xxx) +u | Logic (011xxx) Comparisons (100xxx) = Comparisons (101xxx) =r Comparisons (110xxx) =u ? (011xxx): 001 010 +. $+. +.r $+.r +.u $+.u ˜| ˆ 0? +? 0?r +?r 0?u +?u 011 -r -u ˜ˆ +?? +??r +??u 100 101 -. $-. -.r $-.r -.u $-.u & ˜& < <= <r <=r <u <=u 110 |-| |-|r |-|u &˜ $< $<r $<u 111 |$-| |$-|r |$-|u ˜ &˜ $<= $<=r $<=u Table B.1: ALU Matrix operations The versions of subtractions indicated by | − | are absolute differences. The comparisons with +? test for carry out while those with +?? test for overflow. For non saturating arithmetic the results for signed and unsigned operations are the same. Absolute differences also don’t make sense with saturating arithmetic.
© Copyright 2024