A 32 POINT MONOLITHIC FFT PROCESSOR CHIP Guy D. Covert Senior Systems Engineer TRW LSI Products La Jolla, CA 92038 ABSTRACT The Discrete Fourier Transform (DFT) is used in a wide variety of digital signal processing applications. The algorithms used to implement this transform require intensive arithmetic computation as well as complex control and sequence functions. The designer of VLSI components is faced with the problem of identifying requirements and architectures for chips which directly support the DFT. Design goals of these chips include minimum chip count to implement an entire transform, very high speed and low power dissipation. This paper discusses a monolithic CMOS device that was fabricated to perform 32 point Fast Fourier Transforms at very high data rates. All data memory and arithmetic and control circuitry is contained on this single low power chip. INTRODUCTION The TMC 2032 is a monolithic, completely self contained Fourier Transform processor which is capable of computing both forward and inverse Discrete Fourier Transforms (DFT) on 32 complex valued data samples. The device has been fabricated using a TRW proprietary 2-micron bulk CMOS process technology that offers very high circuit density and low power dissipation plus the extremely high speed operation that has previously been associated only with bipolar devices. Approximately 27,000 FET devices were used on a 236 x 248 die and the device dissipates about 900 milliwatts from a single five volt power supply. Total time required to perform a 32 TMC2032 consists of a 16 x 16 bit Multiplier Accumulator when the maximum master clock frequency of 50MHZ is used. multiply scheme. One input is connected to a sine-cosine ROM that provides the complex twiddle factors required The algorithm implemented by the TMC 2032 to factors are stored in Booth coded form so that they can Figure 1. Block Diagram of the TMC 2032 As can be seen, the arithmetic unit of the (MAC) and a separate 17 bit carry-lookahead adder. Together, these form a one-quarter butterfly circuit that, under microprogram control, is sequenced through twenty—four cycles of the master clock to complete ore full complex FFT butterfly operation every 480 nanoseconds. The MAC circuit uses a Booth coded point complex-to-complex DFT is 47.0 microseconds by both the forward and inverse transforms. These be used directly by the MAC. This resulted in a compute the OFT is an in-place decimation-in—time (DIT) butterflies per pass are then required for one complete 32 significant saving of devices in the MAC circuitry at the lesser cost of requiring a 24 bit wide ROM look up table 96dB of overall dynamic range. butterfly circuit may be right shifted up to one bit under control of an external signal. This allows scaling of data as required to prevent arithmetic overflows. Arithmetic rounding is applied to the final butterfly output by adding 0.5 to the least significant bit of each output word. FFT using radix-2 butterflies. Five passes with sixteen rather than a 16 bit width. Output from the quarter All input/output and arithmetic point transform. operations are performed with a sixteen bit, fractional two's complement fixed point numeric format that is common to many existing high speed digital signal processing systems. This format offers approximately All input/output data as well as interim results are stored in a 64 word by 16 bit RAM. This memory may DEVICE ARCHITECTURE read from one address while writing to another in a single A block diagram of the FFT processor is shown in memory cycle. A memory cycle corresponds to four cycles of the master clock. Figure 1: 1081 CH 17467/82I0OOO 1081.$ 00.75 © 1982 IEEE required. For example, if the input signal is essentially Gaussian noise, the optimum fixed scaling is usually a All control and sequence functions in the TMC 2032 are performed by a PLA based microprogrammed control right shift on every even numbered pass. unit. This unit is easily programmed by a final mask step. It cycles at the master clock rate and generates all the signals required to step through the 80 butterflies required by a 32 point transform. These signals include: A more flexible approach to data scaling requires an external circuit to monitor the state of the overflow bit and determine which passes of the FFT must be right Twiddle factor ROM addresses, RAM read/write addresses and butterfly unit states. An instruction shifted in order to prevent overflows. Non-valid data will come out of the first few FFTs, while the appropriate right—shift pattern is developed. As long as the input signal characteristics do not change significantly, the minimum shift no overflow sealing case will soon be found and valid output data will result from that point decoder circuit allows the PLA to receive and process macro level instructions via the off—chip interf ace. The off—chip interface includes separate 16 bit input and output ports, an instruction input and a status register output. All outputs have three-state buffers to give added flexibility when interfacing to bus oriented systems. Instructions to the chip include: onward. 32 POINT REAL FFTS Load complex data samples over the input! output port sequentially into the dual port RAM, then 1. The TMC2032 performs a complex to complex Fourier transform, However, in many potential perform a 32 point FFT. applications for this chip, real data only is being processed and 32 points of real data must be transformed 2. Output complex data in bit reversed addressing into sixteen complex valued frequencies. order. Here, the TMC2032 may be used to compute two real-to-complex 3. order. transforms in the same amount of time required to Output complex data in natural sequential compute a single complex-to--complex transform. The following computational procedure applies (1): 4. Load complex data and perform 32 point inverse FFT. 1. Load the first 32 real valued data points into the real array of the TMC 2032. Load the second 32 real valued data points into the imaginary array of the 5. Right shift all data values by one bit during the next sequential pass. TMC2032. 2. Execute the 32 point complex-to-complex FFT macro instruction. 6. Return status. The status register consists of five bits, Three of these indicate which of the five FFT passes is currently in progress. The fourth bit indicates that the chip is busy and the fifth bit indicates that an arithmetic overflow has occured during the current FFT pass. transform of real only data will have a real part that is DATA SCALING imaginary part and an odd real part. Therefore, the two sets of sixteen complex frequencies may be generated by simple additions and subtractions required to sort out the At this point, we must realize that the Fourier an even function of frequency and an imaginary part that is an odd function of frequency. Correspondingly, the transform of imaginary only data will have añ even even and odd parts. In the implementation of any fixed point FFT, provisions must be made for scaling data points to prevent arithmetic overflows which may be caused by normal Using the above procedure, the effective processing bandwidth of the FFT chip may be doubled when word growth within the algorithm. The TMC 2032 accom- processing real data by the addition of fairly slow add and subtract elements. Therefore, we can now transform real data with an input sample rate of up to 1.36 MHZ. plishes this scaling by use of two external signals as follows: 1. A bit is available in the status register which indicates that an arithmetic overflow occurred on the BUILDING A 1024 POINT FFT current of the five FFT passes. This signal is reset at the beginning of each pass and latched whenever an overflow occurs. The TMC2032 was designed to be used as a building block for the construction of larger size transforms. A 1024 point FFT may be constructed using 2. An instruction may be input which causes the TMC2032 to rightshift all data points by one bit before the following computational method (2): they are output from the next sequential pass. This 1. First of all, we must take the 1024 input signal is latched at the start of each pass. complex time samples and arrange them into a two dimensional matrix with the following format: The simplest application using these two signals to prevent arithmetic overflows is a fixed scaling procedure wherein an external circuit monitors the pass counter and asserts the right shift instruction in a predetermined fixed sequence. Here, the overflow bit becomes an error flag. In order to use this method effectively, some a priori knowledge about the structure of the input signal is 1082 0 1 2 3 31 32 33 34 35 63 992 - . . . 1023 Further, we will define M (M=0 through 31) as the colum index and L (L=0 through 31) as the row index. Generation of the final 1024 point transform will now be flexibility of selecting combinations of parallel and serial performed by using 32 point FFT's on the rows and structures which implement the required processing within his speed constraints. For example, maximum speed will be attained using 64 TMC2032's and the corn plex frequencies. sequenced through all 64 FFT's. minimum hardware system will use a single chip columns of this matrix then reformatting back to 1024 2. Using the TMC2032, perform a 32 point FFT on each of the 32 columns. A block diagram of one possible implementation of the 1024 point transform is given in Figure 2. Here, 16 TMC2032's are each sequenced through four FFT's to 3. Every element must now be multiplied by a complex twiddle factor depending on its location in the matrix. This factor is: compute a single 1024 point transform. Complex multiplication is performed using two multiplieraccumulator chips. Each chip is sequenced twice to generate a single complex product.Finally, a total of three frame store memories are used to store intermediate results and read them out in row or column order as required. These memories are double buffered Where M and L are the column and row indices and W is: to allow sustained rate processing. This system is capable of producing a new 1024 point FPT every 188 w = e2 7T/1024 microseconds, subject to a latency time of 752 4. Using the TMC2032, compute the 32 point FFT's of each of the 32 rows. microseconds. 5. We now have completed the 1024 point REFERENCES transform computation and, in the process, transposed the original matrix. Therefore, we must now read out our frequencies with F(0) being located at position (0,0), F(1) at (0,1), F(2) at (0,2), F(32) at (1,0) etc. (1) L. D. Enochson and R. K. Otnes "Digital Time Series Analysis" 1972. As can be seen, the above procedure requires the computation of 64 different 32 point FFT's as well as 1024 complex multiplies. The system designer has the Column L. R. Rabiner and B. Gold "Theory and Application of Digital Signal Processing" Prentice—Hall, 1975; pp. 371— (2) 379. 