Sample Integrated Fourier Transform (SIFT): An approach to low latency, low power high performance FT for real time signal processing The computation provides the same number of coefficients as the number of samples. For all cases where N is even and the samples are real-valued: Trang K. Ta1, Walter Pelton2, Pochang Hsu, Nipa Yossakda3 Microelectronics, CA, USA – email: [email protected] 2Agnitio Technologies, CA, USA – email: [email protected] 3Northwestern Polytechnic University, CA, USA At fi = 0 or N/2, then cos (2*π*fi*xj/N) = 1, and sin (2*π*fi*xj/N) = 0 1Fujitsu The Fast Fourier Transform (FFT) is widely used in DSP applications because of its computationally efficient algorithm. The significant advantage of the FFT over a direct implementation of the Discrete Fourier Transform (DFT) is in terms of the number of required Multiply and ACcumulate (MAC) cycles. This is clearly shown in Table 1. However, the FFT is a batch process and requires more than N*Log(N) cycles [1, 2] to do an N-point transform. Any implementation of FFT hardware must provide a reversal sorting mechanism, which is complicated [3], regardless of size of the number of points, and the batch process cannot begin until the data collection is complete. This paper introduces a synergistic algorithm plus architecture implementation of the Fourier Coefficients based on DFT, the SIFT paradigm. The SIFT makes use of a transactional process that does not require sample storage or register addressing hardware in its implementation to compute the Fourier Transform while attaining very low latency, low power, high performance. A 64-point discrete IC SIFT was designed and its micrograph shown in figure 14. An ASIC core of the design was completed and 100% synthesized using 0.18µm process. The simulated results of the core are presented in the following sections of this paper. A 1024-point complex FT, based on 32-point SIFT cells, is able to complete a transform in 3.2 µsec, performing 20GMACs/s while dissipating 500mW and requiring an area of 16mm2 when fabricated in a 0.18 um process. An overview of DFT and FFT The FFT based on decimation and the FFTW based on divide and conquer, remain the most efficient algorithms for converting information from time domain to frequency domain in Von Neumann, Harvard, RISC and DSP processors. This important analytical tool has been used in many fields as acoustics, communications, signal and image processing. An overview of DFT, FFT, and SIFT with implementation results are presented in the paper. The Equations (1) and (2) are used to compute the coefficients of the DFT. This is the digital approximation to the Fourier Series that is actually used for most analog situations. The function S(xj) is a finite set of evenly spaced samples. These samples are assembled in a memory to be processed. The number of samples in a group, N, is commonly a power of two, i.e. 8, 16, 64, … 1024, etc. Two coefficients are calculated for the frequency whose period equals the time of N samples (f0). This is called the Base Frequency. The same format repeats for 2*f0 … to N/2*f0. In the case of fi = N/2*f0 and fi = 0*f0 the Bi are zero by identity. The Discrete Fourier Transform procedure is to multiply each sample by the sine and by the cosine of the value of the independent variable times the rate and sum over all of the samples. This requires N MACs for each of N coefficients or N squared MACs per DFT. Examination of the process identifies that many of the values are the same due to periodicity of the trigonometric functions. The samples to be multiplied by equal values can be multiplied once and the value may be used in the locations that have equal values. This is the basis of the butterfly which reduces the number of multiplies from Nsquared to N*Log(N) in the FFT. Table 1. # MAC cycles vs. # Points in Transform N (# of points) 8 16 32 64 128 256 512 1024 DFT: N * N 64 256 1,024 4,096 16,384 65,536 262,144 1,048,576 FFT: N*Log(N) 24 64 160 384 896 2048 4608 10240 FFT Butterfly: Bit reversal sorting algorithm and samples dependency Currently the process to multiply-add in a given system may require the same time as add. The butterfly requires a complex structure [1, 2, 3] for storing and addressing the samples and the values of the functions of the independent variables. The complexity exists because the FFT butterfly operates mainly by decomposing an N-point time domain signal into N single point time domain signals using bit reversal sorting algorithm. Then, the N-frequency spectrum will be combined in the exact reverse order that the time domain decomposition took place. As a result, it requires hardware to accommodate the sorting algorithm and more than N*Log(N) cycles to complete an N-point transform [1, 2]. The sorting hardware already exists in general purposecomputing architectures. The graph in figure 2 shows the number of cycles required for SIFT and FFT butterfly in hardware implementation. With 32-point transform, to complete 32*Log (32) or 160 calculations, the FFT butterfly needs 733 cycles for single precision, FFT(s) curve, and 2097 cycles for double precision [1], FFT (d) curve, versus 1024 cycles for SIFT. When the number of points get smaller than 16, SIFT requires smallest number of cycles to do a transform. The sorting and switching hardware present in a generalpurpose computer adds to the cycle time and power requirements. As will be shown in the 1024-point structure, parallelism comes easily with SIFT and the number of cycles may be reduced dramatically. Note that the multiplies in the FFT butterfly require a specific sample sequence that is very different than the order of arrival. Calculation of the value of each coefficient requires input from all of the samples. Figure 1 shows a load instruction always required before an execution instruction in a FFT machine, meanwhile SIFT allows a continuous execution. In FFT design all of the samples must be present to begin the process [3]. This batch requirement derives from the FFT bit reversal-sorting algorithm. The time from the arrival of the last sample until the availability of the coefficients is referred to as the latency (hidden time.) Sample Integration Fourier Transform paradigm Expansion of the equation (1) yields a set of N coefficients A1, A2,…, An. A similar expansion can be done for the equation (2) A1= S1 cos(2πf1*1/N) + S2 cos(2πf1*2/N) +… + SN cos(2πf1*N/N) A2= S1 cos(2πf2*1/N) + S2 cos(2πf2*2/N) +… + SN cos(2πf2*N/N) Ai= ----- ------ ------- AN= S1cos(2πfN*1/N) + S2cos(2πfN*2/N) +… + SNcos(2πfN*N/N) By noticing that the first terms of all coefficients A’s are contributed by the first sample, then it is possible to compute and store the first component of all coefficients upon the arrival of the first sample. If this can be done before the arrival of the second sample, it follows that each coefficient can be updated before the arrival of the third sample. Extending this procedure to the Nth sample, figure 5, we can complete and output the N coefficients by the time of the arrival of the first sample of the next set. Therefore the name Sample Integrated Fourier Transform is introduced. The SIFT paradigm results in several advantages for hardware implementation. The first advantage of this procedure is shown in figure 2. If the computational resource that is available is just capable of completing these steps as samples arrive, it will be finished at the end of the Nth sample time. If the same resource is dedicated to the computation beginning after the last sample arrives; it will finish at the last sample time of the next sample period. The new paradigm has less latency than FFT does because it is a transactional process and the FFT algorithm is a batch process. Secondly, when a sample arrives it will be used to update the N coefficients and then that sample is no longer needed. There is no need to store or address samples except when each is the current sample. Because the samples are simply consumed on the fly, there is no need for the elaborate special purpose hardware to store, address and process transforms. The situation is illustrated in figure 11. This presents an algorithm with less processing requirement. Thirdly, each coefficient may be updated on the arrival of the next sample by subtracting the oldest contribution and adding the contribution from the current sample. This permits a sequence of complete transforms that are updated at each sample time without additional computational overhead. The situation is illustrated in Figure 6. The beginning has moved over. The convolution with the aperture is different, but the wave represented is the same. The information content is latent in this case. There is no loss of information compared to the classical Fourier Transform. Hardware Implementation and Results Figure 3 shows the block diagram of the 64-point SIFT. The chip is divided into five units namely the Control unit, Aspect Generator, Multiplier, Adder/Substractor and RAM. The memory is a 64-entry 18-bit RAM [7]. The control unit generates all the control signals for the chip. A 13-bit binary counter in the control unit monitors the timings of the other blocks. The LSB of the counter is used as a timing signal in the design. The counter bits [1:6] are used to sequentially update the sixty-four coefficients according to the current sample and bits [12:7] assign the sample number and the appropriate aspect function. The maximum significant value enables the output of the coefficients as they respectively complete their final updates. The Aspect Generator, figure 9, consists of a six bit counter to generate sequences of values of cosine and sine. The sample number from the main counter is used as increment value for the six bit counter. To generate the first sequence, the counter increments at the rate of one count. For the 2nd, 3rd, 4th,…, Nth sequences the counter increments at the rate of 2, 3, 4,…, N. The output of the aspect generator is thus called “sequence” and is 6-bit wide. The two MSB bits, bit 5 and bit 4, are used to decode the quadrant of the aspect value. The Multiplier unit multiplies the aspect function times the absolute value of the sample then divides the product by 128. The division consists of a seven-bit shift, which is achieved by wiring offset. This is done for normalization. Its effect is to make the hardware implementation simpler. This method allows the use of a pure integer multiplier. The Adder/Subtractor unit consists of a fast adder and a controlled inverter to perform 2’s compliment conversion arithmetic. The simulated waveforms of the forward SIFT for sin(x) is shown in figure 10. Figure 12, and 13 show the inverse FT waveforms of Acos(2π*f1*x/N) + Bsin(2π*f2*x/N) at the frequencies (f1,f2): (16,63) and (31,63) respectively. Functional results have been verified against MATLAB simulations and hardware results from a discrete IC SIFT board design that is shown in figure 14. Using 0.18um process, the 64-point SIFT runs at 330MHz, has zero latency, attains 13 µsec execution time, dissipates 8mW. The 64-point SIFT core has an area of 0.21 mm2 and shown in figure 7. Figure 8 shows the top-level diagram of a 1024-complex point SIFT. This design is achieved with 64 32-point SIFT cells. Using 0.18µm cmos process, the design is able to deliver 20 GMACs/s sustained rate, and occupies an area of 16mm2. At maximum throughput of 320 million samples per second it delivers 320 million coefficients per second. This produces a full transform each 3.2µsec. At this sample rate it dissipates 500 milliwatts. At lower sample rates the power is proportionately less. The structure operates by splitting the problem into two phases. The incoming samples are handled as 32 interleaved transforms. The resulting coefficients are then taken level by level as 32 interleaved transforms in the other direction as in the case of a two dimensional Fourier Transform. This introduces a latency of one. At all sample rates the design is the most energy efficient and smallest 1024-point Fourier Transform solution that is currently announced. References: [1] Mike Hannah, et al., “Implementation of the Double Precision Complex FFT for the TMS320C54x DSP,” Texas Instruments, Application Report, SPRA554B, August 1999. [2] Guy R. L. Sohie et al., “Implementation of Fast Fourier Transform on Motorola’s Digital Signal Processors,” Motorola’s High Performance DSP Technology, APR4/D, rev. 3, Sections 4, 6, 8. [3] Steven W. Smith, The Scientist and Engineer’s Guide to Digital Signal Processing, pp. 228-232, California Technical Publishing, 1997. [4] Bevan M. Baas, “A 9.5 mW 330 µsec 1024-point FFT processor,” CICC, 1998 [5] R. N. Bracewell, Fourier Transform and its Applications, rev. 2, New York, New York, 1986 [6] Tukey and Cooley, “An Algorithm for the Machine Calculation of Complex Fourier Series,” Mathematics of Computation, Vol. 19, pp.297-301, April 1965 [7] Trang K. Ta, et al. “Dual Port SRAM Design: An Overview,” IBM Circuit TTL Conference, 1991 Figure 1: SIFT pipeline vs. FFT Butterfly pipeline Figure 2: Number of cycles vs. points for SIFT, single precision FFT(s), and double precision FFT(d) Figure 3: SIFT allows continuous execution as samples arrive Figure 5: Each sample contributes to coefficients upon arrival Figure 6: Coefficients can be updated by subtracting the oldest contribution Figure 4: Block diagram of the 64-point SIFT Figure 7: Micrograpgh of the 64-point FT core Figure 9 : Main potion of the Aspect Generator Figure 8: Architecture of 1024-point FT Figure 11: No cache and buffers required for SIFT as opposed to FFT Butterfly Figure 10: FT of sin(x) Figure 12: Inverse FT of Acos(2πf1 x/N) + Bsin(2πf2 x/N) at frequencies (16,63) Figure 14: Micrograph of the discrete ICs, 64-point SIFT design Figure 13: Inverse FT of Ecos(2πf1 x/N) + Fsin(2πf2 x/N) at frequencies (31,63)
© Copyright 2024