Progress In Science and Engineering Research Journal ISSN 2347-6680 (E) VARIABLE PRECISION FLOATING POINT MULTIPLICATION, DIVISION AND SQUARE ROOT IMPLEMENTATION ON FPGA USING VHDL LANGUAGE Varun Singh Sengar1 , Dr. Vinod Kapse2 M -Tech Scholar, Professor & Head, Department of EC. GGITS, Jabalpur Abstract: Computers were originally built as fast, reliable Many of the scientific applications described above and accurate computing machines. It does not matter how rely on floating point (FP) computation, often large computers get, one of their main tasks will be to requiring the use of the Variable precision (V.P.) always perform computation. Most of these computations need real numbers as an essential category in any real world calculations. Real numbers are not finite; therefore no finite, representation method is capable of representing format specified by the IEEE standard 754. The use of V.P. data type improves the accuracy and dynamic range of the computation, but simultaneously it all real numbers, even within a small range. Thus, most increases the complexity and performance of the real values will have to be represented in an approximate arithmetical computation of the module. The design of manner. The scope of this paper includes study and high performance floating point units (FPUs) is thus implementation of Adder/Subtractor and Multiplication, of interest in this domain. Division and Square root functional units using HDL for computing arithmetic operations and functions suited for Among hardware implementation. Floating Point Arithmetic are performance of multiplication, division and square widely used in large set of scientific and signal processing root can differ based on the algorithm implemented. computation. Hardware implementation of floating point They can highly affect the total performance of the arithmetic is more complex than for fixed point numbers, application running them. In this paper, we improve and this puts a performance limit on several of these these three operations using a table based approach. applications. Several works also focused on their implementation on FPGA platforms. Index Terms: VHDL, FPGA, Variable Precision Floating I. INTRODUCTION other arithmetic operations, the Our implementation is written in Very High Speed Integrated Circuits Hardware Description Language (VHDL) Point. all and implemented on FPGAs. These components have been implemented using Altera development environments, the major FPGA vendor. Floating point arithmetic implementations involve Also these implementations provide good tradeoffs processing separately the sign, exponent and mantissa among hardware resource utilization, maximum clock parts, and then combining them after rounding and frequency and latency. Users can change the latency normalization. IEEE standard for floating point by adjusting the parameters of the components. In (IEEE-754) specifies how single precision (32 bit) and addition to supporting the IEEE 754 standard double precision (64 bit) floating point numbers are to representations which include single and double be represented. precision, these components can be customized by the Corresponding Author: user to specify the length of the exponent and the 1. Mr. Varun Singh Sengar, M -Tech Scholar, Department of mantissa. This will contribute to more resource and EC. Gyan Ganga Institute of Technology & Sciences, energy efficiency in systems where variable precision Jabalpur, India is used. Email Id: [email protected] 2. Dr. Vinod Kapse, Professor & Head, Department of EC. Gyan Ganga Institute of Technology & Sciences, Jabalpur II. LAYOUT FLOW Algorithms Description: here we have designed a library of variable precision floating point units written in VHDL Components include floating point © 2015 PISER Journal http://.piserjournal.org/ PISER 18, Vol.03, Issue: 02/06 March – April ; Bimonthly International Journal Page(s) 0018-023 Progress In Science and Engineering Research Journal arithmetic (addition, subtraction, multiplication, division, square root, accumulation). ISSN 2347-6680 (E) post-processing. Generally speaking, reduction step and evaluation step are the same for both reciprocal and division. The only difference is within the postprocessing step. In the reduction step after normalizing, the fractional part of the floating point number is 1 ≤ Y < 2. Assume Y has an m bit significant and k is b(m + 2)/4c + 1; Y (k) represents the truncation of Y to k bits. Define R as the reciprocal of Y (k). Define M = R for computing the reciprocal and division. For example, in double precision based on the equation Fig.1.Hierarchy for the proposed Design of variable length Floating point arithmetic above, m=53 and k = 14. So R can be determined using a look-up table with a 14 bits address. The number of bits for the look-up table output for R is 16. In the evaluation step B is defined as the Taylor series expansion of f(A) = 1/(1 + A) where A is defined as (Y × R) − 1. Also note that −2−k < A < 2k. For z = 2−k, A can be Fig.2.Combinations of Expected Inputs and Outputs for variable length Floating point arithmetic represented as: A = A2z2 + A3z3 + A4z4 + ... Operand can be variable precision floating point Where |Ai| ≤ 2k −1. We ignore the smaller terms that beyond the standard IEEE 754 formats. Any bit width do not contribute to the final result. exponent or mantissa is supported. Figure 1 shows the Using the Taylor series expansion, standard input and output ports for components in the B = f(A) = C0 + C1A + C2A2 + C3A3 + C4A4 + · · · circuit. Each component has inputs READY, STALL, ≈ C0 + C1(A2z2 + A3z3 + A4z4) ROUND and EXCEPTION IN and outputs DONE and + C2(A2z2 + A3z3 + A4z4) EXCEPTION OUT specifically to handle pipelining. +C3(A2z2 + A3z3 + A4z4)3 The READY and DONE signals are used for +C4(A2z2 + A3z3 + A4z4)4 determining when the inputs are ready and when the ≈ C0 + C1A + C2A22z4 + 2C2A2A3z5 + results are available for use. STALL allows a bubble C3A32z6 to be inserted into the pipeline if needed. Round has Here, Ci = 1 when i is even, Ci = −1 when i is odd. two modes: round to zero or truncate, and round to Simplifying we get: nearest. The exception signals propagate an exception B = f(A) flag with the value that may be incorrect through the A22)z4+ 2A2S3z5 − A32z6 pipeline. For multiplication and division, an exception is identified if the divisor is zero. For square root, an ≈ 1 − A2z2 − A3z3 + (−A4 + ≈ (1 − A) + A22z4 + 2A2A3z5 − A32z6 exception is identified if the operand is a negative The equation above is used in the implementation of number. Otherwise, the exception input is propagated reciprocal and division. to the exception output. In the post-processing step final result is given by Division multiplication. To find the reciprocal 1/Y or the quotient X/Y , the For reciprocal the result of reciprocal of Y is given by algorithm needs three steps: reduction, evaluation, and the product of M and B: © 2015 PISER Journal http://.piserjournal.org/ PISER 18, Vol.03, Issue: 02/06 March – April ; Bimonthly International Journal Page(s) 0018-023 Progress In Science and Engineering Research Journal ISSN 2347-6680 (E) 1/Y = M × B. input floating point numbers, the core computation For division the result of division X/Y is given by the which includes multiplication, division or square root product of M, B and X: and round to normal component for adjusting to the standard floating point format. There are two main X/Y = M × B × X. Square Root modules in common among all components which are For computing square root, there are three steps de-normalizing and round to normal. So for this similar to the reciprocal computation. First step is chapter we will first discuss the common elements reduction, second is evaluation, and the last step is implementation and then the core components of post-processing. multiplier, division and square root separately. In the reduction step the difference between the reduction step of computing square root and reciprocal is that after getting R, M is assigned to 1/√R. So another look-up table to compute the inverse square root of R is needed. Notice that if the exponent is odd, there will be √2 as a coefficient to multiply by the result at the last step. For the last step, we check the last bit of the exponent. 0 means the exponent is an even number, so we assign 1/√R to M. If not, M equals to √2/√R. To compute √2/√R, we create another look-up table. Figure 3: Top Level Hierarchy of Components In the evaluation step f(A) is not 1/(1 + A) as the above. Instead let De-normalizer f(A) = √(1 + A) IEEE 754 standard floating point number consists of f(A)’s Taylor series expansion is still the same as that sign bit, exponent and mantissa. As described, the of the reciprocal’s: stored part of the mantissa does not included the B = f(A) = C0 + C1A + C2A2 + C3A3 + C4A4 + · · · first ’1’ as the integer. Since this ’1’ is required for ≈ C0 + C1(A2z2 + A3z3 + A4z4) computation, the de-normalizer component adds it as + C2(A2z2 + A3z3 + A4z4) the most significant bit of the fractional part. Based on +C3(A2z2 + A3z3 + A4z4)3 the number of operands, we might need one de- +C4(A2z2 + A3z3 + A4z4)4 normalizer for Multiplier and square root or two for ≈ C0 + C1A + C2A22z4 + 2C2A2A3z5 + C3A32z6 division. However, the coefficients change accordingly: C0 = Round to Normal 1,C1 = 1/2,C2 = −1/8,C3 = 1/16. Thus, There are four rounding mode specified in the IEEE B=f(A)= 1+A/2 – 1/8A22z4 – 1/4A2A3z + 1/16 A2 z 5 3 6 standard : round to zero, round to nearest, round to In the post-processing step the final result of the negative infinity and round to positive infinity. In our square root is given by the product of M and B: library, there are two options for rounding. One is √Y = M × B. round to zero, which is truncation. The other is round Implementation to nearest. If input signal rounding is zero to a library Overview component, Variable precision multiplier, division and square root otherwise it’s round to zero. Round to normal are designed as part of the library. Figure 2 shows the component also removes the integer ’1’ of the dataflow in the top level component of the library. The mantissa used in the computation. After rounding and component consists of the module to de-normalize the normalizing of this component, the floating point round to nearest is implemented; © 2015 PISER Journal http://.piserjournal.org/ PISER 18, Vol.03, Issue: 02/06 March – April ; Bimonthly International Journal Page(s) 0018-023 Progress In Science and Engineering Research Journal ISSN 2347-6680 (E) result for this component in IEEE standard format will the result is the difference between the exponents of be on the output port. the inputs with the bias suitably adjusted. Figure 3 Multiplier shows the the components and dataflow within the Figure 2 shows the components and dataflow within division. the reciprocal. This corresponds to the algorithm described above. After obtaining the de-normalized number, the sign bit of the result is the same as that of the de-normalized number. For the exponent part, first we need to obtain the actual value of the exponent by subtracting the bias from the exponent. The second step is to subtract it from zero, then add the corresponding bias to the number obtained from the second step. This is how to get the exponent of the result. Figure 5: Division Division consists of reciprocal followed by multiplication described above. After getting the reciprocal of the divisor, the next step is to multiply the reciprocal with the mantissa of the dividend. This multiplier is called the mantissa multiplier XY. For example, in double precision the inputs to the mantissa multiplier are 53 bits and 57 bits wide, and the output bit width is 110. Figure 4: Multiplier Square Root Division The first step in computing square root is to check that Take double precision multiplier operation as an the sign bit is positive. If not, the exception out is set example. The component uses the multiplier table to high to indicate that an exception situation has calculate R and 4 multipliers. The look-up table has 14 occurred. For computing the exponent, first get the bits of address and 16 output bits for a total of 32K exponent value by subtracting the de-normalized bytes. The four multipliers are the following sizes. exponent by the bias. Second, if the intermediate Multiplier YR has 17 bit and 58 bit inputs and a 71 bit number is even, then divide it by 2. Otherwise, output and its purpose is to multiply Y and R in order subtract 1 and then divide by 2. Next, add the bias to to get A. Multiplier S has two 14 bit inputs and a 28 the number obtained from the second step, and this is bit output; its purpose is to compute A2 *A2 and A2 the temporary exponent part of the result. If the *A3. Multiplier M has 28 bit and 14 bit inputs and a exponent is an odd number, an extra factor of √2 will 42 bit output; it computes the cube of A2. Multiplier L be multiplied to the fractional part of the mantissa has one 16 bit and one 58 bit input and a 74 bit output; result. it computes R*B. The pipeline of those multipliers can be adjusted by the parameter. That’s a way to modify the number of clock cycles latency of the components. Floating Point Division For floating point division, we need to denormalize the two input operands, compute the sign bit, compute the exponent and perform the division of the significants. The sign bit of the result is the XOR logic output of the signs of the two inputs. The exponent of Figure 6: Square Root © 2015 PISER Journal http://.piserjournal.org/ PISER 18, Vol.03, Issue: 02/06 March – April ; Bimonthly International Journal Page(s) 0018-023 Progress In Science and Engineering Research Journal ISSN 2347-6680 (E) Figure 4 shows the the components and dataflow of Hardware Components on Stratix . the square root. This corresponds to the algorithm Stratix II Series FPGAs [1] are Altera’s 28-ns FPGAs described above. The square root component uses which are optimized for high performance, high reciprocal look-up table R to calculate R, look-up bandwidth table M to get 1/√M when the exponent is even, the M architecture and features of Stratix II. The Stratix II Mul2 look-up table to get √2/√M when the exponent is device family contains GT, GX, GS and E sub- odd and 4 multipliers. families. applications. Figure6 For example, in double precision operation, the lookup table R has 14 address bits and 16 output bit for a total of 32K bytes. The look-up table M has 14 the Multiplicatio Parameters Division Top level Entity FPdiv_cl name address bits and 55 output bits. For look-up table M shows Family n Sqr. root FPsqrt_cl k Fpmul_clk k Stratix II Stratix II Stratix II Yes Yes Yes Mul2, the input is 14 bits and the output is 58 bits. For Met Timing Rer. the four multipliers, multiplier YR has 17 bit and 58 Logic Utilization 5% 2% 3% bit inputs and a 71 bit output; its purpose is to Comb. ALUTs 450 274 280 multiply Y and R in order to get A. Multiplier S has Dedi. Logic Registers 444 148 188 Total Registers 444 148 188 Total Pins 67 67 45 Block Memory Bits 88 0 42 0 2 0 two 14 bit inputs and a 28 bit output; Multiplier M has 28 bit and 14 bit inputs and a 42 bit output; Multiplier L has one 16 bit and one 58 bit input and a 74 bit output. DSP Block 9-Bits III. CONCLUSION ele. In this chapter, we have described the implementation of multiplication, division and square root based on the table-based method described in the paper written by Ercegovac and Lang [7]. In the next chapter, we present results from implementing these components on Altera. Table1. Resource summary of variable length Floating point arithmetic (Multiplication, Division, Square root) The core logic fabric consists of high performance adaptive logic modules (ALMs). Each ALM has eight inputs with a fracturable LUT, two embedded adders and four dedicated registers. Variable-precision DSP Experimental Results blocks contain two 18 by 18 multipliers. The results Our implementation of the algorithms for variable can be summed forming a multiply accumulator. precision floating point is written in VHDL and M20K memory blocks provide users good memory targets both of the most popular commercial FPGA block performance and it can simplify floor planning vender: Altera. For our experimental results, we built and rout ability. Optional hard error correction code our reciprocal, division and square root components (ECC) protection enables reliable delivery of the data. and then simulated and synthesized on the Atera’s Conclusions and Future Work FPGA vendor platforms. We used a range of We have presented variable precision floating point precisions supported by our variable precision floating multiplication, point format. For Altera, we synthesized with the implementations that produce hardware components Altera IDE tool and targeted a Stratix V device. The with a good trade off of maximum clock frequency, designs make use of embedded multipliers and RAMs, number of clock cycle latency and resource utilization. which property These implementations are cross-platform, meaning components provided with each set of tools. For they can be implemented on both Altera and Xilinx Altera these are called Megacores. We also provide FPGAs. Also our designs make use of the embedded the clock cycle latency, maximum clock frequency multiplier and resource usage from both Altera. commonly found in modern FPGA fabric. In the require using the intellectual Cores division and and embedded square RAM root Cores © 2015 PISER Journal http://.piserjournal.org/ PISER 18, Vol.03, Issue: 02/06 March – April ; Bimonthly International Journal Page(s) 0018-023 Progress In Science and Engineering Research Journal ISSN 2347-6680 (E) future, we plan to improve our library even further. In particular, we will focus on improving the division frequency by focusing on optimizing the critical path and trying different levels of pipelining. REFERENCES [1] Altera Corp. Stratix II website. http://www.altera.com/ devices/fpga/stratix-fpgas/stratix-v/stxv-index.jsp. [2] I. S. Committee et al. 754-2008 ieee standard for floatingpoint arithmetic. IEEE Computer Society Std, 2008, 2008. [3] F. De Dinechin and B. Pasca. Designing custom arithmetic data paths with flopoco. Design & Test of Computers IEEE, 28(4):18–27, 2011. [4] J. Detrey and F. de Dinechin. A tool for unbiased comparison between logarithmic and floating-point arithmetic. The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, 49(1):161–175, 2007. [5] M. D. Ercegovac, T. Lang, J.-M. Muller, and A. Tisserand. Reciprocation, square root, inverse square root, and some elementary functions using small multipliers. Computers, IEEE Transactions on, 49(7):628–637, 2000. [6] R. Goldberg, G. Even, and P.-M. Seidel. An fpga implementation of pipelined multiplicative division with ieee rounding. In Field-Programmable Custom Computing Machines, 2007. FCCM 2007. 15th Annual IEEE Symposium on, pages 185–196. IEEE, 2007. [7] P. Hung, H. Fahmy, O. Mencer, and M. J. Flynn. Fast division algorithm with a small lookup table. In Signals, Systems, and Computers, 1999. Conference Record of the Thirty-Third Asilomar Conference on, volume 2, pages 465– 1468. IEEE, 1999. [8] M. K. Jaiswal and R. C. Cheung. High performance reconfigurable architecture for double precision floating point division. In Reconfigurable Computing: Architectures, Tools and Applications, pages 302–313. Springer, 2012. [9] M. E. Louie and M. D. Ercegovac. A digit-recurrence square root implementation for field programmable gate arrays. In FPGAs for Custom Computing Machines, 1993. Proceedings. IEEE Workshop on, pages 178–183. IEEE, 1993. [10] B. Pasca. Correctly rounded floating-point division for dsp-enabled fpgas. In Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on, pages 249–254. IEEE, 2012. [11] P. Soderquist and M. Leeser. Area and performance tradeoffs in floating-point divide and square-root implementations. ACM Computing Surveys (CSUR), 28(3):518–564, 1996. [12] P. Soderquist and M. Leeser. Division and square root: choosing the right implementation. Micro, IEEE, 17(4):56–66, 1997. © 2015 PISER Journal http://.piserjournal.org/ PISER 18, Vol.03, Issue: 02/06 March – April ; Bimonthly International Journal Page(s) 0018-023
© Copyright 2025