Download Report

Progress In Science and Engineering Research Journal
ISSN 2347-6680 (E)
VARIABLE PRECISION FLOATING POINT
MULTIPLICATION, DIVISION AND SQUARE ROOT
IMPLEMENTATION ON FPGA USING VHDL LANGUAGE
Varun Singh Sengar1 , Dr. Vinod Kapse2
M -Tech Scholar, Professor & Head, Department of EC. GGITS, Jabalpur
Abstract: Computers were originally built as fast, reliable
Many of the scientific applications described above
and accurate computing machines. It does not matter how
rely on floating point (FP) computation, often
large computers get, one of their main tasks will be to
requiring the use of the Variable precision (V.P.)
always perform computation. Most of these computations
need real numbers as an essential category in any real
world calculations. Real numbers are not finite; therefore
no finite, representation method is capable of representing
format specified by the IEEE standard 754. The use of
V.P. data type improves the accuracy and dynamic
range of the computation, but simultaneously it
all real numbers, even within a small range. Thus, most
increases the complexity and performance of the
real values will have to be represented in an approximate
arithmetical computation of the module. The design of
manner. The scope of this paper includes study and
high performance floating point units (FPUs) is thus
implementation of Adder/Subtractor and Multiplication,
of interest in this domain.
Division and Square root functional units using HDL for
computing arithmetic operations and functions suited for
Among
hardware implementation. Floating Point Arithmetic are
performance of multiplication, division and square
widely used in large set of scientific and signal processing
root can differ based on the algorithm implemented.
computation. Hardware implementation of floating point
They can highly affect the total performance of the
arithmetic is more complex than for fixed point numbers,
application running them. In this paper, we improve
and this puts a performance limit on several of these
these three operations using a table based approach.
applications. Several works also focused on their
implementation on FPGA platforms.
Index Terms: VHDL, FPGA, Variable Precision Floating
I. INTRODUCTION
other
arithmetic
operations,
the
Our implementation is written in Very High Speed
Integrated Circuits Hardware Description Language
(VHDL)
Point.
all
and
implemented
on
FPGAs.
These
components have been implemented using Altera
development environments, the major FPGA vendor.
Floating point arithmetic implementations involve
Also these implementations provide good tradeoffs
processing separately the sign, exponent and mantissa
among hardware resource utilization, maximum clock
parts, and then combining them after rounding and
frequency and latency. Users can change the latency
normalization. IEEE standard for floating point
by adjusting the parameters of the components. In
(IEEE-754) specifies how single precision (32 bit) and
addition to supporting the IEEE 754 standard
double precision (64 bit) floating point numbers are to
representations which include single and double
be represented.
precision, these components can be customized by the
Corresponding Author:
user to specify the length of the exponent and the
1. Mr. Varun Singh Sengar, M -Tech Scholar, Department of
mantissa. This will contribute to more resource and
EC. Gyan Ganga Institute of Technology & Sciences,
energy efficiency in systems where variable precision
Jabalpur, India
is used.
Email Id: [email protected]
2. Dr. Vinod Kapse, Professor & Head, Department of EC. Gyan
Ganga Institute of Technology & Sciences, Jabalpur
II. LAYOUT FLOW
Algorithms Description: here we have designed a
library of variable precision floating point units
written in VHDL Components include floating point
© 2015 PISER Journal
http://.piserjournal.org/
PISER 18, Vol.03, Issue: 02/06 March – April ; Bimonthly International Journal
Page(s) 0018-023
Progress In Science and Engineering Research Journal
arithmetic
(addition,
subtraction,
multiplication,
division, square root, accumulation).
ISSN 2347-6680 (E)
post-processing. Generally speaking, reduction step
and evaluation step are the same for both reciprocal
and division. The only difference is within the postprocessing step.
In the reduction step after normalizing, the fractional
part of the floating point number is 1 ≤
Y < 2.
Assume Y has an m bit significant and k is b(m +
2)/4c + 1; Y (k) represents the truncation of Y to k
bits. Define R as the reciprocal of Y (k). Define M = R
for computing the reciprocal and division. For
example, in double precision based on the equation
Fig.1.Hierarchy for the proposed Design of variable length
Floating point arithmetic
above, m=53 and k = 14. So R can be determined
using a look-up table with a 14 bits address. The
number of bits for the look-up table output for R is 16.
In the evaluation step B is defined as the Taylor series
expansion of
f(A) = 1/(1 + A)
where A is defined as (Y × R) − 1.
Also note that −2−k < A < 2k. For z = 2−k, A can be
Fig.2.Combinations of Expected Inputs and Outputs for
variable length Floating point arithmetic
represented as:
A = A2z2 + A3z3 + A4z4 + ...
Operand can be variable precision floating point
Where |Ai| ≤ 2k −1. We ignore the smaller terms that
beyond the standard IEEE 754 formats. Any bit width
do not contribute to the final result.
exponent or mantissa is supported. Figure 1 shows the
Using the Taylor series expansion,
standard input and output ports for components in the
B = f(A) = C0 + C1A + C2A2 + C3A3 + C4A4 + · · ·
circuit. Each component has inputs READY, STALL,
≈ C0 + C1(A2z2 + A3z3 + A4z4)
ROUND and EXCEPTION IN and outputs DONE and
+ C2(A2z2 + A3z3 + A4z4)
EXCEPTION OUT specifically to handle pipelining.
+C3(A2z2 + A3z3 + A4z4)3
The READY and DONE signals are used for
+C4(A2z2 + A3z3 + A4z4)4
determining when the inputs are ready and when the
≈ C0 + C1A + C2A22z4 + 2C2A2A3z5 +
results are available for use. STALL allows a bubble
C3A32z6
to be inserted into the pipeline if needed. Round has
Here, Ci = 1 when i is even, Ci = −1 when i is odd.
two modes: round to zero or truncate, and round to
Simplifying we get:
nearest. The exception signals propagate an exception
B = f(A)
flag with the value that may be incorrect through the
A22)z4+ 2A2S3z5 − A32z6
pipeline. For multiplication and division, an exception
is identified if the divisor is zero. For square root, an
≈ 1 − A2z2 − A3z3 + (−A4 +
≈
(1 − A) + A22z4 + 2A2A3z5 −
A32z6
exception is identified if the operand is a negative
The equation above is used in the implementation of
number. Otherwise, the exception input is propagated
reciprocal and division.
to the exception output.
In the post-processing step final result is given by
Division
multiplication.
To find the reciprocal 1/Y or the quotient X/Y , the
For reciprocal the result of reciprocal of Y is given by
algorithm needs three steps: reduction, evaluation, and
the product of M and B:
© 2015 PISER Journal
http://.piserjournal.org/
PISER 18, Vol.03, Issue: 02/06 March – April ; Bimonthly International Journal
Page(s) 0018-023
Progress In Science and Engineering Research Journal
ISSN 2347-6680 (E)
1/Y = M × B.
input floating point numbers, the core computation
For division the result of division X/Y is given by the
which includes multiplication, division or square root
product of M, B and X:
and round to normal component for adjusting to the
standard floating point format. There are two main
X/Y = M × B × X.
Square Root
modules in common among all components which are
For computing square root, there are three steps
de-normalizing and round to normal. So for this
similar to the reciprocal computation. First step is
chapter we will first discuss the common elements
reduction, second is evaluation, and the last step is
implementation and then the core components of
post-processing.
multiplier, division and square root separately.
In the reduction step the difference between the
reduction step of computing square root and reciprocal
is that after getting R, M is assigned to 1/√R. So
another look-up table to compute the inverse square
root of R is needed. Notice that if the exponent is odd,
there will be √2 as a coefficient to multiply by the
result at the last step. For the last step, we check the
last bit of the exponent. 0 means the exponent is an
even number, so we assign 1/√R to M. If not, M
equals to √2/√R. To compute √2/√R, we create
another look-up table.
Figure 3: Top Level Hierarchy of Components
In the evaluation step f(A) is not 1/(1 + A) as the
above. Instead let
De-normalizer
f(A) = √(1 + A)
IEEE 754 standard floating point number consists of
f(A)’s Taylor series expansion is still the same as that
sign bit, exponent and mantissa. As described, the
of the reciprocal’s:
stored part of the mantissa does not included the
B = f(A) = C0 + C1A + C2A2 + C3A3 + C4A4 + · · ·
first ’1’ as the integer. Since this ’1’ is required for
≈ C0 + C1(A2z2 + A3z3 + A4z4)
computation, the de-normalizer component adds it as
+ C2(A2z2 + A3z3 + A4z4)
the most significant bit of the fractional part. Based on
+C3(A2z2 + A3z3 + A4z4)3
the number of operands, we might need one de-
+C4(A2z2 + A3z3 + A4z4)4
normalizer for Multiplier and square root or two for
≈ C0 + C1A + C2A22z4 + 2C2A2A3z5 + C3A32z6
division.
However, the coefficients change accordingly: C0 =
Round to Normal
1,C1 = 1/2,C2 = −1/8,C3 = 1/16. Thus,
There are four rounding mode specified in the IEEE
B=f(A)= 1+A/2 –
1/8A22z4
– 1/4A2A3z + 1/16 A2 z
5
3 6
standard : round to zero, round to nearest, round to
In the post-processing step the final result of the
negative infinity and round to positive infinity. In our
square root is given by the product of M and B:
library, there are two options for rounding. One is
√Y = M × B.
round to zero, which is truncation. The other is round
Implementation
to nearest. If input signal rounding is zero to a library
Overview
component,
Variable precision multiplier, division and square root
otherwise it’s round to zero. Round to normal
are designed as part of the library. Figure 2 shows the
component also removes the integer ’1’ of the
dataflow in the top level component of the library. The
mantissa used in the computation. After rounding and
component consists of the module to de-normalize the
normalizing of this component, the floating point
round
to
nearest
is
implemented;
© 2015 PISER Journal
http://.piserjournal.org/
PISER 18, Vol.03, Issue: 02/06 March – April ; Bimonthly International Journal
Page(s) 0018-023
Progress In Science and Engineering Research Journal
ISSN 2347-6680 (E)
result for this component in IEEE standard format will
the result is the difference between the exponents of
be on the output port.
the inputs with the bias suitably adjusted. Figure 3
Multiplier
shows the the components and dataflow within the
Figure 2 shows the components and dataflow within
division.
the reciprocal. This corresponds to the algorithm
described above. After obtaining the de-normalized
number, the sign bit of the result is the same as that of
the de-normalized number. For the exponent part, first
we need to obtain the actual value of the exponent by
subtracting the bias from the exponent. The second
step is to subtract it from zero, then add the
corresponding bias to the number obtained from the
second step. This is how to get the exponent of the
result.
Figure 5: Division
Division
consists
of
reciprocal
followed
by
multiplication described above. After getting the
reciprocal of the divisor, the next step is to multiply
the reciprocal with the mantissa of the dividend. This
multiplier is called the mantissa multiplier XY. For
example, in double precision the inputs to the mantissa
multiplier are 53 bits and 57 bits wide, and the output
bit width is 110.
Figure 4: Multiplier
Square Root
Division
The first step in computing square root is to check that
Take double precision multiplier operation as an
the sign bit is positive. If not, the exception out is set
example. The component uses the multiplier table to
high to indicate that an exception situation has
calculate R and 4 multipliers. The look-up table has 14
occurred. For computing the exponent, first get the
bits of address and 16 output bits for a total of 32K
exponent value by subtracting the de-normalized
bytes. The four multipliers are the following sizes.
exponent by the bias. Second, if the intermediate
Multiplier YR has 17 bit and 58 bit inputs and a 71 bit
number is even, then divide it by 2. Otherwise,
output and its purpose is to multiply Y and R in order
subtract 1 and then divide by 2. Next, add the bias to
to get A. Multiplier S has two 14 bit inputs and a 28
the number obtained from the second step, and this is
bit output; its purpose is to compute A2 *A2 and A2
the temporary exponent part of the result. If the
*A3. Multiplier M has 28 bit and 14 bit inputs and a
exponent is an odd number, an extra factor of √2 will
42 bit output; it computes the cube of A2. Multiplier L
be multiplied to the fractional part of the mantissa
has one 16 bit and one 58 bit input and a 74 bit output;
result.
it computes R*B. The pipeline of those multipliers can
be adjusted by the parameter. That’s a way to modify
the number of clock cycles latency of the components.
Floating Point Division
For floating point division, we need to denormalize
the two input operands, compute the sign bit, compute
the exponent and perform the division of the
significants. The sign bit of the result is the XOR logic
output of the signs of the two inputs. The exponent of
Figure 6: Square Root
© 2015 PISER Journal
http://.piserjournal.org/
PISER 18, Vol.03, Issue: 02/06 March – April ; Bimonthly International Journal
Page(s) 0018-023
Progress In Science and Engineering Research Journal
ISSN 2347-6680 (E)
Figure 4 shows the the components and dataflow of
Hardware Components on Stratix .
the square root. This corresponds to the algorithm
Stratix II Series FPGAs [1] are Altera’s 28-ns FPGAs
described above. The square root component uses
which are optimized for high performance, high
reciprocal look-up table R to calculate R, look-up
bandwidth
table M to get 1/√M when the exponent is even, the M
architecture and features of Stratix II. The Stratix II
Mul2 look-up table to get √2/√M when the exponent is
device family contains GT, GX, GS and E sub-
odd and 4 multipliers.
families.
applications.
Figure6
For example, in double precision operation, the lookup table R has 14 address bits and 16 output bit for a
total of 32K bytes. The look-up table M has 14
the
Multiplicatio
Parameters
Division
Top level Entity
FPdiv_cl
name
address bits and 55 output bits. For look-up table M
shows
Family
n
Sqr. root
FPsqrt_cl
k
Fpmul_clk
k
Stratix II
Stratix II
Stratix II
Yes
Yes
Yes
Mul2, the input is 14 bits and the output is 58 bits. For
Met Timing Rer.
the four multipliers, multiplier YR has 17 bit and 58
Logic Utilization
5%
2%
3%
bit inputs and a 71 bit output; its purpose is to
Comb. ALUTs
450
274
280
multiply Y and R in order to get A. Multiplier S has
Dedi. Logic
Registers
444
148
188
Total Registers
444
148
188
Total Pins
67
67
45
Block Memory Bits
88
0
42
0
2
0
two 14 bit inputs and a 28 bit output; Multiplier M has
28 bit and 14 bit inputs and a 42 bit output; Multiplier
L has one 16 bit and one 58 bit input and a 74 bit
output.
DSP Block 9-Bits
III. CONCLUSION
ele.
In this chapter, we have described the implementation
of multiplication, division and square root based on
the table-based method described in the paper written
by Ercegovac and Lang [7]. In the next chapter, we
present results from implementing these components
on Altera.
Table1. Resource summary of variable length Floating point
arithmetic (Multiplication, Division, Square root)
The core logic fabric consists of high performance
adaptive logic modules (ALMs). Each ALM has eight
inputs with a fracturable LUT, two embedded adders
and four dedicated registers. Variable-precision DSP
Experimental Results
blocks contain two 18 by 18 multipliers. The results
Our implementation of the algorithms for variable
can be summed forming a multiply accumulator.
precision floating point is written in VHDL and
M20K memory blocks provide users good memory
targets both of the most popular commercial FPGA
block performance and it can simplify floor planning
vender: Altera. For our experimental results, we built
and rout ability. Optional hard error correction code
our reciprocal, division and square root components
(ECC) protection enables reliable delivery of the data.
and then simulated and synthesized on the Atera’s
Conclusions and Future Work
FPGA vendor platforms. We used a range of
We have presented variable precision floating point
precisions supported by our variable precision floating
multiplication,
point format. For Altera, we synthesized with the
implementations that produce hardware components
Altera IDE tool and targeted a Stratix V device. The
with a good trade off of maximum clock frequency,
designs make use of embedded multipliers and RAMs,
number of clock cycle latency and resource utilization.
which
property
These implementations are cross-platform, meaning
components provided with each set of tools. For
they can be implemented on both Altera and Xilinx
Altera these are called Megacores. We also provide
FPGAs. Also our designs make use of the embedded
the clock cycle latency, maximum clock frequency
multiplier
and resource usage from both Altera.
commonly found in modern FPGA fabric. In the
require
using
the
intellectual
Cores
division
and
and
embedded
square
RAM
root
Cores
© 2015 PISER Journal
http://.piserjournal.org/
PISER 18, Vol.03, Issue: 02/06 March – April ; Bimonthly International Journal
Page(s) 0018-023
Progress In Science and Engineering Research Journal
ISSN 2347-6680 (E)
future, we plan to improve our library even further. In
particular, we will focus on improving the division
frequency by focusing on optimizing the critical path
and trying different levels of pipelining.
REFERENCES
[1]
Altera Corp. Stratix II website. http://www.altera.com/
devices/fpga/stratix-fpgas/stratix-v/stxv-index.jsp.
[2] I. S. Committee et al. 754-2008 ieee standard for floatingpoint arithmetic. IEEE Computer Society Std, 2008, 2008.
[3] F. De Dinechin and B. Pasca. Designing custom arithmetic
data paths with flopoco. Design & Test of Computers IEEE,
28(4):18–27, 2011.
[4] J. Detrey and F. de Dinechin. A tool for unbiased
comparison between logarithmic and floating-point
arithmetic. The Journal of VLSI Signal Processing
Systems for Signal, Image, and Video Technology,
49(1):161–175, 2007.
[5] M. D. Ercegovac, T. Lang, J.-M. Muller, and A.
Tisserand. Reciprocation, square root, inverse square root,
and some elementary functions using small multipliers.
Computers, IEEE Transactions on, 49(7):628–637, 2000.
[6] R. Goldberg, G. Even, and P.-M. Seidel. An fpga
implementation of pipelined multiplicative division with
ieee rounding. In Field-Programmable Custom Computing Machines, 2007. FCCM 2007. 15th Annual IEEE
Symposium on, pages 185–196. IEEE, 2007.
[7] P. Hung, H. Fahmy, O. Mencer, and M. J. Flynn. Fast
division algorithm with a small lookup table. In Signals,
Systems, and Computers, 1999. Conference Record of the
Thirty-Third Asilomar Conference on, volume 2, pages 465–
1468. IEEE, 1999.
[8] M. K. Jaiswal and R. C. Cheung. High performance
reconfigurable architecture for double precision floating
point
division.
In
Reconfigurable
Computing:
Architectures, Tools and Applications, pages 302–313.
Springer, 2012.
[9] M. E. Louie and M. D. Ercegovac. A digit-recurrence
square root implementation for field programmable gate
arrays. In FPGAs for Custom Computing Machines,
1993. Proceedings. IEEE Workshop on, pages 178–183.
IEEE, 1993.
[10] B. Pasca. Correctly rounded floating-point division for
dsp-enabled fpgas. In Field Programmable Logic and
Applications (FPL), 2012 22nd International Conference
on, pages 249–254. IEEE, 2012.
[11] P. Soderquist and M. Leeser. Area and performance
tradeoffs in floating-point divide and square-root
implementations. ACM Computing Surveys (CSUR),
28(3):518–564, 1996.
[12] P. Soderquist and M. Leeser. Division and square root:
choosing the right implementation. Micro, IEEE,
17(4):56–66, 1997.
© 2015 PISER Journal
http://.piserjournal.org/
PISER 18, Vol.03, Issue: 02/06 March – April ; Bimonthly International Journal
Page(s) 0018-023