High Speed FIR-Filter Architectures with Scalable Sample Rates Abstract

High Speed FIR-Filter Architectures with Scalable
Sample Rates
Martin Vaupel, Heinrich Meyr
Abstract
FIR (nite impulse response) lters are widely used
in digital signal processing. In this paper new architectures for high speed FIR lters with programmable
coecients are presented. Special eorts are undertaken
to develop a structure that is well suitable for dierent
data rates and therefore may be used within a tool (lter generator) that generates demand driven dedicated
lter structures. The presented structure leads to highly
ecient designs, that are useable within dierent environments. The basic design structure is introduced and
implementation considerations are discussed. Results of
synthesis runs are presented.
1 Introduction
FIR (nite impulse response) lters are widely used
in digital signal processing. Their applications often demand high speed computation. To satisfy these requirements dedicated and ecient lter architectures for each
target domain are needed. In order to free the system
designer from designing a lter architecture with unique properties (number of taps, length of input words
and coecients) for each application, considerable efforts have been undertaken to develop lter generators
that are able to deal with a variety of these implementation parameters and generate an eciently scalable architecture 1,2,3,4,5]. Only a small part of this work is
concerned with lters with programmable coecients,
that are used eg. within equalizers or adaptive lters.
However, each of these generators delivers an architecture with one xed sample rate, only. If this does not
meet the system designers actual specications, he has
to accept an eciency loss or to design a new architecture satisfying the requirements, while dropping the
advantages of a generic design approach.
An interesting approach to realize programmable coecients is reported by Khoo et.al. 6]. Their design
enables ltering at dierent sample rates. However, this
eect is not originally intended, as it is a side eect of
the way the coecients are encoded, and may not be
controlled independently. Another drawback is the impossibility to update the values of the coecients during ltering in a way that only values of the same set
of coecients contribute to the output values for each
The authors are with Aachen University of Technology, ISS
611810, Templergraben55, 52056 Aachen, Germany,Tel.:+49-241807880, Fax: +49-241-8888195, email: [email protected]
This work was partially supported by DFG under contract no.
Me651/12-1
time instance (synchronous update). Other designs of
high speed programmable FIR lters with a xed sample rate 7,8] suer from the same drawback.
Noll et.al. 9] have provided a full custom architecture of a programmable FIR lter, which is best suitable
for high speed applications. It is based on a semi-systolic
array of full adders with carry save arithmetics. With
this architecture synchronous coecient update is possible.
Our goals were to develop a fully synchronous design
based on standard cells. This implies a lower importance
of regularity compared to a full custom approach. Therfore some restrictions on the selection between dierent
architectural alternatives are relaxed. Thus the designer
is enabled to use eg. special irregularities in order to
decrease the overall area.
Based on Nolls architecture we developed an approach
to deliver ecient high speed programmable FIR lters
suitable for within a relatively large range of sample rates. We changed the known architecture to meet the
special requirements of a scalable solution and to obtain an additional eciency gain of approximately 30%
compared to an equivalent implementation of 9] with
standard cells and a single phase clock.
The paper is organized as follows: In section two different approaches for the implementation of programmable high speed lters are considered. The new strategy is
developed. Part three is concerned with implementation
aspects which reduce the chip area. The novel architecture will be explained in detail. In section four general
aspects and results of synthesis runs are discussed. Final
remarks and an outlook to further works will conclude
this paper.
2 Algorithm
A FIR lter with L taps is described by its transfer
function:
Y (z) = L;1 c z ;i
G(z) = X(z)
(1)
i
i=0
X
Let us assume xed and positive coecients rst. Each
coecient ci can be splitted into m single bits cji . Than
the transfer function is of the form
G(z) =
X(z;i mX; cji2j ) with ci = mX; cji 2j
L;1
1
1
i=0
j =0
j =0
(2)
tap
X
-1
Z
1
0
2
0
3
c2
c2
c2
s
s
s
c2
2
1
c1
3
c1
c1
c1
s
s
s
2
1
c00
c0
3
c0
c0
s
s
Y
s+
-1
Z
s
: shift right 1 bit
-1
Z
s+
-1
Z
s
possible pipeline slice
: shift left 3 bit
s+
s+
Figure 1: Flow chart of an accumulation free lter
bitplane
X
0
c2
-1 -1 -1
Z
Z
Z
0
c1
-1
Z
0
1
c0
-1
Z
c2
-1
Z
-1 -1 -1
Z
Z
Z
1
c1
-1
Z
s
1
2
c0
-1
Z
c2
-1
Z
-1 -1 -1
Z
Z
Z
2
c1
-1
Z
s
2
3
c0
-1
Z
c2
-1
Z
3
c1
-1
Z
s
3
c0
-1
Z
Y
s+
Figure 2: Structure of a fully pipelined lter with bitplanes
For a lter with three coecients with four bits each this
can be written as:
G(z) =
(c00 + 2(c10 + 2(c20 + 2(c30 )))) + (3)
+z ;1 (c01 + 2(c11 + 2(c21 + 2(c31 )))) +
+z ;1 (c02 + 2(c12 + 2(c22 + 2(c32 ))))
An implementation of this form can be realized with a
structure like Fig. 1 (transposed direct form). Each adder is implemented by a row of 1-bit full adder cells, with
number equal to the actual word length of the intermediate result. The multipliers (triangles in Fig. 1) compute the partial products and are in fact simple ANDgates. Between rows of adder cells a hard shift of the
result is performed (shift and add). The output of one
tap is fed into free adder inputs of the following tap.
Therefore no more explicite adders are needed. (accumulation free lter 10])
In order to speed up the architecture it is possible
to implement pipeline slices by inserting registers behind each adder and correspondingly into the input line
(dotted squares in Fig. 1). A more ecient solution in
terms of area is to perform some kind of resorting | 1)
adding all partial products of the lowest value rst, 2)
performing a shift of the result, and 3) adding the partial
products of the next value | results in so called bitplanes 11]. Due to adding the lowest partial product values
rst, the upper bound on the value of the intermediate
results is growing slower from left to right compared to
the structure in Fig. 1. This leads to a design that has
the minimum possible number of adder cells due to the
lower increase of the wordlength required in each line.
Another advantage is that after each bitplane the lowest
bit of the result is computed completely and may be
truncated without side eects on the upper bits if desired. The corresponding form of the transfer function
equals
G(z ) = (c30 23 z ;9 + z ;1(c31 23 z ;9 + z ;1 (c22 23 z ;9+
(4)
;
1
2
2
;
6
;
1
2
2
;
6
;
1
2
2
;
6
+ z (c0 2 z + z (c1 2 z + z (c2 2 z +
+ z ;1 (c10 21z ;3 + z ;1 (c11 21z ;3 + z ;1(c12 21z ;3 +
+ z ;1 (c00 20z ;0 + z ;1 (c01 20z ;0 + z ;1(c02 20z ;0 ) ::: )
The introduced pipeline steps result in an increased latency of the lter. The corresponding ow graph is outlined in gure 2.
To reduce chip area at the cost of decreased throughput, Noll has introduced modied bitplanes, which are
a mixture between the hitherto regarded approaches. Instead of pipelining each adder, pipeline registers are inserted after each second adder in Fig. 1 only and the
structure is rearranged according to Fig. 3. The number
of adder cells between registers will be called pipeline
depth in the following. The drawback of this solution is
that the minimum number of cells required can not be
reached.
In order to avoid this disadvantage our new approach
is to retain the underlaying structure of Fig. 2 as it is but
to alter the scheduling of the input words to the inputs
of the multipliers accordingly. This leads to considerable area savings especially with larger pipeline depths
due to the lowest possible increase of the word length.
For instance an implementation of a lter with a typical
parameter set (eight coecients, input and coecient
word length of eight and four bit, respectively) and pipeline depth four needs 33% more adder cells if implemented according to Fig. 1 and 11% more adder cells
when implemented according to Fig. 3 compared to the
new introduced structure. The transfer function has the
following form now (shown for pipeline depth 2):
G(z) =
(c3023 z ;4 + c31 23z ;5 +
+ z ;1 (c3223 z ;5 + c20 22z ;3 +
+ z ;1 (c2122 z ;3 + c22 22z ;4 +
+ z ;1 (c1021 z ;1 + c11 21z ;2 +
+ z ;1 (c1221 z ;2 + c00 20z ;0 +
+ z ;1 (c0120 z ;0 + c02 20z ;1 ) ::: )
(5)
The corresponding structure is shown in Fig. 4.
3 Architecture
In order to decrease the area without increasing the
minimum clock period, we have implemented (modied) booth encoding of the coecients. This reduces the
modified bitplane
X
-1
Z
0
c2
1
c2
s
0
c1
-1
Z
1
-1
Z
2
c0
3
c2
2
s
s+
c13
c1
c2
-1
Z
-1
Z
s
-1
Z
1
0
c0
c1
-1
Z
s
2
c03
c0
Y
-1
Z
s
s
s+
Figure 3: Modied bitplanes at pipeline depth 2
bitplane
X
-1
Z
0
c2
0
c1
-1
Z
-1
Z
-1
Z
0
c0
-1
Z
1
c2
1
c1
1
-1
Z
s
-1
Z
-1
Z
2
2
3
c1
c2
c0
c0
Y
-1
Z
-1
Z
s
3
3
2
c1
c2
c0
s
s+
Figure 4: The stucture with minimum number of cells at pipeline depth 2
length of the array by a factor of two and increases array
width by one bit. Using the newly proposed architecture
a synchronous update of the coecients is still possible.
A group of three successive coecient bits is encoded
into two magnitude and one sign bit. These are fed into
modied partial product gates, which are able to perform a shift and a one's complement (inversion) of the
input bits depending on the values of the encoded bits.
The resulting structure of a lter with three taps, six
bit wide coecients and an input wordlength of three
bit is outlined in Fig. 5 for a pipeline depth of two.
This architecture is able to deal with two's complement
numbers as inputs and coecients. It consists of full
adder cells and registers mainly. The input bits are
fed into the array in a parallel manner. Note that the
input bits are drawn for the rst bitplane only. In rows,
where the wordlength of the intermediate result need not
be increased, so called carry-overow-correction (COC)
cells 9,12] are implemented. These are necessary to correct an overow of the carry word, which is possible although the sum of carry and sum word ts into the given word length. This leads to considerable savings in
terms of the required wordlength. Another advantage
of using COC-cells together with the proposed bitplane
structure is that at most one bit sign extension in each
row is neccessary. Therefore no additional buers are
needed to drive large sign extension lines (cc. 2]). Within the rightmost full adder cell of each row the sign bit
of the preceding booth encoding cell is added in order to
complete the computation of a two's complement. The
postponing of this operation leads to a decreased word
length, too.
The upper bits of the nal sum and carry word are added using a vector merging adder (VMA). It consists of a
pipelined array of full and half adder cells, as well. The
internal structure of the VMA is changed for dierent
pipeline depths to reach the minimum possible area.
In order to shorten the critical path, additional registers are inserted after the modied partial product
gates of row 1,3,5,... This has to be taken into account
when delaying the input words correctly. As a result for
each pipeline depth pd the critical path consists of pd
full adder cells only.
input
2
1
input
0
input
modified partial
product gate
COC
0
full adder with
carry overfl. corr.
0
H
modified
booth
encoder
H
H
0
c1
3
half adder
1
c2
full adder
H
0
c2
3
1
1
c1
1
0
register
0
3
c0
3
c2
2
c2
3
c1
2
c1
3
c0
2
c0
3
c2
4
c2
3
c1
4
c1
3
c1
4
c1
1
c0
1
number of taps:
1
3
wordlength
of coefficients: 6
of inputs:
3
pipeline depth: 2
1
3
c2
H
1
3
c1
1
1
3
c0
1
COC
3
5
c2
1
3
5
c1
1
3
5
c1
1
COC
VMA
9
out
8
out
7
out
6
out
5
out
4
out
3
out
2
out out
1
out 0
Figure 5: Detailed structure of the new lter architecture
4 Implementation results
As the array is semi-systolic, the achievable sample
rate becomes independent from the desired implementation parameters, due to the mainly local communication
between cells avoiding broadcast of data. This frees the
system designer from considering an additional parameter during system optimization.
The structural description of the lter was done within VHDL. It is fully parameterizable in number of taps,
and wordlength of inputs and coecients. Furthermore,
due to the generic properties of VHDL, it was possible to
describe the dierent structures (depending on the value
of the pipeline depth) within a single 'architecture'. For
each desired sample rate a C-program computes the scheduling of the delayed input words and writes the results
into a VHDL-package. Within the VHDL-architecture
the underlaying global structure and the insertion of additional registers after partial product gates is described.
Information from package and entity/architecture is fed
into a synthesis tool, which produces steered by a synthesis script a netlist.
7
area 2
mm
number of taps:
8
wordlength of input: 6
wordlength of coeff.: 6
pd=1
6
5
pd=1
2
4
3
4
2
implementation
according to [9]
3
eect throughput are independent. Additionally, due to
the mainly regular array structure, the area may well be
estimated without the need for synthesis runs. These are
advantages especially with regard to making use of the
proposed structure within a system design environment.
5 Conclusion
In this paper a novel architecture of a programmable high-speed digital FIR lter was proposed, which
results in ecient designs. The advantages of Nolls
architecture, high data rates, synchronous updating of
the coecients, mainly local communication within the
array (and therefore independency of the maximumsample rate from actually chosen functional parameters) are
preserved. Beyond this the new proposed structure is
well suited for throughput scaling and saves additionally
about 30 percent of area compared to a straight forward semi custom realization of 9] without decreasing
the possible throughput.
The splitting of ip-ops into edge triggered latches,
together with a two phase clock is possible. Further investigations have shown that additional area savings are
possible for applications with linear phase.
References
1] R. Jain, P. Yang, and T. Yoshino, \FIRGEN: A Computer{
2]
pd=6
3
2
new architecture
1
pd: pipeline depth
~
6
pd=8
3]
A*T = const.
~
0
0 10
4
15
20
25
30
35
40
4]
clock period / ns
Figure 6: Area over minimum clock period at dierent
pipeline depths
In Fig. 6 accumulated cell areas and sample rates for
dierent pipeline depths are shown. (8 taps, 6 bits wide
coecients and input words) The library used was the
ES2 1CMOS standard cell library. Synthesis was undertaken with SYNOPSYS. (Operating conditions were
set to worst case.) It can easily be seen that our approach is well suited to deliver highly ecient designs for
dierent requirements. Note the almost constant areatime-product between 10 ns and 25 ns clock period. For
applications with sample rates below 40 MHz the eciency decreases slowly, therefore in this case resource
sharing should be considered. The proposed architecture leads to considerable savings in area compared to a
direct approach following 9]. (Comparing the points for
pd = 3 and pd = 4 of the upper curve the eect of the
discussed suboptimal number of adder cells is visible,
because it is not covered by savings in terms of registers of pipeline slices.) At dierent sets of parameters
the curves are shifted vertically only, which means that
the sample rate rate is a function of the pipeline depth
exclusively. Functional parameters and implementation
5]
6]
7]
8]
9]
10]
11]
12]
Aided Design System for High Performance FIR Filter Integrated Circuits," IEEE Transactions on Signal Processing,
vol. 39, pp. 1655{1668, July 1991.
R. Hawley, T. Lin, and H. Samueli, \A silicon compiler for
high{speed CMOS multirate FIR digital lters," in Proc.
IEEE Int. Symp. Circuits and Systems, vol. 3, pp. 1348{1351,
May 1992.
R. Hartley, P. Corbett, P. Jacob, and S. Karr, \A High Speed
FIR Filter Designed by Compiler," in Proceedings of the custom intergrated circuits conference, (San Diego), pp. 20.2.1{
20.2.4, 1989.
P. Cappello and C. Wu, \Computer{aided Design of VLSI
FIR Filters," Proceedings of the IEEE, vol. 75, pp. 1260{1271,
September 1987.
F. F. Yassa, J. R. Jasica, R. I. Hartley, and S. R. Noujaim, \A
Silicon Compiler for Digital Signal Processing: Methodology,
Inplementation, and Applications," Proceedings of the IEEE,
vol. 75, pp. 1272{1282, September 1987.
K.-Y. Khoo, A. Kwenuts, and A. N. Willson, \An Ecient
175MHz Progammable FIR Digital Filter," in Proceedings of
International Symposium on Circuits and Systems, pp. 72{
75, 1993.
C. Joanblanq et al., \A 54 MHz CMOS Programmable Video
Signal Processor for HDTV Applications," IEEE Journal on
Solid State Circuits, pp. 730{734, 1990.
M. Hatamian and S. K. Rao, \A 100 MHz 40{Tap Programmable FIR Filter Chip," in Proceedings of International Symposium on Circuits and Systems, (New Orleans),
pp. 3053{3056, 1990.
T. G. Noll, \Semi{Systolic Maximum Rate Transversal Filters
with Programmable Coecients," in Systolic Arrays (W. M.
et.al, ed.), pp. 103{112, Bristol: Adam Hilger, 1987.
P. R. Cappello and K. Steiglitz, \A Note on \free accumulation" in VLSI Filter Architectures," IEEE Trans. on Circuits
and Systems, vol. CAS{32, pp. 291{296, march 1985.
P. B. Denyer and D. Myers, \Carry-Save Arrays for VLSI Signal Processing," in Proc. of rst Int. Conf. on VLSI, (Edinburgh), pp. 151{160, Aug. 1981.
T. G. Noll, \Carry{Save Architectures for High{Speed Digital Signal Processing," J. VLSI Signal Processing, no. 1{2,
pp. 121{140, 1991.