Marc-Julien objois, Catherine Single, Charlena Fong, and Mariya Shterngartz
It may be necessary to create FIR filters which act on wide data. Doing so
using registers and multiple multipliers gets very expensive. For instance, a
20 bit lpm_mult
requires 1259 logic units on a FLEX10K device.
For each tap of the filter, incoming data needs to be stored in a 20 bit
register. Each filter coefficient also needs 20 bits. Considering each bit
requires a logic unit, a multi-tap filter can get extremely large.
However, using a single lpm_mult
and SRAM storage, the same
filter can be implemented.
The lpm_mult
megafunction is most efficient when used with a
pipelining value of 4. On a FLEX10K20RC-4, it can be clocked at approximately
24 MHz. This speed is not entirely necessary, and the master 25MHz clock can
be divided by two to clock the lpm_mult
. The main blocks and
general flow of data are presented in Fig. 1.
The trick is that you can't wait for lpm_mult
to finish a multiplication
before feeding it new values. Using a counter or two, this can be
accomplished easily. One process can feed each wave data value and its
corresponding coefficient into the multiplier, while a delayed counter can
start adding values to an initially cleared accumulator as soon as data starts
to emerge from the lpm_mult
. The coefficients of the FIR filter
will be refered to as h
, and the values of the wave data will be
referred to as n
.
Note that the data width of the multiplier is 2X that of the input data. This
precision is needed all the way to the end. Thus, the adder must support 2X
the data width as well.
At the beginning of the cycle, it is reset to 0. After the lpm_mult
has finished providing new data, the correct output value for the filtered
result is contained in the adder. However, it is then necessary to take it
back down to the proper size (data width). The value of the output of the
adder can be mapped to the proper output value as follows. Take the width of
the adder to be 2*N, and the master data width to be N. The adder's output is
adder_sum
.
signal truncated_sum : std_logic_vector(N-1 downto 0); . . . truncated_sum(N-1) <= adder_sum(2*N-1); truncated_sum(N-2) <= adder_sum(N-2 downto 0);