Matrix Arithmetic Logic Unit (MALU)

The core of the MAC is the Matrix Arithmetic Logic Unit (MALU). The overall functionality of the MALU is to perform the matrix operations and write the output to memory. Once the last keypad entry, which is the matrix operation, has been entered, the control logic interprets the operation required. According to the operation entered, the control will enable the appropriate operator module. The operator module will read the matrix elements from RAM, perform the required operations, and store the result into RAM. There are 3 sets of RAM implemented on the FPGA. They correspond to the two input matrices, and the matrix resulting from the arithmetic operations. The MALU is composed of a control path and a data path. Neither path, however, can be discussed until the underlying matrix operations algorithms are understood.

Algorithms

Matrix Addition/Subtraction

The algorithms for adding and subtracting matrices are the same. The main idea is that similarly addressed cells of the two input matrices are added together or subtracted from each other. The result is then placed at the same address (cell) in the output matrix. Given three 4x4 matrices, A, B, and C, the following pseudocode describes matrix addition. The same pseudocode can be used to describe matrix subtraction by replacing the "+" sign with a "-" sign.

FOR i = 1 TO 4 DO

    FOR j = 1 TO F DO

        C[i, j] := A[i, j] + B[i, j];

    END FOR;

END FOR;

A possible hardware implementation requires a counter, an adder/subtractor, three RAM modules (each 16 words deep), and several registers. The basic idea is values from two RAM modules will be passed to the adder/subtractor module. The result of the adder/subtractor will then be written to the third RAM module. The counter will be used to address the three RAM modules while the registers will delay the address signal to the output RAM. These registers will compensate for the delay caused by the adder/subtractor logic. This is important to ensure that the result is written to the proper address. Figure 7 illustrates the matrix addition/subtraction hardware.

figure7.gif (12935 bytes)

 

Figure 7 – Hardware implementation of a Matrix Adder.

 

Matrix Multiplication

Matrix multiplication is slightly more complex than addition and subtraction. Basically, every row of the first matrix must be multiplied by every column of the second matrix. For a row-column pair, every element of the row is multiplied by the similarly addressed element of the column. The sum of the four products forms one element of the resulting matrix. The following pseudocode describes this process:

FOR k = 1 TO 4 DO

    FOR j = 1 TO 4 DO

        FOR i = 1 TO 4 DO

            C[k, j] := C[k, j] + A[j, i] * B[i, j];

        END FOR;

    END FOR;

END FOR;

For hardware implementation, one must consider the interdependencies of elements. For example, each element in the resulting matrix requires four separate multiplications and a sum of all four products. For efficiency purposes, it is advantageous to perform all possible multiplications in one pass, then sum all products in a second pass. One can observe a pattern when performing multiplication. If all matrices are addressed from 0 to 15 as illustrated in figure 8:

0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

Figure 8 – Address representation of a matrix.

then element 0 from one matrix must be multiplied by elements 0, 1, 2, and 3 from the second matrix. The complete pattern is shown in table 3 where each element in the first column must be multiplied by each element in the second column.

0, 4, 8, 12

0, 1, 2, 3

1, 5, 9, 13

4, 5, 6, 7

2, 6, 10, 14

8, 9, 10, 11

3, 7, 11, 15

12, 13, 14, 15

Table 3 - Multiplication Pattern

Once all multiplications are performed, the products are added together to form the final output. Figure 9 illustrates the two-step process of multiplying matrices. The figure is also representative of the RAM organization in the actual hardware implementation. The first level in the figure represents the two input matrices in RAM. The second level represents temporary memory storing values after the first step in the multiplication process. The sum of all elements in a row (in the second level of the figure) forms the final value of the resulting matrix. The resulting matrix is illustrated in the third level of the figure.

Level 1:

Address

 

Matrix A

 

Matrix B

0   A0   B0
1   A1   B1
2   A2   B2
3   A3   B3
4   A4   B4
5   A5   B5
6   A6   B6
7   A7   B7
8   A8   B8
9   A9   B9
10   A10   B10
11   A11   B11
12   A12   B12
13   A13   B13
14   A14   B14
15   A15   B15
         
Level 2:

Address

 

Temp A

 

Temp B

 

Temp C

 

Temp D

0   A0 x B0   A1 x B4   A2 x B8   A3 x B12
1   A0 x B1   A1 x B5   A2 x B9   A3 x B13
2   A0 x B2   A1 x B6   A2 x B10   A3 x B14
3   A0 x B3   A1 x B7   A2 x B11   A3 x B15
4   A4 x B0   A5 x B4   A6 x B8   A7 x B12
5   A4 x B1   A5 x B5   A6 x B9   A7 x B13
6   A4 x B2   A5 x B6   A6 x B10   A7 x B14
7   A4 x B3   A5 x B7   A6 x B11   A7 x B15
8   A8 x B0   A9 x B4   A10 x B8   A11 x B12
9   A8 x B1   A9 x B5   A10 x B9   A11 x B13
10   A8 x B2   A9 x B6   A10 x B10   A11 x B14
11   A8 x B3   A9 x B7   A10 x B11   A11 x B15
12   A12 x B0   A13 x B4   A14 x B8   A15 x B12
13   A12 x B1   A13 x B5   A14 x B9   A15 x B13
14   A12 x B2   A13 x B6   A14 x B10   A15 x B14
15 A12 x B3 A13 x B7 A14 x B11 A15 x B15
Level 3:

Address

 

Matrix C

0   A0 x B0 + A1 x B4 + A2 x B8 + A3 x B12
1   A0 x B1 + A1 x B5 + A2 x B9 + A3 x B13
2   A0 x B2 + A1 x B6 + A2 x B10 + A3 x B14
3   A0 x B3 + A1 x B7 + A2 x B11 + A3 x B15
4   A4 x B0 + A5 x B4 + A6 x B8 + A7 x B12
5   A4 x B1 + A5 x B5 + A6 x B9 + A7 x B13
6   A4 x B2 + A5 x B6 + A6 x B10 + A7 x B14
7   A4 x B3 + A5 x B7 + A6 x B11 + A7 x B15
8   A8 x B0 + A9 x B4 + A10 x B8 + A11 x B12
9   A8 x B1 + A9 x B5 + A10 x B9 + A11 x B13
10   A8 x B2 + A9 x B6 + A10 x B10 + A11 x B14
11   A8 x B3 + A9 x B7 + A10 x B11 + A11 x B15
12   A12 x B0 + A13 x B4 + A14 x B8 + A15 x B12
13   A12 x B1 + A13 x B5 + A14 x B9 + A15 x B13
14   A12 x B2 + A13 x B6 + A14 x B10 + A15 x B14
15   A12 x B3 + A13 x B7 + A14 x B11 + A15 x B15

Figure 9 – Two step matrix multiplication process.

A possible hardware implementation of a matrix multiplier involves seven memory modules (each 16 words deep), a multiplier, three adders, four different counters, a 2-4 decoder, four 2-1 multiplexers, and several 4-bit registers. Figure 10 illustrates a hardware implementation of a matrix multiplier. This implementation assumes that the input matrices have already been initialized. Counters A and B will control the addressing of Matrices A and B respectively. Counter A will increment at one-fourth the frequency of Counter B. Counter C is a two-bit counter that selects one of the four temporary RAM modules for writing. Each of the output bits from the decoder is multiplexed with a logic low and sent to the write-enable of one of the temporary memories. When write is enabled, the decoder determines which of the four temporary memories is written to. The following truth table describes the behaviour of the decoder. The decoder input is the two-bit counter.

Counter

Decoder

00

0001

01

0010

10

0100

11

1000

Counter D has two addressing modes depending on the read/write state of the temporary memory. When in write-mode, the counter increments every four clock-cycles to ensure that each temporary memory module is written to prior to an address change. When in read mode, the counter increments every clock cycle. This addressing scheme is used to read from the temporary memories and write to the final result memory. The address bus controlling the output RAM must be pipelined to compensate for the combinational logic delay of the adders.

Two stages of adders are used to add all the products to form an element of the result matrix. The first stage is made up of two adders, each of which adds two of the four products. The second stage consists of one adder that adds the outputs of the previous adders together. The output of this third adder is an element of the final matrix product and is written to the output RAM.

figure10.gif (20594 bytes)

Data Path

Due to the limited number of logic cells, the MALU is based on the multiplier hardware implementation. The control path can bypass the multiplier and the first two adders for addition and subtraction operations. Basically, data is channeled directly to the third adder. The devices used in this data path are identical to those described above except for two new introductions. First of all, registers are placed between the input RAM modules and the multiplier. These registers are used to enable and disable the multiplier. The second addition is a custom designed multiplexer that selects one of two 16-bit busses. Using a control bit, the multiplexer is used to select the output of the multiplier logic block or data from the input matrices. This allows data to bypass the multiplier and allows addition and subtraction operations to be performed.

Control Path

The control path is responsible for retrieving data from the BCD-to-Binary converter (input) and placing it into the appropriate memory locations. Basically, the first sixteen inputs make up Matrix A, the second sixteen inputs constitute Matrix B, and the last input represents the desired matrix operation. The control path determines the operation and directs data from memory accordingly. The control path generates all address signals, all read/write signals, and handles all handshaking signals. Control Path also echoes the numbers being input to RAM to the LCD logic. When the matrix operation is complete, the Control Path outputs the resulting matrix from RAM to the LCD logic for display. Figure 11 is the state diagram for the control path.

figure11.gif (5137 bytes)

Figure 11 - State Diagram of MALU's Control Path