# Transmission Gate-based Approximate Adders for Inexact Computing

Zhixi Yang

Department of Electrical and Computer Engineering University of Alberta Edmonton, AB, Canada zhixi@ualberta.ca Jie Han Department of Electrical and Computer Engineering University of Alberta Edmonton, AB, Canada jhan8@ualberta.ca Fabrizio Lombardi Department of Electrical and Computer Engineering Northeastern University Boston, MA 02115, USA lombardi@ece.neu.edu

Abstract—Power dissipation has become a significant concern for integrated circuit design in nanometric CMOS technology. To reduce power consumption, approximate implementations of a circuit have been considered as a potential solution for applications in which strict exactness is not required. In approximate computing, power reduction is achieved through the relaxation of the often demanding requirement of accuracy. In this paper, new approximate adders are proposed for low-power imprecise applications by using logic reduction at the gate level as an approach to relaxing numerical accuracy. Transmission gates are utilized in the designs of two approximate full adders with reduced complexity. A further positive feature of the proposed designs is the reduction of the critical path delay. The approximate adders show advantages in terms of power dissipation over accurate and recently proposed approximate adders. An image processing application is presented using the proposed approximate adders to evaluate the efficiency in power and delay at application level.

Keywords—approximate computing; error rate; low power; error distance; approximate adder

# I. INTRODUCTION

Commonly used multimedia applications rely on digital signal processing (DSP) blocks as core components. Most of these DSP blocks implement algorithms, by which an output is produced as either an image, or a video for human perception and analysis. However, the limited perception of human senses allows the output of these algorithms to be numerically approximate rather than accurate [1]. The relaxation on numerical exactness allows performing imprecise or approximate computation. The development of imprecise and simplified arithmetic units provides an additional layer of power saving over conventional low-power design techniques.

Adders have been investigated for approximate implementations [2]. Speculative approximate adders are proposed in [3, 4] to achieve better performance in terms of area, power and delay over accurate adders. The basic principle of these designs is to truncate the long carry chain of a multiple bit adder by using several sub-adders to calculate the sum. OR gates are used in [5] for the addition of each less significant bit (LSB), while more significant bits (MSBs) use accurate adders to preserve accuracy. Logic reduction is considered in [6] by removing transistors from a mirror adder (AMA) to simplify its design. In [7], pass transistors are used in XOR/XNOR-based approximate adders. However, these designs suffer from a severe signal distortion or degradation when passing a '0' for PMOS and passing a '1' for NMOS transistors.

This paper proposes two new multiplexer-based approximate adder designs. In contrast to [7], transmission gates (TGs) are used as alternative circuit components in the proposed designs. TG is a promising replacement for pass transistors and is commonly used in implementing look up tables (LUTs) of field programmable gate arrays (FPGAs) [8]. Hence, TG-based multiplexers are utilized due to their lower power dissipation than conventional CMOS multiplexers [9] [10].

In this paper, the approximate adders are designed for reducing circuit complexity (in the number of transistors) at the gate level by removing some of the gates from the original full adder. Additionally, the node capacitances and thus the dynamic power are also reduced in the proposed circuits. Delay, power, area and power-delay product are measured and the proposed designs are compared to a truncated adder. The metrics of error distance (ED) and mean error distance (MED) [11] are used to compare the proposed designs with other approximate adders in terms of accuracy. Extensive simulation results are provided to show the effectiveness of the proposed designs.

This paper is organized as follows. Section II presents a brief review and Section III presents the two new TGAs, followed by a comparative study in Section IV. Section V presents an image processing application and Section VI concludes the paper.

#### II. PRELIMINARIES AND REVIEW

In this section, accuracy metrics are introduced for error analysis with a brief review of transmission gates (TGs) and TG-based multiplexers.

#### A. Error Metrics

The error distance (ED) and mean error distance (MED) are proposed in [11] to characterize the accuracy of approximate arithmetic circuits. The ED is defined as the absolute difference between the accurate and approximate output values, i.e.,

$$ED = /R' - R/, \tag{1}$$

where R'(R) denotes the output value of an approximate (accurate) circuit.

The MED is defined as the average of EDs for a set of input values, i.e.,

$$MED = E[ED] = \sum ED_i P(ED_i), \tag{2}$$

where  $P(ED_i)$  is the probability to produce a particular value of ED, ED<sub>i</sub> in this case.

The error rate (ER) is defined as the percentage of incorrect output values among all outputs, i.e.,

$$ER = \frac{\text{\#of incorrect outputs}}{\text{\#of total outputs}}$$
(3)

Similarly, the pass rate (PR) is defined as 1-ER. Other metrics such as power and delay are mostly related to circuit-level features and are also utilized.

### B. Transmission Gate based Multiplexer

A PMOS or an NMOS transistor can be used as an imperfect switch; this structure is commonly referred to as a pass transistor. The pass transistor suffers from signal strength loss due to the threshold hold voltage drop [12]. The degradation determines the closeness of the output signal to an ideal voltage source. The PMOS generates a degraded 0, while the NMOS generates a degraded 1. However, this degradation could be very severe to potentially violate the noise margin of the next stage [12]. A transmission gate consists of an NMOS and a PMOS transistor in parallel with the gates controlled by complementary signals (Fig. 1).

A transmission gate passes both 0 and 1 gracefully. Hence, the transmission gate is often used as an alternative approach to implement a multiplexer. A gate level multiplexer requires rather complex circuitry, thus a transmission gate based multiplexer is a better option for reducing both power and delay. The implementation of a transmission gate-based multiplexer is shown in Fig. 2. The signal *sel* is inverted as the complementary signal for selecting the two transmission gates (using an additional inverter). The implementations of TG based XOR and XNOR gates are shown in Fig. 3. Compared with Fig. 2, the TG based XOR/XNOR gate is a special case for a multiplexer in which the inputs have complementary signal values.



Fig. 1. (a) A transmission gate (TG) and (b) symbol for TG.



Fig. 2. A transmission gate based multiplexer.



Fig. 3. Transmission gate based (a) XOR (b) XNOR gates.

### III. PROPOSED APPROXIMATE ADDERS

In this section, two approximate adders are proposed. It is shown in [13] that a transmission gate based full adder exhibits good power and delay performance with a rather simple circuit. Fig. 4 shows a transmission gate-based full adder.

The adder consists of three modules (as enclosed in the red, blue and brown blocks in Fig. 4).

- The first module is an XOR gate with inputs X and Y.
- The second module is an XOR gate for generating Sum.
- The last module is a MUX for generating Cout.

This implementation is based on transmission gates and several inverters. Table I shows the truth table of an accurate adder (CS denotes  $C_{out}$  and Sum).



Fig. 4. Transmission gate based accurate full adder [13].



Fig. 5. Transmission gate based full adder by inserting inverters.



|                 | *CS | XY<br>00 | 01 | 11 | 10 |
|-----------------|-----|----------|----|----|----|
| C <sub>in</sub> | 0   | 00       | 01 | 10 | 01 |
|                 | 1   | 01       | 10 | 11 | 10 |
|                 |     |          |    |    |    |

However, the utilization of more than two transmission gates in series increases the delay unless buffers are added [12]. Hence, buffers (using two inverters) are added in Fig. 5 instead of using the full adder based on transmission gates connected in series (Fig. 4). The simpler structure of a transmission gate based multiplexer allows achieving complex logic functions. The proposed approximate adders are designed either by removing some transistors or changing some of the signals for *Sum* or  $C_{out}$ .

Figs. 6 and 7 show the proposed approximate adders using transmission gates (denoted as TGA1 and TGA2). The feature common to both TGAs is that the first module is implemented by an XOR gate to reduce the node capacitance, thus lowering power dissipation. Moreover, the first stage just uses an XOR gate for reductions in both delay and power.



Fig. 6. Transmission gate based approximate adder 1 (TGA1).

#### A. Approximate Adder TGA1

In Table I, there are six cases when  $C_{out}$  is equal to Y, thus a simpler design can be obtained if Y is directly connected to  $C_{out}$ . Moreover, the carry propagation path is reduced ( $C_{in}$  to  $C_{out}$ ) by connecting an input to  $C_{out}$ . The *Sum* signal is also modified. In Fig. 4, an additional inverter is utilized to invert  $C_{in}$  and generate *Sum*. This is simplified in TGA1 by removing this inverter, while connecting input X directly to the transmission gate based multiplexer (Fig. 6). This generates two incorrect results out of eight; therefore, the ER is two out of eight for **CS**. Table II is the truth table for TGA1; (4) and (5) show the logic functions of TGA1.

$$Sum = (X \overline{\oplus} Y)C_{in} + X\overline{Y}$$
<sup>(4)</sup>

$$C_{out} = Y \tag{5}$$

| TABLE II. |                 |     | TRUTH TABLE FOR TGA |    |    |    |   |  |  |  |
|-----------|-----------------|-----|---------------------|----|----|----|---|--|--|--|
|           |                 | ~~~ | XY                  |    |    |    | - |  |  |  |
|           |                 | CS  | 00                  | 01 | 11 | 10 |   |  |  |  |
|           | C <sub>in</sub> | 0   | 00                  | 10 | 10 | 01 |   |  |  |  |
|           |                 | 1   | 01                  | 10 | 11 | 01 |   |  |  |  |

#### B. Approximate Adder TGA2

In Table I, if either X or Y is "1", then there are four out of six cases where  $C_{out}$  is "1"; thus an approximate adder can use only an OR gate to generate  $C_{out}$ . In this case, only two incorrect  $C_{out}$  values are generated (shown in bold in Table III). When the input combination is "010" or "100", the accurate adder generates  $C_{out}$  as "0" while it is "1" for TGA2. The error rate is two out of eight for CS.

Similarly, the generation of  $C_{out}$  only depends on X and Y, thus there is no carry propagation for TGA2, resulting in a reduction of the delay. For *Sum*, compared with TGA1, the input for the transmission gate is connected to Gnd. This results in two incorrect output values. Fig. 7 shows TGA2, while (6) and (7) show its output functions.

$$Sum = (X \overline{\oplus} Y)C_{in} \tag{6}$$

$$C_{out} = X + Y \tag{7}$$



Fig. 7. Transmission gate based approximate adder 2 (TGA2).

| TABLE III. |                 |    | TRUTH TABLE FOR TGA2 |    |    |    |   |  |  |
|------------|-----------------|----|----------------------|----|----|----|---|--|--|
|            |                 |    | XY                   |    |    |    | - |  |  |
|            |                 | CS | 00                   | 01 | 11 | 10 | _ |  |  |
|            | C <sub>in</sub> | 0  | 00                   | 10 | 10 | 10 |   |  |  |
|            |                 | 1  | 01                   | 10 | 11 | 10 |   |  |  |

## IV. SIMULATION AND COMPARISON

In this section, the designs of the two proposed approximate adder are evaluated; comparison with respect to power and delay as well as accuracy is presented for each design. The approximate mirror adders (AMAs) in [6] are considered for comparison purposes. AMA1 ~ 4 are obtained by removing transistors from the accurate mirror full adder. In our simulation, 8 and 16 bit adders are compared by replacing the less significant half in a ripple carry adder (RCA) with TGAs and AMA1 ~ 4. For the more significant half in an RCA, accurate mirror adders and the TG based full adder (Fig. 5) are applied for AMA and TGA based multipliers respectively. The lower part OR adder (LOA) [5] is also compared with the accurate mirror adders as the MSB part.

A truncation-based adder is considered as a baseline when assessing the approximate designs. For this adder, the Sum outputs of the truncated section are connected to ground. While the Sum of the truncated part is zero, the carry in for the higher section of a truncated adder is connected to one of the most significant inputs to the truncated part. However, the number of bits that are truncated must be carefully selected for a fair comparison. The power dissipation is chosen as a baseline for a fair comparison of accuracy and delay, i.e., the same power is chosen for each considered adder with a value similar to that of the TGA1 based approximate adder (because TGA1 has the lowest power). Hence under this condition, the truncated bits are given by 3 and 6 for the 8-bit and 16-bit adders with a power consumption of 2.03uW and 4.69uW. Power and delay are measured by using the Cadence Ultrasim SPICE simulator at a clock frequency of 100MHz.

Accuracy comparison in terms of MED and PR is also assessed. A comprehensive comparison by considering both accuracy and power/delay is presented to show significant savings with an acceptable accuracy for the TGAs.

#### A. Accuracy

Table IV summarizes the comparison results including accuracy in terms of MED and PR, and circuit characteristics. The exhaustive input combinations are applied for analysis of an 8-bit RCA; one million randomly generated inputs are used for evaluation of a 16-bit RCA.

- For the 8-bit RCA, TGA1 and TGA2 have a significantly higher PR (i.e., 1-ER) and lower MED than AMA1-4 and the truncated adder.
- TGA2 has the lowest MED and the largest PR (same as AMA1) for both 8 and 16 bit RCAs; thus, TGA2 is the most accurate design in terms of MED and ER.
- AMA2 and the truncated adder have the lowest PR for 8 and 16 bit RCAs respectively, while AMA3 has the largest MED for both 8 and 16 bit RCAs.

## B. Power

The power consumption is found by using the Cadence Ultrasim simulator with an STM 65nm CMOS standard cell library. For an 8 (16) bit RCA, 5k (25k) randomly generated inputs are used for evaluating the power. A fanout of four standard-sized inverters is applied as load. As shown in Table IV, TGA1 has the lowest power dissipation for both the 8-bit and 16-bit approximate adders. AMA1 has the largest power dissipation compared to the other approximate adder designs.

## C. Delay

The delay for designs including TGAs, AMAs and LOA is also measured by using the simulator Ultrasim with an STM 65nm standard cell library. The delay is reported in Table IV for both 8 and 16 bit RCAs. Four standard-sized inverters are utilized as load for the output. These results yield the following conclusions.

- TGA1/TGA2 and AMA4 have the shortest delay for 8bit and 16-bit RCAs respectively; meanwhile AMA2 has the largest delay for both cases.
- The delay of AMA2 is nearly twice the delay of AMA4, because the critical path of AMA4 is about half of that of the accurate mirror full adder (i.e., *n*/2, where *n* is the length of the RCA), whereas the critical path delay of AMA2 is nearly the same as the *n*-bit mirror adder, because the carry propagation path for AMA2 is the same as the accurate mirror full adder.
- AMA4 has a shorter delay than AMA2 for the 16-bit RCA; however, it is larger than the delays of TGA1 and TGA2 for the 8-bit RCA.

In summary, TGA1 and TGA2 have the best performance in terms of delay than AMAs (except AMA4 as a 16-bit adder).

TABLE IV. COMPARISON OF 8 BIT (8B) AND 16 BIT (16B) RCAS WITH APPROXIMATE ADDERS

|            | Power<br>(uW) |      | Delay<br>(ns) |       | PDP<br>(10 <sup>-15</sup> J) |      | MED  |       | PR<br>(%) |      |
|------------|---------------|------|---------------|-------|------------------------------|------|------|-------|-----------|------|
|            | 8b            | 16b  | 8b            | 16b   | 8b                           | 16b  | 8b   | 16b   | 8b        | 16b  |
| Truncation | 2.03          | 4.69 | 0.463         | 0.975 | 0.94                         | 4.57 | 3.61 | 33.92 | 7.69      | 0.81 |
| TGA1       | 1.97          | 4.30 | 0.445         | 0.934 | 0.88                         | 4.02 | 2.94 | 44.71 | 20.77     | 10   |
| TGA2       | 2.28          | 4.95 | 0.447         | 0.937 | 1.02                         | 4.63 | 2.67 | 32.31 | 23.08     | 14.1 |
| AMA1       | 2.79          | 6.22 | 0.778         | 1.597 | 2.17                         | 9.60 | 3.04 | 34.63 | 15.38     | 14.1 |
| AMA2       | 2.74          | 5.92 | 0.809         | 1.601 | 2.22                         | 9.48 | 3.68 | 59.65 | 3.46      | 10.1 |
| AMA3       | 2.55          | 5.49 | 0.768         | 1.562 | 1.96                         | 8.58 | 4.57 | 71.27 | 8.08      | 2.44 |
| AMA4       | 2.53          | 5.72 | 0.471         | 0.895 | 1.23                         | 5.12 | 4.36 | 66.29 | 10        | 2.42 |
| LOA        | 2.05          | 4.59 | 0.499         | 0.923 | 1.02                         | 4.24 | 2.83 | 47.81 | 31.15     | 9.98 |

## D. Power-Delay Product (PDP)

As shown in Table IV, the proposed designs have lower PDPs compared to AMAs while TGA1 has the lowest PDP among all designs. Moreover, LOA has a better performance than AMAs (i.e. lower PDP) due to the shorter delay and lower power. Therefore, TGA1 has the best performance than the other designs while LOA has a smaller PDP than TGA2 and AMAs.

## E. Comprehensive Comparison

In this section, a comprehensive comparison in terms of both accuracy and circuit related metrics is assessed. The power-delay-MED-product (PDMP) and power-delay-ERproduct (PDEP) are utilized in this evaluation; normalization of PDMP and PDEP is considered for a better presentation.

The normalized PDMP for a group of designs is defined as:

$$NPDMP_i = PDMP_i / PDMP_{max},$$
 (8)

where *i* indicates one of the designs in the group of the proposed approximate adders and the reference designs, i.e., TGA1-2, AMA1-4 and LOA, and  $PDMP_{max}$  is the largest PDMP value found in the group of designs. The normalized PDMP and PDEP can be similarly defined.

Fig. 8 shows the normalized PDMP comparison and Fig. 9 shows the PDEP comparison, both sorted from the smallest to the largest value for 16 bit RCAs.



Fig. 8. Normalized PDMP comparison for 16-bit RCA (TRU represents the truncated adder).



Fig. 9. Normalized PDEP comparison for 16 bit RCA.

As shown in Fig. 8, TGA2 has the lowest PDMP; TGA1 has a larger PDMP (yet lower PDP in Table IV) than a truncated adder due to the higher MED. AMA3 still has the largest PDMP due to the largest MED. If a truncated adder is considered as comparison baseline, TGA2 outperforms the baseline for a better accuracy and power-delay saving. However, AMA1 ~ 4 underperform in terms of PDMP, while LOA has better performance than the AMAs.

For the PDEP (Fig. 9), TGA1 and TGA2 have better performance than AMAs due to the lower error rate (or high PR in Table IV). Moreover, TGA1 has a lower PDEP than LOA due to the lower PDP.

#### V. IMAGE PROCESSING

Using the proposed approximate adders, an image sharpening algorithm is implemented in Matlab with an Intel Core i5 processor and 4GB RAM. The sharpened image quality is measured by the peak signal noise ratio (*PSNR*). The PSNR is usually used to measure the quality of a reconstructive process involving information loss and is based on the mean square error. For an accurate image *I* and an image *K* generated by an approximate process (*I* and *K* are monochrome images with  $m \times n$  pixels), the *MSE* is defined as

$$MSE = \frac{1}{mn} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 .$$
(9)

The PSNR is given by

$$PSNR = 20 \log_{10}(\frac{MAX_{I}}{\sqrt{MSE}})$$
(10)

The term  $MAX_I$  is the maximum possible pixel value of the image; for example, when a pixel is encoded by 8 bits, its maximum value is 255.

In the image sharpening algorithm [14, 15], multiplication is performed by using carry save adders (CSAs) followed by a ripple carry adder (RCA). The subtraction is also performed in 2's complement addition, i.e. an RCA is used as a subtractor and as an adder for multiplication. In this image processing application TGAs and AMAs are compared by replacing the lower bits of the CSAs and RCAs.

An 8 by 6 multiplier is used for multiplication, with the lower 7 LSBs replaced by approximate adders for CSAs and the RCA. The multiplication results in 25 terms that are added using a 16-bit RCA with the lower 8 bits replaced by approximate adders. For the subtractor, a 9 bit RCA is used with the lower 5 bits approximated. Fig. 10 shows the images processed by approximate adders (with corresponding PSNR values). It can be seen that the images processed by TGAs have better image qualities (i.e., with higher PSNR values) than the AMAs and only slight quality degradation for the images.



Fig. 10. Images processed by: (a) AMA1 with PSNR=27.81; (b) AMA2 with PSNR=29.91; (c) AMA3 with PSNR=28.27; (d) AMA4 with PSNR=23.19; (e) TGA1 with PSNR=32.89; (f) TGA2 with PSNR=30.56.

#### VI. CONCLUSION

In this paper, two novel approximate full adders using transmission gate-based multiplexers are proposed. Extensive simulation results have been presented for a comprehensive evaluation of accuracy and electrical figures of merit (such as power dissipation and delay). The proposed approximate adder designs show significant savings in power and delay. At the same time, they generate results that incur only a marginal degradation of accuracy. While TGA2 exhibits the best accuracy in terms of error rate and mean error distance, an image sharpening application shows that TGA1 provides excellent performance in terms of PSNR compared with other approximate adders using logic reduction techniques.

#### REFERENCE

- J. Han and M. Orshansky. "Approximate Computing: An Emerging Paradigm for Energy-Efficient Design." in ETS'13, Proc. of the 18th IEEE European Test Symposium, Avignon, France, May 2013, pp. 1-6.
- [2] H. Jiang, J. Han and F. Lombardi. "A Comparative Review and Evaluation of Approximate Adders," in *GLSVLSI'15, Proc. of the 25th IEEE/ACM Great Lakes Symposium on VLSI*, Pittsburgh, PA, USA, 2015, pp. 343-348.
- [3] S. L. Lu. "Speeding up processing with approximation circuits." Computer, vol. 37, no. 3, pp. 67–73, 2004.
- [4] A.K. Verma, P. Brisk, and P. Ienne. "Variable latency speculative addition: A new paradigm for arithmetic circuit design." In Proc. DATE 2008, pp. 1250–1255.

- [5] H.R. Mahdiani, A. Ahmadi, S.M. Fakhraie and C. Lucas. "Bio-inspired imprecise computational blocks for efficient VLSI implementation of soft computing applications." IEEE Trans. Circuits and Systems I: Regular Papers, vol. 57, no. 4, pp. 850-862, April 2010.
- [6] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy. "Low-power digital signal processing using approximate adders." IEEE Trans. on CAD of Int. Circuits and Systems, vol. 32, no.1, pp. 124-137, 2013.
- [7] Z. Yang, A. Jain, J. Liang, J. Han, and F. Lombardi. "Approximate XOR/XNOR-based adders for inexact computing," In IEEE Intl. Conf. on Nanotechnology. Beijing, China, 2013, pp. 690-693.
- [8] C. Chiasson and V. Betz. "Should FPGAs abandon the pass-gate?" In FPL'13: In the 23rd International Conference on Field Programmable Logic and Applications (FPL). Sep. 2013, pp. 1-8.
- [9] T. Pi and P.J. Crotty. "FPGA lookup table with transmission gate structure for reliable low-voltage operation." U.S. Patent No. 6,667,635. 23 Dec. 2003.
- [10] R. Zimmermann and W. Fichtner. "Low-power logic styles: CMOS versus pass-transistor logic." IEEE Journal of Solid-State Circuits, vol. 32, no. 7, pp. 1079-1090, Jul. 1997.
- [11] J. Liang, J. Han and F. Lombardi. "New metrics for the reliability of approximate and probabilistic adders." IEEE Trans. on Computers, vol. 62, no. 9, pp. 1760-1771, Sept. 2013.
- [12] N.H.E. Weste and D.M. Harris. "CMOS VLSI design: a circuits and systems perspective." Pearson Education India. Third edition, 2005.
- [13] A. M. Shams, T.K. Darwish and M.A. Bayoumi "Performance analysis of low-power 1-bit CMOS full adder cells." IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol.10, no.1, pp. 20-29, Feb. 2002.
- [14] M.S. Lau, K.V. Ling and Y.C. Chu. "Energy-aware probabilistic multiplier: design and analysis." In Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems, Grenoble, France, Oct. 2009, pp. 281-290.
- [15] C. Liu, J. Han and F. Lombardi. "An Analytical Framework for Evaluating the Error Characteristics of Approximate Adders." IEEE Trans. on Computers, vol. 64, no. 5, pp. 1268 - 1281, 2015.