

# Low-Power Complex Multiplier Pin Assignment based on Spatial and Temporal Signal Properties

Downloaded from: https://research.chalmers.se, 2025-03-12 15:48 UTC

Citation for the original published paper (version of record):

Larsson-Edefors, P., Börjeson, E. (2025). Low-Power Complex Multiplier Pin Assignment based on Spatial and Temporal Signal Properties. IEEE International Symposium on Circuits and Systems

N.B. When citing this work, cite the original published paper.

© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, or reuse of any copyrighted component of this work in other works.

# Low-Power Complex Multiplier Pin Assignment Based on Spatial and Temporal Signal Properties

Per Larsson-Edefors and Erik Börjeson Chalmers University of Technology Gothenburg, Sweden Email: perla@chalmers.se

Abstract—Fixed-point integer multipliers are power-intensive components that are integral to many systems in computing and digital signal processing. Operating on complex fixed-point numbers, the complex multiplier is critical to a wide range of applications including communication systems. Since channel properties drift over time, communication systems require adaptive processing blocks which have to be designed for the worst-case scenario. This raises the question of how we can take advantage of performance variations of a system to reduce power dissipation. We describe how knowledge on variations in both dynamic range (the spatial dimension) and switching frequency (the temporal dimension) can be used to assign pins of complex multipliers in order to minimize power dissipation. Using netlist synthesis based on the predictive 7-nm ASAP7 cell library, we find that, for instance, if one of two 12-bit input signals of the complex multiplier has a 2-bit reduced dynamic range and a 50% reduced switching frequency, we decrease the energy per operation by 20% by selecting the optimal pin assignment.

#### I. INTRODUCTION

Design approaches that jointly consider the arithmetic and the data it operates on are becoming highly relevant for lowpower implementations [1]. Interestingly, digital filters operate on input signals which are inherently different; one being the data to be filtered, the other being weights representing the filter's frequency- or time-domain response. The data are continuously changing their values within the allotted fixedpoint dynamic range, but the weights are not necessarily that dynamic; they can even be static. Depending on the application, both the dynamic range (the spatial dimension) and the update frequency (the temporal dimension) of the weights can drift over time. For example, in an FIR filter of an equalizer, all tap weights potentially use their full dynamic range and are frequently updated when the equalizer is converging, but once it has reached a steady-state, many weights converge to a reduced dynamic range. After convergence, the equalizer tracks dynamic properties of the transmission channel in realtime, which offers opportunities to reduce weight update frequency or to simplify the calculation of weights [2].

The complex multiplier is a workhorse in many digital signal processing (DSP) algorithms, in which it operates on real- and imaginary-valued data representations. Complex multipliers often dominate digital filter implementations, so optimizing their power dissipation can have a large impact on overall DSP power. For example, the DSP power dissipation of high-throughput optical coherent receivers is dominated by the multiplier-intensive equalizer [3]. Additionally, a highthroughput MIMO equalizer is dominated by complex multipliers which take up 70% of the equalizer area [4].

We will analyze how we can exploit information on spatial and temporal properties of input signals when we implement DSP circuits that rely on complex multipliers. In particular, we will investigate what power reductions are possible for a number of design scenarios, where the two multiplier input signals exhibit distinctly different behaviors with respect to dynamic range and switching frequency.

#### II. BACKGROUND

Complex multiplications can be straightforwardly implemented using four real multiplications, one addition and one subtraction, according to

$$Z_r = A_r B_r - A_i B_i \tag{1}$$

$$Z_i = A_r B_i + A_i B_r. (2)$$

Fig. 1 shows the corresponding implementation of such a complex multiplier. There exist implementation alternatives which involve only three real multipliers. While this trick can reduce hardware resource usage, it tends to incur a longer delay for fixed-point numbers as shown in [5].



Fig. 1: Complex multiplier. Fig. 2: Booth-encoded multiplier.

As shown in Fig. 2, each real integer multiplier is implemented using an encoding stage (enc) and a partial-product (PP) generation stage (PP gen). A final carry-propagation adder (CPA) performs the final addition of the carry-save representation that is output from the PP reduction tree (PP tree). The inputs to the real multiplier constitute the multiplicand operand A, represented as  $a_{n-1}a_{n-2}\cdots a_0$ , and the multiplier operand B, represented as  $b_{n-1}b_{n-2}\cdots b_0$ . These inputs are 2's complement *n*-bit numbers, in which the most significant bit acts as sign bit. The output of the real multiplier in Fig. 2 is an 2n - 1-bit product, while for the complex multiplier in Fig. 1 the internal addition/substraction extends the complex product to 2n bits.

In the radix-4 modified Booth technique [6], [7], [8], three consecutive bits of the multiplier operand B, i.e.  $b_{i+1}b_ib_{i-1}$  are encoded, resulting in fewer generated PPs ( $\frac{n}{2}$ +1 instead of n). The rationale here is that by adding some encoding logic, we can save PP hardware and hopefully reduce delay.

The decoder circuit internal to PP gen selects either 2A, A, 0, -A, -2A to generate PP rows. Importantly, strings of '000' and '111' in the applied B will lead to the generation of PPs that are evaluated to zero, which is desired as this decreases power dissipation. Now consider a signal that does not utilize the nominal dynamic range (DR) of the implemented wordlength  $20 \log_{10}(2^n - 1) dB$ . An effect of a reduced DR is that bit strings with consecutive '0's and '1's become more common in higher significance bits of the signal representation. This observation has been used in schemes that introduce extra circuits (which is not really practical [9]) to dynamically multiplex input operands to reduce power dissipation [10], [11], [12]. However, if we know during design that the dynamic range of one of the multiplier input signals is likely to be lower than the other, then we should route this signal to the encoded input (i.e., B in Fig. 2).

A reduced dynamic range of a signal does not impact the switching frequency of the 2's complement word whose bits represent the signal. This is because of how the 2's complement sign is handled. Previous work involving FIR filters [13] and FFTs [14] harnessed that the signal property of switching frequency is orthogonal to dynamic range to reduce complex multiplier power, by routing infrequent filter weight updates to the encoded multiplier input. There are many other examples of DSP applications, where data signals are multiplied with slowly-changing weights estimated from drifting channel properties, for example, for phase recovery [15].

### **III. EVALUATION METHOD**

We simultaneously develop complex multipliers in VHDL for different input wordlengths n and generate several vector sets for  $A_r$ ,  $A_i$ ,  $B_r$ ,  $B_i$ , each with a specific dynamic range and switching frequency. For each wordlength n, we have one baseline vector set whose data are randomly switching and utilize the full dynamic range. We use Cadence Genus [16] to synthesize the VHDL code under a timing constraint, where we balance timing and area at a 10% relaxation of the strictest constraint possible for that particular input wordlength n.

As cell library, we use the open-source ASAP7 library [17] and its regular-VT cells. This library was developed by Arizona State University in collaboration with ARM Ltd. to represent a predictive 7-nm FinFET process technology. All synthesized gate netlists are verified for logic functionality using Cadence Xcelium [18] by comparing the outputs to vector sets generated as reference for  $Z_r$  and  $Z_i$ .

Using the clock rate f as reference, we define  $\alpha_i$  as switching activity. This is the fraction of clock cycles when a circuit node i with capacitance  $C_i$  switches from 0 to 1. Assuming N nodes, the switching power is defined as

$$P_{sw} = f V_{DD}^{2} \sum_{i=1}^{N} (C_{i} \alpha_{i})$$
(3)

where  $V_{DD}$  is the supply voltage.

During functional verification, we simulate all netlists with 10,000 inputs vectors from the vector sets previously generated. Power analysis is done at netlist level in Genus using backannotated data from Xcelium, which contributes switching statistics used to calculate  $\alpha_i$  for all circuit nodes. We assume f = 1 GHz and  $V_{DD} = 0.7 \text{ V}$ . Energy per operation can be calculated as  $E_{op} = P_{sw}/f$  as switching power dominates arithmetic-intensive applications.

Our baseline vector set is using random data, which correspond to  $\alpha = 0.25$ . This assumption overestimates the power dissipation of practical scenarios, but it is easy to replicate.

#### **IV. RESULTS**

Our evaluations are done for complex multipliers with input wordlengths from 6 to 16 bits, which is a range relevant for many DSP applications. In fixed-point implementations, we avoid using the full-precision complex multiplier output of 2nproduct bits, as this causes data wordlengths of successive operations to grow too fast. Truncation or rounding become necessary tools, but these force the designer to keep track of the binary point of data and align this correctly between operations. To avoid data wordlength growth, we implement complex multipliers whose outputs are truncated to n bits.

# A. Pin Assignment – Dynamic Range

Fig. 3 shows  $E_{op}$  for complex multipliers with n = 8, 10, 12, 14 bits as function of dynamic range reductions.



Fig. 3: Energy per operation as function of dynamic range reduction.

By choice of vector sets during simulation, we here reduce the dynamic range (DR) in steps of 1 bit (or 6 dB) for each of the A and B inputs (Fig. 2). It is clear that the impact of pin assignment based on DR has significant impact on the energy dissipation, in particular for DR reductions of an even number of bits [19]. The reason we see no gains for n = 8 is because smaller multipliers are not using Booth encoding [9].

# B. Pin Assignment – Switching Frequency

Both the spatial and temporal properties of signals clearly impact multiplier power dissipation. While low-power pin



Fig. 4: Energy per operation as function of input wordlength.

assignments based on dynamic range were illustrated in Fig. 3, low-power pin assignments based on *input switching frequency* are shown in Fig. 4. Here, we show  $E_{op}$  for different  $\alpha$  as function of wordlength. Using  $\alpha = 0.25$  as baseline, we also explore the impact of further reductions by 2 of  $\alpha$  on A and B, respectively.

As can be expected, there will be power reductions regardless of pin assignment, since a reduced switching activity  $\alpha$ on any input will reduce signal switching inside the multiplier. However, we gain significantly from assigning the less active signal to *B*, which is the encoded input in Fig. 2. This is because a reduced switching encoder activity will impact all downstream logic in PP gen, PP tree, and CPA.

We include in this graph the case of n = 6 to show an extended dependence of  $E_{op}$  on wordlength. As already noted in Fig. 3, for shorter even-valued wordlengths, Booth encoding is not used during synthesis.

#### C. Optimal Pin Assignment – Spatial and Temporal Properties

It is relevant to quantitatively evaluate the relative impact of the signal's spatial and temporal properties on complex multiplier power dissipation. We now consider the case when we *optimally* assign the vector sets to the complex multiplier's input pin: When one input signal has a reduced dynamic range and/or switching frequency, we assume it is assigned to input B of Fig. 2, which represents the encoded input.

Fig. 5 shows how the different optimal pin assignments compare for a range of even-valued wordlengths from 8 up to 16 bits. The DR bars correspond to a reduction of 2 bits on the encoded input, while U represents half the baseline switching frequency on the encoded input. The DR+U bars signify a vector set which has both its dynamic range and switching frequency reduced, by 2 bits and by 2 times, respectively.

For our assumptions on the DR and U data, it is clear that the switching frequency has a larger impact on power than dynamic range. A 2-bit DR reduction corresponds to a loss of 12 dB, which is substantial in DSP applications. In contrast,



Fig. 5: Pin-optimal  $E_{op}$  for different input wordlengths. DR denotes 2 bits lower dynamic range, whereas U denotes  $\alpha = 0.0625$ .

reducing  $\alpha$  to half its baseline value, yields a relatively higher payoff in terms of power reductions.

Fig. 6 shows the same categories of data as shown in the previous graph, but here we focus on shorter wordlengths, which include also odd wordlengths.



Fig. 6: Pin-optimal  $E_{op}$  for shorter input wordlengths.

The graph in Fig. 6 reinforces that complex multipliers with shorter even-valued input wordlengths n do not offer us any big opportunities to perform low-power pin assignment. But we can see that this is not the case for shorter odd-valued wordlengths, where n = 7 and n = 9 represent exceptions. While we can reduce power by doing pin assignments, we should note that the complex multipliers for n = 7 and n = 9 have baseline power values that are closer to their n + 1-bit multipliers than to their n - 1-bit counterparts. As we can see for n = 7, the way we assign signals to inputs can have a significant impact on the resulting power.

Activity-based pin assignment as it applies to integer multipliers with shorter wordlengths is discussed in detail in [20].

# D. Optimal vs Non-Optimal Pin Assignment

While it may be difficult to practically control how signals are assigned to the inputs of complex multipliers during implementation, it is useful for a designer to know what power savings are possible, should pin assignment be considered.

Using again the vector set of DR+U, where the dynamic range and switching frequency have been reduced (by 2 bits and 2 times, respectively), Fig. 7 shows the energy per operation for three cases: 1) when baseline vectors are used for A and B, 2) when DR+U vectors are applied to A and baseline vectors are used for B, and 3) when DR+U vectors are applied to B and baseline vectors are used for A.



Fig. 7: Energy per operation for different input wordlengths when optimal and non-optimal pin assignment is used.

As we can see, power reductions from around 25% and upwards are possible with optimal pin assignment. In contrast, if the inputs are swapped, power reductions never exceed 10%. If the switching frequency of input vectors is reduced by another 2 times, optimal pin assignment reduces power by more than 35% for  $n \leq 14$ .

# V. CONCLUSION

There exist plenty of information at the algorithm and architecture levels, which in conventional implementation flows is discarded, but which can be used to guide the implementation of power-efficient circuits. We have shown how information on how different signals behave over time can be exploited to reduce switching power in complex-valued multipliers.

It is well known that for integer multipliers based on Booth recoding, a lower-than-nominal input dynamic range (DR) can be translated into power savings. If optimal pin assignment is used, the lower DR will yield significantly fewer signal transitions, saving multiplier power. Also a lower-thannominal input switching activity ( $\alpha$ ) can be used to reduce power, but just as in the case of reduced DRs, the power savings depend on how we assign signals to the input pins.

We have investigated quantitatively how to optimally assign input pins on complex multipliers, which are key components in DSP and scientific applications, operating on signals whose dynamic range and switching frequency are lower than nominal. Our results show that an optimal pin assignment, compared to swapping the pins, for cases when Booth encoding is used (n = 7, 9, 10, 12, 14, 16), can lower energy per operation by between 18% and 27% when one input signal has a dynamic range reduction of 12 dB and a 50% reduced switching frequency. There is no circuit or performance overhead of this scheme, but it does require the designer to have information on spatial and temporal behavior of signals to be processed.

#### ACKNOWLEDGEMENT

We thank the Swedish Foundation for Strategic Research (SSF) for funding through the classIC and HotOptics projects.

#### REFERENCES

- S. Coward, T. Drane, E. Morini, and G. A. Constantinides, "Combining power and arithmetic optimization via datapath rewriting," in *IEEE Int. Symp. on Computer Arithmetic*, 2024, pp. 24–31.
- [2] C. Fougstedt, P. Johannisson, L. Svensson, and P. Larsson-Edefors, "Dynamic equalizer power dissipation optimization," in *Optical Fiber Communication Conf.*, 2016, p. W4A.2.
- [3] C. Fougstedt, O. Gustafsson, C. Bae, E. Börjeson, and P. Larsson-Edefors, "ASIC design exploration for DSP and FEC of 400-Gbit/s coherent data-center interconnect receivers," in *Optical Fiber Communication Conf.*, 2020, p. Th2A.38.
- [4] E. Börjeson, E. Deriushkina, M. Mazur, M. Karlsson, and P. Larsson-Edefors, "Circuit implementation of pilot-based dynamic MIMO equalization for coupled-core fibers," in *Optical Fiber Communication Conf.*, Mar. 2024, p. W1E.4.
- [5] E. E. Swartzlander and H. H. Saleh, "Floating-point implementation of complex multiplication," in Asilomar Conf. on Signals, Systems and Computers, 2009, pp. 926–929.
- [6] A. D. Booth, "A signed binary multiplication technique," Q. J. Mech. Appl. Math., vol. 4, no. 2, pp. 236–240, 1951.
- [7] O. L. MacSorley, "High speed arithmetic in binary computers," *Proc. IRE*, vol. 49, no. 1, pp. 67–97, Jan. 1961.
- [8] L. Rubinfield, "A proof of the modified Booth's algorithm for multiplication," *IEEE Trans. Computers*, vol. C-24, no. 10, pp. 1014–1015, 1975.
- [9] R. Zimmermann, "Datapath synthesis for standard-cell design," in *IEEE Int. Symp. on Computer Arithmetic*, 2009, pp. 207–211.
- [10] P.-M. Seidel, "Dynamic operand modification for reduced power multiplication," in Asilomar Conf. on Signals, Systems and Computers, vol. 1, 2002, pp. 52–56.
- [11] N.-Y. Shen and O.-C. Chen, "Low-power multipliers by minimizing switching activities of partial products," in *IEEE Int. Symp. on Circuits* and Systems, vol. 4, 2002, pp. 93–96.
- [12] J. Park, S. Kim, and Y.-S. Lee, "A low-power Booth multiplier using novel data partition method," in *IEEE Asia-Pacific Conf. on Advanced System Integrated Circuits*, 2004, pp. 54–57.
- [13] C. J. Nicol and P. Larsson, "Low power multiplication for FIR filters," in Int. Symp. on Low Power Electronics and Design, 1997, pp. 76–79.
- [14] O. Meteer and M. J. G. Bekooij, "Low-power Booth multiplication without dynamic range detection in FFTs for FMCW radar signal processing," in Asia-Pacific Signal and Information Processing Association Annual Summit and Conf., 2021, pp. 44–48.
- [15] E. Börjeson, C. Fougstelt, and P. Larsson-Edefors, "VLSI implementations of carrier phase recovery algorithms for M-QAM fiber-optic systems," *IEEE J. Lightw. Technol.*, vol. 38, no. 14, pp. 3616–3623, 2020.
- [16] Cadence® Genus®, v. 18.14, Cadence Design Systems, Inc., 2019.
- [17] V. Vashishtha, M. Vangala, and L. T. Clark, "ASAP7 predictive design kit development and cell design technology co-optimization," in *IEEE/ACM Int. Conf. Computer-Aided Design*, Nov. 2017, pp. 992–998.
- [18] Cadence<sup>®</sup> Xcelium<sup>®</sup>, v. 22.09, Cadence Design Systems, Inc., 2023.
- [19] Z. Yu, L. Wasserman, and A. Willson, "A painless way to reduce power dissipation by over 18% in Booth-encoded carry-save array multipliers for DSP," in *IEEE Workshop on Signal Processing Systems*, 2000, pp. 571–580.
- [20] P. Larsson-Edefors and E. Börjeson, "Activity-based input operand assignment for reduced multiplier power dissipation," in *IEEE Latin American Symp. on Circuits and Systems (LASCAS)*, 2025.