4
21 978-1-4244-2332-3/08/$25.00 © 2008 IEEE A Recursive Multifunction Circuit for Leading-Digit Detection and Comparison Sébastien Roy Département de génie électrique et de génie informatique Université Laval, Québec, Canada, G1K 7P4 T: 1-418-656-2131 ext. 2981, F: 1-418-656-3159, e-mail: [email protected] Abstract— It is known that fast, fully combinational leading- digit detector circuits can be generated efficiently by recognizing their inherent hierarchical structure. It is shown herein that this structure is not only hierarchical, but also recursive. This recur- sivity fully defines a minimal-complexity circuit, thus garanteeing optimal circuit synthesis. Such a circuit having an N -bit operand generates all output bits with log 2 (N ) combinational stages. It also makes possible a recursive parameterizable description in VHDL or other hardware description languages supporting recursion. For standard cell generation, it is amenable to efficient, automatic recursive routing. Furthermore, the same recursive structure can serve as the basis of other useful arithmetic functions, such as a fast comparator (log 2 (N ) stages). Therefore, such a multifunction circuit could be employed in the design of fast, low-complexity arithmetic-logic units (ALUs) inside mi- croprocessors, digital signal processors, or application-specific system-on-chip (SoC) designs. I. I NTRODUCTION A. Combination synthesis It is well known that some combinational functions are difficult to implement in an efficient manner and / or with a short critical path when the operands are wide because each output bit is a function of most or all input bits. The classical case is that of addition where the naive case (ripple-carry adder) does not scale well, having a delay that is proportional to operand width. This problem was identified very early in the history of computing and fundamental techniques, based on the concepts of propagating and generating positions, were proposed dating back to the works of Babbage in 1851. One possibility proposed at the time is the carry-lookahead concept, where the carry at each position is generated as an optimal two-stage sum-of-products (SOP) or product-of- sum (POS) combinational circuit using e.g. Karnaugh maps or their automated equivalent, the Quine-McClusky algorithm. Unfortunately, this approach provides little gain since carry positions of high weight comprise gates of proportionally high fan-in, thus leading to a significant slowdown in conventional circuits (e.g. CMOS) in spite of the fact that only two levels of gates are traversed. Furthermore, the complexity (gate and / or transistor count) of each individual circuit grows proportionally with the weight of the position, eventually making wide adders based on this technique impractical. This shows that for such circuits, the use of automated synthesis does not necessarily lead to a good solution. From a basic truth table description, an optimal two-level synthesis process would yield the carry-lookahead structure, with all of its drawbacks. If multi-level synthesis is used, it is not clear what would be generated, since there is no clear optimality criterion beyond two levels. It is certain, however, that the natural structure of the problem would not be recognized. This requires recognizing the inherent hierarchy in such problems, leading to the synthesis of circuits yielding intermediate si- gnals across a number of stages. This would lead to one of the many known parallel-prefix adder structures. This shows that finding the optimal architecture for a given circuit function through automated synthesis remains an open problem. However, Verma, Brisk and Lenne [1] have proposed a progressive decomposition approach to impose hierarchy in the automated synthesis. While this seems promising, it does not reach the level of optimality possible by applying hierarchy prior to automated synthesis. Aside from the difficulty in recognizing hierarchy in some circuits, automated combinational synthesis suffers from two drawbacks : 1. When the number of logic levels is allowed to exceed 2 (it is generally the case), no unambiguous definition of optimality exists. 2. When multiple functions are implemented, it is not clear how to identify common intermediate signals / circuit modules which could potentially be shared. In other words, there is currently no synthesis engine to which one can say : “Here are two circuits realizing two distinct functions which we believe to be related ; synthesize a single circuit which realizes both functions by maximizing the degree of sharing.” B. Leading-digit detectors In any floating-point processor, operand normalization is a crucial operation. It consists of shifting the operand to the left until the most significant non-zero digit is in the left-most position. The number of positions to shift is a function of the position of the most significant (or leading) non-zero digit. This is one of many instances where a special circuit is called for to detect the position of the leading digit. For normalisation, it makes sense to implement this circuit in such a way as to obtain the position of the leading-digit by counting from the left. The said position then corresponds to the number of leading zeros and gives exactly the number of positions to shift. In this context, the circuit is referred to as a Leading

[IEEE 2008 Joint International IEEE Northeast Workshop on Circuits and Systems (NEWCAS) and TAISA Conference (NEWCAS-TAISA) - Montreal, QC, Canada (2008.06.22-2008.06.25)] 2008 Joint

Embed Size (px)

Citation preview

Page 1: [IEEE 2008 Joint International IEEE Northeast Workshop on Circuits and Systems (NEWCAS) and TAISA Conference (NEWCAS-TAISA) - Montreal, QC, Canada (2008.06.22-2008.06.25)] 2008 Joint

21

978-1-4244-2332-3/08/$25.00 © 2008 IEEE

A Recursive Multifunction Circuit for Leading-Digit

Detection and Comparison

Sébastien Roy

Département de génie électrique et de génie informatique

Université Laval, Québec, Canada, G1K 7P4

T: 1-418-656-2131 ext. 2981, F: 1-418-656-3159, e-mail: [email protected]

Abstract— It is known that fast, fully combinational leading-digit detector circuits can be generated efficiently by recognizingtheir inherent hierarchical structure. It is shown herein that thisstructure is not only hierarchical, but also recursive. This recur-sivity fully defines a minimal-complexity circuit, thus garanteeingoptimal circuit synthesis. Such a circuit having an N -bit operandgenerates all output bits with log

2(N) combinational stages.

It also makes possible a recursive parameterizable descriptionin VHDL or other hardware description languages supportingrecursion. For standard cell generation, it is amenable to efficient,automatic recursive routing. Furthermore, the same recursivestructure can serve as the basis of other useful arithmeticfunctions, such as a fast comparator (log

2(N) stages). Therefore,

such a multifunction circuit could be employed in the designof fast, low-complexity arithmetic-logic units (ALUs) inside mi-croprocessors, digital signal processors, or application-specificsystem-on-chip (SoC) designs.

I. INTRODUCTION

A. Combination synthesis

It is well known that some combinational functions are

difficult to implement in an efficient manner and / or with

a short critical path when the operands are wide because each

output bit is a function of most or all input bits. The classical

case is that of addition where the naive case (ripple-carry

adder) does not scale well, having a delay that is proportional

to operand width. This problem was identified very early in

the history of computing and fundamental techniques, based

on the concepts of propagating and generating positions, were

proposed dating back to the works of Babbage in 1851.

One possibility proposed at the time is the carry-lookahead

concept, where the carry at each position is generated as

an optimal two-stage sum-of-products (SOP) or product-of-

sum (POS) combinational circuit using e.g. Karnaugh maps

or their automated equivalent, the Quine-McClusky algorithm.

Unfortunately, this approach provides little gain since carry

positions of high weight comprise gates of proportionally high

fan-in, thus leading to a significant slowdown in conventional

circuits (e.g. CMOS) in spite of the fact that only two levels

of gates are traversed. Furthermore, the complexity (gate

and / or transistor count) of each individual circuit grows

proportionally with the weight of the position, eventually

making wide adders based on this technique impractical.

This shows that for such circuits, the use of automated

synthesis does not necessarily lead to a good solution. From

a basic truth table description, an optimal two-level synthesis

process would yield the carry-lookahead structure, with all of

its drawbacks. If multi-level synthesis is used, it is not clear

what would be generated, since there is no clear optimality

criterion beyond two levels. It is certain, however, that the

natural structure of the problem would not be recognized. This

requires recognizing the inherent hierarchy in such problems,

leading to the synthesis of circuits yielding intermediate si-

gnals across a number of stages. This would lead to one of

the many known parallel-prefix adder structures.

This shows that finding the optimal architecture for a given

circuit function through automated synthesis remains an open

problem. However, Verma, Brisk and Lenne [1] have proposed

a progressive decomposition approach to impose hierarchy in

the automated synthesis. While this seems promising, it does

not reach the level of optimality possible by applying hierarchy

prior to automated synthesis.

Aside from the difficulty in recognizing hierarchy in some

circuits, automated combinational synthesis suffers from two

drawbacks :

1. When the number of logic levels is allowed to exceed

2 (it is generally the case), no unambiguous definition of

optimality exists.

2. When multiple functions are implemented, it is not clear

how to identify common intermediate signals / circuit

modules which could potentially be shared.

In other words, there is currently no synthesis engine to

which one can say :

“Here are two circuits realizing two distinct functions which

we believe to be related ; synthesize a single circuit which

realizes both functions by maximizing the degree of sharing.”

B. Leading-digit detectors

In any floating-point processor, operand normalization is

a crucial operation. It consists of shifting the operand to the

left until the most significant non-zero digit is in the left-most

position. The number of positions to shift is a function of the

position of the most significant (or leading) non-zero digit.

This is one of many instances where a special circuit is

called for to detect the position of the leading digit. For

normalisation, it makes sense to implement this circuit in such

a way as to obtain the position of the leading-digit by counting

from the left. The said position then corresponds to the number

of leading zeros and gives exactly the number of positions to

shift. In this context, the circuit is referred to as a Leading

Page 2: [IEEE 2008 Joint International IEEE Northeast Workshop on Circuits and Systems (NEWCAS) and TAISA Conference (NEWCAS-TAISA) - Montreal, QC, Canada (2008.06.22-2008.06.25)] 2008 Joint

22

Zero Detector (LZD) since it counts the number of leading

zeros. The term Leading-Digit Detector (LDD) is used herein

because it is more generic in the sense that : (a) it does not

necessarily imply computing the position from the left ; (b)

the leading-digit is not necessarily a ’1’, depending on the

application and the arithmetic representation.

Fully combinational design of an LDD is no light affair

since each output bit is a complicated function of all the inputs.

It is therefore related to the addition problem since all output

bits are high fan-in functions. Classical logic minimization

techniques such as Karnaugh maps would prove cumbersome

with this problem, and would not yield an efficient solu-

tion. The similarity with addition suggests that hierarchical,

parallel-prefix-like structures are possible. Furthermore, it will

be seen that the optimal solution involves joint generation

of all output bits. However, even software synthesis tools

which have the capability of exploiting commonalities between

many combinational functions will not recognize the simple

recursive structure presented herein.

Okobdzija [2] has identified the hierarchical nature of the

problem. Exploiting that structure, he has shown that parti-

tioning the problem prior to logic synthesis was better than

logic synthesis alone both in terms of area and propagation

delay. Other authors have published similar solutions [3],

[4] afterwards. Furthermore, there are many patents on priority

encoders (synonymous with lead-digit detector), most notably

one by Xilinx [5] from 2006.

None of these works, however, have clearly shown that the

circuit is in fact recursive, i.e. it is self-similar across stages.

This fact leads to a systematic design procedure which can be

embodied in a parameterizable HDL description. Furthermore,

it will be shown that comparison is a related function which

can be handled by the same circuit with little additional logic,

thus yielding a multifunction recursive circuit suitable for

automated recursive synthesis such as described in [6].

II. CIRCUIT ARCHITECTURE

A. Leading-digit detector

To bring out the recursive structure of the leading-digit

detection circuit, we start by examining the case of a 2-

bit operand. In the following discussion, we will assume

without loss of generality that the position Pn we will extract

corresponds to the bit with weight 2n, i.e. the position is

counted from the right, starting with 0. We will further assume

that the leading digit to be detected is a ’1’. The 2-bit LDD

has two outputs : position P and existence E. The existence

output indicates whether there is in fact a ’1’ in the operand.

Hence, if E = 0, the operand is "00" and the P output has no

meaning. The circuit operation as described leads to a fairly

simple truth table (see Table I).

A single OR-gate suffices to implement the 2-bit LZD, as

shown in Figure 1a. This circuit constitutes what we will call

the LDD foundation block and is represented henceforth by

the symbol of Figure 1b.

We can build a 4-bit LZD by replicating the 2-bit LZD

and adding some glue logic. We obtain the circuit of Figure

I1 I0 P E

0 0 X 00 1 0 11 0 1 11 1 1 1

TABLE I

TRUTH TABLE FOR 2-BIT LDD WITH INPUT VECTOR (I1I0).

I1 I0

EP

(a) circuit

LDDFB

P E

I1 I0

(b) symbol

Fig. 1. 2-bit leading-digit detector

2, where the added logic amounts to one gate and one

multiplexer. All outputs are generated with a 2-stage delay.

An arborescent structure becomes apparent when studying a

16-bit LDD circuit, such as pictured in Figure 3. The existence

bits at all levels are generated by a binary tree of two-input OR

gates. From the existence bits at the various levels, position bit

P0 is provided through a binary tree of two-input multiplexers.

A shorter, but otherwise identical, multiplexer tree generates

P1. Likewise, P2 is generated by a single multiplexer. Finally,

P3 is directly equal to E(1, 3) (existence bit of subtree 1 at

stage 3), i.e. it is generated from a zero-height tree.

It is noteworthy that much circuitry is shared by all outputs,

yet P0, P1 and P2 are generated from relatively independent

multiplexer subcircuits. The only common circuitry they share

is the OR-gate tree which generates their inputs. Furthermore,

it should be obvious that the structure can readily be generali-

zed to any desired depth h to handle operands of width 2h. All

outputs are produced with an h-stage delay. In technologies

where multiplexers have shorter delays than logic gates (e.g.

transmission-gate based multiplexers in CMOS), the position

vector would be produced least-significant bit first.

However, the recursive nature of this circuit is not im-

mediately apparent. The OR-gate tree by itself is definitely

recursive, being composed of two identical subtrees of h − 1stages. The various multiplexer trees are also dyadic trees,

E

Sel0

1

I1 I0I3 I2

P0

P1

Fig. 2. 4-bit leading-digit detector circuit.

Page 3: [IEEE 2008 Joint International IEEE Northeast Workshop on Circuits and Systems (NEWCAS) and TAISA Conference (NEWCAS-TAISA) - Montreal, QC, Canada (2008.06.22-2008.06.25)] 2008 Joint

23

Sel0

1

Sel0

1

Sel0

1

Sel0

1

I1 I0I3 I2I5 I4I7 I6I9 I8I11 I10I13 I12I15 I14

Sel0

1

Sel0

1

Sel0

1

Sel0

1

E

Sel0

1

P0

Sel0

1

Sel0

1

P1

P2

P3

Fig. 3. 16-bit leading-digit detector circuit.

LDD, h

LDD, h − 1 LDD, h − 1

Sel01

E E

Ph−1 E

h − 1

P P

(Ph−2 · · ·P0)

(I2h−1−1 · · · I0)

I

(I2h−1 · · · I2h−1)

I

Fig. 4. Recursive structure of 2h-bits, h-stages LDD.

but complications arise because their inputs are taken at

various points and stages of the OR-gate tree. Nonetheless,

this conceptual difficulty can be lifted, simply by representing

the circuit as a single tree. The recursive structure is illustrated

in Figure 4. It can be seen the right-hand side h − 1-stages

LDD takes I2h−1−1 through I0 as inputs, while the left-hand

side handles inputs I2h−1 through I2h−1 .

B. Comparator

The same structure can act as a comparator by interlacing

the two operands and keeping only the least-significant bit

(P0) of the position output vector (see Figure 5). The output

P0 now indicates which operand is larger while the existence

output E indicates whether the two operands are equal, in

which case output P0 becomes irrelevant. Comparing Figure

5 with the LDD circuit of Figure 4, the only difference is the

foundation block and the absence of the subtrees for P1 and

P2.

The circuit works by extracting the position of the leading

differing digit in the interlaced operands. Only the positions

where the two operands differ are of interest, therefore the

foundation block takes the form of an exclusive-OR gate

(see Figure 5). Indeed, the existence output E will then

Sel0

1

Sel0

1

Sel0

1

Sel0

1

A0 B0A1 B1A2 B2A3 B3A4 B4A5 B5A6 B6A7 B7

Sel0

1

Sel0

1

E

Sel0

1

P0

Fig. 5. 8-bit comparator circuit.

indicate whether all bits are equal (in which case E = 0).

Furthermore, the output P0 will yield the least-significant bit

of the position of the most significant digit which is different in

both operands. Referring to Figure 5, an odd position indicates

that operand A = (A7A6A5 · · ·A0) is larger, while an even

position indicates that A < B.

With slight modifications, such a circuit can easily support

two’s complement operands. It can be verified that if the

sign bits are inverted and fed to the comparator as standard

msb’s, all combinations of signs will be correctly treated.

In other words, the operands in Figure 5 would become(

A7A6A5 · · ·A0

)

and(

B7B6B5 · · ·B0

)

. Another option is to

treat the sign bits separately and feed only the remaining bits

to the tree structure.

C. Generalized comparator / LDD

A slightly more sophisticated tree structure can be evolved

by incorporating two modifiers called “flip” and “invert.”

Both modifiers have advantages for both comparator and LDD

functions. Hence, they will be described here in the context

of a bifunction circuit capable of both comparisons and lead-

digit detection. This duality in functionality affects only the

foundation blocks. The rest of the circuit remains unchanged,

although it is slightly modified from the previous sections to

accommodate the “flip” modifier.

The new foundation block is detailed in Figure 6. Three

new input signals appear which were not present in either the

LDD or comparator circuits : 1- signal F , which corresponds

to the “flip” modifier ; 2- signal Inv, which corresponds to

the “invert” modifier, and 3- signal C/L̄ which selects the

comparator or LDD function.

While the functionality of the third signal should be obvious

(circuit acts as a comparator if it is high, as an LDD otherwise),

the other two signals require some explanations. When Invis low, the circuit functions normally. But when Inv = 1,

all inputs are inverted. For the LDD function, this implies

that the position of the leading zero will be isolated instead

of the leading one. This can be useful for instance when

working with a negative two’s complement operand. Likewise,

the comparator function would then isolate the leading zero

Page 4: [IEEE 2008 Joint International IEEE Northeast Workshop on Circuits and Systems (NEWCAS) and TAISA Conference (NEWCAS-TAISA) - Montreal, QC, Canada (2008.06.22-2008.06.25)] 2008 Joint

24

An Bn

Sel01

Sel01

Inv C/L̄

Sel0

1

F

P

Sel0 1

E

′0′

Fig. 6. Foundation block for generalized comparator / LDD function

C / LDD, h

C / LDD, h − 1 C / LDD, h − 1

Sel01

E E

Ph−1 E

h − 1

Sel0 1

F

FF

P P

(Ph−2 · · ·P0)

(I2h−1−1 · · · I0)

I

(I2h−1 · · · I2h−1)

I

Fig. 7. Recursive structure of 2h bits, h-stages generalized comparator /

LDD circuit

in the positions that differ in the two interlaced operands.

This identifies the larger magnitude of two negative two’s

complement operands.

The signal F flips the logical weight ordering of the circuit

so that trailing digit positions are identified. With the LDD

function and Inv = 1, this makes it possible to find the

trailing zero position, a prerequisite for a fast incrementer

function (described below). Likewise, combining F = 1 and

Inv = 0 isolates the trailing ’1’ position, a prerequisite for

decrementation.

Figure 7 shows the overall recursive structure of the ge-

neralized circuit. With respect to the comparator and LDD

functions, the only addition in the unifying components is a

2-to-1 multiplexer, necessary to correctly implement the flip

modifier. The Inv and C/L̄ inputs are omitted for clarity.

III. SYNTHESIS RESULTS

Table II shows synthesis results on a Virtex-4 target, more

specifically a 4VFX140 device with a speed grade of 11, for

a 32-bit LDD circuit, 16-bit comparator circuit, and 32-bit

comp / LDD circuit. Table III show similar synthesis results

in CMOS for a 0.18 µm technology by TSMC with a supply

voltage of 1.8V. It can be seen that moving to the joint Comp

/ LDD circuit has no significant impact on the critical path.

On the other hand, it may seem surprising that the number

of FPGA slices is slightly more for the Comp / LDD circuit

than the sum of the two elementary functions. This is easy

Circuit slices LUTs critical path

LDD 32 bits 19 33 8.925 ns

Comp 16 bits 13 22 9.072 ns

Comp / LDD 39 71 11.275 ns

TABLE II

SYNTHESIS RESULTS ON VIRTEX-4 (4VFX140-11) FPGA TARGET

Circuit area critical path

LDD 32 bits 701.9 µm2 2.30 ns

Comp 16 bits 901.5 µm2 1.64 ns

Comp / LDD 3153.4 µm2 2.47 ns

TABLE III

SYNTHESIS RESULTS ON TSMC 0.18 µM CMOS

to explain given the considerable added complexity of the

foundation block which provides much additional flexibility

with its Flip and Inv signals. However, the area of the Comp

/ LDD circuit in CMOS is surprisingly high and warrants

additional investigation in the future.

IV. CONCLUSION

Three recursive circuits have been presented for fundamen-

tal arithmetic operations which are amenable to automatic

scaling and synthesis through structural recursion in VHDL.

This highlights the recursive structure inherent in many arith-

metic operations which are considered to be problematic for

automated synthesis. Exploiting this recursive structure leads

to an elegant design which exhibits many desirable proper-

ties : scalability, minimum depth, equal depth for all outputs,

constant and low fan-in and fan-out. Also, the combined

Comp / LDD circuit shows how two distinct functions can be

combined advantageously in a single circuit. The full potential

of this circuit, which has many secondary modes, has not been

fully explored due to lack of space. It can, among other things,

be used as a basis to produce other functions, such as fast

incrementation / decrementation.

Finally, the concepts presented herein are substantiated by

synthesis results in both Virtex-4 FPGA and 0.18 µm CMOS

technologies.

REFERENCES

[1] A. K. Verma, P. Brisk, and P. Ienne, “Progressive decomposition : Aheuristic to structure arithmetic circuits,” in Proc. Design and Automa-

tion Conference (DAC), San Diego, California, June 4-8, 2007.

[2] V. G. Oklobdzija, “An algorithmic and novel design of a leading zerodetector circuit : comparison with logic synthesis,” IEEE Trans. VLSI

Systems, vol. 2, no. 1, March 1994.

[3] H. Suzuki and al., “Leading-zero anticipatory logic for high-speedfloating point addition,” IEEE J. Solid-State Circuits, vol. 31, pp. 1157–1169, Aug. 1996.

[4] E. Antelo et al., “A novel design of a two-operand normalization circuit,”IEEE Transactions VLSI Systems, vol. 6, no. 1, March 1998.

[5] S. Secatch and J. E. Ogden, “Binary priority encoder,” U. S. Patent7,057,546 B1 assigned to Xilinx Inc., granted June 6, 2006.

[6] P. J. Ashenden, “Recursive and Repetitive Hardware Models in VHDL,”Technical Report TR 160/12/93/ECE, Dept. of Electrical and ComputerEng., U. Cincinnati, Ohio, USA ; also published as Technical Report93–19, Dept. of Computer Science, U. Adelaide, South Australia,Australia