### Floating point revisited

In floating point, numbers are represented by a sign bit, an exponent, and a mantissa:The value of a normal floating point number is

*±1.m*(actually,

_{2}*2^{e}*e*is stored with a bias in order to be able to treat it like an unsigned number most of the time, but let's not get distracted by that kind of detail). By using an exponent, a wide range of numbers can be represented at a constant relative accuracy.

There are some non-normal floating point numbers. When

*e*is maximal, the number is either considered infinity or "not a number", depending on

*m*. When

*e*is minimal, it represents a

*sub-normal*number: either a

*denormal*or zero.

Denormals can be confusing at first, but their justification is actually quite simple. Let's take single-precision floating point as an example, where there are 8 exponent bits and 23 mantissa bits. The smallest positive normal single-precision floating point number is

*1.00000000000000000000000*. The next larger representable number is

_{2}*2^{-126}*1.00000000000000000000001*. Those numbers are not equal, but their difference is not representable as a normal single-precision floating point number. It would be rather odd if the difference between non-equal numbers were equal to zero, as it would be if we had to round the difference to zero!

_{2}*2^{-126}When

*e*is minimal, the represented number is (in the case of single-precision floating point)

*±0.m*, which means that the difference between the smallest normal numbers,

_{2}*2^{-126}*0.00000000000000000000001*, can still be represented.

_{2}*2^{-126}Note how with floating point numbers, the relative accuracy with which numbers can be represented is constant for almost the entire range of representable numbers. Once you get to sub-normal numbers, the accuracy drops very quickly. At the other end, the drop is even more extreme with a sudden jump to infinity.

### Posits

The basic idea of posits is to vary the size of the mantissa and to use a variable-length hybrid encoding of the exponent that mixes unary with binary encodings. The variable-length exponent encoding is shorter for exponents close to zero, so that more bits of mantissa are available for numbers close to one.Posits have a fixed number of binary exponent bits

*e*(except in the extreme ranges), and a posit system is characterized by that number. A typical choice appears to be

*es = 3*. The unary part of the exponent is encoded by the

*r*bits. For positive posits,

*10*encodes

_{1}*0*,

*110*encodes

_{1}*1*,

*01*encodes

_{1}*-1*,

*001*encodes

_{1}*-2*, and so on. The overall encoded number is then

*±1.m*.

_{2}*2^{r*2es + e}Let's look at some examples of 16-bit posits with

*es = 3*.

0 10 000 0000000000 is

*1.0*.

_{2}*2^{0*23+0}= 10 10 000 1000000000 is

*1.1*.

_{2}*2^{0*23+0}= 1.50 10 001 0000000000 is

*1.0*.

_{2}*2^{0*23+1}= 20 10 111 0000000000 is

*1.0*.

_{2}*2^{0*23+7}= 1280 110 000 000000000 is

*1.0*.

_{2}*2^{1*23+0}= 2560 1110 000 00000000 is

*1.0*.

_{2}*2^{2*23+0}= 655360 111111111110 000 is

*1.0*. Note that there is no mantissa anymore! The next larger number is:

_{2}*2^{10*23+0}= 2^{80}0 111111111110 001 is

*1.0*.

_{2}*2^{10*23+1}= 2^{81}0 111111111110 111 is

*1.0*.

_{2}*2^{10*23+7}= 2^{87}0 1111111111110 00 is

*1.0*. Now the number of binary exponent bits starts shrinking. The missing bits are implicitly zero, so the next larger number is:

_{2}*2^{11*23+0}= 2^{88}0 1111111111110 01 is

*1.0*.

_{2}*2^{11*23+2}= 2^{90}0 1111111111110 11 is

*1.0*.

_{2}*2^{11*23+6}= 2^{94}0 11111111111110 0 is

*1.0*.

_{2}*2^{12*23+0}= 2^{96}0 11111111111110 1 is

*1.0*.

_{2}*2^{12*23+4}= 2^{100}0 111111111111110 is

*1.0*. There are no binary exponent bits left, but the presentation in the slides linked above still allows for one larger normal number:

_{2}*2^{13*23+0}= 2^{104}0 111111111111111 is

*1.0*.

_{2}*2^{14*23+0}= 2^{112}Going in the other direction, we get:

0 01 111 0000000000 is

*1.0*.

_{2}*2^{-1*23+7}= 1/2 = 0.50 01 000 0000000000 is

*1.0*.

_{2}*2^{-1*23+0}= 1/256 = 0.003906250 001 000 000000000 is

*1.0*.

_{2}*2^{-2*23+0}= 1/65536 = 0.00001525878906250 000000000001 111 is

*1.0*.

_{2}*2^{-11*23+7}= 2^{-81}0 000000000001 000 is

*1.0*.

_{2}*2^{-11*23+0}= 2^{-88}0 0000000000001 11 is

*1.0*.

_{2}*2^{-12*23+6}= 2^{-90}0 0000000000001 00 is

*1.0*.

_{2}*2^{-12*23+0}= 2^{-96}0 00000000000001 0 is

*1.0*.

_{2}*2^{-13*23+0}= 2^{-104}0 000000000000001 is

*1.0*. This is the smallest positive normal number, since we have no choice but to treat 0 specially:

_{2}*2^{-14*23+0}= 2^{-112}0 000000000000000 is 0.

For values close to 1, the accuracy is the same as for half-precision floating point numbers (which have 5 exponent and 10 mantissa bits). Half-precision floating point numbers do have slightly higher accuracy at the extreme ends of their dynamic range, but the dynamic range of posits is

*much*higher. This is a very tempting trade-off for many applications.

By the way: if we had set

*es = 2*, we could have larger accuracy for values close to 1, while still having a higher dynamic range than half-precision floating point.

You'll note that we have not encountered an infinity. Gustafson's proposal here is to do away with the distinction between positive and negative zero and infinity. Instead, his proposal is to think of the real numbers projectively, and use a two's complement representation, meaning that negating a posit is the same operation at the bit level as negating an integer. For example:

1 111111111111111 is

*-1.0*.

_{2}*2^{-14*23+0}= -2^{-112}1 10 000 0000000000 is

*-1.0*. The next smaller number (larger in absolute magnitude) is:

_{2}*2^{0*23+0}= -11 01 111 1111111111 is

*-1.0000000001*.

_{2}*2^{0*23+0}1 01 111 1000000000 is

*-1.1*.5

_{2}*2^{0*23+0}= -11 000000000000001 is

*-1.0*

_{2}*2^{14*23+0}= -*.*

*2*^{112}The bit pattern 1 000000000000000 (which, like 0, is its own inverse in two's complement negation) would then represent infinity.

There's an elegance to thinking projectively in this way. Comparison of posits is the same as comparison of signed integers at the bit level (except for infinity, which is unordered). Even better, it's great that the smallest and largest normal numbers are multiplicative inverses of each other.

But to people used to floating point, not having a "sign + magnitude" representation is surprising. I also imagine that it could be annoying for a hardware implementation, so let's look into that.

### Hardware implementations

In his presentations, Gustafson claims that by reducing the number of special cases, posits are easier to implement than floating point. No doubt there are fewer special cases (no denorms, no NaNs), but at the cost of a more complicated normal case.Let's take a look at a floating point multiply. The basic structure is conceptually quite simple, since all parts of a floating point number can be treated separately:

By far the most expensive part here is the multiplication of the mantissas. There are of course a bunch of special cases that need to be accounted for: the inputs could be zero, infinity, or NaN, and the multiplication could overflow. Each of these cases are easily detected and handled with a little bit of comparatively inexpensive boolean logic.

Where it starts to get complicated is when handling the possibility that an input is denormal, or when the multiplication of two normal numbers results in a denormal.

When an input is denormal, the corresponding input for the multiply is

*0.m*instead of

*1.m*. Some logic has to decide whether the most significant input bit to the multiply is 0 or 1. This could potentially add to the latency of the computation. Luckily, deciding whether the input is denormal is fairly simple, and only the most significant input bit is affected. Because of carries, the less significant input bits tend to be more critical for latency. Conversely, this means that the latency of determining the most significant input bit can be hidden well.

On the output side, the cost is higher, both in terms of the required logic and in terms of the added latency, because a shifter is needed to shift the output into the correct position. Two cases need to be considered: When a multiplication of two normal numbers results in a denormal, the output has to be shifted to the right an appropriate number of places.

When a denormal is multiplied by a normal number, the output needs to be shifted to the left or the right, depending on the exponent of the normal number. Additionally, the number of leading zeros of either the denormal input or of the multiplication output is required to determine the exponent of the final result. Since the area cost is the same either way, I would expect implementations to determine the leading zero of the denormal input, since that allows for better latency hiding.

(The design space for floating point multipliers is larger than I've shown here. For example, you could deal with denormals by shifting their mantissa into place

*before*the multiply. That seems like a waste of hardware considering that you cannot avoid the shifter after the multiply, but my understanding of hardware design is limited, so who knows.)

So there is a bit more hardware required than just what is shown in the diagram above: a leading-zero-count and a shifter, plus a bit more random logic. But now compare to the effort required for a posit multiply:

First of all, there is unavoidable latency in front of the multiplier. Every single bit of mantissa input may be masked off, depending on the variable size of the exponent's unary part. The exponents themselves need to be decoded in order to add them, and then the resulting exponent needs to be encoded again. Finally, the multiplication result needs to be shifted into place; this was already required for floating point multiplication, but the determination of the shift becomes more complicated since it depends on the exponent size. Also, each output bit needs a multiplexer since it can originate from either the exponent or the mantissa.

From my non-expert glance, here's the hardware you need in addition to the multiplier and exponent addition:

- two leading-bit counts to decode the unary exponent parts (floating-point multiply only needs a single leading-zero count for a denormal input)
- two shifters to shift the binary input exponent parts into place
- logic for masking the input mantissas
- one leading bit encoder
- one shifter to shift the binary output exponent part into place
- one shifter to shift the mantissa into place (floating-point also needs this)
- multiplexer logic to combine the variable-length output parts

*es*- the mantissa of posits close to 1 can be larger than the mantissa of floating point numbers.

On the other hand, the additional shifters don't have to be large, since they only need to shift

*es*bits. The additional hardware is almost certainly dominated by the cost of the mantissa multiplier. Still, the additional latency could be a problem - though obviously, I have no actual experience designing floating point multipliers.

There's also the issue of the proposed two's complement representation for negative posits. This may not be too bad for the mantissa multiplication, since one can probably treat it as a signed integer multiplication and automatically get the correct signs for the resulting mantissa. However, I would expect some more overhead for the decoding and encoding of the exponent.

The story should be similar for posit vs. floating point addition. When building a multiply-accumulate unit, the latency that is added for masking the input based on the variable exponent length can likely be hidden quite well, but there does not appear a way around the decoding and encoding of exponents.

### Closing thoughts

As explained above, I expect posit hardware to be more expensive than floating point hardware. However, the gain in dynamic range and accuracy is nothing to sneeze at. It's worth giving posits a fair shot, since the trade-off may be worth it.There is a lot of legacy software that relies on floating point behavior. Luckily, a posit ALU contains all the pieces of a floating point ALU, so it should be possible to build an ALU that can do both at pretty much the cost of a posit-only ALU. This makes a painless transition feasible.

Posits have an elegant design based on thinking about numbers projectively, but the lack of NaNs, the two's complement representation, and not having signed zeros and infinities may be alien to some floating point practicioners. I don't know how much of an issue this really is, but it's worth pointing out that a simple modification to posits could accommodate all these concerns. Using again the example of 16-bit posits with

*es = 3*, we could designate bit patterns at the extreme ends as NaN and infinity:

0 111111111111111 is

*+inf*(instead of

*2*).

^{112}0 000000000000001 is

*+NaN*(instead of

*2*).

^{-112}We could then treat the sign bit independently, like in floating point, giving us

*±0*,

*, and*

*±inf**The neat properties related to thinking projectively would be lost, but the smallest and largest positive normal numbers would still be multiplicative inverses of each other. The hardware implementation may even be smaller, thanks to not having to deal with two's complement exponents.*

*.**±NaN*The inertia of floating point is massive, and I don't expect it to be unseated anytime soon. But it's awesome to see people rethinking such fundamental building blocks of computing and coming up with solid new ideas. Posits aren't going to happen quickly, if at all, but it's worth taking them seriously.