Octuple-precision floating-point format

In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes (256 bits) in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely (if ever) used and very few environments support it.

IEEE 754 octuple-precision binary floating-point format: binary256

In its 2008 revision, the IEEE 754 standard specifies a binary256 format among the interchange formats (it is not a basic format), as having:

Sign bit: 1 bit
Exponent width: 19 bits
Significand precision: 237 bits (236 explicitly stored)

The format is written with an implicit lead bit with value 1 unless the exponent is all zeros. Thus only 236 bits of the significand appear in the memory format, but the total precision is 237 bits (approximately 71 decimal digits: log₁₀(2²³⁷) ≈ 71.344). The bits are laid out as follows:

Exponent encoding

The octuple-precision binary floating-point exponent is encoded using an offset binary representation, with the zero offset being 262143; also known as exponent bias in the IEEE 754 standard.

E_min = −262142
E_max = 262143
Exponent bias = 3FFFF₁₆ = 262143

Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 262143 has to be subtracted from the stored exponent.

The stored exponents 00000₁₆ and 7FFFF₁₆ are interpreted specially.

Exponent	Significand zero	Significand non-zero	Equation
00000₁₆	0, −0	subnormal numbers	(-1)^signbit × 2^−262142 × 0.significandbits₂
00001₁₆, ..., 7FFFE₁₆	normalized value		(-1)^signbit × 2^{exponent bits₂} × 1.significandbits₂
7FFFF₁₆	±∞	NaN (quiet, signalling)

The minimum strictly positive (subnormal) value is 2^−262378 ≈ 10⁻⁷⁸⁹⁸⁴ and has a precision of only one bit. The minimum positive normal value is 2^−262142 ≈ 2.4824 × 10⁻⁷⁸⁹¹³. The maximum representable value is 2²⁶²¹⁴⁴ − 2²⁶¹⁹⁰⁷ ≈ 1.6113 × 10⁷⁸⁹¹³.

Octuple-precision examples

These examples are given in bit representation, in hexadecimal, of the floating-point value. This includes the sign, (biased) exponent, and significand.

0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000₁₆ = +0
8000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000₁₆ = −0

7fff f000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000₁₆ = +infinity
ffff f000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000₁₆ = −infinity

0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001₁₆
= 2^−262142 × 2⁻²³⁶ = 2^−262378
≈ 2.24800708647703657297018614776265182597360918266100276294348974547709294462 × 10⁻⁷⁸⁹⁸⁴
  (smallest positive subnormal number)

0000 0fff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff₁₆
= 2^−262142 × (1 − 2⁻²³⁶)
≈ 2.4824279514643497882993282229138717236776877060796468692709532979137875392 × 10⁻⁷⁸⁹¹³
  (largest subnormal number)

0000 1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000₁₆
= 2^−262142
≈ 2.48242795146434978829932822291387172367768770607964686927095329791378756168 × 10⁻⁷⁸⁹¹³
  (smallest positive normal number)

7fff efff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff₁₆
= 2²⁶²¹⁴³ × (2 − 2⁻²³⁶)
≈ 1.61132571748576047361957211845200501064402387454966951747637125049607182699 × 10⁷⁸⁹¹³
  (largest normal number)

3fff efff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff₁₆
= 1 − 2⁻²³⁷
≈ 0.999999999999999999999999999999999999999999999999999999999999999999999995472
  (largest number less than one)

3fff f000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000₁₆
= 1 (one)

3fff f000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001₁₆
= 1 + 2⁻²³⁶
≈ 1.00000000000000000000000000000000000000000000000000000000000000000000000906
  (smallest number larger than one)

By default, 1/3 rounds down like double precision, because of the odd number of bits in the significand. So the bits beyond the rounding point are 0101... which is less than 1/2 of a unit in the last place.

Implementations

Octuple precision is rarely implemented since usage of it is extremely rare. Apple Inc. had an implementation of addition, subtraction and multiplication of octuple-precision numbers with a 224-bit two's complement significand and a 32-bit exponent.[1] One can use general arbitrary-precision arithmetic libraries to obtain octuple (or higher) precision, but specialized octuple-precision implementations may achieve higher performance.

Hardware support

There is no known hardware implementation of octuple precision.

References

Crandall, Richard E.; Papadopoulos, Jason S. (2002-05-08). "Octuple-precision floating point on Apple G4 (archived copy on web.archive.org)" (PDF). Archived from the original on 2006-07-28.{{cite web}}: CS1 maint: unfit URL (link) (8 pages)

Data types
Uninterpreted	Bit Byte Trit Tryte Word Bit array
Numeric	Arbitrary-precision or bignum Complex Decimal Fixed point Floating point Reduced precision Minifloat Half precision bfloat16 Single precision Double precision Quadruple precision Octuple precision Extended precision Long double Integer signedness Interval Rational
Pointer	Address physical virtual Reference
Text	Character String null-terminated
Composite	Algebraic data type generalized Array Associative array Class Dependent Equality Inductive Intersection List Object metaobject Option type Product Record or Struct Refinement Set Union tagged
Other	Boolean Bottom type Collection Enumerated type Exception Function type Opaque data type Recursive data type Semaphore Stream Strongly typed identifier Top type Type class Empty type Unit type Void
Related topics	Abstract data type Boxing Data structure Generic Kind metaclass Parametric polymorphism Primitive data type Interface Subtyping Type constructor Type conversion Type system Type theory Variable