this repo has no description
1.\" Copyright (c) 1985, 1991 The Regents of the University of California.
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\" notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\" notice, this list of conditions and the following disclaimer in the
11.\" documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\" must display the following acknowledgement:
14.\" This product includes software developed by the University of
15.\" California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\" may be used to endorse or promote products derived from this software
18.\" without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\" from: @(#)floor.3 6.5 (Berkeley) 4/19/91
33.\" $Id: float.3,v 1.3 2004/12/02 18:29:12 scp Exp $
34.\"
35.Dd March 28, 2007
36.Dt FLOAT 3
37.ds up \fIulp\fR
38.ds nn \fINaN\fR
39.Os
40.Sh NAME
41.Nm float
42.Nd description of floating-point types available on OS X
43.Sh DESCRIPTION
44This page describes the available C floating-point types. For a list of math library functions
45that operate on these types, see the page on the math library, "man math".
46.Sh TERMINOLOGY
47Floating point numbers are represented in three parts: a \fBsign\fR, a \fBmantissa\fR (or \fBsignificand\fR),
48and an \fBexponent\fR. Given such a representation with sign
49.Fa s ,
50mantissa
51.Fa m ,
52and exponent
53.Fa e ,
54the corresponding numerical value is
55.Fa s*m*2**e .
56.Pp
57Floating-point types differ in the number of bits of accuracy in the mantissa (called the \fBprecision\fR),
58and set of available exponents (the \fBexponent range\fR).
59.Pp
60Floating-point numbers with the maximum available exponent are reserved operands, denoting an \fBinfinity\fR if the
61significand is precisely zero, and a Not-a-Number, or \fBNaN\fR, otherwise.
62.Pp
63Floating-point numbers with the minimum available exponent are either \fBzero\fR if the significand is precisely zero,
64and \fBdenormal\fR otherwise. Note that zero is signed: +0 and -0 are distinct floating point numbers.
65.Pp
66Floating-point numbers with exponents other than the maximum and minimum available are called \fBnormal\fR numbers.
67.Sh PROPERTIES OF IEEE-754 FLOATING-POINT
68Basic arithmetic operations in IEEE-754 floating-point are \fBcorrectly rounded\fR: this means that the result delivered
69is the same as the result that would be achieved by computing the exact real-number operation on the operands,
70then rounding the real-number result to a floating-point value.
71.Pp
72\fBOverflow\fR occurs when the value of the exact result is too large in magnitude to be represented in the
73floating-point type in which the computation is being performed; doing so would require an exponent outside of the
74exponent range of the type. By default, computations that result in overflow return a signed infinity.
75.Pp
76\fBUnderflow\fR occurs when the value of the exact result is too small in magnitude to be represented as a normal
77number in the floating-point type in which the computation is being performed. By default, underflow is gradual,
78and produces a denormal number or a zero.
79.Pp
80All floating-points number of a given type are integer multiples of the smallest non-zero floating-point number of
81that type; however, the converse is not true. This means that, in the default mode, (x-y) = 0 only if x = y.
82.Pp
83The sign of zero transforms correctly through multiplication and division, and is preserved by addition of
84zeros with like signs, but x - x yields +0 for every finite floating-point number x. The only operations that
85reveal the sign of a zero are x/(�0) and copysign(x,�0). In particular, comparisons (x > y, x != y, etc) are not
86affected by the sign of zero.
87.Pp
88The sign of infinity transforms correctly through multiplication and division, and infinities are unaffected by addition
89or subtraction of any finite floating-point number. But Inf-Inf, Inf*0, and Inf/Inf are, like 0/0 or sqrt(-3), invalid
90operations that produce NaN.
91.Pp
92NaNs are the default results of invalid operations, and they propagate through subsequent arithmetic operations.
93If x is a NaN, then x != x is TRUE, and every other comparison predicate (x > y, x = y, x <= y, etc) evaluates to
94FALSE, regardless of the value of y. Additionally, predicates that entail an ordered comparison (rather than mere
95equality or inequality) signal Invalid Operation when one of the arguments is NaN.
96.Pp
97IEEE-754 provides five kinds of floating-point \fBexceptions\fR, listed below:
98.Pp
99.nf
100.ta \w'Invalid Operation'u+6n +\w'Gradual Underflow'u+2n
101Exception Default Result
102.tc \(ru
103
104.tc
105Invalid Operation NaN or FALSE
106Overflow �Infinity
107Divide by Zero �Infinity
108Underflow Gradual Underflow
109Inexact Rounded Value
110.ta
111.fi
112.Pp
113NOTE: An exception is not an error unless it is handled incorrectly. What makes a class of exceptions exceptional
114is that no single default response can be satisfactory in every instance. On the other hand, because a default
115response will serve most instances of the exception satisfactorily, simply aborting the computation cannot be
116justified.
117.Pp
118For each kind of floating-point exception, IEEE-754 provides a flag that is raised each time its exception is
119signaled, and remains raised until the program resets it. Programs may test, save, and restore the flags, or a subset
120thereof.
121.Sh PRECISION AND EXPONENT RANGE OF SPECIFIC FLOATING-POINT TYPES
122On both Intel and PPC macs, the type
123.Fa float
124corresponds to IEEE-754 single precision. A single-precision number is represented in 32 bits, and has a precision
125of 24 significant bits, roughly like 7 significant decimal digits. 8 bits are used to encode the exponent, which gives
126an exponent range from -126 to 127, inclusive.
127.Pp
128The header <float.h> defines several useful constants for the float type:
129.br
130.Fa FLT_MANT_DIG
131- The number of binary digits in the significand of a float.
132.br
133.Fa FLT_MIN_EXP
134- One more than the smallest exponent available in the float type.
135.br
136.Fa FLT_MAX_EXP
137- One more than the largest exponent available in the float type.
138.br
139.Fa FLT_DIG
140- the precision in decimal digits of a float. A decimal value with this many digits, stored as a float, always
141yields the same value up to this many digits when converted back to decimal notation.
142.br
143.Fa FLT_MIN_10_EXP
144- the smallest n such that 10**n is a non-zero normal number as a float.
145.br
146.Fa FLT_MAX_10_EXP
147- the largest n such that 10**n is finite as a float.
148.br
149.Fa FLT_MIN
150- the smallest positive normal float.
151.br
152.Fa FLT_MAX
153- the largest finite float.
154.br
155.Fa FLT_EPSILON
156- the difference between 1.0 and the smallest float bigger than 1.0.
157.Pp
158On both Intel and PPC macs, the type
159.Fa double
160corresponds to IEEE-754 double precision. A double-precision number is represented in 64 bits, and has a precision
161of 53 significant bits, roughly like 16 significant decimal digits. 11 bits are used to encode the exponent, which gives
162an exponent range from -1022 to 1023, inclusive.
163.Pp
164The header <float.h> defines several useful constants for the double type:
165.br
166.Fa DBL_MANT_DIG
167- The number of binary digits in the significand of a double.
168.br
169.Fa DBL_MIN_EXP
170- One more than the smallest exponent available in the double type.
171.br
172.Fa DBL_MAX_EXP
173- One more than the exponent available in the double type.
174.br
175.Fa DBL_DIG
176- the precision in decimal digits of a double. A decimal value with this many digits, stored as a double, always
177yields the same value up to this many digits when converted back to decimal notation.
178.br
179.Fa DBL_MIN_10_EXP
180- the smallest n such that 10**n is a non-zero normal number as a double.
181.br
182.Fa DBL_MAX_10_EXP
183- the largest n such that 10**n is finite as a double.
184.br
185.Fa DBL_MIN
186- the smallest positive normal double.
187.br
188.Fa DBL_MAX
189- the largest finite double.
190.br
191.Fa DBL_EPSILON
192- the difference between 1.0 and the smallest double bigger than 1.0.
193.Pp
194On Intel macs, the type
195.Fa long double
196corresponds to IEEE-754 double extended precision. A double extended number is represented in 80 bits, and has a
197precision of 64 significant bits, roughly like 19 significant decimal digits. 15 bits are used to encode the exponent,
198which gives an exponent range from -16383 to 16384, inclusive.
199.Pp
200The header <float.h> defines several useful constants for the long double type:
201.br
202.Fa LDBL_MANT_DIG
203- The number of binary digits in the significand of a long double.
204.br
205.Fa LDBL_MIN_EXP
206- One more than the smallest exponent available in the long double type.
207.br
208.Fa LDBL_MAX_EXP
209- One more than the exponent available in the long double type.
210.br
211.Fa LDBL_DIG
212- the precision in decimal digits of a long double. A decimal value with this many digits, stored as a long double,
213always yields the same value up to this many digits when converted back to decimal notation.
214.br
215.Fa LDBL_MIN_10_EXP
216- the smallest n such that 10**n is a non-zero normal number as a long double.
217.br
218.Fa LDBL_MAX_10_EXP
219- the largest n such that 10**n is finite as a long double.
220.br
221.Fa LDBL_MIN
222- the smallest positive normal long double.
223.br
224.Fa LDBL_MAX
225- the largest finite long double.
226.br
227.Fa LDBL_EPSILON
228- the difference between 1.0 and the smallest long double bigger than 1.0.
229.Sh LONG DOUBLE ON POWERPC MACS
230On PowerPC macs, by default the type
231.Fa long double
232is mapped to IEEE-754 double precision, described above. If additional precision is required, a non-IEEE-754 128
233bit long double format is also available. To use this format, compile with the
234.Fa -mlong-double-128
235flag. If you wish to use the <math.h> functions, you must include the linker flag
236.Fa -lmx
237as well as the usual
238.Fa -lm .
239The -mlong-double-128 flag is only valid when the target architecture is ppc or ppc64.
240.Pp
241Each 128-bit long double is made up of two IEEE doubles (head and tail). The value of the long double is the sum of the
242values of the two parts (unless the head double has value -0.0, in which case the value of the long double is -0.0).
243The value of the head is required to be the value of the long double rounded to the nearest double. If the head is an
244infinity, the value of the long double is the value of the head, and the tail must be �0.0. The tail of a NaN can be
245any double value. There are many 128-bit bit patterns that are not valid as long doubles. These do not represet any
246value.
247.Pp
248The 128-bit long double format provides 106 significant bits, which is roughly like 31 significant decimal digits. It
249has the same exponent range as double, from -1022 to 1023, inclusive. The usual constants are provided from <float.h>,
250as described above.
251.Pp
252In the 128-bit long double format, addition and subtraction have a relative error bound of one \fBulp\fR, or 2**-106.
253Multiplication has a relative error within 2 ulps, and division a relative error within 3 ulps.
254.Sh SEE ALSO
255.Xr math 3 ,
256.Xr complex 3
257.Sh STANDARDS
258Floating-point arithmetic conforms to the ISO/IEC 9899:1999(E) standard.