this repo has no description
at fixPythonPipStalling 258 lines 12 kB view raw
1.\" Copyright (c) 1985, 1991 The Regents of the University of California. 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 3. All advertising materials mentioning features or use of this software 13.\" must display the following acknowledgement: 14.\" This product includes software developed by the University of 15.\" California, Berkeley and its contributors. 16.\" 4. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" from: @(#)floor.3 6.5 (Berkeley) 4/19/91 33.\" $Id: float.3,v 1.3 2004/12/02 18:29:12 scp Exp $ 34.\" 35.Dd March 28, 2007 36.Dt FLOAT 3 37.ds up \fIulp\fR 38.ds nn \fINaN\fR 39.Os 40.Sh NAME 41.Nm float 42.Nd description of floating-point types available on OS X 43.Sh DESCRIPTION 44This page describes the available C floating-point types. For a list of math library functions 45that operate on these types, see the page on the math library, "man math". 46.Sh TERMINOLOGY 47Floating point numbers are represented in three parts: a \fBsign\fR, a \fBmantissa\fR (or \fBsignificand\fR), 48and an \fBexponent\fR. Given such a representation with sign 49.Fa s , 50mantissa 51.Fa m , 52and exponent 53.Fa e , 54the corresponding numerical value is 55.Fa s*m*2**e . 56.Pp 57Floating-point types differ in the number of bits of accuracy in the mantissa (called the \fBprecision\fR), 58and set of available exponents (the \fBexponent range\fR). 59.Pp 60Floating-point numbers with the maximum available exponent are reserved operands, denoting an \fBinfinity\fR if the 61significand is precisely zero, and a Not-a-Number, or \fBNaN\fR, otherwise. 62.Pp 63Floating-point numbers with the minimum available exponent are either \fBzero\fR if the significand is precisely zero, 64and \fBdenormal\fR otherwise. Note that zero is signed: +0 and -0 are distinct floating point numbers. 65.Pp 66Floating-point numbers with exponents other than the maximum and minimum available are called \fBnormal\fR numbers. 67.Sh PROPERTIES OF IEEE-754 FLOATING-POINT 68Basic arithmetic operations in IEEE-754 floating-point are \fBcorrectly rounded\fR: this means that the result delivered 69is the same as the result that would be achieved by computing the exact real-number operation on the operands, 70then rounding the real-number result to a floating-point value. 71.Pp 72\fBOverflow\fR occurs when the value of the exact result is too large in magnitude to be represented in the 73floating-point type in which the computation is being performed; doing so would require an exponent outside of the 74exponent range of the type. By default, computations that result in overflow return a signed infinity. 75.Pp 76\fBUnderflow\fR occurs when the value of the exact result is too small in magnitude to be represented as a normal 77number in the floating-point type in which the computation is being performed. By default, underflow is gradual, 78and produces a denormal number or a zero. 79.Pp 80All floating-points number of a given type are integer multiples of the smallest non-zero floating-point number of 81that type; however, the converse is not true. This means that, in the default mode, (x-y) = 0 only if x = y. 82.Pp 83The sign of zero transforms correctly through multiplication and division, and is preserved by addition of 84zeros with like signs, but x - x yields +0 for every finite floating-point number x. The only operations that 85reveal the sign of a zero are x/(�0) and copysign(x,�0). In particular, comparisons (x > y, x != y, etc) are not 86affected by the sign of zero. 87.Pp 88The sign of infinity transforms correctly through multiplication and division, and infinities are unaffected by addition 89or subtraction of any finite floating-point number. But Inf-Inf, Inf*0, and Inf/Inf are, like 0/0 or sqrt(-3), invalid 90operations that produce NaN. 91.Pp 92NaNs are the default results of invalid operations, and they propagate through subsequent arithmetic operations. 93If x is a NaN, then x != x is TRUE, and every other comparison predicate (x > y, x = y, x <= y, etc) evaluates to 94FALSE, regardless of the value of y. Additionally, predicates that entail an ordered comparison (rather than mere 95equality or inequality) signal Invalid Operation when one of the arguments is NaN. 96.Pp 97IEEE-754 provides five kinds of floating-point \fBexceptions\fR, listed below: 98.Pp 99.nf 100.ta \w'Invalid Operation'u+6n +\w'Gradual Underflow'u+2n 101Exception Default Result 102.tc \(ru 103 104.tc 105Invalid Operation NaN or FALSE 106Overflow �Infinity 107Divide by Zero �Infinity 108Underflow Gradual Underflow 109Inexact Rounded Value 110.ta 111.fi 112.Pp 113NOTE: An exception is not an error unless it is handled incorrectly. What makes a class of exceptions exceptional 114is that no single default response can be satisfactory in every instance. On the other hand, because a default 115response will serve most instances of the exception satisfactorily, simply aborting the computation cannot be 116justified. 117.Pp 118For each kind of floating-point exception, IEEE-754 provides a flag that is raised each time its exception is 119signaled, and remains raised until the program resets it. Programs may test, save, and restore the flags, or a subset 120thereof. 121.Sh PRECISION AND EXPONENT RANGE OF SPECIFIC FLOATING-POINT TYPES 122On both Intel and PPC macs, the type 123.Fa float 124corresponds to IEEE-754 single precision. A single-precision number is represented in 32 bits, and has a precision 125of 24 significant bits, roughly like 7 significant decimal digits. 8 bits are used to encode the exponent, which gives 126an exponent range from -126 to 127, inclusive. 127.Pp 128The header <float.h> defines several useful constants for the float type: 129.br 130.Fa FLT_MANT_DIG 131- The number of binary digits in the significand of a float. 132.br 133.Fa FLT_MIN_EXP 134- One more than the smallest exponent available in the float type. 135.br 136.Fa FLT_MAX_EXP 137- One more than the largest exponent available in the float type. 138.br 139.Fa FLT_DIG 140- the precision in decimal digits of a float. A decimal value with this many digits, stored as a float, always 141yields the same value up to this many digits when converted back to decimal notation. 142.br 143.Fa FLT_MIN_10_EXP 144- the smallest n such that 10**n is a non-zero normal number as a float. 145.br 146.Fa FLT_MAX_10_EXP 147- the largest n such that 10**n is finite as a float. 148.br 149.Fa FLT_MIN 150- the smallest positive normal float. 151.br 152.Fa FLT_MAX 153- the largest finite float. 154.br 155.Fa FLT_EPSILON 156- the difference between 1.0 and the smallest float bigger than 1.0. 157.Pp 158On both Intel and PPC macs, the type 159.Fa double 160corresponds to IEEE-754 double precision. A double-precision number is represented in 64 bits, and has a precision 161of 53 significant bits, roughly like 16 significant decimal digits. 11 bits are used to encode the exponent, which gives 162an exponent range from -1022 to 1023, inclusive. 163.Pp 164The header <float.h> defines several useful constants for the double type: 165.br 166.Fa DBL_MANT_DIG 167- The number of binary digits in the significand of a double. 168.br 169.Fa DBL_MIN_EXP 170- One more than the smallest exponent available in the double type. 171.br 172.Fa DBL_MAX_EXP 173- One more than the exponent available in the double type. 174.br 175.Fa DBL_DIG 176- the precision in decimal digits of a double. A decimal value with this many digits, stored as a double, always 177yields the same value up to this many digits when converted back to decimal notation. 178.br 179.Fa DBL_MIN_10_EXP 180- the smallest n such that 10**n is a non-zero normal number as a double. 181.br 182.Fa DBL_MAX_10_EXP 183- the largest n such that 10**n is finite as a double. 184.br 185.Fa DBL_MIN 186- the smallest positive normal double. 187.br 188.Fa DBL_MAX 189- the largest finite double. 190.br 191.Fa DBL_EPSILON 192- the difference between 1.0 and the smallest double bigger than 1.0. 193.Pp 194On Intel macs, the type 195.Fa long double 196corresponds to IEEE-754 double extended precision. A double extended number is represented in 80 bits, and has a 197precision of 64 significant bits, roughly like 19 significant decimal digits. 15 bits are used to encode the exponent, 198which gives an exponent range from -16383 to 16384, inclusive. 199.Pp 200The header <float.h> defines several useful constants for the long double type: 201.br 202.Fa LDBL_MANT_DIG 203- The number of binary digits in the significand of a long double. 204.br 205.Fa LDBL_MIN_EXP 206- One more than the smallest exponent available in the long double type. 207.br 208.Fa LDBL_MAX_EXP 209- One more than the exponent available in the long double type. 210.br 211.Fa LDBL_DIG 212- the precision in decimal digits of a long double. A decimal value with this many digits, stored as a long double, 213always yields the same value up to this many digits when converted back to decimal notation. 214.br 215.Fa LDBL_MIN_10_EXP 216- the smallest n such that 10**n is a non-zero normal number as a long double. 217.br 218.Fa LDBL_MAX_10_EXP 219- the largest n such that 10**n is finite as a long double. 220.br 221.Fa LDBL_MIN 222- the smallest positive normal long double. 223.br 224.Fa LDBL_MAX 225- the largest finite long double. 226.br 227.Fa LDBL_EPSILON 228- the difference between 1.0 and the smallest long double bigger than 1.0. 229.Sh LONG DOUBLE ON POWERPC MACS 230On PowerPC macs, by default the type 231.Fa long double 232is mapped to IEEE-754 double precision, described above. If additional precision is required, a non-IEEE-754 128 233bit long double format is also available. To use this format, compile with the 234.Fa -mlong-double-128 235flag. If you wish to use the <math.h> functions, you must include the linker flag 236.Fa -lmx 237as well as the usual 238.Fa -lm . 239The -mlong-double-128 flag is only valid when the target architecture is ppc or ppc64. 240.Pp 241Each 128-bit long double is made up of two IEEE doubles (head and tail). The value of the long double is the sum of the 242values of the two parts (unless the head double has value -0.0, in which case the value of the long double is -0.0). 243The value of the head is required to be the value of the long double rounded to the nearest double. If the head is an 244infinity, the value of the long double is the value of the head, and the tail must be �0.0. The tail of a NaN can be 245any double value. There are many 128-bit bit patterns that are not valid as long doubles. These do not represet any 246value. 247.Pp 248The 128-bit long double format provides 106 significant bits, which is roughly like 31 significant decimal digits. It 249has the same exponent range as double, from -1022 to 1023, inclusive. The usual constants are provided from <float.h>, 250as described above. 251.Pp 252In the 128-bit long double format, addition and subtraction have a relative error bound of one \fBulp\fR, or 2**-106. 253Multiplication has a relative error within 2 ulps, and division a relative error within 3 ulps. 254.Sh SEE ALSO 255.Xr math 3 , 256.Xr complex 3 257.Sh STANDARDS 258Floating-point arithmetic conforms to the ISO/IEC 9899:1999(E) standard.