Notebook

Computer Arithmetics and Round-off Methods¶

In the ideal mathematical world, operations like $1+2=3$, $4\times 3 = 12$, $(\sqrt{2})^2 = 2$ are unambiguously defined, however, when one is representing numbers in a computer, this is no longer true. The main reason of this is the so-called finite arithmetic, what is the way in which a computer performs basic operations. Some features of finite arithmetic are stated below:

Only integer and rational numbers can be exactly represented.
The elements of the set in which arithmetic is performed is necessarily finite.
Any arithmetic operation between two or more numbers of this set should be another element of the set.
Non-representable numbers like irrational numbers are approximated to the closest rational number within the defined set.
Extremely large numbers produce overflows and extremely small numbers produce underflows, which are taken as null.
Operations over non-representable numbers are not exact.

In spite of this, defining adequately the set of elements in which our computer will operate, round-off methods can be systematically neglected, yielding correct results within reasonable error margins. In some pathological cases, when massive iterations are required, these errors must be taken into account more seriously.

Binary machine numbers
- Single-precision numbers
- Double-precision numbers
Finite Arithmetic
- Addition
- Multiplication

In [2]:

import numpy as np

Binary machine numbers¶

As everyone knows, the base of the modern computation is the binary numbers. The binary base or base-2 numeral system is the simplest one among the existing numeral bases. As every electronic devices are based on logic circuits (circuits operating with logic gates), the implementation of a binary base is straightforward, besides, any other numeral system can be reduced to a binary representation.

According to the standard IEEE 754-2008, representation of real numbers can be done in several ways, single-precision and double precision are the most used ones.

Single-precision numbers¶

Single-precision numbers are used when one does not need very accurate results and/or need to save memory. These numbers are represented by a 32-bits (Binary digIT) lenght binary number, where the real number is stored following the next rules:

The fist digit (called s) indicates the sign of the number (s=0 means a positive number, s=1 a negative one).
The next 8 bits represent the exponent of the number.
The last 23 bits represent the fractional part of the number.

The formula for recovering the real number is then given by:

$$r = (-1)^s\times \left( 1 + \sum_{i=1}^{23}b_{23-i}2^{-i} \right)\times 2^{e-127}$$

where $s$ is the sign, $b_{23-i}$ the fraction bits and $e$ is given by:

$$e = \sum_{i=0}^7 b_{23+i}2^i$$

Next, it is shown a little routine for calculating the value of the represented 32-bits number

In [2]:

def number32( binary ):
    #Inverting binary string
    binary = binary[::-1]
    #Decimal part
    dec = 1
    for i in xrange(1,24):
        dec += int(binary[23-i])*2**-i
    #Exponent part
    exp = 0
    for i in xrange(0,8):
        exp += int(binary[23+i])*2**i
    #Total number
    number = (-1)**int(binary[31])*2**(exp-127)*dec
    return number

In [3]:

number32( "00111110001000000000000000000000" )

Out[3]:

0.15625

Single-precision system can represent real numbers within the interval $\pm 10^{-38} \cdots 10^{38}$, with $7-8$ decimal digits.

In [3]:

#Decimal digits 
print "\n"
print "Decimal digits contributions for single precision number"
print 2**-23., 2**-15., 2**-5. , "\n"

#Largest and smallest exponent  
suma = 0 
for i in xrange(0,8):
    suma += 2**i
print "Largest and smallest exponent for single precision number"    
print 2**(suma-127.), 2**(-127.),"\n"


Decimal digits contributions for single precision number
1.19209289551e-07 3.0517578125e-05 0.03125 

Largest and smallest exponent for single precision number
3.40282366921e+38 5.87747175411e-39

Double-precision numbers¶

Double-precision numbers are used when high accuracy is required. These numbers are represented by a 64-bits (Binary digIT) lenght binary number, where the real number is stored following the next rules:

The fist digit (called s) indicates the sign of the number (s=0 means a positive number, s=1 a negative one).
The next 11 bits represent the exponent of the number.
The last bits represent the fractional part of the number.

The formula for recovering the real number is then given by:

$$r = (-1)^s\times \left( 1 + \sum_{i=1}^{52}b_{52-i}2^{-i} \right)\times 2^{e-1023}$$

where $s$ is the sign, $b_{23-i}$ the fraction bits and $e$ is given by:

$$e = \sum_{i=0}^{10} b_{52+i}2^i$$

Double-precision system can represent real numbers within the interval $\pm 10^{-308} \cdots 10^{308}$, with $16-17$ decimal digits.

ACTIVITY¶

1. Write a python script that calculates the double precision number represented by a 64-bits binary.

2. What is the number represented by:

0 10000000011 1011100100001111111111111111111111111111111111111111

**ANSWER:** 27.56640625

Finite Arithmetic¶

The most basic arithmetic operations are addition and multiplication. Further operations such as subtraction, division and power are secondary as they can be reached by iteratively use the latter ones.

Addition¶

As mentioned before, arithmetic operations are not exact in a computer due to the inherent limitations in number representing. Even when adding two already approximate numbers, say a single-precision couple of numbers, the result may not be a representable number, being necessary to apply approximation rules.

In [4]:

N = 9
x = 0
for i in xrange(N):
    x += np.float16(1.0/N)
print x

0.999755859375

Note that the sucessive application of rounded-off numbers produces a final result less precise.

In [4]:

print "5/7", np.float32(5/7.)
print "1/3", np.float32(1/3.)
print np.float32(5/7.+1/3.), 22/21.
print "Error:", np.float32(5/7.+1/3.)-22/21.

5/7 0.714286
1/3 0.333333
1.04762 1.04761904762
Error: 5.67663283046e-08

Although the float16 or half-float precision is standard according to the IEEE 754-2008, many devices do not support it well.

Multiplication¶

For multiplication it is applied the same round-off rules as the addition, however, be aware that multiplicative errors propagate more quickly than additive errors.

In [2]:

N = 20
x = 1
for i in xrange(N):
    x *= np.float16(2.0**(1.0/N))
print x, np.float16(5/7.)

1.99580530418 0.71436

The final result has an error at the third decimal digit, one more than the case of addition.

ACTIVITY

Find the error associated to the finite representation in the next operations

$$ x-u, \frac{x-u}{w}, (x-u)*v, u+v $$

considering the values

$$ x = \frac{5}{7}, y = \frac{1}{3}, u = 0.71425 $$$$ v = 0.98765\times 10^5, w = 0.111111\times 10^{-4} $$