Bad programmers worry about the code. Good programmers worry about data structures and their relationships.
— Linus Torvalds
This chapter introduces basic data types and data structures of Python
. Although the Python
interpreter itself already brings a rich variety of data structures with it, NumPy
and other libraries add to these in a valuable fashion.
The chapter is organized as follows:
int
, float
, and string
.
list
objects) and illustrates control structures, functional programming paradigms, and anonymous functions.
NumPy
ndarray
class and illustrates some of the benefits of this class for scientific and financial applications.
NumPy
’s array class vectorized code is easily implemented, leading to more compact and also betterperforming code.
The spirit of this chapter is to provide a general introduction to Python
specifics when it comes to data types and structures. If you are equipped with a background from another programing language, say C
or Matlab
, you should be able to easily grasp the differences that Python
usage might bring along. The topics introduced here are all important and fundamental for the chapters to come.
Python
is a dynamically typed language, which means that the Python
interpreter infers the type of an object at runtime. In comparison, compiled languages like C
are generally statically typed. In these cases, the type of an object has to be attached to the object before compile time.^{[18]}
One of the most fundamental data types is the integer, or int
:
In
[
1
]:
a
=
10
type
(
a
)
Out[1]: int
The builtin function type
provides type information for all objects with standard and builtin types as well as for newly created classes and objects. In the latter case, the information provided depends on the description the programmer has stored with the class. There is a saying that “everything in Python
is an object.” This means, for example, that even simple objects like the int
object we just defined have builtin methods. For example, you can get the number of bits needed to represent the int
object inmemory by calling the method bit_length
:
In
[
2
]:
a
.
bit_length
()
Out[2]: 4
You will see that the number of bits needed increases the higher the integer value is that we assign to the object:
In
[
3
]:
a
=
100000
a
.
bit_length
()
Out[3]: 17
In general, there are so many different methods that it is hard to memorize all methods of all classes and objects. Advanced Python
environments, like IPython
, provide tab completion capabilities that show all methods attached to an object. You simply type the object name followed by a dot (e.g., a.
) and then press the Tab key, e.g., a.
. This then provides a collection of methods you can call on the object. Alternatively, the tab
Python
builtin function dir
gives a complete list of attributes and methods of any object.
A specialty of Python
is that integers can be arbitrarily large. Consider, for example, the googol number 10^{100}. Python
has no problem with such large numbers, which are technically long
objects:
In
[
4
]:
googol
=
10
**
100
googol
Out[4]: 100000000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000L
In
[
5
]:
googol
.
bit_length
()
Out[5]: 333
Python
integers can be arbitrarily large. The interpreter simply uses as many bits/bytes as needed to represent the numbers.
It is important to note that mathematical operations on int
objects return int
objects. This can sometimes lead to confusion and/or hardtodetect errors in mathematical routines. The following expression yields the expected result:
In
[
6
]:
1
+
4
Out[6]: 5
However, the next case may return a somewhat surprising result:
In
[
7
]:
1
/
4
Out[7]: 0
In
[
8
]:
type
(
1
/
4
)
Out[8]: int
For the last expression to return the generally desired result of 0.25, we must operate on float
objects, which brings us naturally to the next basic data type. Adding a dot to an integer value, like in 1.
or 1.0
, causes Python
to interpret the object as a float
. Expressions involving a float
also return a float
object in general:^{[19]}
In
[
9
]:
1.
/
4
Out[9]: 0.25
In
[
10
]:
type
(
1.
/
4
)
Out[10]: float
A float
is a bit more involved in that the computerized representation of rational or real numbers is in general not exact and depends on the specific technical approach taken. To illustrate what this implies, let us define another float
object:
In
[
11
]:
b
=
0.35
type
(
b
)
Out[11]: float
float
objects like this one are always represented internally up to a certain degree of accuracy only. This becomes evident when adding 0.1 to b
:
In
[
12
]:
b
+
0.1
Out[12]: 0.44999999999999996
The reason for this is that float
s are internally represented in binary format; that is, a decimal number 0 < n < 1 is represented by a series of the form . For certain floatingpoint numbers the binary representation might involve a large number of elements or might even be an infinite series. However, given a fixed number of bits used to represent such a number—i.e., a fixed number of terms in the representation series—inaccuracies are the consequence. Other numbers can be represented perfectly and are therefore stored exactly even with a finite number of bits available. Consider the following example:
In
[
13
]:
c
=
0.5
c
.
as_integer_ratio
()
Out[13]: (1, 2)
One half, i.e., 0.5, is stored exactly because it has an exact (finite) binary representation as . However, for b = 0.35
we get something different than the expected rational number :
In
[
14
]:
b
.
as_integer_ratio
()
Out[14]: (3152519739159347, 9007199254740992)
The precision is dependent on the number of bits used to represent the number. In general, all platforms that Python
runs on use the IEEE 754 doubleprecision standard (i.e., 64 bits), for internal representation.^{[20]} This translates into a 15digit relative accuracy.
Since this topic is of high importance for several application areas in finance, it is sometimes necessary to ensure the exact, or at least best possible, representation of numbers. For example, the issue can be of importance when summing over a large set of numbers. In such a situation, a certain kind and/or magnitude of representation error might, in aggregate, lead to significant deviations from a benchmark value.
The module decimal
provides an arbitraryprecision object for floatingpoint numbers and several options to address precision issues when working with such numbers:
In
[
15
]:
import
decimal
from
decimal
import
Decimal
In
[
16
]:
decimal
.
getcontext
()
Out[16]: Context(prec=28, rounding=ROUND_HALF_EVEN, Emin=999999999, Emax=999999 999, capitals=1, flags=[], traps=[Overflow, InvalidOperation, DivisionB yZero])
In
[
17
]:
d
=
Decimal
(
1
)
/
Decimal
(
11
)
d
Out[17]: Decimal('0.09090909090909090909090909091')
You can change the precision of the representation by changing the respective attribute value of the Context
object:
In
[
18
]:
decimal
.
getcontext
()
.
prec
=
4
# lower precision than default
In
[
19
]:
e
=
Decimal
(
1
)
/
Decimal
(
11
)
e
Out[19]: Decimal('0.09091')
In
[
20
]:
decimal
.
getcontext
()
.
prec
=
50
# higher precision than default
In
[
21
]:
f
=
Decimal
(
1
)
/
Decimal
(
11
)
f
Out[21]: Decimal('0.090909090909090909090909090909090909090909090909091')
If needed, the precision can in this way be adjusted to the exact problem at hand and one can operate with floatingpoint objects that exhibit different degrees of accuracy:
In
[
22
]:
g
=
d
+
e
+
f
g
Out[22]: Decimal('0.27272818181818181818181818181909090909090909090909')
Now that we can represent natural and floatingpoint numbers, we turn to text. The basic data type to represent text in Python
is the string
. The string
object has a number of really helpful builtin methods. In fact, Python
is generally considered to be a good choice when it comes to working with text files of any kind and any size. A string
object is generally defined by single or double quotation marks or by converting another object using the str
function (i.e., using the object’s standard or userdefined string
representation):
In
[
23
]:
t
=
'this is a string object'
With regard to the builtin methods, you can, for example, capitalize the first word in this object:
In
[
24
]:
t
.
capitalize
()
Out[24]: 'This is a string object'
Or you can split it into its singleword components to get a list
object of all the words (more on list
objects later):
In
[
25
]:
t
.
split
()
Out[25]: ['this', 'is', 'a', 'string', 'object']
You can also search for a word and get the position (i.e., index value) of the first letter of the word back in a successful case:
In
[
26
]:
t
.
find
(
'string'
)
Out[26]: 10
If the word is not in the string
object, the method returns 1:
In
[
27
]:
t
.
find
(
'Python'
)
Out[27]: 1
Replacing characters in a string is a typical task that is easily accomplished with the replace
method:
In
[
28
]:
t
.
replace
(
' '
,
''
)
Out[28]: 'thisisastringobject'
The stripping of strings—i.e., deletion of certain leading/lagging characters—is also often necessary:
In
[
29
]:
'http://www.python.org'
.
strip
(
'htp:/'
)
Out[29]: 'www.python.org'
Table 41 lists a number of helpful methods of the string
object.
Method  Arguments  Returns/result 

 Copy of the string with first letter capitalized 

 Count of the number of occurrences of substring 

 Decoded version of the string, using 

 Encoded version of the string 

 (Lowest) index where substring is found 

 Concatenation of strings in sequence 

 Replaces 

 List of words in string with 

 Separated lines with line ends/breaks if 

 Copy of string with leading/lagging characters in 

 Copy with all letters capitalized 
A powerful tool when working with string
objects is regular expressions. Python
provides such functionality in the module re
:
In
[
30
]:
import
re
Suppose you are faced with a large text file, such as a commaseparated value (CSV
) file, which contains certain time series and respective datetime information. More often than not, the datetime information is delivered in a format that Python
cannot interpret directly. However, the datetime information can generally be described by a regular expression. Consider the following string
object, containing three datetime elements, three integers, and three strings. Note that triple quotation marks allow the definition of strings over multiple rows:
In
[
31
]:
series
=
"""
'01/18/2014 13:00:00', 100, '1st';
'01/18/2014 13:30:00', 110, '2nd';
'01/18/2014 14:00:00', 120, '3rd'
"""
The following regular expression describes the format of the datetime information provided in the string
object:^{[21]}
In
[
32
]:
dt
=
re
.
compile
(
"'[09/:\s]+'"
)
# datetime
Equipped with this regular expression, we can go on and find all the datetime elements. In general, applying regular expressions to string
objects also leads to performance improvements for typical parsing tasks:
In
[
33
]:
result
=
dt
.
findall
(
series
)
result
Out[33]: ["'01/18/2014 13:00:00'", "'01/18/2014 13:30:00'", "'01/18/2014 14:00:0 0'"]
When parsing string
objects, consider using regular expressions, which can bring both convenience and performance to such operations.
The resulting string
objects can then be parsed to generate Python datetime
objects (cf. Appendix C for an overview of handling date and time data with Python
). To parse the string
objects containing the datetime information, we need to provide information of how to parse—again as a string
object:
In
[
34
]:
from
datetime
import
datetime
pydt
=
datetime
.
strptime
(
result
[
0
]
.
replace
(
"'"
,
""
),
'%m/
%d
/%Y %H:%M:%S'
)
pydt
Out[34]: datetime.datetime(2014, 1, 18, 13, 0)
In
[
35
]:
pydt
Out[35]: 20140118 13:00:00
In
[
36
]:
type
(
pydt
)
Out[36]: <type 'datetime.datetime'>
Later chapters provide more information on datetime data, the handling of such data, and datetime
objects and their methods. This is just meant to be a teaser for this important topic in finance.
As a general rule, data structures are objects that contain a possibly large number of other objects. Among those that Python
provides as builtin structures are:
tuple
list
dict
set
A tuple
is an advanced data structure, yet it’s still quite simple and limited in its applications. It is defined by providing objects in parentheses:
In
[
37
]:
t
=
(
1
,
2.5
,
'data'
)
type
(
t
)
Out[37]: tuple
You can even drop the parentheses and provide multiple objects separated by commas:
In
[
38
]:
t
=
1
,
2.5
,
'data'
type
(
t
)
Out[38]: tuple
Like almost all data structures in Python
the tuple
has a builtin index, with the help of which you can retrieve single or multiple elements of the tuple
. It is important to remember that Python
uses zerobased numbering, such that the third element of a tuple
is at index position 2:
In
[
39
]:
t
[
2
]
Out[39]: 'data'
In
[
40
]:
type
(
t
[
2
])
Out[40]: str
In contrast to some other programming languages like Matlab
, Python
uses zerobased numbering schemes. For example, the first element of a tuple
object has index value 0.
There are only two special methods that this object type provides: count
and index
. The first counts the number of occurrences of a certain object and the second gives the index value of the first appearance of it:
In
[
41
]:
t
.
count
(
'data'
)
Out[41]: 1
In
[
42
]:
t
.
index
(
1
)
Out[42]: 0
tuple
objects are not very flexible since, once defined, they cannot be changed easily.
Objects of type list
are much more flexible and powerful in comparison to tuple
objects. From a finance point of view, you can achieve a lot working only with list
objects, such as storing stock price quotes and appending new data. A list
object is defined through brackets and the basic capabilities and behavior are similar to those of tuple
objects:
In
[
43
]:
l
=
[
1
,
2.5
,
'data'
]
l
[
2
]
Out[43]: 'data'
list
objects can also be defined or converted by using the function list
. The following code generates a new list
object by converting the tuple
object from the previous example:
In
[
44
]:
l
=
list
(
t
)
l
Out[44]: [1, 2.5, 'data']
In
[
45
]:
type
(
l
)
Out[45]: list
In addition to the characteristics of tuple
objects, list
objects are also expandable and reducible via different methods. In other words, whereas string
and tuple
objects are immutable sequence objects (with indexes) that cannot be changed once created, list
objects are mutable and can be changed via different operations. You can append list
objects to an existing list
object, and more:
In
[
46
]:
l
.
append
([
4
,
3
])
# append list at the end
l
Out[46]: [1, 2.5, 'data', [4, 3]]
In
[
47
]:
l
.
extend
([
1.0
,
1.5
,
2.0
])
# append elements of list
l
Out[47]: [1, 2.5, 'data', [4, 3], 1.0, 1.5, 2.0]
In
[
48
]:
l
.
insert
(
1
,
'insert'
)
# insert object before index position
l
Out[48]: [1, 'insert', 2.5, 'data', [4, 3], 1.0, 1.5, 2.0]
In
[
49
]:
l
.
remove
(
'data'
)
# remove first occurrence of object
l
Out[49]: [1, 'insert', 2.5, [4, 3], 1.0, 1.5, 2.0]
In
[
50
]:
p
=
l
.
pop
(
3
)
# removes and returns object at index
l
,
p
Out[50]: [1, 'insert', 2.5, 1.0, 1.5, 2.0] [4, 3]
Slicing is also easily accomplished. Here, slicing refers to an operation that breaks down a data set into smaller parts (of interest):
In
[
51
]:
l
[
2
:
5
]
# 3rd to 5th elements
Out[51]: [2.5, 1.0, 1.5]
Table 42 provides a summary of selected operations and methods of the list
object.
Method  Arguments  Returns/result 

 Replaces 

 Replaces every 

 Appends 

 Number of occurrences of object 

 Deletes elements with index values 

 Appends all elements of 

 First index of 

 Inserts 

 Removes element with index 

 Removes element with index 

 Reverses all items in place 

 Sorts all items in place 
Although a topic in itself, control structures like for
loops are maybe best introduced in Python
based on list
objects. This is due to the fact that looping in general takes place over list
objects, which is quite different to what is often the standard in other languages. Take the following example. The for
loop loops over the elements of the list
object l
with index values 2 to 4 and prints the square of the respective elements. Note the importance of the indentation (whitespace) in the second line:
In
[
52
]:
for
element
in
l
[
2
:
5
]:
element
**
2
Out[52]: 6.25 1.0 2.25
This provides a really high degree of flexibility in comparison to the typical counterbased looping. Counterbased looping is also an option with Python
, but is accomplished based on the (standard) list
object range
:
In
[
53
]:
r
=
range
(
0
,
8
,
1
)
# start, end, step width
r
Out[53]: [0, 1, 2, 3, 4, 5, 6, 7]
In
[
54
]:
type
(
r
)
Out[54]: list
For comparison, the same loop is implemented using range
as follows:
In
[
55
]:
for
i
in
range
(
2
,
5
):
l
[
i
]
**
2
Out[55]: 6.25 1.0 2.25
In Python
you can loop over arbitrary list
objects, no matter what the content of the object is. This often avoids the introduction of a counter.
Python
also provides the typical (conditional) control elements if
, elif
, and else
. Their use is comparable in other languages:
In
[
56
]:
for
i
in
range
(
1
,
10
):
if
i
%
2
==
0
:
# % is for modulo
"
%d
is even"
%
i
elif
i
%
3
==
0
:
"
%d
is multiple of 3"
%
i
else
:
"
%d
is odd"
%
i
Out[56]: 1 is odd 2 is even 3 is multiple of 3 4 is even 5 is odd 6 is even 7 is odd 8 is even 9 is multiple of 3
Similarly, while
provides another means to control the flow:
In
[
57
]:
total
=
0
while
total
<
100
:
total
+=
1
total
Out[57]: 100
A specialty of Python
is socalled list
comprehensions. Instead of looping over existing list
objects, this approach generates list
objects via loops in a rather compact fashion:
In
[
58
]:
m
=
[
i
**
2
for
i
in
range
(
5
)]
m
Out[58]: [0, 1, 4, 9, 16]
In a certain sense, this already provides a first means to generate “something like” vectorized code in that loops are rather more implicit than explicit (vectorization of code is discussed in more detail later in this chapter).
Python
provides a number of tools for functional programming support as well—i.e., the application of a function to a whole set of inputs (in our case list
objects). Among these tools are filter
, map
, and reduce
. However, we need a function definition first. To start with something really simple, consider a function f
that returns the square of the input x
:
In
[
59
]:
def
f
(
x
):
return
x
**
2
f
(
2
)
Out[59]: 4
Of course, functions can be arbitrarily complex, with multiple input/parameter objects and even multiple outputs, (return objects). However, consider the following function:
In
[
60
]:
def
even
(
x
):
return
x
%
2
==
0
even
(
3
)
Out[60]: False
The return object is a Boolean. Such a function can be applied to a whole list
object by using map
:
In
[
61
]:
map
(
even
,
range
(
10
))
Out[61]: [True, False, True, False, True, False, True, False, True, False]
To this end, we can also provide a function definition directly as an argument to map
, by using lambda
or anonymous functions:
In
[
62
]:
map
(
lambda
x
:
x
**
2
,
range
(
10
))
Out[62]: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Functions can also be used to filter a list
object. In the following example, the filter returns elements of a list
object that match the Boolean condition as defined by the even
function:
In
[
63
]:
filter
(
even
,
range
(
15
))
Out[63]: [0, 2, 4, 6, 8, 10, 12, 14]
Finally, reduce
helps when we want to apply a function to all elements of a list
object that returns a single value only. An example is the cumulative sum of all elements in a list
object (assuming that summation is defined for the objects contained in the list):
In
[
64
]:
reduce
(
lambda
x
,
y
:
x
+
y
,
range
(
10
))
Out[64]: 45
An alternative, nonfunctional implementation could look like the following:
In
[
65
]:
def
cumsum
(
l
):
total
=
0
for
elem
in
l
:
total
+=
elem
return
total
cumsum
(
range
(
10
))
Out[65]: 45
It can be considered good practice to avoid loops on the Python
level as far as possible. list
comprehensions and functional programming tools like map
, filter
, and reduce
provide means to write code without loops that is both compact and in general more readable. lambda
or anonymous functions are also powerful tools in this context.
dict
objects are dictionaries, and also mutable sequences, that allow data retrieval by keys that can, for example, be string
objects. They are socalled keyvalue stores. While list
objects are ordered and sortable, dict
objects are unordered and unsortable. An example best illustrates further differences to list
objects. Curly brackets are what define dict
objects:
In
[
66
]:
d
=
{
'Name'
:
'Angela Merkel'
,
'Country'
:
'Germany'
,
'Profession'
:
'Chancelor'
,
'Age'
:
60
}
type
(
d
)
Out[66]: dict
In
[
67
]:
d
[
'Name'
],
d
[
'Age'
]
Out[67]: Angela Merkel 60
Again, this class of objects has a number of builtin methods:
In
[
68
]:
d
.
keys
()
Out[68]: ['Country', 'Age', 'Profession', 'Name']
In
[
69
]:
d
.
values
()
Out[69]: ['Germany', 60, 'Chancelor', 'Angela Merkel']
In
[
70
]:
d
.
items
()
Out[70]: [('Country', 'Germany'), ('Age', 60), ('Profession', 'Chancelor'), ('Name', 'Angela Merkel')]
In
[
71
]:
birthday
=
True
if
birthday
is
True
:
d
[
'Age'
]
+=
1
d
[
'Age'
]
Out[71]: 61
There are several methods to get iterator
objects from the dict
object. The objects behave like list
objects when iterated over:
In
[
72
]:
for
item
in
d
.
iteritems
():
item
Out[72]: ('Country', 'Germany') ('Age', 61) ('Profession', 'Chancelor') ('Name', 'Angela Merkel')
In
[
73
]:
for
value
in
d
.
itervalues
():
type
(
value
)
Out[73]: <type 'str'> <type 'int'> <type 'str'> <type 'str'>
Table 43 provides a summary of selected operations and methods of the dict
object.
Method  Arguments  Returns/result 

 Item of 

 Sets item key 

 Deletes item with key 

 Removes all items 

 Makes a copy 




 Copy of all keyvalue pairs 

 Iterator over all items 

 Iterator over all keys 

 Iterator over all values 

 Copy of all keys 

 Returns and removes item with key 

 Updates items with items from 

 Copy of all values 
The last data structure we will consider is the set
object. Although set theory is a cornerstone of mathematics and also finance theory, there are not too many practical applications for set
objects. The objects are unordered collections of other objects, containing every element only once:
In
[
74
]:
s
=
set
([
'u'
,
'd'
,
'ud'
,
'du'
,
'd'
,
'du'
])
s
Out[74]: {'d', 'du', 'u', 'ud'}
In
[
75
]:
t
=
set
([
'd'
,
'dd'
,
'uu'
,
'u'
])
With set
objects, you can implement operations as you are used to in mathematical set theory. For example, you can generate unions, intersections, and differences:
In
[
76
]:
s
.
union
(
t
)
# all of s and t
Out[76]: {'d', 'dd', 'du', 'u', 'ud', 'uu'}
In
[
77
]:
s
.
intersection
(
t
)
# both in s and t
Out[77]: {'d', 'u'}
In
[
78
]:
s
.
difference
(
t
)
# in s but not t
Out[78]: {'du', 'ud'}
In
[
79
]:
t
.
difference
(
s
)
# in t but not s
Out[79]: {'dd', 'uu'}
In
[
80
]:
s
.
symmetric_difference
(
t
)
# in either one but not both
Out[80]: {'dd', 'du', 'ud', 'uu'}
One application of set
objects is to get rid of duplicates in a list
object. For example:
In
[
81
]:
from
random
import
randint
l
=
[
randint
(
0
,
10
)
for
i
in
range
(
1000
)]
# 1,000 random integers between 0 and 10
len
(
l
)
# number of elements in l
Out[81]: 1000
In
[
82
]:
l
[:
20
]
Out[82]: [8, 3, 4, 9, 1, 7, 5, 5, 6, 7, 4, 4, 7, 1, 8, 5, 0, 7, 1, 9]
In
[
83
]:
s
=
set
(
l
)
s
Out[83]: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
The previous section shows that Python
provides some quite useful and flexible general data structures. In particular, list
objects can be considered a real workhorse with many convenient characteristics and application areas. However, scientific and financial applications generally have a need for highperforming operations on special data structures. One of the most important data structures in this regard is the array. Arrays generally structure other (fundamental) objects in rows and columns.
Assume for the moment that we work with numbers only, although the concept generalizes to other types of data as well. In the simplest case, a onedimensional array then represents, mathematically speaking, a vector of, in general, real numbers, internally represented by float
objects. It then consists of a single row or column of elements only. In a more common case, an array represents an i × j matrix of elements. This concept generalizes to i × j × k cubes of elements in three dimensions as well as to general ndimensional arrays of shape i × j × k × l × … .
Mathematical disciplines like linear algebra and vector space theory illustrate that such mathematical structures are of high importance in a number of disciplines and fields. It can therefore prove fruitful to have available a specialized class of data structures explicitly designed to handle arrays conveniently and efficiently. This is where the Python
library NumPy
comes into play, with its ndarray
class.
Before we turn to NumPy
, let us first construct arrays with the builtin data structures presented in the previous section. list
objects are particularly suited to accomplishing this task. A simple list
can already be considered a onedimensional array:
In
[
84
]:
v
=
[
0.5
,
0.75
,
1.0
,
1.5
,
2.0
]
# vector of numbers
Since list
objects can contain arbitrary other objects, they can also contain other list
objects. In that way, two and higherdimensional arrays are easily constructed by nested list
objects:
In
[
85
]:
m
=
[
v
,
v
,
v
]
# matrix of numbers
m
Out[85]: [[0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0]]
We can also easily select rows via simple indexing or single elements via double indexing (whole columns, however, are not so easy to select):
In
[
86
]:
m
[
1
]
Out[86]: [0.5, 0.75, 1.0, 1.5, 2.0]
In
[
87
]:
m
[
1
][
0
]
Out[87]: 0.5
Nesting can be pushed further for even more general structures:
In
[
88
]:
v1
=
[
0.5
,
1.5
]
v2
=
[
1
,
2
]
m
=
[
v1
,
v2
]
c
=
[
m
,
m
]
# cube of numbers
c
Out[88]: [[[0.5, 1.5], [1, 2]], [[0.5, 1.5], [1, 2]]]
In
[
89
]:
c
[
1
][
1
][
0
]
Out[89]: 1
Note that combining objects in the way just presented generally works with reference pointers to the original objects. What does that mean in practice? Let us have a look at the following operations:
In
[
90
]:
v
=
[
0.5
,
0.75
,
1.0
,
1.5
,
2.0
]
m
=
[
v
,
v
,
v
]
m
Out[90]: [[0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0]]
Now change the value of the first element of the v
object and see what happens to the m
object:
In
[
91
]:
v
[
0
]
=
'Python'
m
Out[91]: [['Python', 0.75, 1.0, 1.5, 2.0], ['Python', 0.75, 1.0, 1.5, 2.0], ['Python', 0.75, 1.0, 1.5, 2.0]]
This can be avoided by using the deepcopy
function of the copy
module:
In
[
92
]:
from
copy
import
deepcopy
v
=
[
0.5
,
0.75
,
1.0
,
1.5
,
2.0
]
m
=
3
*
[
deepcopy
(
v
),
]
m
Out[92]: [[0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0]]
In
[
93
]:
v
[
0
]
=
'Python'
m
Out[93]: [[0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0]]
Obviously, composing array structures with list
objects works, somewhat. But it is not really convenient, and the list
class has not been built with this specific goal in mind. It has rather been built with a much broader and more general scope. From this point of view, some kind of specialized class could therefore be really beneficial to handle arraytype structures.
Such a specialized class is numpy.ndarray
, which has been built with the specific goal of handling ndimensional arrays both conveniently and efficiently—i.e., in a highly performing manner. The basic handling of instances of this class is again best illustrated by examples:
In
[
94
]:
import
numpy
as
np
In
[
95
]:
a
=
np
.
array
([
0
,
0.5
,
1.0
,
1.5
,
2.0
])
type
(
a
)
Out[95]: numpy.ndarray
In
[
96
]:
a
[:
2
]
# indexing as with list objects in 1 dimension
Out[96]: array([ 0. , 0.5])
A major feature of the numpy.ndarray
class is the multitude of builtin methods. For instance:
In
[
97
]:
a
.
sum
()
# sum of all elements
Out[97]: 5.0
In
[
98
]:
a
.
std
()
# standard deviation
Out[98]: 0.70710678118654757
In
[
99
]:
a
.
cumsum
()
# running cumulative sum
Out[99]: array([ 0. , 0.5, 1.5, 3. , 5. ])
Another major feature is the (vectorized) mathematical operations defined on ndarray
objects:
In
[
100
]:
a
*
2
Out[100]: array([ 0., 1., 2., 3., 4.])
In
[
101
]:
a
**
2
Out[101]: array([ 0. , 0.25, 1. , 2.25, 4. ])
In
[
102
]:
np
.
sqrt
(
a
)
Out[102]: array([ 0. , 0.70710678, 1. , 1.22474487, 1.41421356 ])
The transition to more than one dimension is seamless, and all features presented so far carry over to the more general cases. In particular, the indexing system is made consistent across all dimensions:
In
[
103
]:
b
=
np
.
array
([
a
,
a
*
2
])
b
Out[103]: array([[ 0. , 0.5, 1. , 1.5, 2. ], [ 0. , 1. , 2. , 3. , 4. ]])
In
[
104
]:
b
[
0
]
# first row
Out[104]: array([ 0. , 0.5, 1. , 1.5, 2. ])
In
[
105
]:
b
[
0
,
2
]
# third element of first row
Out[105]: 1.0
In
[
106
]:
b
.
sum
()
Out[106]: 15.0
In contrast to our list
objectbased approach to constructing arrays, the numpy.ndarray
class knows axes explicitly. Selecting either rows or columns from a matrix is essentially the same:
In
[
107
]:
b
.
sum
(
axis
=
0
)
# sum along axis 0, i.e. columnwise sum
Out[107]: array([ 0. , 1.5, 3. , 4.5, 6. ])
In
[
108
]:
b
.
sum
(
axis
=
1
)
# sum along axis 1, i.e. rowwise sum
Out[108]: array([ 5., 10.])
There are a number of ways to initialize (instantiate) a numpy.ndarray
object. One is as presented before, via np.array
. However, this assumes that all elements of the array are already available. In contrast, one would maybe like to have the numpy.ndarray
objects instantiated first to populate them later with results generated during the execution of code. To this end, we can use the following functions:
In
[
109
]:
c
=
np
.
zeros
((
2
,
3
,
4
),
dtype
=
'i'
,
order
=
'C'
)
# also: np.ones()
c
Out[109]: array([[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]], dtype=int32)
In
[
110
]:
d
=
np
.
ones_like
(
c
,
dtype
=
'f16'
,
order
=
'C'
)
# also: np.zeros_like()
d
Out[110]: array([[[ 1.0, 1.0, 1.0, 1.0], [ 1.0, 1.0, 1.0, 1.0], [ 1.0, 1.0, 1.0, 1.0]], [[ 1.0, 1.0, 1.0, 1.0], [ 1.0, 1.0, 1.0, 1.0], [ 1.0, 1.0, 1.0, 1.0]]], dtype=float128)
With all these functions we provide the following information:
shape
int
, a sequence of int
s, or a reference to another numpy.ndarray
dtype
(optional)
numpy.dtype
—these are NumPy
specific data types for numpy.ndarray
objects
order
(optional)
C
for C
like (i.e., rowwise) or F
for Fortran
like (i.e., columnwise)
Here, it becomes obvious how NumPy
specializes the construction of arrays with the numpy.ndarray
class, in comparison to the list
based approach:
numpy.dtype
) for the whole array.
The role of the order
parameter is discussed later in the chapter. Table 44 provides an overview of numpy.dtype
objects (i.e., the basic data types NumPy
allows).
dtype  Description  Example 
 Bit field 

 Boolean 

 Integer 

 Unsigned integer 

 Floating point 

 Complex floating point 

 Object 

 String 

 Unicode 

 Other 

NumPy
provides a generalization of regular arrays that loosens at least the dtype
restriction, but let us stick with regular arrays for a moment and see what the specialization brings in terms of performance.
As a simple exercise, suppose we want to generate a matrix/array of shape 5,000 × 5,000 elements, populated with (pseudo)random, standard normally distributed numbers. We then want to calculate the sum of all elements. First, the pure Python
approach, where we make heavy use of list
comprehensions and functional programming methods as well as lambda
functions:
In
[
111
]:
import
random
I
=
5000
In
[
112
]:
%
time
mat
=
[[
random
.
gauss
(
0
,
1
)
for
j
in
range
(
I
)]
for
i
in
range
(
I
)]
# a nested list comprehension
Out[112]: CPU times: user 36.5 s, sys: 408 ms, total: 36.9 s Wall time: 36.4 s
In
[
113
]:
%
time
reduce
(
lambda
x
,
y
:
x
+
y
,
\[
reduce
(
lambda
x
,
y
:
x
+
y
,
row
)
\for
row
in
mat
])
Out[113]: CPU times: user 4.3 s, sys: 52 ms, total: 4.35 s Wall time: 4.07 s 678.5908519876674
Let us now turn to NumPy
and see how the same problem is solved there. For convenience, the NumPy
sublibrary random
offers a multitude of functions to initialize a numpy.ndarray
object and populate it at the same time with (pseudo)random numbers:
In
[
114
]:
%
time
mat
=
np
.
random
.
standard_normal
((
I
,
I
))
Out[114]: CPU times: user 1.83 s, sys: 40 ms, total: 1.87 s Wall time: 1.87 s
In
[
115
]:
%
time
mat
.
sum
()
Out[115]: CPU times: user 36 ms, sys: 0 ns, total: 36 ms Wall time: 34.6 ms 349.49777911439384
We observe the following:
Python
code, the NumPy
version is even more compact and readable.
numpy.ndarray
object is roughly 20 times faster and the calculation of the sum is roughly 100 times faster than the respective operations in pure Python
.
The specialization of the numpy.ndarray
class obviously brings a number of really valuable benefits with it. However, a toonarrow specialization might turn out to be too large a burden to carry for the majority of arraybased algorithms and applications. Therefore, NumPy
provides structured arrays that allow us to have different NumPy
data types per column, at least. What does “per column” mean? Consider the following initialization of a structured array object:
In
[
116
]:
dt
=
np
.
dtype
([(
'Name'
,
'S10'
),
(
'Age'
,
'i4'
),
(
'Height'
,
'f'
),
(
'Children/Pets'
,
'i4'
,
2
)])
s
=
np
.
array
([(
'Smith'
,
45
,
1.83
,
(
0
,
1
)),
(
'Jones'
,
53
,
1.72
,
(
2
,
2
))],
dtype
=
dt
)
s
Out[116]: array([('Smith', 45, 1.8300000429153442, [0, 1]), ('Jones', 53, 1.7200000286102295, [2, 2])], dtype=[('Name', 'S10'), ('Age', '<i4'), ('Height', '<f4'), ('Chi ldren/Pets', '<i4', (2,))])
In a sense, this construction comes quite close to the operation for initializing tables in a SQL
database. We have column names and column data types, with maybe some additional information (e.g., maximum number of characters per string
object). The single columns can now be easily accessed by their names:
In
[
117
]:
s
[
'Name'
]
Out[117]: array(['Smith', 'Jones'], dtype='S10')
In
[
118
]:
s
[
'Height'
]
.
mean
()
Out[118]: 1.7750001
Having selected a specific row and record, respectively, the resulting objects mainly behave like dict
objects, where one can retrieve values via keys:
In
[
119
]:
s
[
1
][
'Age'
]
Out[119]: 53
In summary, structured arrays are a generalization of the regular numpy.ndarray
object types in that the data type only has to be the same per column, as one is used to in the context of tables in SQL
databases. One advantage of structured arrays is that a single element of a column can be another multidimensional object and does not have to conform to the basic NumPy
data types.
NumPy
provides, in addition to regular arrays, structured arrays that allow the description and handling of rather complex arrayoriented data structures with a variety of different data types and even structures per (named) column. They bring SQL
tablelike data structures to Python
, with all the benefits of regular numpy.ndarray
objects (syntax, methods, performance).
Vectorization of code is a strategy to get more compact code that is possibly executed faster. The fundamental idea is to conduct an operation on or to apply a function to a complex object “at once” and not by iterating over the single elements of the object. In Python
, the functional programming tools map
, filter
, and reduce
provide means for vectorization. In a sense, NumPy
has vectorization built in deep down in its core.
As we learned in the previous section, simple mathematical operations can be implemented on numpy.ndarray
objects directly. For example, we can add two NumPy
arrays elementwise as follows:
In
[
120
]:
r
=
np
.
random
.
standard_normal
((
4
,
3
))
s
=
np
.
random
.
standard_normal
((
4
,
3
))
In
[
121
]:
r
+
s
Out[121]: array([[1.94801686, 0.6855251 , 2.28954806], [ 0.33847593, 1.97109602, 1.30071653], [1.12066585, 0.22234207, 2.73940339], [ 0.43787363, 0.52938941, 1.38467623]])
NumPy
also supports what is called broadcasting. This allows us to combine objects of different shape within a single operation. We have already made use of this before. Consider the following example:
In
[
122
]:
2
*
r
+
3
Out[122]: array([[ 2.54691692, 1.65823523, 8.14636725], [ 4.94758114, 0.25648128, 1.89566919], [ 0.41775907, 0.58038395, 2.06567484], [ 0.67600205, 3.41004636, 1.07282384]])
In this case, the r
object is multiplied by 2 elementwise and then 3 is added elementwise—the 3 is broadcasted or stretched to the shape of the r
object. It works with differently shaped arrays as well, up to a certain point:
In
[
123
]:
s
=
np
.
random
.
standard_normal
(
3
)
r
+
s
Out[123]: array([[ 0.23324118, 1.09764268, 1.90412565], [ 1.43357329, 1.79851966, 1.22122338], [0.83133775, 1.63656832, 1.13622055], [0.70221625, 0.22173711, 1.63264605]])
This broadcasts the onedimensional array of size 3 to a shape of (4, 3). The same does not work, for example, with a onedimensional array of size 4:
In
[
124
]:
s
=
np
.
random
.
standard_normal
(
4
)
r
+
s
Out[124]: ValueError operands could not be broadcast together with shapes (4,3) (4,)
However, transposing the r
object makes the operation work again. In the following code, the transpose
method transforms the ndarray
object with shape (4, 3) into an object of the same type with shape (3, 4):
In
[
125
]:
r
.
transpose
()
+
s
Out[125]: array([[0.63380522, 0.5964174 , 0.88641996, 0.86931849], [1.07814606, 1.74913253, 0.9677324 , 0.49770367], [ 2.16591995, 0.92953858, 1.71037785, 0.67090759]])
In
[
126
]:
np
.
shape
(
r
.
T
)
Out[126]: (3, 4)
As a general rule, customdefined Python
functions work with numpy.ndarray
s as well. If the implementation allows, arrays can be used with functions just as int
or float
objects can. Consider the following function:
In
[
127
]:
def
f
(
x
):
return
3
*
x
+
5
We can pass standard Python
objects as well as numpy.ndarray
objects (for which the operations in the function have to be defined, of course):
In
[
128
]:
f
(
0.5
)
# float object
Out[128]: 6.5
In
[
129
]:
f
(
r
)
# NumPy array
Out[129]: array([[ 4.32037538, 2.98735285, 12.71955087], [ 7.9213717 , 0.88472192, 3.34350378], [ 1.1266386 , 1.37057593, 3.59851226], [ 1.51400308, 5.61506954, 2.10923576]])
What NumPy
does is to simply apply the function f
to the object elementwise. In that sense, by using this kind of operation we do not avoid loops; we only avoid them on the Python
level and delegate the looping to NumPy
. On the NumPy
level, looping over the numpy.ndarray
object is taken care of by highly optimized code, most of it written in C
and therefore generally much faster than pure Python
. This explains the “secret” behind the performance benefits of using NumPy
for arraybased use cases.
When working with arrays, one has to take care to call the right functions on the respective objects. For example, the sin
function from the standard math
module of Python
does not work with NumPy
arrays:
In
[
130
]:
import
math
math
.
sin
(
r
)
Out[130]: TypeError only length1 arrays can be converted to Python scalars
The function is designed to handle, for example, float
objects—i.e., single numbers, not arrays. NumPy
provides the respective counterparts as socalled ufuncs, or universal functions:
In
[
131
]:
np
.
sin
(
r
)
# array as input
Out[131]: array([[0.22460878, 0.62167738, 0.53829193], [ 0.82702259, 0.98025745, 0.52453206], [0.96114497, 0.93554821, 0.45035471], [0.91759955, 0.20358986, 0.82124413]])
In
[
132
]:
np
.
sin
(
np
.
pi
)
# float as input
Out[132]: 1.2246467991473532e16
NumPy
provides a large number of such ufuncs that generalize typical mathematical functions to numpy.ndarray
objects.^{[22]}
Be careful when using the from library import *
approach to importing. Such an approach can cause the NumPy
reference to the ufunc numpy.sin
to be replaced by the reference to the math
function math.sin
. You should, as a rule, import both libraries by name to avoid confusion: import numpy as np; import math
. Then you can use math.sin
alongside np.sin
.
When we first initialized numpy.ndarray
objects by using numpy.zero
, we provided an optional argument for the memory layout. This argument specifies, roughly speaking, which elements of an array get stored in memory next to each other. When working with small arrays, this has hardly any measurable impact on the performance of array operations. However, when arrays get large the story is somewhat different, depending on the operations to be implemented on the arrays.
To illustrate this important point for memorywise handling of arrays in science and finance, consider the following construction of multidimensional numpy.ndarray
objects:
In
[
133
]:
x
=
np
.
random
.
standard_normal
((
5
,
10000000
))
y
=
2
*
x
+
3
# linear equation y = a * x + b
C
=
np
.
array
((
x
,
y
),
order
=
'C'
)
F
=
np
.
array
((
x
,
y
),
order
=
'F'
)
x
=
0.0
;
y
=
0.0
# memory cleanup
In
[
134
]:
C
[:
2
]
.
round
(
2
)
Out[134]: array([[[0.51, 1.14, 1.07, ..., 0.2 , 0.18, 0.1 ], [1.22, 0.68, 1.83, ..., 1.23, 0.27, 0.16], [ 0.45, 0.15, 0.01, ..., 0.75, 0.91, 1.12], [0.16, 1.4 , 0.79, ..., 0.33, 0.54, 1.81], [ 1.07, 1.07, 0.37, ..., 0.76, 0.71, 0.34]], [[ 1.98, 0.72, 0.86, ..., 3.4 , 2.64, 3.21], [ 0.55, 4.37, 6.66, ..., 5.47, 2.47, 2.68], [ 3.9 , 3.29, 3.03, ..., 1.5 , 4.82, 0.76], [ 2.67, 5.8 , 1.42, ..., 2.34, 4.09, 6.63], [ 5.14, 0.87, 2.27, ..., 1.48, 4.43, 3.67]]])
Let’s look at some really fundamental examples and use cases for both types of ndarray
objects:
In
[
135
]:
%
timeit
C
.
sum
()
Out[135]: 10 loops, best of 3: 123 ms per loop
In
[
136
]:
%
timeit
F
.
sum
()
Out[136]: 10 loops, best of 3: 123 ms per loop
When summing up all elements of the arrays, there is no performance difference between the two memory layouts. However, consider the following example with the Clike memory layout:
In
[
137
]:
%
timeit
C
[
0
]
.
sum
(
axis
=
0
)
Out[137]: 10 loops, best of 3: 102 ms per loop
In
[
138
]:
%
timeit
C
[
0
]
.
sum
(
axis
=
1
)
Out[138]: 10 loops, best of 3: 61.9 ms per loop
Summing five large vectors and getting back a single large results vector obviously is slower in this case than summing 10,000,000 small ones and getting back an equal number of results. This is due to the fact that the single elements of the small vectors—i.e., the rows—are stored next to each other. With the Fortran
like memory layout, the relative performance changes considerably:
In
[
139
]:
%
timeit
F
.
sum
(
axis
=
0
)
Out[139]: 1 loops, best of 3: 801 ms per loop
In
[
140
]:
%
timeit
F
.
sum
(
axis
=
1
)
Out[140]: 1 loops, best of 3: 2.23 s per loop
In
[
141
]:
F
=
0.0
;
C
=
0.0
# memory cleanup
In this case, operating on a few large vectors performs better than operating on a large number of small ones. The elements of the few large vectors are stored in memory next to each other, which explains the relative performance advantage. However, overall the operations are absolutely much slower when compared to the C
like variant.
Python
provides, in combination with NumPy
, a rich set of flexible data structures. From a finance point of view, the following can be considered the most important ones:
int
, float
, and string
provide the atomic data types.
tuple
, list
, dict
, and set
have many application areas in finance, with list
being the most flexible workhorse in general.
NumPy
provides the specialized class numpy.ndarray
, which provides both convenience and compactness of code as well as high performance.
This chapter shows that both the basic data structures and the NumPy
ones allow for highly vectorized implementation of algorithms. Depending on the specific shape of the data structures, care should be taken with regard to the memory layout of arrays. Choosing the right approach here can speed up code execution by a factor of two or more.
This chapter focuses on those issues that might be of particular importance for finance algorithms and applications. However, it can only represent a starting point for the exploration of data structures and data modeling in Python
. There are a number of valuable resources available to go deeper from here.
Here are some Internet resources to consult:
Python
documentation is always a good starting point: http://www.python.org/doc/.
NumPy
arrays as well as related methods and functions, see http://docs.scipy.org/doc/.
SciPy
lecture notes are also a good source to get started: http://scipylectures.github.io/.
Good references in book form are:
^{[18] }The Cython
library brings static typing and compiling features to Python
that are comparable to those in C
. In fact, Cython
is a hybrid language of Python
and C
.
^{[19] }Here and in the following discussion, terms like float, float object, etc. are used interchangeably, acknowledging that every float is also an object. The same holds true for other object types.
^{[21] }It is not possible to go into details here, but there is a wealth of information available on the Internet about regular expressions in general and for Python
in particular. For an introduction to this topic, refer to Fitzgerald, Michael (2012): Introducing Regular Expressions. O’Reilly, Sebastopol, CA.
^{[22] }Cf. http://docs.scipy.org/doc/numpy/reference/ufuncs.html for an overview.
No credit card required