Chapter 4. Data Types and Structures

Bad programmers worry about the code. Good programmers worry about data structures and their relationships.

Linus Torvalds

This chapter introduces basic data types and data structures of Python. Although the Python interpreter itself already brings a rich variety of data structures with it, NumPy and other libraries add to these in a valuable fashion.

The chapter is organized as follows:

Basic data types
The first section introduces basic data types such as int, float, and string.
Basic data structures
The next section introduces the fundamental data structures of Python (e.g., list objects) and illustrates control structures, functional programming paradigms, and anonymous functions.
NumPy data structures
The following section is devoted to the characteristics and capabilities of the NumPy ndarray class and illustrates some of the benefits of this class for scientific and financial applications.
Vectorization of code
As the final section illustrates, thanks to NumPy’s array class vectorized code is easily implemented, leading to more compact and also better-performing code.

The spirit of this chapter is to provide a general introduction to Python specifics when it comes to data types and structures. If you are equipped with a background from another programing language, say C or Matlab, you should be able to easily grasp the differences that Python usage might bring along. The topics introduced here are all important and fundamental for the chapters to come.

Basic Data Types

Python is a dynamically typed language, which means that the Python interpreter infers the type of an object at runtime. In comparison, compiled languages like C are generally statically typed. In these cases, the type of an object has to be attached to the object before compile time.[18]

Integers

One of the most fundamental data types is the integer, or int:

In [1]: a = 10
        type(a)
Out[1]: int

The built-in function type provides type information for all objects with standard and built-in types as well as for newly created classes and objects. In the latter case, the information provided depends on the description the programmer has stored with the class. There is a saying that “everything in Python is an object.” This means, for example, that even simple objects like the int object we just defined have built-in methods. For example, you can get the number of bits needed to represent the int object in-memory by calling the method bit_length:

In [2]: a.bit_length()
Out[2]: 4

You will see that the number of bits needed increases the higher the integer value is that we assign to the object:

In [3]: a = 100000
        a.bit_length()
Out[3]: 17

In general, there are so many different methods that it is hard to memorize all methods of all classes and objects. Advanced Python environments, like IPython, provide tab completion capabilities that show all methods attached to an object. You simply type the object name followed by a dot (e.g., a.) and then press the Tab key, e.g., a.tab. This then provides a collection of methods you can call on the object. Alternatively, the Python built-in function dir gives a complete list of attributes and methods of any object.

A specialty of Python is that integers can be arbitrarily large. Consider, for example, the googol number 10100. Python has no problem with such large numbers, which are technically long objects:

In [4]: googol = 10 ** 100
        googol
Out[4]: 100000000000000000000000000000000000000000000000000000000000000000000000
        00000000000000000000000000000L
In [5]: googol.bit_length()
Out[5]: 333

Large Integers

Python integers can be arbitrarily large. The interpreter simply uses as many bits/bytes as needed to represent the numbers.

It is important to note that mathematical operations on int objects return int objects. This can sometimes lead to confusion and/or hard-to-detect errors in mathematical routines. The following expression yields the expected result:

In [6]: 1 + 4
Out[6]: 5

However, the next case may return a somewhat surprising result:

In [7]: 1 / 4
Out[7]: 0
In [8]: type(1 / 4)
Out[8]: int

Floats

For the last expression to return the generally desired result of 0.25, we must operate on float objects, which brings us naturally to the next basic data type. Adding a dot to an integer value, like in 1. or 1.0, causes Python to interpret the object as a float. Expressions involving a float also return a float object in general:[19]

In [9]: 1. / 4
Out[9]: 0.25
In [10]: type (1. / 4)
Out[10]: float

A float is a bit more involved in that the computerized representation of rational or real numbers is in general not exact and depends on the specific technical approach taken. To illustrate what this implies, let us define another float object:

In [11]: b = 0.35
         type(b)
Out[11]: float

float objects like this one are always represented internally up to a certain degree of accuracy only. This becomes evident when adding 0.1 to b:

In [12]: b + 0.1
Out[12]: 0.44999999999999996

The reason for this is that floats are internally represented in binary format; that is, a decimal number 0 < n < 1 is represented by a series of the form . For certain floating-point numbers the binary representation might involve a large number of elements or might even be an infinite series. However, given a fixed number of bits used to represent such a number—i.e., a fixed number of terms in the representation series—inaccuracies are the consequence. Other numbers can be represented perfectly and are therefore stored exactly even with a finite number of bits available. Consider the following example:

In [13]: c = 0.5
         c.as_integer_ratio()
Out[13]: (1, 2)

One half, i.e., 0.5, is stored exactly because it has an exact (finite) binary representation as . However, for b = 0.35 we get something different than the expected rational number :

In [14]: b.as_integer_ratio()
Out[14]: (3152519739159347, 9007199254740992)

The precision is dependent on the number of bits used to represent the number. In general, all platforms that Python runs on use the IEEE 754 double-precision standard (i.e., 64 bits), for internal representation.[20] This translates into a 15-digit relative accuracy.

Since this topic is of high importance for several application areas in finance, it is sometimes necessary to ensure the exact, or at least best possible, representation of numbers. For example, the issue can be of importance when summing over a large set of numbers. In such a situation, a certain kind and/or magnitude of representation error might, in aggregate, lead to significant deviations from a benchmark value.

The module decimal provides an arbitrary-precision object for floating-point numbers and several options to address precision issues when working with such numbers:

In [15]: import decimal
         from decimal import Decimal
In [16]: decimal.getcontext()
Out[16]: Context(prec=28, rounding=ROUND_HALF_EVEN, Emin=-999999999, Emax=999999
         999, capitals=1, flags=[], traps=[Overflow, InvalidOperation, DivisionB
         yZero])
In [17]: d = Decimal(1) / Decimal (11)
         d
Out[17]: Decimal('0.09090909090909090909090909091')

You can change the precision of the representation by changing the respective attribute value of the Context object:

In [18]: decimal.getcontext().prec = 4  # lower precision than default
In [19]: e = Decimal(1) / Decimal (11)
         e
Out[19]: Decimal('0.09091')
In [20]: decimal.getcontext().prec = 50  # higher precision than default
In [21]: f = Decimal(1) / Decimal (11)
         f
Out[21]: Decimal('0.090909090909090909090909090909090909090909090909091')

If needed, the precision can in this way be adjusted to the exact problem at hand and one can operate with floating-point objects that exhibit different degrees of accuracy:

In [22]: g = d + e + f
         g
Out[22]: Decimal('0.27272818181818181818181818181909090909090909090909')

Arbitrary-Precision Floats

The module decimal provides an arbitrary-precision floating-point number object. In finance, it might sometimes be necessary to ensure high precision and to go beyond the 64-bit double-precision standard.

Strings

Now that we can represent natural and floating-point numbers, we turn to text. The basic data type to represent text in Python is the string. The string object has a number of really helpful built-in methods. In fact, Python is generally considered to be a good choice when it comes to working with text files of any kind and any size. A string object is generally defined by single or double quotation marks or by converting another object using the str function (i.e., using the object’s standard or user-defined string representation):

In [23]: t = 'this is a string object'

With regard to the built-in methods, you can, for example, capitalize the first word in this object:

In [24]: t.capitalize()
Out[24]: 'This is a string object'

Or you can split it into its single-word components to get a list object of all the words (more on list objects later):

In [25]: t.split()
Out[25]: ['this', 'is', 'a', 'string', 'object']

You can also search for a word and get the position (i.e., index value) of the first letter of the word back in a successful case:

In [26]: t.find('string')
Out[26]: 10

If the word is not in the string object, the method returns -1:

In [27]: t.find('Python')
Out[27]: -1

Replacing characters in a string is a typical task that is easily accomplished with the replace method:

In [28]: t.replace(' ', '|')
Out[28]: 'this|is|a|string|object'

The stripping of strings—i.e., deletion of certain leading/lagging characters—is also often necessary:

In [29]: 'http://www.python.org'.strip('htp:/')
Out[29]: 'www.python.org'

Table 4-1 lists a number of helpful methods of the string object.

Table 4-1. Selected string methods
Method Arguments Returns/result

capitalize

()

Copy of the string with first letter capitalized

count

(sub[, start[, end]])

Count of the number of occurrences of substring

decode

([encoding[, errors]])

Decoded version of the string, using encoding (e.g., UTF-8)

encode

([encoding[, errors]])

Encoded version of the string

find

(sub[, start[, end]])

(Lowest) index where substring is found

join

(seq)

Concatenation of strings in sequence seq

replace

(old, new[, count])

Replaces old by new the first count times

split

([sep[, maxsplit]])

List of words in string with sep as separator

splitlines

([keepends])

Separated lines with line ends/breaks if keepends is True

strip

(chars)

Copy of string with leading/lagging characters in chars removed

upper

()

Copy with all letters capitalized

A powerful tool when working with string objects is regular expressions. Python provides such functionality in the module re:

In [30]: import re

Suppose you are faced with a large text file, such as a comma-separated value (CSV) file, which contains certain time series and respective date-time information. More often than not, the date-time information is delivered in a format that Python cannot interpret directly. However, the date-time information can generally be described by a regular expression. Consider the following string object, containing three date-time elements, three integers, and three strings. Note that triple quotation marks allow the definition of strings over multiple rows:

In [31]: series = """
         '01/18/2014 13:00:00', 100, '1st';
         '01/18/2014 13:30:00', 110, '2nd';
         '01/18/2014 14:00:00', 120, '3rd'
         """

The following regular expression describes the format of the date-time information provided in the string object:[21]

In [32]: dt = re.compile("'[0-9/:\s]+'")  # datetime

Equipped with this regular expression, we can go on and find all the date-time elements. In general, applying regular expressions to string objects also leads to performance improvements for typical parsing tasks:

In [33]: result = dt.findall(series)
         result
Out[33]: ["'01/18/2014 13:00:00'", "'01/18/2014 13:30:00'", "'01/18/2014 14:00:0
         0'"]

Regular Expressions

When parsing string objects, consider using regular expressions, which can bring both convenience and performance to such operations.

The resulting string objects can then be parsed to generate Python datetime objects (cf. Appendix C for an overview of handling date and time data with Python). To parse the string objects containing the date-time information, we need to provide information of how to parse—again as a string object:

In [34]: from datetime import datetime
         pydt = datetime.strptime(result[0].replace("'", ""),
                                  '%m/%d/%Y %H:%M:%S')
         pydt
Out[34]: datetime.datetime(2014, 1, 18, 13, 0)
In [35]: print pydt
Out[35]: 2014-01-18 13:00:00
In [36]: print type(pydt)
Out[36]: <type 'datetime.datetime'>

Later chapters provide more information on date-time data, the handling of such data, and datetime objects and their methods. This is just meant to be a teaser for this important topic in finance.

Basic Data Structures

As a general rule, data structures are objects that contain a possibly large number of other objects. Among those that Python provides as built-in structures are:

tuple
A collection of arbitrary objects; only a few methods available
list
A collection of arbitrary objects; many methods available
dict
A key-value store object
set
An unordered collection object for other unique objects

Tuples

A tuple is an advanced data structure, yet it’s still quite simple and limited in its applications. It is defined by providing objects in parentheses:

In [37]: t = (1, 2.5, 'data')
         type(t)
Out[37]: tuple

You can even drop the parentheses and provide multiple objects separated by commas:

In [38]: t = 1, 2.5, 'data'
         type(t)
Out[38]: tuple

Like almost all data structures in Python the tuple has a built-in index, with the help of which you can retrieve single or multiple elements of the tuple. It is important to remember that Python uses zero-based numbering, such that the third element of a tuple is at index position 2:

In [39]: t[2]
Out[39]: 'data'
In [40]: type(t[2])
Out[40]: str

Zero-Based Numbering

In contrast to some other programming languages like Matlab, Python uses zero-based numbering schemes. For example, the first element of a tuple object has index value 0.

There are only two special methods that this object type provides: count and index. The first counts the number of occurrences of a certain object and the second gives the index value of the first appearance of it:

In [41]: t.count('data')
Out[41]: 1
In [42]: t.index(1)
Out[42]: 0

tuple objects are not very flexible since, once defined, they cannot be changed easily.

Lists

Objects of type list are much more flexible and powerful in comparison to tuple objects. From a finance point of view, you can achieve a lot working only with list objects, such as storing stock price quotes and appending new data. A list object is defined through brackets and the basic capabilities and behavior are similar to those of tuple objects:

In [43]: l = [1, 2.5, 'data']
         l[2]
Out[43]: 'data'

list objects can also be defined or converted by using the function list. The following code generates a new list object by converting the tuple object from the previous example:

In [44]: l = list(t)
         l
Out[44]: [1, 2.5, 'data']
In [45]: type(l)
Out[45]: list

In addition to the characteristics of tuple objects, list objects are also expandable and reducible via different methods. In other words, whereas string and tuple objects are immutable sequence objects (with indexes) that cannot be changed once created, list objects are mutable and can be changed via different operations. You can append list objects to an existing list object, and more:

In [46]: l.append([4, 3])  # append list at the end
         l
Out[46]: [1, 2.5, 'data', [4, 3]]
In [47]: l.extend([1.0, 1.5, 2.0])  # append elements of list
         l
Out[47]: [1, 2.5, 'data', [4, 3], 1.0, 1.5, 2.0]
In [48]: l.insert(1, 'insert')  # insert object before index position
         l
Out[48]: [1, 'insert', 2.5, 'data', [4, 3], 1.0, 1.5, 2.0]
In [49]: l.remove('data')  # remove first occurrence of object
         l
Out[49]: [1, 'insert', 2.5, [4, 3], 1.0, 1.5, 2.0]
In [50]: p = l.pop(3)  # removes and returns object at index
         print l, p
Out[50]: [1, 'insert', 2.5, 1.0, 1.5, 2.0] [4, 3]

Slicing is also easily accomplished. Here, slicing refers to an operation that breaks down a data set into smaller parts (of interest):

In [51]: l[2:5]  # 3rd to 5th elements
Out[51]: [2.5, 1.0, 1.5]

Table 4-2 provides a summary of selected operations and methods of the list object.

Table 4-2. Selected operations and methods of list objects
Method Arguments Returns/result

l[i] = x

[i]

Replaces ith element by x

l[i:j:k] = s

[i:j:k]

Replaces every kth element from i to j - 1 by s

append

(x)

Appends x to object

count

(x)

Number of occurrences of object x

del l[i:j:k]

[i:j:k]

Deletes elements with index values i to j – 1

extend

(s)

Appends all elements of s to object

index

(x[, i[, j]])

First index of x between elements i and j – 1

insert

(i, x)++

Inserts x at/before index i

remove

(i)

Removes element with index i

pop

(i)

Removes element with index i and return it

reverse

()

Reverses all items in place

sort

([cmp[, key[, reverse]]])

Sorts all items in place

Excursion: Control Structures

Although a topic in itself, control structures like for loops are maybe best introduced in Python based on list objects. This is due to the fact that looping in general takes place over list objects, which is quite different to what is often the standard in other languages. Take the following example. The for loop loops over the elements of the list object l with index values 2 to 4 and prints the square of the respective elements. Note the importance of the indentation (whitespace) in the second line:

In [52]: for element in l[2:5]:
             print element ** 2
Out[52]: 6.25
         1.0
         2.25

This provides a really high degree of flexibility in comparison to the typical counter-based looping. Counter-based looping is also an option with Python, but is accomplished based on the (standard) list object range:

In [53]: r = range(0, 8, 1)  # start, end, step width
         r
Out[53]: [0, 1, 2, 3, 4, 5, 6, 7]
In [54]: type(r)
Out[54]: list

For comparison, the same loop is implemented using range as follows:

In [55]: for i in range(2, 5):
             print l[i] ** 2
Out[55]: 6.25
         1.0
         2.25

Looping over Lists

In Python you can loop over arbitrary list objects, no matter what the content of the object is. This often avoids the introduction of a counter.

Python also provides the typical (conditional) control elements if, elif, and else. Their use is comparable in other languages:

In [56]: for i in range(1, 10):
             if i % 2 == 0:  # % is for modulo
                 print "%d is even" % i
             elif i % 3 == 0:
                 print "%d is multiple of 3" % i
             else:
                 print "%d is odd" % i
Out[56]: 1 is odd
         2 is even
         3 is multiple of 3
         4 is even
         5 is odd
         6 is even
         7 is odd
         8 is even
         9 is multiple of 3

Similarly, while provides another means to control the flow:

In [57]: total = 0
         while total < 100:
             total += 1
         print total
Out[57]: 100

A specialty of Python is so-called list comprehensions. Instead of looping over existing list objects, this approach generates list objects via loops in a rather compact fashion:

In [58]: m = [i ** 2 for i in range(5)]
         m
Out[58]: [0, 1, 4, 9, 16]

In a certain sense, this already provides a first means to generate “something like” vectorized code in that loops are rather more implicit than explicit (vectorization of code is discussed in more detail later in this chapter).

Excursion: Functional Programming

Python provides a number of tools for functional programming support as well—i.e., the application of a function to a whole set of inputs (in our case list objects). Among these tools are filter, map, and reduce. However, we need a function definition first. To start with something really simple, consider a function f that returns the square of the input x:

In [59]: def f(x):
             return x ** 2
         f(2)
Out[59]: 4

Of course, functions can be arbitrarily complex, with multiple input/parameter objects and even multiple outputs, (return objects). However, consider the following function:

In [60]: def even(x):
             return x % 2 == 0
         even(3)
Out[60]: False

The return object is a Boolean. Such a function can be applied to a whole list object by using map:

In [61]: map(even, range(10))
Out[61]: [True, False, True, False, True, False, True, False, True, False]

To this end, we can also provide a function definition directly as an argument to map, by using lambda or anonymous functions:

In [62]: map(lambda x: x ** 2, range(10))
Out[62]: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Functions can also be used to filter a list object. In the following example, the filter returns elements of a list object that match the Boolean condition as defined by the even function:

In [63]: filter(even, range(15))
Out[63]: [0, 2, 4, 6, 8, 10, 12, 14]

Finally, reduce helps when we want to apply a function to all elements of a list object that returns a single value only. An example is the cumulative sum of all elements in a list object (assuming that summation is defined for the objects contained in the list):

In [64]: reduce(lambda x, y: x + y, range(10))
Out[64]: 45

An alternative, nonfunctional implementation could look like the following:

In [65]: def cumsum(l):
             total = 0
             for elem in l:
                 total += elem
             return total
         cumsum(range(10))
Out[65]: 45

List Comprehensions, Functional Programming, Anonymous Functions

It can be considered good practice to avoid loops on the Python level as far as possible. list comprehensions and functional programming tools like map, filter, and reduce provide means to write code without loops that is both compact and in general more readable. lambda or anonymous functions are also powerful tools in this context.

Dicts

dict objects are dictionaries, and also mutable sequences, that allow data retrieval by keys that can, for example, be string objects. They are so-called key-value stores. While list objects are ordered and sortable, dict objects are unordered and unsortable. An example best illustrates further differences to list objects. Curly brackets are what define dict objects:

In [66]: d = {
              'Name' : 'Angela Merkel',
              'Country' : 'Germany',
              'Profession' : 'Chancelor',
              'Age' : 60
              }
         type(d)
Out[66]: dict
In [67]: print d['Name'], d['Age']
Out[67]: Angela Merkel 60

Again, this class of objects has a number of built-in methods:

In [68]: d.keys()
Out[68]: ['Country', 'Age', 'Profession', 'Name']
In [69]: d.values()
Out[69]: ['Germany', 60, 'Chancelor', 'Angela Merkel']
In [70]: d.items()
Out[70]: [('Country', 'Germany'),
          ('Age', 60),
          ('Profession', 'Chancelor'),
          ('Name', 'Angela Merkel')]
In [71]: birthday = True
         if birthday is True:
             d['Age'] += 1
         print d['Age']
Out[71]: 61

There are several methods to get iterator objects from the dict object. The objects behave like list objects when iterated over:

In [72]: for item in d.iteritems():
             print item
Out[72]: ('Country', 'Germany')
         ('Age', 61)
         ('Profession', 'Chancelor')
         ('Name', 'Angela Merkel')
In [73]: for value in d.itervalues():
             print type(value)
Out[73]: <type 'str'>
         <type 'int'>
         <type 'str'>
         <type 'str'>

Table 4-3 provides a summary of selected operations and methods of the dict object.

Table 4-3. Selected operations and methods of dict objects
Method Arguments Returns/result

d[k]

[k]

Item of d with key k

d[k] = x

[k]

Sets item key k to x

del d[k]

[k]

Deletes item with key k

clear

()

Removes all items

copy

()

Makes a copy

has_key

(k)

True if k is a key

items

()

Copy of all key-value pairs

iteritems

()

Iterator over all items

iterkeys

()

Iterator over all keys

itervalues

()

Iterator over all values

keys

()

Copy of all keys

poptiem

(k)

Returns and removes item with key k

update

([e])

Updates items with items from e

values

()

Copy of all values

Sets

The last data structure we will consider is the set object. Although set theory is a cornerstone of mathematics and also finance theory, there are not too many practical applications for set objects. The objects are unordered collections of other objects, containing every element only once:

In [74]: s = set(['u', 'd', 'ud', 'du', 'd', 'du'])
         s
Out[74]: {'d', 'du', 'u', 'ud'}
In [75]: t = set(['d', 'dd', 'uu', 'u'])

With set objects, you can implement operations as you are used to in mathematical set theory. For example, you can generate unions, intersections, and differences:

In [76]: s.union(t)  # all of s and t
Out[76]: {'d', 'dd', 'du', 'u', 'ud', 'uu'}
In [77]: s.intersection(t)  # both in s and t
Out[77]: {'d', 'u'}
In [78]: s.difference(t)  # in s but not t
Out[78]: {'du', 'ud'}
In [79]: t.difference(s)  # in t but not s
Out[79]: {'dd', 'uu'}
In [80]: s.symmetric_difference(t)  # in either one but not both
Out[80]: {'dd', 'du', 'ud', 'uu'}

One application of set objects is to get rid of duplicates in a list object. For example:

In [81]: from random import randint
         l = [randint(0, 10) for i in range(1000)]
             # 1,000 random integers between 0 and 10
         len(l)  # number of elements in l
Out[81]: 1000
In [82]: l[:20]
Out[82]: [8, 3, 4, 9, 1, 7, 5, 5, 6, 7, 4, 4, 7, 1, 8, 5, 0, 7, 1, 9]
In [83]: s = set(l)
         s
Out[83]: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

NumPy Data Structures

The previous section shows that Python provides some quite useful and flexible general data structures. In particular, list objects can be considered a real workhorse with many convenient characteristics and application areas. However, scientific and financial applications generally have a need for high-performing operations on special data structures. One of the most important data structures in this regard is the array. Arrays generally structure other (fundamental) objects in rows and columns.

Assume for the moment that we work with numbers only, although the concept generalizes to other types of data as well. In the simplest case, a one-dimensional array then represents, mathematically speaking, a vector of, in general, real numbers, internally represented by float objects. It then consists of a single row or column of elements only. In a more common case, an array represents an i × j matrix of elements. This concept generalizes to i × j × k cubes of elements in three dimensions as well as to general n-dimensional arrays of shape i × j × k × l × … .

Mathematical disciplines like linear algebra and vector space theory illustrate that such mathematical structures are of high importance in a number of disciplines and fields. It can therefore prove fruitful to have available a specialized class of data structures explicitly designed to handle arrays conveniently and efficiently. This is where the Python library NumPy comes into play, with its ndarray class.

Arrays with Python Lists

Before we turn to NumPy, let us first construct arrays with the built-in data structures presented in the previous section. list objects are particularly suited to accomplishing this task. A simple list can already be considered a one-dimensional array:

In [84]: v = [0.5, 0.75, 1.0, 1.5, 2.0]  # vector of numbers

Since list objects can contain arbitrary other objects, they can also contain other list objects. In that way, two- and higher-dimensional arrays are easily constructed by nested list objects:

In [85]: m = [v, v, v]  # matrix of numbers
         m
Out[85]: [[0.5, 0.75, 1.0, 1.5, 2.0],
          [0.5, 0.75, 1.0, 1.5, 2.0],
          [0.5, 0.75, 1.0, 1.5, 2.0]]

We can also easily select rows via simple indexing or single elements via double indexing (whole columns, however, are not so easy to select):

In [86]: m[1]
Out[86]: [0.5, 0.75, 1.0, 1.5, 2.0]
In [87]: m[1][0]
Out[87]: 0.5

Nesting can be pushed further for even more general structures:

In [88]: v1 = [0.5, 1.5]
         v2 = [1, 2]
         m = [v1, v2]
         c = [m, m]  # cube of numbers
         c
Out[88]: [[[0.5, 1.5], [1, 2]], [[0.5, 1.5], [1, 2]]]
In [89]: c[1][1][0]
Out[89]: 1

Note that combining objects in the way just presented generally works with reference pointers to the original objects. What does that mean in practice? Let us have a look at the following operations:

In [90]: v = [0.5, 0.75, 1.0, 1.5, 2.0]
         m = [v, v, v]
         m
Out[90]: [[0.5, 0.75, 1.0, 1.5, 2.0],
          [0.5, 0.75, 1.0, 1.5, 2.0],
          [0.5, 0.75, 1.0, 1.5, 2.0]]

Now change the value of the first element of the v object and see what happens to the m object:

In [91]: v[0] = 'Python'
         m
Out[91]: [['Python', 0.75, 1.0, 1.5, 2.0],
          ['Python', 0.75, 1.0, 1.5, 2.0],
          ['Python', 0.75, 1.0, 1.5, 2.0]]

This can be avoided by using the deepcopy function of the copy module:

In [92]: from copy import deepcopy
         v = [0.5, 0.75, 1.0, 1.5, 2.0]
         m = 3 * [deepcopy(v), ]
         m
Out[92]: [[0.5, 0.75, 1.0, 1.5, 2.0],
          [0.5, 0.75, 1.0, 1.5, 2.0],
          [0.5, 0.75, 1.0, 1.5, 2.0]]
In [93]: v[0] = 'Python'
         m
Out[93]: [[0.5, 0.75, 1.0, 1.5, 2.0],
          [0.5, 0.75, 1.0, 1.5, 2.0],
          [0.5, 0.75, 1.0, 1.5, 2.0]]

Regular NumPy Arrays

Obviously, composing array structures with list objects works, somewhat. But it is not really convenient, and the list class has not been built with this specific goal in mind. It has rather been built with a much broader and more general scope. From this point of view, some kind of specialized class could therefore be really beneficial to handle array-type structures.

Such a specialized class is numpy.ndarray, which has been built with the specific goal of handling n-dimensional arrays both conveniently and efficiently—i.e., in a highly performing manner. The basic handling of instances of this class is again best illustrated by examples:

In [94]: import numpy as np
In [95]: a = np.array([0, 0.5, 1.0, 1.5, 2.0])
         type(a)
Out[95]: numpy.ndarray
In [96]: a[:2]  # indexing as with list objects in 1 dimension
Out[96]: array([ 0. ,  0.5])

A major feature of the numpy.ndarray class is the multitude of built-in methods. For instance:

In [97]: a.sum()  # sum of all elements
Out[97]: 5.0
In [98]: a.std()  # standard deviation
Out[98]: 0.70710678118654757
In [99]: a.cumsum()  # running cumulative sum
Out[99]: array([ 0. ,  0.5,  1.5,  3. ,  5. ])

Another major feature is the (vectorized) mathematical operations defined on ndarray objects:

In [100]: a * 2
Out[100]: array([ 0.,  1.,  2.,  3.,  4.])
In [101]: a ** 2
Out[101]: array([ 0.  ,  0.25,  1.  ,  2.25,  4.  ])
In [102]: np.sqrt(a)
Out[102]: array([ 0.        ,  0.70710678,  1.        ,  1.22474487,  1.41421356
          ])

The transition to more than one dimension is seamless, and all features presented so far carry over to the more general cases. In particular, the indexing system is made consistent across all dimensions:

In [103]: b = np.array([a, a * 2])
          b
Out[103]: array([[ 0. ,  0.5,  1. ,  1.5,  2. ],
                 [ 0. ,  1. ,  2. ,  3. ,  4. ]])
In [104]: b[0]  # first row
Out[104]: array([ 0. ,  0.5,  1. ,  1.5,  2. ])
In [105]: b[0, 2]  # third element of first row
Out[105]: 1.0
In [106]: b.sum()
Out[106]: 15.0

In contrast to our list object-based approach to constructing arrays, the numpy.ndarray class knows axes explicitly. Selecting either rows or columns from a matrix is essentially the same:

In [107]: b.sum(axis=0)
            # sum along axis 0, i.e. column-wise sum
Out[107]: array([ 0. ,  1.5,  3. ,  4.5,  6. ])
In [108]: b.sum(axis=1)
            # sum along axis 1, i.e. row-wise sum
Out[108]: array([  5.,  10.])

There are a number of ways to initialize (instantiate) a numpy.ndarray object. One is as presented before, via np.array. However, this assumes that all elements of the array are already available. In contrast, one would maybe like to have the numpy.ndarray objects instantiated first to populate them later with results generated during the execution of code. To this end, we can use the following functions:

In [109]: c = np.zeros((2, 3, 4), dtype='i', order='C')  # also: np.ones()
          c
Out[109]: array([[[0, 0, 0, 0],
                  [0, 0, 0, 0],
                  [0, 0, 0, 0]],

                 [[0, 0, 0, 0],
                  [0, 0, 0, 0],
                  [0, 0, 0, 0]]], dtype=int32)
In [110]: d = np.ones_like(c, dtype='f16', order='C')  # also: np.zeros_like()
          d
Out[110]: array([[[ 1.0,  1.0,  1.0,  1.0],
                  [ 1.0,  1.0,  1.0,  1.0],
                  [ 1.0,  1.0,  1.0,  1.0]],

                 [[ 1.0,  1.0,  1.0,  1.0],
                  [ 1.0,  1.0,  1.0,  1.0],
                  [ 1.0,  1.0,  1.0,  1.0]]], dtype=float128)

With all these functions we provide the following information:

shape
Either an int, a sequence of ints, or a reference to another numpy.ndarray
dtype (optional)
A numpy.dtype—these are NumPy-specific data types for numpy.ndarray objects
order (optional)
The order in which to store elements in memory: C for C-like (i.e., row-wise) or F for Fortran-like (i.e., column-wise)

Here, it becomes obvious how NumPy specializes the construction of arrays with the numpy.ndarray class, in comparison to the list-based approach:

  • The shape/length/size of the array is homogenous across any given dimension.
  • It only allows for a single data type (numpy.dtype) for the whole array.

The role of the order parameter is discussed later in the chapter. Table 4-4 provides an overview of numpy.dtype objects (i.e., the basic data types NumPy allows).

Table 4-4. NumPy dtype objects
dtype Description Example

t

Bit field

t4 (4 bits)

b

Boolean

b (true or false)

i

Integer

i8 (64 bit)

u

Unsigned integer

u8 (64 bit)

f

Floating point

f8 (64 bit)

c

Complex floating point

c16 (128 bit)

O

Object

0 (pointer to object)

S, a

String

S24 (24 characters)

U

Unicode

U24 (24 Unicode characters)

V

Other

V12 (12-byte data block)

NumPy provides a generalization of regular arrays that loosens at least the dtype restriction, but let us stick with regular arrays for a moment and see what the specialization brings in terms of performance.

As a simple exercise, suppose we want to generate a matrix/array of shape 5,000 × 5,000 elements, populated with (pseudo)random, standard normally distributed numbers. We then want to calculate the sum of all elements. First, the pure Python approach, where we make heavy use of list comprehensions and functional programming methods as well as lambda functions:

In [111]: import random
          I = 5000
In [112]: %time mat = [[random.gauss(0, 1) for j in range(I)] for i in range(I)]
            # a nested list comprehension
Out[112]: CPU times: user 36.5 s, sys: 408 ms, total: 36.9 s
          Wall time: 36.4 s
In [113]: %time reduce(lambda x, y: x + y,      \
               [reduce(lambda x, y: x + y, row) \
                       for row in mat])
Out[113]: CPU times: user 4.3 s, sys: 52 ms, total: 4.35 s
          Wall time: 4.07 s

          678.5908519876674

Let us now turn to NumPy and see how the same problem is solved there. For convenience, the NumPy sublibrary random offers a multitude of functions to initialize a numpy.ndarray object and populate it at the same time with (pseudo)random numbers:

In [114]: %time mat = np.random.standard_normal((I, I))
Out[114]: CPU times: user 1.83 s, sys: 40 ms, total: 1.87 s
          Wall time: 1.87 s
In [115]: %time mat.sum()
Out[115]: CPU times: user 36 ms, sys: 0 ns, total: 36 ms
          Wall time: 34.6 ms

          349.49777911439384

We observe the following:

Syntax
Although we use several approaches to compact the pure Python code, the NumPy version is even more compact and readable.
Performance
The generation of the numpy.ndarray object is roughly 20 times faster and the calculation of the sum is roughly 100 times faster than the respective operations in pure Python.

Using NumPy Arrays

The use of NumPy for array-based operations and algorithms generally results in compact, easily readable code and significant performance improvements over pure Python code.

Structured Arrays

The specialization of the numpy.ndarray class obviously brings a number of really valuable benefits with it. However, a too-narrow specialization might turn out to be too large a burden to carry for the majority of array-based algorithms and applications. Therefore, NumPy provides structured arrays that allow us to have different NumPy data types per column, at least. What does “per column” mean? Consider the following initialization of a structured array object:

In [116]: dt = np.dtype([('Name', 'S10'), ('Age', 'i4'),
                         ('Height', 'f'), ('Children/Pets', 'i4', 2)])
          s = np.array([('Smith', 45, 1.83, (0, 1)),
                        ('Jones', 53, 1.72, (2, 2))], dtype=dt)
          s
Out[116]: array([('Smith', 45, 1.8300000429153442, [0, 1]),
                 ('Jones', 53, 1.7200000286102295, [2, 2])],
                dtype=[('Name', 'S10'), ('Age', '<i4'), ('Height', '<f4'), ('Chi
          ldren/Pets', '<i4', (2,))])

In a sense, this construction comes quite close to the operation for initializing tables in a SQL database. We have column names and column data types, with maybe some additional information (e.g., maximum number of characters per string object). The single columns can now be easily accessed by their names:

In [117]: s['Name']
Out[117]: array(['Smith', 'Jones'],
                dtype='|S10')
In [118]: s['Height'].mean()
Out[118]: 1.7750001

Having selected a specific row and record, respectively, the resulting objects mainly behave like dict objects, where one can retrieve values via keys:

In [119]: s[1]['Age']
Out[119]: 53

In summary, structured arrays are a generalization of the regular numpy.ndarray object types in that the data type only has to be the same per column, as one is used to in the context of tables in SQL databases. One advantage of structured arrays is that a single element of a column can be another multidimensional object and does not have to conform to the basic NumPy data types.

Structured Arrays

NumPy provides, in addition to regular arrays, structured arrays that allow the description and handling of rather complex array-oriented data structures with a variety of different data types and even structures per (named) column. They bring SQL table-like data structures to Python, with all the benefits of regular numpy.ndarray objects (syntax, methods, performance).

Vectorization of Code

Vectorization of code is a strategy to get more compact code that is possibly executed faster. The fundamental idea is to conduct an operation on or to apply a function to a complex object “at once” and not by iterating over the single elements of the object. In Python, the functional programming tools map, filter, and reduce provide means for vectorization. In a sense, NumPy has vectorization built in deep down in its core.

Basic Vectorization

As we learned in the previous section, simple mathematical operations can be implemented on numpy.ndarray objects directly. For example, we can add two NumPy arrays element-wise as follows:

In [120]: r = np.random.standard_normal((4, 3))
          s = np.random.standard_normal((4, 3))
In [121]: r + s
Out[121]: array([[-1.94801686, -0.6855251 ,  2.28954806],
                 [ 0.33847593, -1.97109602,  1.30071653],
                 [-1.12066585,  0.22234207, -2.73940339],
                 [ 0.43787363,  0.52938941, -1.38467623]])

NumPy also supports what is called broadcasting. This allows us to combine objects of different shape within a single operation. We have already made use of this before. Consider the following example:

In [122]: 2 * r + 3
Out[122]: array([[ 2.54691692,  1.65823523,  8.14636725],
                 [ 4.94758114,  0.25648128,  1.89566919],
                 [ 0.41775907,  0.58038395,  2.06567484],
                 [ 0.67600205,  3.41004636,  1.07282384]])

In this case, the r object is multiplied by 2 element-wise and then 3 is added element-wise—the 3 is broadcasted or stretched to the shape of the r object. It works with differently shaped arrays as well, up to a certain point:

In [123]: s = np.random.standard_normal(3)
          r + s
Out[123]: array([[ 0.23324118, -1.09764268,  1.90412565],
                 [ 1.43357329, -1.79851966, -1.22122338],
                 [-0.83133775, -1.63656832, -1.13622055],
                 [-0.70221625, -0.22173711, -1.63264605]])

This broadcasts the one-dimensional array of size 3 to a shape of (4, 3). The same does not work, for example, with a one-dimensional array of size 4:

In [124]: s = np.random.standard_normal(4)
          r + s
Out[124]: ValueError
          operands could not be broadcast together with shapes (4,3) (4,)

However, transposing the r object makes the operation work again. In the following code, the transpose method transforms the ndarray object with shape (4, 3) into an object of the same type with shape (3, 4):

In [125]: r.transpose() + s
Out[125]: array([[-0.63380522,  0.5964174 ,  0.88641996, -0.86931849],
                 [-1.07814606, -1.74913253,  0.9677324 ,  0.49770367],
                 [ 2.16591995, -0.92953858,  1.71037785, -0.67090759]])
In [126]: np.shape(r.T)
Out[126]: (3, 4)

As a general rule, custom-defined Python functions work with numpy.ndarrays as well. If the implementation allows, arrays can be used with functions just as int or float objects can. Consider the following function:

In [127]: def f(x):
              return 3 * x + 5

We can pass standard Python objects as well as numpy.ndarray objects (for which the operations in the function have to be defined, of course):

In [128]: f(0.5)  # float object
Out[128]: 6.5
In [129]: f(r)  # NumPy array
Out[129]: array([[  4.32037538,   2.98735285,  12.71955087],
                 [  7.9213717 ,   0.88472192,   3.34350378],
                 [  1.1266386 ,   1.37057593,   3.59851226],
                 [  1.51400308,   5.61506954,   2.10923576]])

What NumPy does is to simply apply the function f to the object element-wise. In that sense, by using this kind of operation we do not avoid loops; we only avoid them on the Python level and delegate the looping to NumPy. On the NumPy level, looping over the numpy.ndarray object is taken care of by highly optimized code, most of it written in C and therefore generally much faster than pure Python. This explains the “secret” behind the performance benefits of using NumPy for array-based use cases.

When working with arrays, one has to take care to call the right functions on the respective objects. For example, the sin function from the standard math module of Python does not work with NumPy arrays:

In [130]: import math
          math.sin(r)
Out[130]: TypeError
          only length-1 arrays can be converted to Python scalars

The function is designed to handle, for example, float objects—i.e., single numbers, not arrays. NumPy provides the respective counterparts as so-called ufuncs, or universal functions:

In [131]: np.sin(r)  # array as input
Out[131]: array([[-0.22460878, -0.62167738,  0.53829193],
                 [ 0.82702259, -0.98025745, -0.52453206],
                 [-0.96114497, -0.93554821, -0.45035471],
                 [-0.91759955,  0.20358986, -0.82124413]])
In [132]: np.sin(np.pi)  # float as input
Out[132]: 1.2246467991473532e-16

NumPy provides a large number of such ufuncs that generalize typical mathematical functions to numpy.ndarray objects.[22]

Universal Functions

Be careful when using the from library import * approach to importing. Such an approach can cause the NumPy reference to the ufunc numpy.sin to be replaced by the reference to the math function math.sin. You should, as a rule, import both libraries by name to avoid confusion: import numpy as np; import math. Then you can use math.sin alongside np.sin.

Memory Layout

When we first initialized numpy.ndarray objects by using numpy.zero, we provided an optional argument for the memory layout. This argument specifies, roughly speaking, which elements of an array get stored in memory next to each other. When working with small arrays, this has hardly any measurable impact on the performance of array operations. However, when arrays get large the story is somewhat different, depending on the operations to be implemented on the arrays.

To illustrate this important point for memory-wise handling of arrays in science and finance, consider the following construction of multidimensional numpy.ndarray objects:

In [133]: x = np.random.standard_normal((5, 10000000))
          y = 2 * x + 3  # linear equation y = a * x + b
          C = np.array((x, y), order='C')
          F = np.array((x, y), order='F')
          x = 0.0; y = 0.0  # memory cleanup
In [134]: C[:2].round(2)
Out[134]: array([[[-0.51, -1.14, -1.07, ...,  0.2 , -0.18,  0.1 ],
                  [-1.22,  0.68,  1.83, ...,  1.23, -0.27, -0.16],
                  [ 0.45,  0.15,  0.01, ..., -0.75,  0.91, -1.12],
                  [-0.16,  1.4 , -0.79, ..., -0.33,  0.54,  1.81],
                  [ 1.07, -1.07, -0.37, ..., -0.76,  0.71,  0.34]],

                 [[ 1.98,  0.72,  0.86, ...,  3.4 ,  2.64,  3.21],
                  [ 0.55,  4.37,  6.66, ...,  5.47,  2.47,  2.68],
                  [ 3.9 ,  3.29,  3.03, ...,  1.5 ,  4.82,  0.76],
                  [ 2.67,  5.8 ,  1.42, ...,  2.34,  4.09,  6.63],
                  [ 5.14,  0.87,  2.27, ...,  1.48,  4.43,  3.67]]])

Let’s look at some really fundamental examples and use cases for both types of ndarray objects:

In [135]: %timeit C.sum()
Out[135]: 10 loops, best of 3: 123 ms per loop
In [136]: %timeit F.sum()
Out[136]: 10 loops, best of 3: 123 ms per loop

When summing up all elements of the arrays, there is no performance difference between the two memory layouts. However, consider the following example with the C-like memory layout:

In [137]: %timeit C[0].sum(axis=0)
Out[137]: 10 loops, best of 3: 102 ms per loop
In [138]: %timeit C[0].sum(axis=1)
Out[138]: 10 loops, best of 3: 61.9 ms per loop

Summing five large vectors and getting back a single large results vector obviously is slower in this case than summing 10,000,000 small ones and getting back an equal number of results. This is due to the fact that the single elements of the small vectors—i.e., the rows—are stored next to each other. With the Fortran-like memory layout, the relative performance changes considerably:

In [139]: %timeit F.sum(axis=0)
Out[139]: 1 loops, best of 3: 801 ms per loop
In [140]: %timeit F.sum(axis=1)
Out[140]: 1 loops, best of 3: 2.23 s per loop
In [141]: F = 0.0; C = 0.0  # memory cleanup

In this case, operating on a few large vectors performs better than operating on a large number of small ones. The elements of the few large vectors are stored in memory next to each other, which explains the relative performance advantage. However, overall the operations are absolutely much slower when compared to the C-like variant.

Conclusions

Python provides, in combination with NumPy, a rich set of flexible data structures. From a finance point of view, the following can be considered the most important ones:

Basic data types
In finance, the classes int, float, and string provide the atomic data types.
Standard data structures
The classes tuple, list, dict, and set have many application areas in finance, with list being the most flexible workhorse in general.
Arrays
A large class of finance-related problems and algorithms can be cast to an array setting; NumPy provides the specialized class numpy.ndarray, which provides both convenience and compactness of code as well as high performance.

This chapter shows that both the basic data structures and the NumPy ones allow for highly vectorized implementation of algorithms. Depending on the specific shape of the data structures, care should be taken with regard to the memory layout of arrays. Choosing the right approach here can speed up code execution by a factor of two or more.

Further Reading

This chapter focuses on those issues that might be of particular importance for finance algorithms and applications. However, it can only represent a starting point for the exploration of data structures and data modeling in Python. There are a number of valuable resources available to go deeper from here.

Here are some Internet resources to consult:

Good references in book form are:

  • Goodrich, Michael et al. (2013): Data Structures and Algorithms in Python. John Wiley & Sons, Hoboken, NJ.
  • Langtangen, Hans Petter (2009): A Primer on Scientific Programming with Python. Springer Verlag, Berlin, Heidelberg.


[18] The Cython library brings static typing and compiling features to Python that are comparable to those in C. In fact, Cython is a hybrid language of Python and C.

[19] Here and in the following discussion, terms like float, float object, etc. are used interchangeably, acknowledging that every float is also an object. The same holds true for other object types.

[21] It is not possible to go into details here, but there is a wealth of information available on the Internet about regular expressions in general and for Python in particular. For an introduction to this topic, refer to Fitzgerald, Michael (2012): Introducing Regular Expressions. O’Reilly, Sebastopol, CA.

Get Python for Finance now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.