O'Reilly logo

Data Wrangling with Python by Katharine Jarmul, Jacqueline Kazil

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 4. Working with Excel Files

Unlike the previous chapter’s data, not all the data in this and the following chapter will easily import into Python without a little work. This is because some data formats were made to be machine readable, while others, such as the ones we’ll look at next, were meant to be interacted with through desktop tools. In this chapter and the next, we will look at two example file types—Excel files and PDFs—and provide some generic instructions to follow if you encounter a different file type.

So far in this book, the solutions you’ve learned about for importing data have been pretty standard. In this chapter, we will begin to learn about processes which will vary greatly each time you perform them. Although the processes are more difficult, the end goal is the same: to extract the useful information and put it into a usable format in Python.

The examples we use in this chapter and the next contain data from UNICEF’s 2014 report on The State of the World’s Children. The data is available in PDF and Excel format.

When you have to extract data from files in these more difficult formats, you might think there is someone out there who hates you, because it can be painful. We assure you in most cases, the person who generated the file with the data inside simply did not identify the importance of also releasing it in a machine-readable format.

Installing Python Packages

Before we can continue, we need to learn how to install external Python packages (or libraries). Up until this point, we were using Python libraries that came standard with Python when you installed it. Do you remember importing the csv and json packages in Chapter 3? Those were packages in the standard library—they came with your Python installation.

Python comes with a set of frequently used libraries. Because many libraries serve a niche purpose, you have to explicitly install them. This is so your computer doesn’t get bloated with every Python library available.

Python packages are collected in an online directory called PyPI, which stores the packages along with their metadata and any documentation.

In this chapter, we are looking at Excel files. If you visit PyPI in your browser, you can search for libraries relating to Excel and see lists of matching package results you can download. This is one way to explore which package you should use.

We will be using pip from this point forward to install packages. There are multiple ways to install pip, and you should have already done so in Chapter 1 of this book.

First, we will be evaluating Excel data. Let’s install the package to do that— xlrd. To install the package, we use pip in the following way:

pip install xlrd

To remove the package, we would run the uninstall command:

pip uninstall xlrd

Try installing, uninstalling, and then reinstalling xlrd. It’s good to get a handle on the pip commands, as you’ll be using them throughout this book and your data wrangling career.

Why did we choose xlrd when there are many possible packages? Choosing a Python library is an imperfect process. There are different ways to go about your selection. Don’t worry about trying to figure out what is the right library. When you are perfecting your skills and you find a couple of options, use the library that makes sense to you.

The first thing we recommend is searching the Web to see which libraries other people recommend. If you search for “parse excel using python”, you will find the xlrd library surfaces at the top of the search results.

However, the answer is not always obvious. In Chapter 13, we will learn more about the selection process when looking into Twitter libraries.

Parsing Excel Files

Sometimes the easiest way to extract data from an Excel sheet is finding a better way to get the data. There are times when parsing is not the answer. Before you start parsing, ask yourself the following questions:

  • Have you tried to find the data in another form? Sometimes other forms might be available from the same source.

  • Have you tried to use a phone to figure out if the data is available in another form? Check out Chapter 6 for more tips.

  • Have you tried to export the tab or tabs into CSV form from Excel (or your document reader)? This is a good solution if you only have a couple of tabs of data or isolated data in one tab on the Excel sheet.

If you have exhausted these options and you still don’t have the data you need, you’ll need to use Python to parse your Excel file.

Getting Started with Parsing

The library we identified for parsing Excel files is xlrd. This library is part of a series of libraries for working with Excel files in Python.

There are three main libraries for handling Excel files:

xlrd

Reads Excel files

xlwt

Writes and formats Excel files

xlutils

A set of tools for more advanced operations in Excel (requires xlrd and xlwt)

You’ll need to install each separately if you want to use them; however, in this chapter we will only use xlrd. Because we want to read Excel files into Python, you’ll need to make sure you have xlrd installed before continuing:

pip install xlrd
Warning

If you get the following error, that means you don’t have pip installed:

-bash: pip: command not found

For installation instructions, refer to “Install pip” or https://pip.pypa.io/en/latest/installing/.

Set up your work environment for this Excel file by doing the following (or something like it, depending on your organizational system):

  1. Create a folder for your Excel work.

  2. Create a new Python file called parse_excel.py and put it in the folder you created.

  3. Place the Excel file from the book’s repository called SOWC 2014 Stat Tables_Table 9.xlsx in the same folder.

From this folder, type the following command in your terminal to run the script from the command line:

python parse_excel.py

By the end of this chapter, we will write a script to parse child labor and marriage data stored in this Excel file.

To start our script, we need to import xlrd and open our Excel workbook in Python. We store the opened file in the book variable:

import xlrd

book = xlrd.open_workbook('SOWC 2014 Stat Tables_Table 9.xlsx')

Unlike CSVs, Excel books can have multiple tabs or sheets. To get at our data, we are going to pull out only the sheet with the data we want.

If you have a couple of sheets, you could just guess at the index, but that won’t work if you have lots of sheets. So, you should know about the command book.sheet_by_name(somename), where somename is the name of the sheet you want to access.

Let’s check out the names of the sheets we have:

import xlrd

book = xlrd.open_workbook('SOWC 2014 Stat Tables_Table 9.xlsx')

for sheet in book.sheets():
    print sheet.name

The sheet that we are looking for is Table 9. So, let’s put that into our script:

import xlrd

book = xlrd.open_workbook('SOWC 2014 Stat Tables_Table 9.xlsx')
sheet = book.sheet_by_name('Table 9')

print sheet

If you run that code, it exits with an error that provides you with the following information:

xlrd.biffh.XLRDError: No sheet named <'Table 9'>

At this point, you might be really confused. The problem lies in the difference between what we see and what actually exists.

If you open up your Excel workbook and select the name of the sheet by double-clicking it, you will find that there is an extra space at the end. This space is not visible to users in the browser. In Chapter 7, we will learn how to troubleshoot this in Python. For now, update your code to reflect the space.

Change this line:

sheet = book.sheet_by_name('Table 9')

to this:

sheet = book.sheet_by_name('Table 9 ')

Now, if we run our script it should work. You will see output similar to this:

<xlrd.sheet.Sheet object at 0x102a575d0>

Let’s explore what we can do with a sheet. Add the following after you assign the sheet variable and rerun your script:

print dir(sheet)

In the returned list, you’ll see a method called nrows. We will use this method to iterate over all rows. If we write print sheet.nrows, the total number of rows will be returned.

Try this now:

print sheet.nrows

You should have gotten back 303. We need to iterate over each row, which means we need a for loop. As we learned in “How to Import CSV Data”, for loops iterate over items in a list, so we need to turn 303 into a list we can iterate over 303 times. To do this, we will use the range function.

With the addition of the range function we can transform 303 into a list our for loop can iterate over, our script should look like the following:

import xlrd

book = xlrd.open_workbook('SOWC 2014 Stat Tables_Table 9.xlsx')
sheet = book.sheet_by_name('Table 9 ')

for i in range(sheet.nrows):         1
    print i                          2
1

Loops over the index i in range(303), which will be a list of 303 integers incrementing by one.

2

Outputs i, which will be the numbers from 0 to 302.

From here, we need to do a lookup on each of the rows to pull out the contents of each row instead of just printing the number. To do a lookup, we will use i as an index reference to take the nth row.

To get each row’s values we will use row_values, which was another method returned by dir(sheet) earlier. We can see from the row_values documentation that the method expects an index number and returns the corresponding row’s values. Update your for loop to reflect this and rerun your script:

for i in range(sheet.nrows):
    print sheet.row_values(i)        1
1

Uses i as the index to look up the row’s values. Because it is in a for loop that spans the length of the sheet, we call this method for each row in our sheet.

When you run this code, you will see a list for each row. The following is a subset of the data you’ll see:

['', u'TABLE 9. CHILD PROTECTION', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '']
['', '', u'TABLEAU 9. PROTECTION DE L\u2019ENFANT', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['', '', '', u'TABLA 9. PROTECCI\xd3N INFANTIL', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '']
['', u'Countries and areas', '', '', u'Child labour (%)+\n2005\u20132012*',
'', '', '', '', '', u'Child marriage (%)\n2005\u20132012*', '', '', '',
u'Birth registration (%)+\n2005\u20132012*', '', u'Female genital mutilation/
cutting (%)+\n2002\u20132012*', '', '', '', '', '', u'Justification of wife
beating (%)\n 2005\u20132012*', '', '', '', u'Violent discipline
(%)+\n2005\u20132012*', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '']

Now that we can see each row, we need to pull out the information we want. To help us determine what information we need and how to get it, it’s much easier to open up the file in a program for displaying Excel files, such as Microsoft Excel on Windows or Numbers on Mac. If you visit the second tab on the spreadsheet, you will notice quite a few header rows.

Note

In our code, we will aim to grab the English text. However, if you want an extra challenge, try to pull out the French or Spanish headings and countries.

On the second tab, look at the information you can extract and think about how to best organize it. We provide one possible way to do this here, but there are many different ways using different data structures.

For this exercise, we will pull out child labor and child marriage statistics. The following is one way to organize the data—we’ll use this as an example to work toward:

{
    u'Afghanistan': {
        'child_labor': {
            'female': [9.6, ''], 1
            'male': [11.0, ''],
            'total': [10.3, '']},
        'child_marriage': {
            'married_by_15': [15.0, ''],
            'married_by_18': [40.4, '']
        }
    },
    u'Albania': {
        'child_labor': {
            'female': [9.4, u'  '],
            'male': [14.4, u'  '],
            'total': [12.0, u'  ']},
        'child_marriage': {
            'married_by_15': [0.2, ''],
            'married_by_18': [9.6, '']
        }
    },
    ...
}
1

If you are viewing the data in Excel, some of these numbers might appear off. This is because Excel will often round numbers. We are showing the numbers you will find when you use Python to parse the cells.

Tip

Planning what you want the outcome to look like and writing an example of your data will save you time as you begin coding. Once you have identified how you’d like to format your data, you can ask yourself, “What do I need to do next to get there?” This is especially helpful when you feel blocked on your next step.

There are two Python constructs we are going to use to pull the data out. The first method we will use is a nested for loop, which is a for loop inside another for loop. This is often used when you have x rows that contain y objects. To access each object you need a for loop for each row, then another for loop for each object. We also used a nested for loop in an example in Chapter 3.

We are going to use a nested for loop to output each cell from each row. This will output the items we saw earlier, where each row was listed.

for i in xrange(sheet.nrows):
    row = sheet.row_values(i)       1
    for cell in row:                2
        print cell                  3
1

Takes the list that is each row and saves it to the row variable. This makes our code more readable.

2

Loops over each item in the list, which represents each cell for the current row.

3

Outputs the cell value.

If you run your complete code with the nested for loop, you will notice your output is not so helpful anymore. That brings us to the second mechanism to explore our Excel file—a counter.

Let’s add a counter to our code so we can step through the cells and rows to find what we want to pull out. Be careful where you place the counter—you will have very different results if you place it on the cell level versus the row level.

Reset your for loop to look like the code shown here:

count = 0
for i in xrange(sheet.nrows):
    if count < 10:
        row = sheet.row_values(i)
        print i, row                1

    count += 1
1

Outputs i and the row so we can actually see which row number has which information

Now, if we go back to what we want our final output to look like, what we really need to figure out is where the country names begin. Remember, the country names are the first keys of our output dictionary:

{
    u'Afghanistan': {...},
    u'Albania': {...},
    ...
}

If you run your script with the counter in it where count < 10, you will see from the output that we have not yet reached the row where the country names start.

Because we are skipping a few lines to get to the data we are interested in, we are looking to identify which row number we will need to start our data collection. From our previous attempt, we know the country names start past row 10. But how can we tell where to start?

The answer is in the next code example, but before you look, try updating the counter to start at the row where the country names start. (There are multiple ways to do this, so if your answer is slightly different than what we have in the following code example, that’s OK.)

After you identify the proper row number, you will need to add an if statement to begin pulling out values after that row. This is so we only work with the data below that line.

If you were able to get that working, your code should like something like this:

count = 0

for i in xrange(sheet.nrows):
    if count < 20:                          1
        if i >= 14:                         2
            row = sheet.row_values(i)
            print i, row
        count += 1
1

This line will iterate through the first 20 rows to identify which row the country names begin on.

2

This if statement starts the output at the point where the country rows appear.

At this point, you should have output that looks like this:

14 ['', u'Afghanistan', u'Afghanistan', u'Afganist\xe1n', 10.3, '', 11.0, '',
9.6, '', 15.0, '', 40.4, '', 37.4, '', u'\u2013', '', u'\u2013', '',
u'\u2013', '', u'\u2013', '', 90.2, '', 74.4, '', 74.8, '', 74.1, '', '', '',
'', '', '', '', '', '', '', '', '', '']
15 ['', u'Albania', u'Albanie', u'Albania', 12.0, u'  ', 14.4, u'  ', 9.4,
u'  ', 0.2, '', 9.6, '', 98.6, '', u'\u2013', '', u'\u2013', '', u'\u2013',
'', 36.4, '', 29.8, '', 75.1, '', 78.3, '', 71.4, '', '', '', '', '', '', '',
'', '', '', '', '', '']
16 ['', u'Algeria', u'Alg\xe9rie', u'Argelia', 4.7, u'y', 5.5, u'y', 3.9,
u'y', 0.1, '', 1.8, '', 99.3, '', u'\u2013', '', u'\u2013', '', u'\u2013', '',
u'\u2013', '', 67.9, '', 87.7, '', 88.8, '', 86.5, '', '', '', '', '', '', '',
'', '', '', '', '', '']
.... more

Now, we need to turn each row into our dictionary format. This will make the data more meaningful to us when we try to do other things with it in future chapters.

Looking back at our earlier example of how we want our output to be organized, we are going to need a dictionary and we are going to use countries as keys. To pull out the country names, we will need to do some indexing.

Let’s add a dictionary to our code, and then pull out the country name from each row and add it as a key to our dictionary.

Update your for loop to reflect this:

count = 0
data = {}                                   1

for i in xrange(sheet.nrows):
    if count < 10:
        if i >= 14:
            row = sheet.row_values(i)
            country = row[1]                2
            data[country] = {}              3
    count += 1

print data                                  4
1

This creates an empty dictionary to store our data.

2

row[1] pulls out the country from each row we iterate over.

3

data[country] adds the country as a key to the data dictionary. We set the value to another dictionary, because that is where we are going to store our data in the following steps.

4

This outputs the data, so we can see what it looks like.

At this point, your output should look something like this:

{u'Afghanistan': {}, u'Albania': {}, u'Angola': {}, u'Algeria': {},
u'Andorra': {}, u'Austria': {}, u'Australia': {}, u'Antigua and Barbuda': {},
u'Armenia': {}, u'Argentina': {}}

Now, we need to match up each of the values in the rest of the row with the proper values in the spreadsheet, then store them in our dictionary.

Note

As you try to pull out all the values and check them against your Excel sheet, you will make lots of errors. That is fine and expected. This process should be embraced—it means you’re working your way through the problem.

First, let’s create an empty version of our data structure where we can store our data. Let’s also remove our counter, as we know that the rows of data start at line 14. Because we know xrange can accept a start and end point, we can begin our counting at 14 and end at the end of the file. Let’s take a look at our updated code:

data = {}

for i in xrange(14, sheet.nrows):   1
    row = sheet.row_values(i)
    country = row[1]

    data[country] = {               2
        'child_labor': {            3
            'total': [],            4
            'male': [],
            'female': [],
        },
        'child_marriage': {
            'married_by_15': [],
            'married_by_18': [],
        }
    }

print data['Afghanistan']           5
1

We can remove all of our references to the counter and just begin our for loop starting at the 14th row of our sheet. This line begins the loop with a value of 14, so we automatically skip the lines we don’t need for our dataset.

2

This line expands the dictionary to multiple lines to fill out the other data points.

3

This creates the key child_labor and sets it equal to another dictionary.

4

The dictionary has strings to explain each part of the data it holds. For each of these keys, the values are lists.

5

This outputs the values associated with the key Afghanistan.

Our output data for Afghanistan looks like this:

{
    'child_labor': {'total': [], 'male': [], 'female': []},
    'child_marriage': {'married_by_18': [], 'married_by_15': []}
}

Let’s now populate the data. Because we have access to each column of each row using the index, we can populate these lists with the values from the sheet. By looking at our sheet and lining up which columns relate to which parts of the data, we can update the data dictionary to reflect the following:

    data[country] = {
        'child_labor': {
            'total': [row[4], row[5]],              1
            'male': [row[6], row[7]],
            'female': [row[8], row[9]],
        },
        'child_marriage': {
            'married_by_15': [row[10], row[11]],
            'married_by_18': [row[12], row[13]],
        }
    }
1

Because there are two cells for each of the columns, our code stores both values. Because in this line our child labor totals are the fifth and sixth columns and we know Python is zero-indexed, we want the fourth and fifth indexes.

When we run our code again, we get output like this:

{
  'child_labor': {'female': [9.6, ''], 'male': [11.0, ''], 'total': [10.3, '']},
  'child_marriage': {'married_by_15': [15.0, ''], 'married_by_18': [40.4, '']}}
}
Warning

Before you continue, output a couple of records and check the number in the dictionary. It is easy to end up one index off and ruin the rest of your data.

Finally, to preview our data, we can use pprint instead of a print statement. In complicated data structures (like dictionaries), this makes it a lot easier to review the output. Add the following to the end of your file to preview the data in a formatted fashion:

import pprint               1
pprint.pprint(data)         2
1

Imports the pprint library. Normally, import statements come at the beginning of the file, but we are putting it here for simplicity. After you are done, you will want to delete these lines, because they are not critical to your script.

2

Passes data to the pprint.pprint() function.

If you scroll through your output, you will notice the majority of it looks good. But there are a couple of records that seem out of place.

If you look at the spreadsheet, you should note the last row for countries is Zimbabwe. So, we want to look for when the country is equal to 'Zimbabwe', and exit there. To exit, we add a break to our code, which is how we prematurely break the for loop to continue with the rest of script. Let’s add that as our stopping point. At the end of the for loop, add the following and rerun your code:

        if country == 'Zimbabwe':   1
            break                   2
1

If the country is equal to Zimbabwe…

2

Exits out of the for loop.

Warning

After adding the break, did you end up with a NameError: name 'country' is not defined error? If so, check your indentation. The if statement should be indented four spaces to be in the for loop.

Stepping through code can be helpful in identifying an issue. If you need to troubleshoot to figure out what a variable, such as country, is equal to in a for loop, try adding print statements inside the for loop and watching the values before your script exits with an error. They will likely give you a hint as to what is happening.

At this point, our script’s output matches our end goal. The last thing we want to do to our script is to make sure we document it with some comments.

Your script should now look something like this:

"""
    This is a script to parse child labor and child marriage data.      1
    The Excel file used in this script can be found here:
        http://www.unicef.org/sowc2014/numbers/
"""

import xlrd
book = xlrd.open_workbook('SOWC 2014 Stat Tables_Table 9.xlsx')

sheet = book.sheet_by_name('Table 9 ')

data = {}
for i in xrange(14, sheet.nrows):
    # Start at 14th row, because that is where the country data begins   2

    row = sheet.row_values(i)

    country = row[1]

    data[country] = {
        'child_labor': {
            'total': [row[4], row[5]],
            'male': [row[6], row[7]],
            'female': [row[8], row[9]],
        },
        'child_marriage': {
            'married_by_15': [row[10], row[11]],
            'married_by_18': [row[12], row[13]],
        }
    }

    if country == 'Zimbabwe':
        break

import pprint
pprint.pprint(data)                                                     3
1

This is a multiline comment used to generally describe what is going on in this script.

2

This is a single-line comment to document why we start on line 14 and not earlier.

3

We can and should remove these lines as we move beyond simple parsing of data into data analysis.

At this point, we have a similar output to the previous chapter’s data. In the next chapter, we will take this a step further and parse the same data from a PDF.

Summary

The Excel format is an odd in-between category that is kind of machine readable. Excel files were not meant to be read by programs, but they are parsable.

To handle this nonstandard format, we had to install external libraries. There are two ways to find libraries: by looking on PyPI, the Python package index, or by searching for tutorials and how-tos to see what other people have done.

Once you have identified the library you want to install, use the pip install command to do it; to uninstall a library, use pip uninstall.

Besides learning how to parse Excel using the xlrd library, we also learned a few new Python programming concepts, which are summarized in Table 4-1.

Table 4-1. New Python programming concepts
Concept Purpose

range and xrange

This turns a number into a consecutive list of numbers. Example: range(3) will output [0, 1, 2].

Counting starts at 0, not 1

This is a computer construct to be aware of; it occurs throughout programming. It is important to note when using range, indexing, or slicing.

Indexing and slicing

Use this to pull out a specific subset of a string or list.

Counters

Use this as a tool to control for loops.

Nested for loops

Use when iterating over a data structure within a data structure, such as a list of lists, a list of dictionaries, or a dictionary of dictionaries.

pprint

pprint is a way to output data into the terminal in a nice format. This is good to use when programming with complicated data structures.

break

You can exit from a for loop early by using break. This will stop executing the loop and continue on to the next part of the script.

Commenting

It is important to keep all your code commented, so you know what is happening for future reference.

As you read on and dig into PDFs, you will learn the importance of exploring alternatives to the data you have or finding alternate ways to locate and find data you need to answer your research questions.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required