O'Reilly logo

Data Wrangling with Python by Katharine Jarmul, Jacqueline Kazil

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 1. Introduction to Python

Whether you are a journalist, an analyst, or a budding data scientist, you likely picked up this book because you want to learn how to analyze data programmatically, summarize your findings, and clearly communicate those findings to others. You might show your findings in a report, a graphic, or summarized statistics. Essentially, you are trying to tell a story.

Traditional storytelling or journalism often uses an individual story to paint a relatable face on overall findings or trends. In that type of storytelling, the data becomes a secondary feature. However, other storytellers, such as Christian Rudde, author of Datacylsm (Broadway Books) and one of the founders of OkCupid, argue the data itself is and should be the primary subject.

To begin, you need to identify the topic you want to explore. Perhaps you are interested in exploring communication habits of different people or societies, in which case you might start with a specific question (e.g., what are the qualities of successful information sharing among people on the Web?). Or you might be interested in historical baseball statistics and question whether they show changes in the game over time.

After you have identified your area of interest, you need to find data you can examine to explore your topic further. In the case of human behavior, you could investigate what people share on Twitter, drawing data from the Twitter API. If you want to delve into baseball history, you could use Sean Lahman’s Baseball Database.

The Twitter and baseball datasets are examples of large, general datasets which should be filtered and analyzed in manageable chunks to answer your specific questions. Sometimes smaller datasets are just as interesting and meaningful, especially if your topic touches on a local or regional issue. Let’s consider an example.

While writing this book, one of the authors read an article about her public high school,1 which had reportedly begun charging a $20 fee to graduating seniors and $200 a row for prime seating at the graduation ceremony.

According to the local news report, “the new fees are a part of an effort to cover an estimated $12,000 in graduation costs for Manatee High School after the financially strapped school district pulled its $3,400 contribution this year.”

The article explains the reason why the graduation costs are so high in comparison to the school district’s budget. However, it does not explain why the school district was unable to make its usual contribution. The question remained: Why is the Manatee County School District so financially strapped that it cannot make its regular contribution to the graduating class?

The initial questions you have in your investigation will often lead to deeper questions that define a problem. For example: What has the district been spending money on? How have the district’s spending patterns changed over time?

Identifying our specific topic area and the questions we want to anwer allows us to identify the data we will need to find. After formulating these questions, the first dataset we need to look for is the spending and budget data for the Manatee County School District.

Before we continue, let’s look at a brief overview of the entire process, from initial identification of a problem all the way to the final story (see Figure 1-1).

Once you have identified your questions, you can begin to ask questions about your data, such as: Which datasets best tell the story I want to communicate? Which datasets explore the subject in depth? What is the overall theme? What are some datasets associated with those themes? Who might be tracking or keeping this data? Are these datasets publicly available?

Tip

When you begin the storytelling process, you should focus on researching the questions you want to answer. Then you can figure out which datasets are most valuable to you. In this initial stage, don’t get too caught up in the tools you’ll use to analyze the data or the data wrangling process.

Data handling process
Figure 1-1. Data handling process

Once you have identified the datasets you want and acquired them, you’ll need to get them into a usable format. In Chapters 3, 4, and 5, you will learn various techniques for programmatically acquiring data and transforming data from one form to another. Chapter 6 will look at some of the logistics behind human-to-human interaction with regard to data acquisition and lightly touch on legalities. In the same Chapters 3 through 5, we will present how to extract data from CSV, Excel, XML, JSON, and PDF files, and in Chapters 11, 12, and 13 you will learn how to extract data from websites and APIs.

Note

If you don’t recognize some of these acronyms, don’t worry! They will be explained thoroughly as we encounter them, as will other technical terms with which you may not be familiar.

After you have acquired and transformed the data, you will begin your initial data exploration. Here, you will seek stories the data might expose—all while determining what is useful and what can be thrown away. You will play with the data by manipulating it into groups and looking at trends among the fields. Then you’ll combine datasets to connect the dots and expose larger trends and uncover underlying inconsistencies. Through this process you will learn how to clean the data and identify and resolve issues hidden in your datasets.

While learning how to parse and clean data in Chapters 7 and 8, you will not only use Python but also explore other open source tools. As we cover data issues you may encounter, you will learn how to determine whether to write a cleanup script or use a ready-made approach. In Chapter 7, we’ll cover how to fix common errors such as duplicate records, outliers, and formatting problems.

After you have identified the story you want to tell, cleaned the data, and processed it, we will explore how to present the data using Python. You will learn to tell the story in multiple formats and compare different publication options. In Chapter 10, you will find basic means of presenting and organizing data on a website.

Chapter 14 will help you scale your data-analysis processes to cover more data in less time. We will analyze methods to store and access your data, and review scaling your data in the cloud.

Chapter 14 will also cover how to take a one-off project and automate it so the project can drive itself. By automating the processes, you can take what would be a one-time special report and make it an annual one. This automation lets you focus on refining your storytelling process, move on to another story, or at least refill your coffee. Throughout this book the main tool used is the Python programming language. It will help us work through each part of the storytelling process, from initial exploration to standardization and automation.

Why Python

There are many programming languages, so why does this book use Python? Depending on what your background is, you may have heard of one or more of the following alternatives: R, MATLAB, Java, C/C++, HTML, JavaScript, and Ruby. Each of these has one or more primary uses, and some of them can be used for data wrangling. You can also execute a data wrangling process in a program like Excel. You can often program Excel and Python to give you the same output, but one will likely be more efficient. In some cases, though, a program like Excel can’t handle the task. We chose Python over the other options because Python is easy to get started with and handles data wrangling tasks in a simple and straightforward way.

If you would like to learn the more technical labeling and classification of Python and other languages, check out Appendix A. Those explanations will enable you to converse with other analysts or developers about why you’re using Python. As a new developer, we believe you will benefit from Python’s accessibility, and we hope this book will be one of many useful references in your data wrangling toolbox.

Aside from the benefits of Python as a language, it also has one of the most open and helpful communities. No community is perfect, but the Python community works to create a supportive environment for newcomers: sometimes this is with locally hosted tutorials, free classes, and meetups, and at other times it is with larger conferences that bring people together to solve problems and share knowledge.

Having a larger community has obvious benefits—there are people who can answer your questions, people who can help brainstorm your code’s or module’s structure, people you can learn from, shared code you can build upon. To learn more, check out Appendix B.

The community exists because people support it. When you are first starting out with Python, you will take from the community more than you contribute. However, there is quite a lot the greater community can learn from individuals who are not experts. We encourage you to share your problems and solutions. This will help the next person who has the same problems, and you may uncover a bug that needs to be addressed in an open source tool.

Note

Many members of the Python community no longer have the fresh eyes you currently possess. As you begin typing Python, you should consider yourself part of the programming community. Your contributions are as valuable as those of the individuals who have been programming for 20 years.

Without further ado, let’s get started with Python!

Getting Started with Python

Your initial steps with programming are the most difficult (not dissimilar to the first steps you take as a human!). Think about times you started a new hobby or sport. Getting started with Python (or any other programming language) will share some similar angst and hiccups. Perhaps you are lucky and have an amazing mentor to help you through the first stages. If not, maybe you have experience taking on similar challenges. Regardless of how you get through the initial steps, if you do encounter difficulties, remember this is often the hardest part.

Note

We hope this book can be a guide for you, but it’s no substitute for good mentorship or broader experiences with Python. Along the way, we’ll provide tips on some resources and places to look if a problem you encounter isn’t addressed.

To avoid getting bogged down in an extensive or advanced setup, we will use a very minimal initial setup for our Python environment. In the following sections, we will select a Python version, install Python and a tool to help us with external code and libraries, and install a code editor so we can write and run our code.

Which Python Version

You will need to choose which version of Python to use. Python versions are actually versions of something called the Python interpreter. The interpreter allows you to read, write, and run Python on your computer. Wikipedia describes it as follows:

In computer science, an interpreter is a computer program that directly executes, i.e. performs, instructions written in a programming or scripting language, without previously compiling them into a machine language program.

No one is going to ask you to memorize this definition, so don’t worry if you do not completely understand this. When Jackie first got started in programming, this was the part in introductory books where she felt that she would never get anywhere, because she didn’t understand what “batch compiling” meant. If she didn’t understand that, how could she program? We will talk about compiling later, but for now let’s summarize the definition like so:

An interpreter is the computer program that reads and executes your Python code.

There are two major Python versions (or interpreters), Python 2.X and Python 3.X. The most recent version of Python 2.X is 2.7, which is the Python version used in this book. The most recent version of Python 3.X is Python 3.5, which is also the newest Python version available. For now, assume code you write for 2.7 will not work in 3.4. The term used to describe this is to say that 3.4 breaks backward compatibility.

You can write code to work with both 2.7 and 3.4; however, this is not a requirement nor the focus of this book. Getting preoccupied with doing this at the beginning is like living in Florida and worrying about how to drive in snow. One day, you might need this skill, but it’s not a concern at this point in time.

Some people reading this book are probably asking themselves why we decided to use Python 2.7 and not Python 3.4. This is a highly debated topic within the Python community. Python 2.7 is a well-utilized release, while 3.X is currently being adopted. We want to make sure you can find easy-to-read and easy-to-access resources and that your operating system and services support the Python version you use.

Note

Quite a lot of the code written in this book will work with Python 3. If you’d like to try out some of the examples with Python 3, feel free; however, we’d rather you focus on learning Python 2.7 and move on to Python 3 after completing this book. For more information on the changes required to make code Python 3–compliant, take a look at the change documentation.

As you move through this book, you will use both self-written code and code written by other (awesome) people. Most of these external pieces of code will work for Python 2.7, but might not yet work for 3.4. If you were using Python 3, you would have to rewrite them—and if you spend a lot of time rewriting and editing every piece of code you touch, it will be very difficult to finish your first project.

Think of your first pieces of code like a rough draft. Later, you can go back and improve them with further revisions. For now, let’s begin by installing Python.

Setting Up Python on Your Machine

The good news is Python can run on any operating system. The bad news is not all operating systems have the same setup. There are two major operating systems we will discuss, in order of popularity with respect to programming Python: Mac OS X and Windows. If you are running Mac OS X or Linux, you likely already have Python installed. For a more complete installation, we recommend searching the Web for your flavor of Linux along with “advanced Python setup” for more advice.

Note

OS X and Linux are a bit easier to install and run Python code on than Windows. For a deeper understanding of why these differences exist, we recommend reading the history of Windows versus Unix-based operating systems. Compare the Unix-favoring view presented in Hadeel Tariq Al-Rayes’s “Studying Main Differences Between Linux & Windows Operating Systems” to Microsoft’s “Functional Comparison of UNIX and Windows”.

If you use Windows, you should be able to execute all the code; however, Windows setups may need additional installation for code compilers, additional system libraries, and environment variables.

To set up your computer to use Python, follow the instructions for your operating system. We will run through a series of tests to make sure things are working for you the way they should before moving on to the next chapter.

Mac OS X

Start by opening up Terminal, which is a command-line interface that allows you to interact with your computer. When PCs were first introduced, command-line interfaces were the only way to interact with computers. Now most people use graphical interface operating systems, as they are more easily accessible and widely distributed.

There are two ways to find Terminal on your machine. The first is through OS X’s Spotlight. Click on the Spotlight icon—the magnifying glass in the upper-right corner of your screen—and type “Terminal.” Then select the option that comes up next to the Applications classification.

After you select it, a little window will pop up that looks like Figure 1-2 (note that your version of Mac OS X might look different).

You can also launch Terminal through the Finder. Terminal is located in your Utilities folder: Applications → Utilities → Terminal.

After you select and launch Terminal, you should see something like Figure 1-3.

At this time it is a good idea to create an easily accessible shortcut to Terminal in a place that works well for you, like in the Dock. To do so, simply right-click on the Terminal icon in your Dock and choose Options and then “Keep in Dock.” Each time you execute an exercise in this book, you will need to access Terminal.

Image of Terminal
Figure 1-3. A newly opened Terminal window

And you’re done. Macs come with Python preinstalled, which means you do not need to do anything else. If you’d like to get your computer set up for future advanced library usage, take a look at Appendix D.

Windows 8 and 10

Windows does not come with Python installed, but Python has a special Windows installer. You’ll need to determine if you are running 32- or 64-bit Windows. If you are running 64-bit Windows, you will need to download the x86-64 MSI Installer from the downloads page. If not, you can use the x86 MSI Installer.

Once you have downloaded the installer, simply double-click on it and step through the prompts to install. We recommend installing for all users. Click on the boxes next to the options to select them all, and also choose to install the feature on your hard drive (see Figure 1-4).

After you’ve successfully installed Python, you’ll want to add Python to your environment settings. This allows you to interact with Python in your cmd utility (the Windows command-line interface). To do so, simply search your computer for “environment variable.” Select the option “Edit the system environment variables,” then click the Environment Variables…button (see Figure 1-5).

Add features
Figure 1-4. Adding features using the installer
Edit Environment Variables
Figure 1-5. Editing environment variables

Scroll down in the “System variables” list and select the Path variable, then click “Edit.” (If you don’t have a Path variable listed, click “New” to create a new one.)

Add this to the end of your Path value, ensuring you have a semicolon separating each of the paths (including at the end of the existing value, if there was one):

C:\Python27;C:\Python27\Lib\site-packages\;C:\Python27\Scripts\;

The end of your Path variable should look similar to Figure 1-6. Once you are done editing, click “OK” to save your settings.

Add Python to Path
Figure 1-6. Adding Python to Path

Test Driving Python

At this point, you should be on the command line (Terminal or cmd2) and ready to launch Python. You should see a line ending with a $ on a Mac or a > on Windows. After that prompt, type python, and press the Return (or Enter) key:

$ python

If everything is working correctly, you should receive a Python prompt (>>>), as seen in Figure 1-7.

Python interpreter in Terminal
Figure 1-7. Python prompt

For Windows users, if you don’t see this prompt, make sure your Path variable is properly set up (as described in the preceding section) and everything installed correctly. If you’re using the 64-bit version, you may need to uninstall Python (you can use the install MSI you downloaded to modify, uninstall, and repair your installation) and try installing the 32-bit version. If that doesn’t work, we recommend searching for the specific error you see during the installation.

>>> Versus $ or >

The Python prompt is different from the system prompt ($ on Mac/Linux, > on Windows). Beginners often make the mistake of typing Python commands into the default terminal prompt and typing terminal commands into the Python interpreter. This will always return errors. If you receive an error, keep this in mind and check to make sure you are entering Python commands only in the Python interpreter.

If you type a command into your Python interpreter that should be typed in your system terminal, you will probably get a NameError or SyntaxError. If you type a Python command into your system terminal, you will probably get a bash error, command not found.

When the Python interpreter starts, we’re given a few helpful lines of information. One of those helpful hints shows the Python version we are using (Figure 1-7 shows Python 2.7.5). This is important in the troubleshooting process, as sometimes there are commands or tools you can use with one Python version that don’t work in another.

Now, let’s test our Python installation by using a quick import statement. Type the following into your Python interpreter:

import sys
import pprint
pprint.pprint(sys.path)

The output you should recieve is a list of a bunch of directories or locations on your computer. This list shows where Python is looking for Python files. This set of commands can be a useful tool when you are trying to troubleshoot Python import errors.

Here is one example output (your list will be a little different from this; also, note also that some lines have been wrapped to fit this book’s page constraints):

['',
 '/usr/local/lib/python2.7/site-packages/setuptools-4.0.1-py2.7.egg',
 '/usr/local/lib/python2.7/site-packages/pip-1.5.6-py2.7.egg',
 '/usr/local/Cellar/python/2.7.7_1/Frameworks/Python.framework/Versions/2.7/
   lib/python27.zip',
 '/usr/local/Cellar/python/2.7.7_1/Frameworks/Python.framework/Versions/2.7/
   lib/python2.7',
 '/usr/local/Cellar/python/2.7.7_1/Frameworks/Python.framework/Versions/2.7/
   lib/python2.7/lib-tk',
 '/Library/Python/2.7/site-packages',
 '/usr/local/lib/python2.7/site-packages']

If your code was unsuccessful, you will have received an error. The easiest way to debug Python errors is to read them. For example, if you type in import sus instead of import sys, you will get the following output:

>>> import sus
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named sus

Read the last line: ImportError: No module named sus. This line tells you there is an import error, because there is no sus module in Python. Python has searched through the files on your computer and cannot find an importable Python file or folder of files called sus.

If you make a typo in the code you transfer from this book, you will likely get a syntax error. In the following example, we purposely mistyped pprint.pprint and instead entered pprint.print(sys.path()):

>>> pprint.print(sys.path())
  File "<stdin>", line 1
    pprint.print(sys.path())
               ^
SyntaxError: invalid syntax

We purposely mistyped it, but during the writing of this book, one of the authors did mistype it. You need to get comfortable troubleshooting errors as they arise. You should acknowledge that errors will be a part of the learning process as a developer. We want to make sure you are comfortable seeing errors; you should treat them as opportunities to learn something new about Python and programming.

Import errors and syntax errors are some of the most common you will see while developing code, and they are the easiest to troubleshoot. When you come across an error, web search engines will be useful to help you fix it.

Before you continue, make sure to exit from the Python interpreter. This takes you back to the Terminal or cmd prompt. To exit, type the following:

exit()

Now your prompt should return to $ (Mac/Linux) or > (Windows). We will play more with the Python interpreter in the next chapter. For now, let’s move on to installing a tool called pip.

Install pip

pip is a command-line tool used to manage shared Python code and libraries. Programmers often solve the same problems, so folks end up sharing their code to help others. That is one key part of the open source software culture.

Mac users can install pip by running a simple downloadable Python script in Terminal. You will need to be in the same folder you downloaded the script into. For example, if you downloaded the script into your Downloads folder, you will need to change into that folder from your Terminal. One easy shortcut on a Mac is to press the Command key (Cmd) and then drag your Downloads folder onto your Terminal. Another is to type some simple bash commands (for a more comprehensive introduction to bash, check out Appendix C). Begin by typing this into your Terminal:

cd ~/Downloads

This tells your computer to change directory into the Downloads subfolder in your home folder. To make sure you are in your Downloads folder, type the following into your Terminal:

pwd

This asks the Terminal to show your present working directory, the folder you are currently in. It should output something like the following:

/Users/your_name/Downloads

If your output looks similar, you can run the file by simply using this command:

sudo python get-pip.py

Because you are running a sudo command (meaning you are using special permissions to run the command so it can install packages in restricted places), you will be prompted to type in your password. You should then see a series of messages installing the package.

Note

On Windows, you likely already have pip installed (it comes with the Windows installation package). To check, you can type pip install ipython into your cmd utility. If you receive an error, download the pip installation script and use chdir C:\Users\YOUR_NAME\Downloads to change into your Downloads folder (substituting your computer’s home directory name for YOUR_NAME). Then, you should be able to execute the downloaded file by typing python get-pip.py. You will need to be an administrator on your computer to properly install everything.

When you use pip, your computer searches PyPI for the specified code package or library, downloads it to your machine, and installs it. This means you do not have to use a browser to download libraries, which can be cumbersome.

We’re almost done with the setup. The final step is installing our code editor.

Install a Code Editor

When writing Python, you’ll need a code editor, as Python requires special spacing, indentation, and character encoding to run properly. There are many code editors to choose from. One of the authors of this book uses Sublime. It is free, but suggests a nominal fee after a certain time period to help support current and future development. You can download Sublime here. Another completely free and cross-platform text editor is Atom.

Some people are particular about their code editors. While you do not have to use the editors we recommend, we suggest avoiding Vim, Vi, or Emacs unless you are already using these tools. Some programming purists use these tools exclusively for their code (one of the authors among them), because they can navigate the editor completely by keyboard. However, if you choose one of these editors without having any experience with it, you’ll likely have trouble making it through this book as you’ll be learning two things at once.

Tip

Learn one thing at a time, and feel free to try several editors until you find one that lets you code easily and freely. For Python development, the most important thing is having an editor you feel comfortable with that supports many file types (look for Unicode and UTF-8 support).

After you have downloaded and installed your editor of choice, launch the program to make sure the installation was successful.

Optional: Install IPython

If you’d like to install a slightly more advanced Python interpreter, we recommend installing a library called IPython. We review some benefits and use cases as well as how to install IPython in Appendix F. Again, this is not required, but it can be a useful tool in getting started with Python.

Summary

In this chapter, we learned about the two popular Python versions. We also completed some initial setup so we can move forward with data wrangling:

  1. We installed and tested Python.

  2. We installed pip.

  3. We installed a code editor.

This is the most basic setup required to get started. As you learn more about Python and programming, you will discover more complex setups. Our aim here was to get you started as quickly as possible without getting too overwhelmed by the setup process. If you’d like to take a look at a more advanced Python setup, check out Appendix D.

As you work through this book, you might encounter tools you need that require a more advanced setup; in that event we will show you how to create a more complex setup from your current basic one. For now, your first steps in Python require only what we’ve shown here.

Congratulations—you have completed your initial setup and run your first few lines of Python code! In the next chapter, we will start learning basic Python concepts.

1 Public high schools in the United States are government-run schools funded largely by taxes from the local community, meaning children can attend and be educated at little to no cost to their parents.

2 To open the cmd utility in Windows, simply search for Command Prompt or open All Programs and select Accessories and then Command Prompt.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required