Today, images and video are everywhere. Online photo-sharing sites and social networks have them in the billions. Search engines will produce images of just about any conceivable query. Practically all phones and computers come with built-in cameras. It is not uncommon for people to have many gigabytes of photos and videos on their devices.
Programming a computer and designing algorithms for understanding what is in these images is the field of computer vision. Computer vision powers applications like image search, robot navigation, medical image analysis, photo management, and many more.
The idea behind this book is to give an easily accessible entry point to hands-on computer vision with enough understanding of the underlying theory and algorithms to be a foundation for students, researchers, and enthusiasts. The Python programming language, the language choice of this book, comes with many freely available, powerful modules for handling images, mathematical computing, and data mining.
When writing this book, I have used the following principles as a guideline. The book should:
Be written in an exploratory style and encourage readers to follow the examples on their computers as they are reading the text.
Promote and use free and open software with a low learning threshold. Python was the obvious choice.
Be complete and self-contained. This book does not cover all of computer vision but rather it should be complete in that all code is presented and explained. The reader should be able to reproduce the examples and build upon them directly.
Be broad rather than detailed, inspiring and motivational rather than theoretical.
In short, it should act as a source of inspiration for those interested in programming computer vision applications.
This book looks at theory and algorithms for a wide range of applications and problems. Here is a short summary of what to expect.
Basic programming experience. You need to know how to use an editor and run scripts, how to structure code as well as basic data types. Familiarity with Python or other scripting languages like Ruby or Matlab will help.
Basic mathematics. To make full use of the examples, it helps if you know about matrices, vectors, matrix multiplication, and standard mathematical functions and concepts like derivatives and gradients. Some of the more advanced mathematical examples can be easily skipped.
Hands-on programming with images using Python.
Computer vision techniques behind a wide variety of real-world applications.
Many of the fundamental algorithms and how to implement and apply them yourself.
The code examples in this book will show you object recognition, content-based image retrieval, image search, optical character recognition, optical flow, tracking, 3D reconstruction, stereo imaging, augmented reality, pose estimation, panorama creation, image segmentation, de-noising, image grouping, and more.
Introduces the basic tools for working with images and the central Python modules used in the book. This chapter also covers many fundamental examples needed for the remaining chapters.
Explains methods for detecting interest points in images and how to use them to find corresponding points and regions between images.
Describes basic transformations between images and methods for computing them. Examples range from image warping to creating panoramas.
Introduces how to model cameras, generate image projections from 3D space to image features, and estimate the camera viewpoint.
Explains how to work with several images of the same scene, the fundamentals of multiple-view geometry, and how to compute 3D reconstructions from images.
Introduces a number of clustering methods and shows how to use them for grouping and organizing images based on similarity or content.
Shows how to build efficient image retrieval techniques that can store image representations and search for images based on their visual content.
Describes algorithms for classifying image content and how to use them to recognize objects in images.
Introduces different techniques for dividing an image into meaningful regions using clustering, user interactions, or image models.
Shows how to use the Python interface for the commonly used OpenCV computer vision library and how to work with video and camera input.
There is also a bibliography at the back of the book. Citations of bibliographic entries are made by number in square brackets, as in .
Computer vision is the automated extraction of information from images. Information can mean anything from 3D models, camera position, object detection and recognition to grouping and searching image content. In this book, we take a wide definition of computer vision and include things like image warping, de-noising, and augmented reality.
Sometimes computer vision tries to mimic human vision, sometimes it uses a data and statistical approach, and sometimes geometry is the key to solving problems. We will try to cover all of these angles in this book.
Practical computer vision contains a mix of programming, modeling, and mathematics and is sometimes difficult to grasp. I have deliberately tried to present the material with a minimum of theory in the spirit of “as simple as possible but no simpler.” The mathematical parts of the presentation are there to help readers understand the algorithms. Some chapters are by nature very math-heavy (Chapter 4 and Chapter 5, mainly). Readers can skip the math if they like and still use the example code.
Python is the programming language used in the code examples throughout this book. Python is a clear and concise language with good support for input/output, numerics, images, and plotting. The language has some peculiarities, such as indentation and compact syntax, that take getting used to. The code examples assume you have Python 2.6 or later, as most packages are only available for these versions. The upcoming Python 3.x version has many language differences and is not backward compatible with Python 2.x or compatible with the ecosystem of packages we need (yet).
Some familiarity with basic Python will make the material more accessible for readers. For beginners to Python, Mark Lutz’ book Learning Python  and the online documentation at http://www.python.org/ are good starting points.
When programming computer vision, we need representations of vectors and matrices and operations on them. This is handled by Python’s
NumPy module, where both vectors and matrices are represented by the
array type. This is also the representation we will use for images. A good
NumPy reference is Travis Oliphant’s free book Guide to NumPy . The documentation at http://numpy.scipy.org/ is also a good starting point if you are new to
NumPy. For visualizing results, we will use the
Matplotlib module, and for more advanced mathematics, we will use
SciPy. These are the central packages you will need and will be explained and introduced in Chapter 1.
Besides these central packages, there will be many other free Python packages used for specific purposes like reading JSON or XML, loading and saving data, generating graphs, graphics programming, web demos, classifiers, and many more. These are usually only needed for specific applications or demos and can be skipped if you are not interested in that particular application.
It is worth mentioning IPython, an interactive Python shell that makes debugging and experimentation easier. Documentation and downloads are available at http://ipython.org/.
Code looks like this:
# some points x = [100,100,400,400] y = [200,500,200,500] # plot the points plot(x,y)
The following typographical conventions are used in this book:
Used for definitions, filenames, and variable names.
Used for functions, Python modules, and code examples. It is also used for console printouts.
Used for URLs.
Used for everything else.
Mathematical formulas are given inline like this f(x) = wT x + b or centered independently:
and are only numbered when a reference is needed.
In the mathematical sections, we will use lowercase (s, r, λ, θ, . . .) for scalars, uppercase (A, V, H, . . .) for matrices (including I for the image as an array), and lowercase bold (t, c, . . .) for vectors. We will use x = [x, y] and X = [X, Y, Z] to mean points in 2D (images) and 3D, respectively.
This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Programming Computer Vision with Python by Jan Erik Solem (O’Reilly). Copyright © 2012 Jan Erik Solem, 978-1-449-31654-9.”
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at email@example.com.
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
(800) 998-9938 (in the United States or Canada)
(707) 829-0515 (international or local)
(707) 829-0104 (fax)
We have a web page for this book, where we list errata, examples, links to the code and data sets used, and any additional information. You can access this page at:
To comment or ask technical questions about this book, send email to:
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Safari Books Online (www.safaribooksonline.com) is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business.
Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.
Safari Books Online offers a range of product mixes and pricing programs for organizations, government agencies, and individuals. Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and dozens more. For more information about Safari Books Online, please visit us online.
I’d like to express my gratitude to everyone involved in the development and production of this book. The whole O’Reilly team has been helpful. Special thanks to Andy Oram (O’Reilly) for editing, and Paul Anagnostopoulos (Windfall Software) for efficient production work.
Many people commented on the various drafts of this book as I shared them online. Klas Josephson and Håkan Ardö deserve lots of praise for their thorough comments and feedback. Fredrik Kahl and Pau Gargallo helped with fact checks. Thank you all readers for encouraging words and for making the text and code examples better. Receiving emails from strangers sharing their thoughts on the drafts was a great motivator.
Finally, I’d like to thank my friends and family for support and understanding when I spent nights and weekends on writing. Most thanks of all to my wife Sara, my long-time supporter.
 These examples produce new images and are more image processing than actually extracting information from images.