Chapter 5. Contingency Tables Using Sparse Coordinate Matrices

I like sparseness. There’s something about that minimalist feel that can make something have an immediate impact and make it unique. I’ll probably always work with that formula. I just don’t know how.

Britt Daniel, lead singer of Spoon

Many real-world matrices are sparse, which means that most of their values are zero.

Using NumPy arrays to manipulate sparse matrices wastes a lot of time and energy multiplying many, many values by 0. Instead, we can use SciPy’s sparse module to solve these efficiently, examining only nonzero values. In addition to helping solve these “canonical” sparse matrix problems, sparse can be used for problems that are not obviously related to sparse matrices.

One such problem is the comparison of image segmentations. (Review Chapter 3 for a definition of segmentation.)

The code sample motivating this chapter uses sparse matrices twice. First, we use code nominated by Andreas Mueller to compute a contingency matrix that counts the correspondence of labels between two segmentations. Then, with suggestions from Jaime Fernández del Río and Warren Weckesser, we use that contingency matrix to compute the variation of information, which measures the differences between segmentations.

def variation_of_information(x, y):
    # compute contingency matrix, aka joint probability matrix
    n = x.size
    Pxy = sparse.coo_matrix((np.full(n, 1/n), (x.ravel(), y.ravel())),
                            dtype=float).tocsr()

    # compute ...

Get Elegant SciPy now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.