Chapter 2. Distributions

One of the best ways to describe a variable is to report the values that appear in the dataset and how many times each value appears. This description is called the distribution of the variable.

The most common representation of a distribution is a histogram, which is a graph that shows the frequency of each value. In this context, “frequency” means the number of times the value appears.

In Python, an efficient way to compute frequencies is with a dictionary. Given a sequence of values, t:

hist = {}
for x in t:
    hist[x] = hist.get(x, 0) + 1

The result is a dictionary that maps from values to frequencies. Alternatively, you could use the Counter class defined in the collections module:

from collections import Counter
counter = Counter(t)

The result is a Counter object, which is a subclass of dictionary.

Another option is to use the pandas method value_counts, which we saw in the previous chapter. But for this book I created a class, Hist, that represents histograms and provides the methods that operate on them.

Representing Histograms

The Hist constructor can take a sequence, dictionary, pandas Series, or another Hist. You can instantiate a Hist object like this:

>>> import thinkstats2
>>> hist = thinkstats2.Hist([1, 2, 2, 3, 5])
>>> hist
Hist({1: 1, 2: 2, 3: 1, 5: 1})

Hist objects provide Freq, which takes a value and returns its frequency:

>>> hist.Freq(2)
2

The bracket operator does the same thing:

>>> hist[2]
2

If you look up a value that has never appeared, the frequency ...

Get Think Stats, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.