Preprocessing the Data

We'll start from the beginning: like many websites, FaceStat runs on an SQL database. The judgment interface takes user judgments and saves them as a set of (face ID, attribute, judgment) triples. The first thing we do is extract those 10 million rows from the database. This gives us a file that looks like:

face_id   key          value
149777    describe     serious
18717     trustworthy  3
140467    attractive   2
149777    describe     five-head
...

We're interested in exploring the relationships between different types of perceived attributes. One interesting question is, "How old do I look?" The very first thing to do is to look at the responses that people have given. Unix command-line tools make it easy to quickly see a histogram of responses. The most common responses look like reasonable ages, but we also see a problem:

Look at                    $ cat data.tsv  |
age judgments'                 grep "age"  |
values                         cut -f3  |
and count how many times       sort  |
each value occurs,             uniq -c  |
and order by this count.       sort -nr

Here's the output of this shell pipeline. For each line, the first number is the frequency count. The second string is the response value—exactly what the user typed in the web form in response to the question How old do I look? Most often, she typed in a number, but there are some issues:

70472 19
70021 22
69387 18
68423 17
...
27 24\r\n
27 17\r\n
23 01
21 16\r\n
...
1 old enough to know better
1 hopefully over 21
1 e
1 ??
...

FaceStat has existed for eight months and undergone many changes, so data has been collected ...

Get Beautiful Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.