Preprocessing the Data
We'll start from the beginning: like many websites, FaceStat runs on an SQL database. The judgment interface takes user judgments and saves them as a set of (face ID, attribute, judgment) triples. The first thing we do is extract those 10 million rows from the database. This gives us a file that looks like:
face_id key value 149777 describe serious 18717 trustworthy 3 140467 attractive 2 149777 describe five-head ...
We're interested in exploring the relationships between different types of perceived attributes. One interesting question is, "How old do I look?" The very first thing to do is to look at the responses that people have given. Unix command-line tools make it easy to quickly see a histogram of responses. The most common responses look like reasonable ages, but we also see a problem:
Look at $ cat data.tsv | age judgments' grep "age" | values cut -f3 | and count how many times sort | each value occurs, uniq -c | and order by this count. sort -nr
Here's the output of this shell pipeline. For each line, the first number is the frequency count. The second string is the response value—exactly what the user typed in the web form in response to the question How old do I look? Most often, she typed in a number, but there are some issues:
70472 19 70021 22 69387 18 68423 17 ... 27 24\r\n 27 17\r\n 23 01 21 16\r\n ... 1 old enough to know better 1 hopefully over 21 1 e 1 ?? ...
FaceStat has existed for eight months and undergone many changes, so data has been collected ...
Get Beautiful Data now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.