Which Words Are Gendered?

Many social theorists have wondered to what extent gender is reflected in language. Our data set lets us explore this at the word level: we can find which description tags are most characteristic of male or female faces. We could just count the words that occur most often for men and the words that occur most often for women, but generally this just gets words that are frequent everywhere. A better approach is to score tags by their ratio of occurrences between genders. That is, to determine how characteristic a tag T is for gender G, look at:

This has a flaw: rare tags introduce noise. For example, any tag that appears just once automatically gets a perfect score of 1 for whichever gender it appeared with. (This is another example of error due to small sample sizes that we saw for sparse age buckets.) A simple way around this is to use a frequency threshold. In this case, we'll only look at tags that occur more than 100 times.

Calculating these scores—in statistical terminology, they're maximum likelihood estimates of the conditional probabilities Pr(G|T)—we get the following tables.

Words most characteristic of men are shown in the following table.

 

G

T

Ratio

daddy

122

122

1.0000000

fatherly

115

115

1.0000000

fratboy

177

177

1.0000000

father

172

173

0.9942197

dad

341

343

0.9941691

douche

229

231

0.9913420

Handsome

110

111

0.9909910

scruffy

149

151

0.9867550

bald

343

350

0.9800000

jock

395

404

0.9777228 ...

Get Beautiful Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.