Analysis of Deviance with Count Data

In our next example the response variable is a count of infected blood cells per mm2 on microscope slides prepared from randomly selected individuals. The explanatory variables are smoker (logical, yes or no), age (three levels, under 20, 21 to 59, 60 and over), sex (male or female) and body mass score (three levels, normal, overweight, obese).

count<-read.table("c:\\temp\\cells.txt",header=T)
attach(count)
names(count)

[1] "cells" "smoker" "age" "sex" "weight"

It is always a good idea with count data to get a feel for the overall frequency distribution of counts using table:

table(cells)
  0   1   2   3   4   5  6  7
314  75  50  32  18  13  7  2

Most subjects (314 of them) showed no damaged cells, and the maximum of 7 was observed in just two patients.

We begin data inspection by tabulating the main effect means:

tapply(cells,smoker,mean)

    FALSE      TRUE
0.5478723 1.9111111

tapply(cells,weight,mean)

   normal       obese        over
0.5833333   1.2814371   0.9357143

tapply(cells,sex,mean)

   female       male
0.6584507  1.2202643

tapply(cells,age,mean)
      mid         old        young
0.8676471   0.7835821    1.2710280

It looks as if smokers have a substantially higher mean count than non-smokers, that overweight and obese subjects had higher counts than normal weight, males had a higher count that females, and young subjects had a higher mean count than middle-aged or older people. We need to test whether any of these differences are significant and to assess whether there are interactions between the explanatory ...

Get The R Book now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.