Preface

One thing I (Sarah) have learned over the last 20 or so years is that a sure way to derail a promising conversation at a party is to tell people what I do for a living. And rest assured that I’m neither a tax auditor nor captain of a sludge barge. No, I’m merely a biostatistician and statistics instructor, a revelation which invariably provokes a response such as “statistics was my worst class in school” or the sudden inspiration to quote that old chestnut popularized by Mark Twain that there are three kinds of lies: lies, damned lies, and statistics.

Personally, I find statistics fascinating and I love working in this field. I like teaching statistics as well, and I like to believe that I communicate some of this enthusiasm to my students, most of whom are physicians or other healthcare professionals required to take my classes as part of their fellowship studies. It’s often an uphill battle, however: some of them arrive with a negative attitude toward everything statistical, possibly augmented by the belief that statistics is some kind of magical procedure that will do their thinking for them, or a set of tricks and manipulations whose purpose is to twist reality in order to mislead other people.

I’m not sure how statistics got such a bad reputation, or why so many people have a negative attitude toward it. I do know that most of them can’t afford it: the need to be competent in statistics is fast becoming a necessity in many fields of work. It’s also becoming a requirement to be a thoughtful participant in modern society, as we are bombarded daily by statistical information and arguments, many of questionable merit. I have long since ceased to hope that I can keep everyone from misusing statistics: instead I have placed my hopes in cultivating a statistics-educated populace who will be able to recognize when statistics are being misused and discount the speaker’s credibility accordingly. We (Sarah and Paul) have tried to address both concerns in this book: statistics as a professional necessity, and statistics as part of the intellectual content required for informed citizenship.

What Is Statistics?

Before we jump into the technical details of learning and using statistics, let’s step back for a minute and consider what can be meant by the word “statistics.” Don’t worry if you don’t understand all the vocabulary immediately: it will become clear over the course of this book.

When people speak of statistics, they usually mean one or more of the following:

  1. Numerical data such as the unemployment rate, the number of persons who die annually from bee stings, or the racial makeup of the population of New York City in 2006 as compared to 1906.

  2. Numbers used to describe samples (subsets) of data, such as the mean (average), as opposed to numbers used to describe populations (entire sets of data); for instance, if we work for an advertising firm interested in the average age of people who subscribe to Sports Illustrated, we can draw a sample of subscribers and calculate the mean of that sample (a statistic), which is an estimate of the mean of the entire population of subscribers.

  3. Particular procedures used to analyze data, and the results of those procedures, such as the t statistic or the chi-square statistic.

  4. A field of study that develops and uses mathematical procedures to describe data and make decisions regarding it.

The type of statistics referred to in definition #1 is not the primary concern of this book: if you simply want to find the latest figures on unemployment, health, or any of the myriad other topics on which governments and private organizations regularly release statistical data, your best bet is to consult a reference librarian or subject expert. If, however, you want to know how to interpret those figures (to understand why the mean is often misleading as a statement of average value, for instance, or the difference between crude and standardized mortality rates), Statistics in a Nutshell can definitely help you out.

The concepts included in definition #2 will be discussed in Chapter 7, which introduces inferential statistics, but they also permeate this book. It is partly a question of vocabulary (statistics are numbers that describe samples, while parameters are numbers that describe populations), but also underscores a fundamental point about the practice of statistics. The concept of using information gained from studying a sample to make statements about a population is the basis of inferential statistics, and inferential statistics is the primary focus of this book (as it is of most books about statistics).

Definition #3 is also fundamental to most chapters of this book. The process of learning statistics is to some extent the process of learning particular statistical procedures, including how to calculate and interpret them, how to choose the appropriate statistic for a given situation, and so on. In fact, many new students of statistics subscribe to this definition: learning statistics to them means learning to execute a set of statistical procedures. This is not an invalid approach to statistics so much as it is incomplete: learning to execute statistical procedures is a necessary part of the practice of statistics, but it is far from being the entire story. What’s more, since computer software has made it increasingly easy for anyone, regardless of mathematical background, to produce statistical analyses, the need to understand and interpret statistics has far outstripped the need to learn how to do the calculations themselves.

Definition #4 is nearest to my heart, since I chose statistics as my professional field. If you are a secondary or post-secondary student you are probably aware of this definition of statistics, as many universities and colleges today either have a separate department of statistics or include statistics as a field of specialization within mathematics. Statistics is increasingly taught in high school as well: in the U.S., enrollment in the A.P. (Advanced Placement) Statistics classes is increasing more rapidly than enrollment in any other A.P. area.

Statistics is too important to be left to the statisticians, however, and university study in many subjects requires one or more semesters of statistics classes. Many basic techniques in modern statistics have been developed by people who learned and used statistics as part of their studies in another field. For instance, Stephen Raudenbush, a pioneer in the development of hierarchical linear modeling, studied Policy Analysis and Evaluation Research at Harvard, and Edward Tufte, perhaps the world’s leading expert on statistical graphics, began his career as a political scientist: his Ph.D. dissertation at Yale was on the American Civil Rights Movement.

With the increasing use of statistics in many professions, and at all levels from top to bottom, basic knowledge of statistics has become a necessity for many people who have been out of school for years. Such individuals are often ill-served by textbooks aimed at introductory college courses, which are too specialized, too focused on calculation, and too expensive.

Finally, statistics cannot be left to the statisticians because it’s also a necessity to understand much of what you read in the newspaper or hear on television and the radio. A working knowledge of statistics is the best check against the proliferation of misleading or outright false claims (whether by politicians, advertisers, or social reformers), which seem to occupy an ever-increasing portion of our daily news diet. There’s a reason that Darryl Huff’s 1954 classic How to Lie with Statistics (W.W. Norton) remains in print: statistics are easy to misuse, the common techniques of statistical distortion have been around for decades, and the best defense against those who would lie with statistics is to educate yourself so you can spot the lies and stop the lying liars in their tracks.

The Focus of This Book

There are so many statistics books already on the market that you might well wonder why we feel the need to add another to the pile. The primary reason is that we haven’t found any statistics books that answer the needs we have addressed in Statistics in a Nutshell. In fact, if I may wax poetic for a moment, the situation is, to paraphrase the plight of Coleridge’s Ancient Mariner, “books, books everywhere, nor any with which to learn.” The issues we have tried to address with this book are:

  1. The need for a book that focuses on using and understanding statistics in a research or applications context, not as a discrete set of mathematical techniques but as part of the process of reasoning with numbers.

  2. The need to integrate discussion of issues such as measurement and data management into an introductory statistics text.

  3. The need for a book that isn’t focused on a particular subject area. Elementary statistics is largely the same across subjects (a t -test is pretty much the same whether the data comes from medicine, finance, or criminal justice), so there’s no need for a proliferation of texts presenting the same information with slightly different spin.

  4. The need for an introductory statistics book that is compact, inexpensive, and easy for beginners to understand without being condescending or overly simplistic.

So who is the intended audience of Statistics in a Nutshell ? We see three in particular:

  1. Students taking introductory statistics classes in high schools, colleges, and universities.

  2. Adults who need to learn statistics as part of their current jobs or in order to be eligible for promotion.

  3. People who are interested in learning about statistics out of intellectual curiosity.

Our focus throughout Statistics in a Nutshell is not on particular techniques, although many are taught within this work, but on statistical reasoning. You might say that our focus is not on doing statistics, but on thinking statistically. What does that mean? Several things are necessary in order to be able to focus on the process of thinking with numbers. More particularly, we focus on thinking about data, and using statistics to aid in that process.

Statistics in the Age of Information

It’s become fashionable to say that we’re living in the Age of Information, where so many facts are collected and disseminated that no one could possibly keep up with them. Well, this is one of those clichés that is based on truth: we are drowning in data and the problem is only going to get worse. Wide access to computing technology and electronic means of data storage and dissemination have made information easier to access, which is great from the researcher’s point of view, since you no longer have to travel to a particular library or archive to peruse printed copies of records.

Whether your interest is the U.S. population in 1790, annual oil production and consumption in different countries, or the worldwide burden of disease, an Internet search will point you to data sources that can be accessed electronically, often directly from your home computer. However, data has no meaning in and of itself: it has to be organized and interpreted by human beings. So part of participating fully in the Information Age requires becoming fluent in understanding data, including the ways it is collected, analyzed, and interpreted. And because the same data can often be interpreted in many ways, to support radically different conclusions, even people who don’t engage in statistical work themselves need to understand how statistics work and how to spot valid versus invalid claims, however solidly they may seem to be backed by numbers.

Organization of This Book

Statistics in a Nutshell is organized into four parts: introductory material (Chapters 16) that lays the necessary foundation for the chapters that follow; elementary inferential statistical techniques (Chapters 711); more advanced techniques (Chapters 1216); and specialized techniques (Chapters 1719).

Here’s a more detailed breakdown of the chapters:

Chapter 1, Basic Concepts of Measurement

Discusses foundational issues for statistics, including levels of measurement, operationalization, proxy measurement, random and systematic error, measures of agreement, and types of bias. Statistics demonstrated include percent agreement and kappa.

Chapter 2, Probability

Introduces the basic vocabulary and laws of probability, including trials, events, independence, mutual exclusivity, the addition and multiplication laws, and conditional probability. Procedures demonstrated include calculation of basic probabilities, permutations and combinations, and Bayes’s theorem.

Chapter 3, Data Management

Discusses practical issues in data management, including procedures to troubleshoot an existing file, methods for storing data electronically, data types, and missing data.

Chapter 4, Descriptive Statistics and Graphics

Explains the differences between descriptive and inferential statistics and between populations and samples, and introduces common measures of central tendency and variability and frequently used graphs and charts. Statistics demonstrated include mean, median, mode, range, interquartile range, variance, and standard deviation. Graphical methods demonstrated include frequency tables, bar charts, pie charts, Pareto charts, stem and leaf plots, boxplots, histograms, scatterplots, and line graphs.

Chapter 5, Research Design

Discusses observational and experimental studies, common elements of good research designs, the steps involved in data collection, types of validity, and methods to limit or eliminate the influence of bias.

Chapter 6, Critiquing Statistics Presented by Others

Offers guidelines for reviewing the use of statistics, including a checklist of questions to ask of any statistical presentation and examples of when legitimate statistical procedures may be manipulated to appear to support questionable conclusions.

Chapter 7, Inferential Statistics

Introduces the basic concepts of inferential statistics, including probability distributions, independent and dependent variables and the different names under which they are known, common sampling designs, the central limit theorem, hypothesis testing, Type I and Type II error, confidence intervals and p-values, and data transformation. Procedures demonstrated include converting raw scores to Z-scores, calculation of binomial probabilities, and the square-root and log data transformations.

Chapter 8, The t-Test

Discusses the t-distribution, the different types of t-tests, and the influence of effect size on power in t-tests. Statistics demonstrated include the one-sample t-test, the two independent samples t-test, the two repeated measures t-test, and the unequal variance t-test.

Chapter 9, The Correlation Coefficient

Introduces the concept of association with graphics displaying different strengths of association between two variables, and discusses common statistics used to measure association. Statistics demonstrated include Pearson’s product-moment correlation, the t-test for statistical significance of Pearson’s correlation, the coefficient of determination, Spearman’s rank-order coefficient, the point-biserial coefficient, and phi.

Chapter 10, Categorical Data

Reviews the concepts of categorical and interval data, including the Likert scale, and introduces the R × C table. Statistics demonstrated include the chi-squared tests for independence, equality of proportions, and goodness of fit, Fisher’s exact test, McNemar’s test, gamma, Kendall’s tau-a, tau-b, and tau-c, and Somers’s d.

Chapter 11, Nonparametric Statistics

Discusses when to use nonparametric rather than parametric statistics, and presents nonparametric statistics for between-subjects and within-subjects designs. Statistics demonstrated include the Wilcoxon Rank Sum and Mann-Whitney U tests, the median test, the Kruskal-Wallis H test, the Wilcoxon matched pairs signed rank test, and the Friedman test.

Chapter 12, Introduction to the General Linear Model

Introduces linear regression and ANOVA through the concept of the General Linear Model, and discusses assumptions made when using these designs. Statistical procedures demonstrated include simple (bivariate) regression, one-way ANOVA, and post-hoc testing.

Chapter 13, Extensions of Analysis of Variance

Discusses more complex ANOVA designs. Statistical procedures demonstrated include two-way and three-way ANOVA, MANOVA, ANCOVA, repeated measures ANOVA, and mixed designs.

Chapter 14, Multiple Linear Regression

Extends the ideas introduced in Chapter 12 to models with multiple predictors. Topics covered include relationships among predictor variables, standardized coefficients, dummy variables, methods of model building, and violations of assumptions of linear regression, including nonlinearity, autocorrelation, and heteroscedasticity.

Chapter 15, Other Types of Regression

Extends the technique of regression to data with binary outcomes (logistic regression) and nonlinear models (polynomial regression), and discusses the problem of overfitting a model.

Chapter 16, Other Statistical Techniques

Demonstrates several advanced statistical procedures, including factor analysis, cluster analysis, discriminant function analysis, and multidimensional scaling, including discussion of the types of problems for which each technique may be useful.

Chapter 17, Business and Quality Improvement Statistics

Demonstrates statistical procedures commonly used in business and quality improvement contexts. Analytical and statistical procedures covered include construction and use of simple and composite indexes, time series, the minimax, maximax, and maximin decision criteria, decision making under risk, decision trees, and control charts.

Chapter 18, Medical and Epidemiological Statistics

Introduces concepts and demonstrates statistical procedures particularly relevant to medicine and epidemiology. Concepts and statistics covered include the definition and use of ratios, proportions, and rates, measures of prevalence and incidence, crude and standardized rates, direct and indirect standardization, measures of risk, confounding, the simple and Mantel-Haenszel odds ratio, and precision, power, and sample size calculations.

Chapter 19, Educational and Psychological Statistics

Introduces concepts and statistical procedures commonly used in the fields of education and psychology. Concepts and procedures demonstrated include percentiles, standardized scores, methods of test construction, the true score model of classical test theory, reliability of a composite test, measures of internal consistency including coefficient alpha, and procedures for item analysis. An overview of item response theory is also provided

Two appendixes cover topics that are a necessary background to the material covered in the main text, and a third provides references to supplemental reading:

Appendix A

Provides a self-test and review of basic arithmetic and algebra for people whose memory of their last math course is fast receding on the distant horizon. Topics covered include the laws of arithmetic, exponents, roots and logs, methods to solve equations and systems of equations, fractions, factorials, permutations, and combinations.

Appendix B

Provides an introduction to some of the most common computer programs used for statistical applications, demonstrates basic analyses in each program, and discusses their relative strengths and weaknesses. Programs covered include Minitab, SPSS, SAS, and R; the use of Microsoft Excel (not a statistical package) for statistical analysis is also discussed.

Appendix C

An annotated bibliography organized by chapter, which includes published works and websites cited in the text and others that are good starting points for people researching a particular topic.

You should think of these chapters as tools, whose best use depends on the individual reader’s, background and needs. Even the introductory chapters may not be relevant immediately to everyone: for instance, many introductory statistics classes do not require students to master topics such as data management or measurement theory. In that case, these chapters can serve as references when the topics become necessary (expertise in data management is often an expectation of research assistants, for instance, although it is rarely directly taught).

Classification of what is “elementary” and what is “advanced” depends on an individual’s background and purposes. We designed Statistics in a Nutshell to answer the needs of many different types of users. For this reason, there’s no perfect way to organize the material to meet everyone’s needs, which brings us to an important point: there’s no reason you should feel the need to read the chapters in the order they are presented here. Statistics presents many chicken-and-egg dilemmas: for instance, you can’t design experiments without knowing what statistics are available to you, but you can’t understand how statistics are used without knowing something about research design. Similarly, it might seem that a chapter on data management would be most useful to individuals who have already done some statistical analysis, but I’ve advised many research assistants and project managers who are put in charge of large data sets before they’ve had a single course in statistics. So use the chapters in the way that best facilitates your specific purposes, and don’t be shy about skipping around and focusing on whatever meets your particular needs.

Some of the later chapters are also specialized and not relevant to everyone, most obviously Chapters 1719, which are written with particular subject areas in mind. Chapters 15 and 16 also cover topics that are not often included in introductory statistics texts, but that are the statistical procedure of choice in particular contexts. Because we have planned this book to be useful for consumers of statistics and working professionals who deal with statistics even if they don’t compute them themselves, we have included these topics, although beginning students may not feel the need to tackle them in their first statistics course.

It’s wise to keep an open mind regarding what statistics you need to know. You may currently believe that you will never have the need to conduct a nonparametric test or a logistic regression analysis. However, you never know what will come in handy in the future. It’s also a mistake to compartmentalize too much by subject field: because statistical techniques are ultimately about numbers rather than content, techniques developed in one field often prove to be useful in another. For instance, control charts (covered in Chapter 17) were developed in a manufacturing context, but are now used in many fields from medicine to education.

We have included more advanced material in other chapters, when it serves to illustrate a principle or make an interesting point. These sections are clearly identified as digressions from the main thread of the book, and beginners can skip over them without feeling that they are missing any vital concepts of basic statistics.

Symbols Used in This Book

Symbol

Meaning

Names of statistics

μ

Mean of a population

σ

Standard deviation of a population

σ2

Variance of a population

Π

Proportion of a population

χ2

Mean of a sample

s

Standard deviation of a sample

s2

Variance of a sample

n

Number of cases in a sample

p

Proportion of a sample

Κ

Kappa (measure of agreement)

χ2

Chi-squared (statistic, distribution)

Statistical formulas

Σ

Summation

!

Factorial

C

Combination

P

Permutation

E

Expected value

O

Observed value

xij

Value of variable x for case ij

Set theory, Bayes Theorem

~

Not

|

Conditional probability

Union

Intersection

Other

α

Alpha (significance level; probability of Type I error)

β

Beta (probability of Type II error)

R

Number of rows in a table

C

Number of columns in a table

Conventions Used in This Book

The following typographical conventions are used in this book:

Plain text

Indicates menu titles, menu options, menu buttons, and keyboard accelerators (such as Alt and Ctrl).

Italic

Indicates new terms, URLs, email addresses, filenames, file extensions, pathnames, directories, and Unix utilities

Constant width

Indicates examples

Tip

This icon signifies a tip, suggestion, or general note.

We’d Like to Hear From You

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at:

http://www.oreilly.com/catalog/9780596510497

To comment or ask technical questions about this book, send email to:

For more information about our books, conferences, Resource Centers, and the O’Reilly Network, see our website at:

http://www.oreilly.com

Safari® Books Online

When you see a Safari® Books Online icon on the cover of your favorite technology book, that means the book is available online through the O’Reilly Network Safari Bookshelf.

Safari offers a solution that’s better than e-books. It’s a virtual library that lets you easily search thousands of top tech books, cut and paste code samples, download chapters, and find quick answers when you need the most accurate, current information. Try it for free at http://safari.oreilly.com.

Acknowledgments

Only two authors are listed on the cover of this book, but the contributions of many people played a role in its creation.

Sarah Boslaugh

I would like to thank my agent, Neil Salkind, for his continued guidance and support; my colleagues at Washington University and BJC HealthCare for their willingness to share their wisdom and experience; the crew at O’Reilly, including Mary Treseler, Isabel Kunkle, Rachel Monaghan, and Colleen Gorman; and the statisticians who assisted in the technical review process, especially Dave McArthur at UCLA who is never shy about sharing his suggestions. I would also like to thank all my friends who keep pestering me to explain statistical concepts to them, and thus encouraged me to write this book. On a personal note, I would like to thank my colleague Rand Ross at Washington University for helping me remain sane throughout the writing process, and my husband Dan Peck for being the very model of a modern supportive spouse.

Paul Watters

Firstly, I would like to thank the academics who managed to make learning statistics interesting: Professor Rachel Heath (University of Newcastle) and Mr. James Alexander (University of Tasmania). An inspirational teacher is a rare and wonderful thing, especially in statistics! Secondly, a big thank you to my colleagues at the School of ITMS at the University of Ballarat, and our partners at Westpac, IBM, and the Victorian government, for their ongoing research support. Finally, I would like to acknowledge the patience of my wife Maya, and daughters Arwen and Bounty, as writing a book invariably takes away time from family.

Get Statistics in a Nutshell now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.