You are previewing Data Analysis and Graphics Using R, Third Edition.
O'Reilly logo
Data Analysis and Graphics Using R, Third Edition

Book Description

Discover what you can do with R! Introducing the R system, covering standard regression methods, then tackling more advanced topics, this book guides users through the practical, powerful tools that the R system provides. The emphasis is on hands-on analysis, graphical display, and interpretation of data. The many worked examples, from real-world research, are accompanied by commentary on what is done and why. The companion website has code and datasets, allowing readers to reproduce all analyses, along with solutions to selected exercises and updates. Assuming basic statistical knowledge and some experience with data analysis (but not R), the book is ideal for research scientists, final-year undergraduate or graduate-level students of applied statistics, and practising statisticians. It is both for learning and for reference. This third edition expands upon topics such as Bayesian inference for regression, errors in variables, generalized linear mixed models, and random forests.

Table of Contents

  1. Cover
  2. Half Title
  3. Title Page
  4. Copyright
  5. Contents
  6. Preface
  7. Content – how the chapters fit together
  8. 1. A brief introduction to R
    1. 1.1 An overview of R
      1. 1.1.1 A short R session
      2. 1.1.2 The uses of R
      3. 1.1.3 Online help
      4. 1.1.4 Input of data from a file
      5. 1.1.5 R packages
      6. 1.1.6 Further steps in learning R
    2. 1.2 Vectors, factors, and univariate time series
      1. 1.2.1 Vectors
      2. 1.2.2 Concatenation – joining vector objects
      3. 1.2.3 The use of relational operators to compare vector elements
      4. 1.2.4 The use of square brackets to extract subsets of vectors
      5. 1.2.5 Patterned data
      6. 1.2.6 Missing values
      7. 1.2.7 Factors
      8. 1.2.8 Time series
    3. 1.3 Data frames and matrices
      1. 1.3.1 Accessing the columns of data frames – with() and attach()
      2. 1.3.2 Aggregation, stacking, and unstacking
      3. 1.3.3* Data frames and matrices
    4. 1.4 Functions, operators, and loops
      1. 1.4.1 Common useful built-in functions
      2. 1.4.2 Generic functions, and the class of an object
      3. 1.4.3 User-written functions
      4. 1.4.4 if Statements
      5. 1.4.5 Selection and matching
      6. 1.4.6 Functions for working with missing values
      7. 1.4.7* Looping
    5. 1.5 Graphics in R
      1. 1.5.1 The function plot( ) and allied functions
      2. 1.5.2 The use of color
      3. 1.5.3 The importance of aspect ratio
      4. 1.5.4 Dimensions and other settings for graphics devices
      5. 1.5.5 The plotting of expressions and mathematical symbols
      6. 1.5.6 Identification and location on the figure region
      7. 1.5.7 Plot methods for objects other than vectors
      8. 1.5.8 Lattice (trellis) graphics
      9. 1.5.9 Good and bad graphs
      10. 1.5.10 Further information on graphics
    6. 1.6 Additional points on the use of R
    7. 1.7 Recap
    8. 1.8 Further reading
    9. 1.9 Exercises
  9. 2. Styles of data analysis
    1. 2.1 Revealing views of the data
      1. 2.1.1 Views of a single sample
      2. 2.1.2 Patterns in univariate time series
      3. 2.1.3 Patterns in bivariate data
      4. 2.1.4 Patterns in grouped data – lengths of cuckoo eggs
      5. 2.1.5* Multiple variables and times
      6. 2.1.6 Scatterplots, broken down by multiple factors
      7. 2.1.7 What to look for in plots
    2. 2.2 Data summary
      1. 2.2.1 Counts
      2. 2.2.2 Summaries of information from data frames
      3. 2.2.3 Standard deviation and inter-quartile range
      4. 2.2.4 Correlation
    3. 2.3 Statistical analysis questions, aims, and strategies
      1. 2.3.1 How relevant and how reliable are the data?
      2. 2.3.2 How will results be used?
      3. 2.3.3 Formal and informal assessments
      4. 2.3.4 Statistical analysis strategies
      5. 2.3.5 Planning the formal analysis
      6. 2.3.6 Changes to the intended plan of analysis
    4. 2.4 Recap
    5. 2.5 Further reading
    6. 2.6 Exercises
  10. 3. Statistical models
    1. 3.1 Statistical models
      1. 3.1.1 Incorporation of an error or noise component
      2. 3.1.2 Fitting models – the model formula
    2. 3.2 Distributions: models for the random component
      1. 3.2.1 Discrete distributions – models for counts
      2. 3.2.2 Continuous distributions
    3. 3.3 Simulation of random numbers and random samples
      1. 3.3.1 Sampling from the normal and other continuous distributions
      2. 3.3.2 Simulation of regression data
      3. 3.3.3 Simulation of the sampling distribution of the mean
      4. 3.3.4 Sampling from finite populations
    4. 3.4 Model assumptions
      1. 3.4.1 Random sampling assumptions – independence
      2. 3.4.2 Checks for normality
      3. 3.4.3 Checking other model assumptions
      4. 3.4.4 Are non-parametric methods the answer?
      5. 3.4.5 Why models matter – adding across contingency tables
    5. 3.5 Recap
    6. 3.6 Further reading
    7. 3.7 Exercises
  11. 4. A review of inference concepts
    1. 4.1 Basic concepts of estimation
      1. 4.1.1 Population parameters and sample statistics
      2. 4.1.2 Sampling distributions
      3. 4.1.3 Assessing accuracy – the standard error
      4. 4.1.4 The standard error for the difference of means
      5. 4.1.5* The standard error of the median
      6. 4.1.6 The sampling distribution of the t-statistic
    2. 4.2 Confidence intervals and tests of hypotheses
      1. 4.2.1 A summary of one- and two-sample calculations
      2. 4.2.2 Confidence intervals and tests for proportions
      3. 4.2.3 Confidence intervals for the correlation
      4. 4.2.4 Confidence intervals versus hypothesis tests
    3. 4.3 Contingency tables
      1. 4.3.1 Rare and endangered plant species
      2. 4.3.2 Additional notes
    4. 4.4 One-way unstructured comparisons
      1. 4.4.1 Multiple comparisons
      2. 4.4.2 Data with a two-way structure, i.e., two factors
      3. 4.4.3 Presentation issues
    5. 4.5 Response curves
    6. 4.6 Data with a nested variation structure
      1. 4.6.1 Degrees of freedom considerations
      2. 4.6.2 General multi-way analysis of variance designs
    7. 4.7 Resampling methods for standard errors, tests, and confidence intervals
      1. 4.7.1 The one-sample permutation test
      2. 4.7.2 The two-sample permutation test
      3. 4.7.3* Estimating the standard error of the median: bootstrapping
      4. 4.7.4 Bootstrap estimates of confidence intervals
    8. 4.8* Theories of inference
      1. 4.8.1 Maximum likelihood estimation
      2. 4.8.2 Bayesian estimation
      3. 4.8.3 If there is strong prior information, use it!
    9. 4.9 Recap
    10. 4.10 Further reading
    11. 4.11 Exercises
  12. 5. Regression with a single predictor
    1. 5.1 Fitting a line to data
      1. 5.1.1 Summary information – lawn roller example
      2. 5.1.2 Residual plots
      3. 5.1.3 Iron slag example: is there a pattern in the residuals?
      4. 5.1.4 The analysis of variance table
    2. 5.2 Outliers, influence, and robust regression
    3. 5.3 Standard errors and confidence intervals
      1. 5.3.1 Confidence intervals and tests for the slope
      2. 5.3.2 SEs and confidence intervals for predicted values
      3. 5.3.3* Implications for design
    4. 5.4 Assessing predictive accuracy
      1. 5.4.1 Training/test sets and cross-validation
      2. 5.4.2 Cross-validation – an example
      3. 5.4.3* Bootstrapping
    5. 5.5 Regression versus qualitative anova comparisons – issues of power
    6. 5.6 Logarithmic and other transformations
      1. 5.6.1* A note on power transformations
      2. 5.6.2 Size and shape data – allometric growth
    7. 5.7 There are two regression lines!
    8. 5.8 The model matrix in regression
    9. 5.9* Bayesian regression estimation using the MCMCpack package
    10. 5.10 Recap
    11. 5.11 Methodological references
    12. 5.12 Exercises
  13. 6. Multiple linear regression
    1. 6.1 Basic ideas: a book weight example
      1. 6.1.1 Omission of the intercept term
      2. 6.1.2 Diagnostic plots
    2. 6.2 The interpretation of model coefficients
      1. 6.2.1 Times for Northern Irish hill races
      2. 6.2.2 Plots that show the contribution of individual terms
      3. 6.2.3 Mouse brain weight example
      4. 6.2.4 Book dimensions, density, and book weight
    3. 6.3 Multiple regression assumptions, diagnostics, and efficacy measures
      1. 6.3.1 Outliers, leverage, influence, and Cook’s distance
      2. 6.3.2 Assessment and comparison of regression models
      3. 6.3.3 How accurately does the equation predict?
    4. 6.4 A strategy for fitting multiple regression models
      1. 6.4.1 Suggested steps
      2. 6.4.2 Diagnostic checks
      3. 6.4.3 An example – Scottish hill race data
    5. 6.5 Problems with many explanatory variables
      1. 6.5.1 Variable selection issues
    6. 6.6 Multicollinearity
      1. 6.6.1 The variance inflation factor
      2. 6.6.2 Remedies for multicollinearity
    7. 6.7 Errors in x
    8. 6.8 Multiple regression models – additional points
      1. 6.8.1 Confusion between explanatory and response variables
      2. 6.8.2 Missing explanatory variables
      3. 6.8.3* The use of transformations
      4. 6.8.4* Non-linear methods – an alternative to transformation?
    9. 6.9 Recap
    10. 6.10 Further reading
    11. 6.11 Exercises
  14. 7. Exploiting the linear model framework
    1. 7.1 Levels of a factor – using indicator variables
      1. 7.1.1 Example – sugar weight
      2. 7.1.2 Different choices for the model matrix when there are factors
    2. 7.2 Block designs and balanced incomplete block designs
      1. 7.2.1 Analysis of the rice data, allowing for block effects
      2. 7.2.2 A balanced incomplete block design
    3. 7.3 Fitting multiple lines
    4. 7.4 Polynomial regression
      1. 7.4.1 Issues in the choice of model
    5. 7.5* Methods for passing smooth curves through data
      1. 7.5.1 Scatterplot smoothing – regression splines
      2. 7.5.2* Roughness penalty methods and generalized additive models
      3. 7.5.3 Distributional assumptions for automatic choice of roughness penalty
      4. 7.5.4 Other smoothing methods
    6. 7.6 Smoothing with multiple explanatory variables
      1. 7.6.1 An additive model with two smooth terms
      2. 7.6.2* A smooth surface
    7. 7.7 Further reading
    8. 7.8 Exercises
  15. 8. Generalized linear models and survival analysis
    1. 8.1 Generalized linear models
      1. 8.1.1 Transformation of the expected value on the left
      2. 8.1.2 Noise terms need not be normal
      3. 8.1.3 Log odds in contingency tables
      4. 8.1.4 Logistic regression with a continuous explanatory variable
    2. 8.2 Logistic multiple regression
      1. 8.2.1 Selection of model terms, and fitting the model
      2. 8.2.2 Fitted values
      3. 8.2.3 A plot of contributions of explanatory variables
      4. 8.2.4 Cross-validation estimates of predictive accuracy
    3. 8.3 Logistic models for categorical data – an example
    4. 8.4 Poisson and quasi-Poisson regression
      1. 8.4.1 Data on aberrant crypt foci
      2. 8.4.2 Moth habitat example
    5. 8.5 Additional notes on generalized linear models
      1. 8.5.1* Residuals, and estimating the dispersion
      2. 8.5.2 Standard errors and z- or t-statistics for binomial models
      3. 8.5.3 Leverage for binomial models
    6. 8.6 Models with an ordered categorical or categorical response
      1. 8.6.1 Ordinal regression models
      2. 8.6.2* Loglinear models
    7. 8.7 Survival analysis
      1. 8.7.1 Analysis of the Aids2 data
      2. 8.7.2 Right-censoring prior to the termination of the study
      3. 8.7.3 The survival curve for male homosexuals
      4. 8.7.4 Hazard rates
      5. 8.7.5 The Cox proportional hazards model
    8. 8.8 Transformations for count data
    9. 8.9 Further reading
    10. 8.10 Exercises
  16. 9. Time series models
    1. 9.1 Time series – some basic ideas
      1. 9.1.1 Preliminary graphical explorations
      2. 9.1.2 The autocorrelation and partial autocorrelation function
      3. 9.1.3 Autoregressive models
      4. 9.1.4* Autoregressive moving average models – theory
      5. 9.1.5 Automatic model selection?
      6. 9.1.6 A time series forecast
    2. 9.2* Regression modeling with ARIMA errors
    3. 9.3* Non-linear time series
    4. 9.4 Further reading
    5. 9.5 Exercises
  17. 10. Multi-level models and repeated measures
    1. 10.1 A one-way random effects model
      1. 10.1.1 Analysis with aov()
      2. 10.1.2 A more formal approach
      3. 10.1.3 Analysis using lmer()
    2. 10.2 Survey data, with clustering
      1. 10.2.1 Alternative models
      2. 10.2.2 Instructive, though faulty, analyses
      3. 10.2.3 Predictive accuracy
    3. 10.3 A multi-level experimental design
      1. 10.3.1 The anova table
      2. 10.3.2 Expected values of mean squares
      3. 10.3.3* The analysis of variance sums of squares breakdown
      4. 10.3.4 The variance components
      5. 10.3.5 The mixed model analysis
      6. 10.3.6 Predictive accuracy
    4. 10.4 Within- and between-subject effects
      1. 10.4.1 Model selection
      2. 10.4.2 Estimates of model parameters
    5. 10.5 A generalized linear mixed model
    6. 10.6 Repeated measures in time
      1. 10.6.1 Example – random variation between profiles
      2. 10.6.2 Orthodontic measurements on children
    7. 10.7 Further notes on multi-level and other models with correlated errors
      1. 10.7.1 Different sources of variance – complication or focus of interest?
      2. 10.7.2 Predictions from models with a complex error structure
      3. 10.7.3 An historical perspective on multi-level models
      4. 10.7.4 Meta-analysis
      5. 10.7.5 Functional data analysis
      6. 10.7.6 Error structure in explanatory variables
    8. 10.8 Recap
    9. 10.9 Further reading
    10. 10.10 Exercises
  18. 11. Tree-based classification and regression
    1. 11.1 The uses of tree-based methods
      1. 11.1.1 Problems for which tree-based regression may be used
    2. 11.2 Detecting email spam – an example
      1. 11.2.1 Choosing the number of splits
    3. 11.3 Terminology and methodology
      1. 11.3.1 Choosing the split – regression trees
      2. 11.3.2 Within and between sums of squares
      3. 11.3.3 Choosing the split – classification trees
      4. 11.3.4 Tree-based regression versus loess regression smoothing
    4. 11.4 Predictive accuracy and the cost–complexity trade-off
      1. 11.4.1 Cross-validation
      2. 11.4.2 The cost–complexity parameter
      3. 11.4.3 Prediction error versus tree size
    5. 11.5 Data for female heart attack patients
      1. 11.5.1 The one-standard-deviation rule
      2. 11.5.2 Printed information on each split
    6. 11.6 Detecting email spam – the optimal tree
    7. 11.7 The randomForest package
    8. 11.8 Additional notes on tree-based methods
    9. 11.9 Further reading and extensions
    10. 11.10 Exercises
  19. 12. Multivariate data exploration and discrimination
    1. 12.1 Multivariate exploratory data analysis
      1. 12.1.1 Scatterplot matrices
      2. 12.1.2 Principal components analysis
      3. 12.1.3 Multi-dimensional scaling
    2. 12.2 Discriminant analysis
      1. 12.2.1 Example – plant architecture
      2. 12.2.2 Logistic discriminant analysis
      3. 12.2.3 Linear discriminant analysis
      4. 12.2.4 An example with more than two groups
    3. 12.3* High-dimensional data, classification, and plots
      1. 12.3.1 Classifications and associated graphs
      2. 12.3.2 Flawed graphs
      3. 12.3.3 Accuracies and scores for test data
      4. 12.3.4 Graphs derived from the cross-validation process
    4. 12.4 Further reading
    5. 12.5 Exercises
  20. 13. Regression on principal component or discriminant scores
    1. 13.1 Principal component scores in regression
    2. 13.2* Propensity scores in regression comparisons – labor training data
      1. 13.2.1 Regression comparisons
      2. 13.2.2 A strategy that uses propensity scores
    3. 13.3 Further reading
    4. 13.4 Exercises
  21. 14. The R system – additional topics
    1. 14.1 Graphical user interfaces to R
      1. 14.1.1 The R Commander’s interface – a guide to getting started
      2. 14.1.2 The rattle GUI
      3. 14.1.3 The creation of simple GUIs – the fgui package
    2. 14.2 Working directories, workspaces, and the search list
      1. 14.2.1* The search path
      2. 14.2.2 Workspace management
      3. 14.2.3 Utility functions
    3. 14.3 R system configuration
      1. 14.3.1 The R Windows installation directory tree
      2. 14.3.2 The library directories
      3. 14.3.3 The startup mechanism
    4. 14.4 Data input and output
      1. 14.4.1 Input of data
      2. 14.4.2 Data output
      3. 14.4.3 Database connections
    5. 14.5 Functions and operators – some further details
      1. 14.5.1 Function arguments
      2. 14.5.2 Character string and vector functions
      3. 14.5.3 Anonymous functions
      4. 14.5.4 Functions for working with dates (and times)
      5. 14.5.5 Creating groups
      6. 14.5.6 Logical operators
    6. 14.6 Factors
    7. 14.7 Missing values
    8. 14.8* Matrices and arrays
      1. 14.8.1 Matrix arithmetic
      2. 14.8.2 Outer products
      3. 14.8.3 Arrays
    9. 14.9 Manipulations with lists, data frames, matrices, and time series
      1. 14.9.1 Lists – an extension of the notion of “vector”
      2. 14.9.2 Changing the shape of data frames (or matrices)
      3. 14.9.3* Merging data frames – merge()
      4. 14.9.4 Joining data frames, matrices, and vectors – cbind()
      5. 14.9.5 The apply family of functions
      6. 14.9.6 Splitting vectors and data frames into lists – split()
      7. 14.9.7 Multivariate time series
    10. 14.10 Classes and methods
      1. 14.10.1 Printing and summarizing model objects
      2. 14.10.2 Extracting information from model objects
      3. 14.10.3 S4 classes and methods
    11. 14.11 Manipulation of language constructs
      1. 14.11.1 Model and graphics formulae
      2. 14.11.2 The use of a list to pass arguments
      3. 14.11.3 Expressions
      4. 14.11.4 Environments
      5. 14.11.5 Function environments and lazy evaluation
    12. 14.12* Creation of R packages
    13. 14.13 Document preparation – Sweave() and xtable()
    14. 14.14 Further reading
    15. 14.15 Exercises
  22. 15. Graphs in R
    1. 15.1 Hardcopy graphics devices
    2. 15.2 Plotting characters, symbols, line types, and colors
    3. 15.3 Formatting and plotting of text and equations
      1. 15.3.1 Symbolic substitution of symbols in an expression
      2. 15.3.2 Plotting expressions in parallel
    4. 15.4 Multiple graphs on a single graphics page
    5. 15.5 Lattice graphics and the grid package
      1. 15.5.1 Groups within data, and/or columns in parallel
      2. 15.5.2 Lattice parameter settings
      3. 15.5.3 Panel functions, strip functions, strip labels, and other annotation
      4. 15.5.4 Interaction with lattice (and other) plots – the playwith package
      5. 15.5.5 Interaction with lattice plots – focus, interact, unfocus
      6. 15.5.6 Overlaid plots with different scales
    6. 15.6 An implementation of Wilkinson’s Grammar of Graphics
    7. 15.7 Dynamic graphics – the rgl and rggobi packages
    8. 15.8 Further reading
  23. Epilogue
  24. References
  25. Index of R symbols and functions
  26. Index of terms
  27. Index of authors
  28. Plates