You are previewing R in a Nutshell, 2nd Edition.

R in a Nutshell, 2nd Edition

Cover of R in a Nutshell, 2nd Edition by Joseph Adler Published by O'Reilly Media, Inc.
  1. R in a Nutshell
  2. Preface
    1. Why I Wrote This Book
    2. When Should You Use R?
    3. What’s New in the Second Edition?
    4. R License Terms
    5. Examples
    6. How This Book Is Organized
    7. Conventions Used in This Book
    8. Using Code Examples
    9. Safari® Books Online
    10. How to Contact Us
    11. Acknowledgments
  3. I. R Basics
    1. 1. Getting and Installing R
      1. R Versions
      2. Getting and Installing Interactive R Binaries
    2. 2. The R User Interface
      1. The R Graphical User Interface
      2. The R Console
      3. Batch Mode
      4. Using R Inside Microsoft Excel
      5. RStudio
      6. Other Ways to Run R
    3. 3. A Short R Tutorial
      1. Basic Operations in R
      2. Functions
      3. Variables
      4. Introduction to Data Structures
      5. Objects and Classes
      6. Models and Formulas
      7. Charts and Graphics
      8. Getting Help
    4. 4. R Packages
      1. An Overview of Packages
      2. Listing Packages in Local Libraries
      3. Loading Packages
      4. Exploring Package Repositories
      5. Installing Packages From Other Repositories
      6. Custom Packages
  4. II. The R Language
    1. 5. An Overview of the R Language
      1. Expressions
      2. Objects
      3. Symbols
      4. Functions
      5. Objects Are Copied in Assignment Statements
      6. Everything in R Is an Object
      7. Special Values
      8. Coercion
      9. The R Interpreter
      10. Seeing How R Works
    2. 6. R Syntax
      1. Constants
      2. Operators
      3. Expressions
      4. Control Structures
      5. Accessing Data Structures
      6. R Code Style Standards
    3. 7. R Objects
      1. Primitive Object Types
      2. Vectors
      3. Lists
      4. Other Objects
      5. Attributes
    4. 8. Symbols and Environments
      1. Symbols
      2. Working with Environments
      3. The Global Environment
      4. Environments and Functions
      5. Exceptions
    5. 9. Functions
      1. The Function Keyword
      2. Arguments
      3. Return Values
      4. Functions as Arguments
      5. Argument Order and Named Arguments
      6. Side Effects
    6. 10. Object-Oriented Programming
      1. Overview of Object-Oriented Programming in R
      2. Object-Oriented Programming in R: S4 Classes
      3. Old-School OOP in R: S3
  5. III. Working with Data
    1. 11. Saving, Loading, and Editing Data
      1. Entering Data Within R
      2. Saving and Loading R Objects
      3. Importing Data from External Files
      4. Exporting Data
      5. Importing Data From Databases
      6. Getting Data from Hadoop
    2. 12. Preparing Data
      1. Combining Data Sets
      2. Transformations
      3. Binning Data
      4. Subsets
      5. Summarizing Functions
      6. Data Cleaning
      7. Finding and Removing Duplicates
      8. Sorting
  6. IV. Data Visualization
    1. 13. Graphics
      1. An Overview of R Graphics
      2. Graphics Devices
      3. Customizing Charts
    2. 14. Lattice Graphics
      1. History
      2. An Overview of the Lattice Package
      3. High-Level Lattice Plotting Functions
      4. Customizing Lattice Graphics
      5. Low-Level Functions
    3. 15. ggplot2
      1. A Short Introduction
      2. The Grammar of Graphics
      3. A More Complex Example: Medicare Data
      4. Quick Plot
      5. Creating Graphics with ggplot2
      6. Learning More
  7. V. Statistics with R
    1. 16. Analyzing Data
      1. Summary Statistics
      2. Correlation and Covariance
      3. Principal Components Analysis
      4. Factor Analysis
      5. Bootstrap Resampling
    2. 17. Probability Distributions
      1. Normal Distribution
      2. Common Distribution-Type Arguments
      3. Distribution Function Families
    3. 18. Statistical Tests
      1. Continuous Data
      2. Discrete Data
    4. 19. Power Tests
      1. Experimental Design Example
      2. t-Test Design
      3. Proportion Test Design
      4. ANOVA Test Design
    5. 20. Regression Models
      1. Example: A Simple Linear Model
      2. Details About the lm Function
      3. Subset Selection and Shrinkage Methods
      4. Nonlinear Models
      5. Survival Models
      6. Smoothing
      7. Machine Learning Algorithms for Regression
    6. 21. Classification Models
      1. Linear Classification Models
      2. Machine Learning Algorithms for Classification
    7. 22. Machine Learning
      1. Market Basket Analysis
      2. Clustering
    8. 23. Time Series Analysis
      1. Autocorrelation Functions
      2. Time Series Models
  8. VI. Additional Topics
    1. 24. Optimizing R Programs
      1. Measuring R Program Performance
      2. Optimizing Your R Code
      3. Other Ways to Speed Up R
    2. 25. Bioconductor
      1. An Example
      2. Key Bioconductor Packages
      3. Data Structures
      4. Where to Go Next
    3. 26. R and Hadoop
      1. R and Hadoop
      2. Other Packages for Parallel Computation with R
      3. Where to Learn More
  9. A. R Reference
    1. base
      1. Functions
      2. Data Sets
    2. boot
      1. Functions
      2. Data Sets
    3. class
      1. Functions
    4. cluster
      1. Functions
      2. Data Sets
    5. codetools
    6. foreign
      1. Functions
    7. grDevices
      1. Functions
      2. Data Sets
    8. graphics
      1. Functions
    9. grid
    10. KernSmooth
      1. Functions
    11. lattice
      1. Functions
      2. Data Sets
    12. MASS
      1. Functions
      2. Data Sets
    13. methods
      1. Functions
    14. mgcv
    15. nlme
    16. nnet
      1. Functions
    17. rpart
      1. Functions
      2. Data Sets
    18. spatial
      1. Functions
    19. splines
      1. Functions
    20. stats
      1. Functions
      2. Data Set
    21. stats4
      1. Functions
    22. survival
      1. Functions
      2. Data Sets
    23. tcltk
    24. tools
      1. Functions
      2. Data Sets
    25. utils
      1. Functions
  10. Bibliography
  11. Index
  12. About the Author
  13. Colophon
  14. Copyright
O'Reilly logo

Subset Selection and Shrinkage Methods

Modeling functions like lm will include every variable specified in the formula, calculating a coefficient for each one. Unfortunately, this means that lm may calculate coefficients for variables that aren’t needed. You can manually tune a model using diagnostics like summary and lm.influence. However, you can also use some other statistical techniques to reduce the effect of insignificant variables or remove them from a model altogether.

Stepwise Variable Selection

A simple technique for selecting the most important variables is stepwise variable selection. The stepwise algorithm works by repeatedly adding or removing variables from the model, trying to “improve” the model at each step. When the algorithm can no longer improve the model by adding or subtracting variables, it stops and returns the new (and usually smaller) model.

Note that “improvement” does not just mean reducing the residual sum of squares (RSS) for the fitted model. Adding an additional variable to a model will not increase the RSS (see a statistics book for an explanation of why), but it does increase model complexity. Typically, AIC (Akaike’s information criterion) is used to measure the value of each additional variable. The AIC is defined as AIC = − 2 ∗ log(L) + k ∗ edf, where L is the likelihood and edf is the equivalent degrees of freedom.

In R, you perform stepwise selection through the step function:

step(object, scope, scale = 0, direction = c("both", "backward", "forward"), ...

The best content for your career. Discover unlimited learning on demand for around $1/day.