You are previewing Data Science in R.
O'Reilly logo
Data Science in R

Book Description

Effectively Access, Transform, Manipulate, Visualize, and Reason about Data and Computation

Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving illustrates the details involved in solving real computational problems encountered in data analysis. It reveals the dynamic and iterative process by which data analysts approach a problem and reason about different ways of implementing solutions.

The book’s collection of projects, comprehensive sample solutions, and follow-up exercises encompass practical topics pertaining to data processing, including:

  • Non-standard, complex data formats, such as robot logs and email messages
  • Text processing and regular expressions
  • Newer technologies, such as Web scraping, Web services, Keyhole Markup Language (KML), and Google Earth
  • Statistical methods, such as classification trees, k-nearest neighbors, and naïve Bayes
  • Visualization and exploratory data analysis
  • Relational databases and Structured Query Language (SQL)
  • Simulation
  • Algorithm implementation
  • Large data and efficiency

Suitable for self-study or as supplementary reading in a statistical computing course, the book enables instructors to incorporate interesting problems into their courses so that students gain valuable experience and data science skills. Students learn how to acquire and work with unstructured or semistructured data as well as how to narrow down and carefully frame the questions of interest about the data.

Blending computational details with statistical and data analysis concepts, this book provides readers with an understanding of how professional data scientists think about daily computational tasks. It will improve readers’ computational reasoning of real-world data analyses.

Table of Contents

  1. Preliminaries
  2. Series
  3. Dedication
  4. Preface
    1. Goals of the Book
    2. Using These Case Studies in Statistical Computing Courses
    3. Broad Topics
    4. Target Audience
    5. The Themes of the Three Parts
    6. Typographic Conventions
    7. Available Materials
  5. Acknowledgments
  6. Authors
  7. Co-Authors
  8. Part I Data Manipulation and Modeling
    1. Chapter 1 Predicting Location via Indoor Positioning Systems
      1. 1.1 Introduction
        1. 1.1.1 Computational Topics
      2. 1.2 The Raw Data
        1. 1.2.1 Processing the Raw Data
      3. 1.3 Cleaning the Data and Building a Representation for Analysis
        1. 1.3.1 Exploring Orientation
        2. 1.3.2 Exploring MAC Addresses
        3. 1.3.3 Exploring the Position of the Hand-Held Device
        4. 1.3.4 Creating a Function to Prepare the Data
      4. 1.4 Signal Strength Analysis
        1. 1.4.1 Distribution of Signal Strength
        2. 1.4.2 The Relationship between Signal and Distance
      5. 1.5 Nearest Neighbor Methods to Predict Location
        1. 1.5.1 Preparing the Test Data
        2. 1.5.2 Choice of Orientation
        3. 1.5.3 Finding the Nearest Neighbors
        4. 1.5.4 Cross-Validation and Choice of k
      6. 1.6 Exercises
      7. Bibliography
        1. Figure 1.1
        2. Figure 1.2
        3. Figure 1.3
        4. Figure 1.4
        5. Figure 1.5
        6. Figure 1.6
        7. Figure 1.7
        8. Figure 1.8
        9. Figure 1.9
        10. Figure 1.10
        11. Figure 1.11
        12. Figure 1.12
        13. Figure 1.13
    2. Chapter 2 Modeling Runners' Times in the Cherry Blossom Race
      1. 2.1 Introduction
        1. 2.1.1 Computational Topics
      2. 2.2 Reading Tables of Race Results into R
      3. 2.3 Data Cleaning and Reformatting Variables
      4. 2.4 Exploring the Run Time for All Male Runners
        1. 2.4.1 Making Plots with Many Observations
        2. 2.4.2 Fitting Models to Average Performance
        3. 2.4.3 Cross-Sectional Data and Covariates
      5. 2.5 Constructing a Record for an Individual Runner across Years
      6. 2.6 Modeling the Change in Running Time for Individuals
      7. 2.7 Scraping Race Results from the Web
      8. 2.8 Exercises
      9. Bibliography
        1. Figure 2.1
        2. Figure 2.2
        3. Figure 2.3
        4. Figure 2.4
        5. Figure 2.5
        6. Figure 2.6
        7. Figure 2.7
        8. Figure 2.8
        9. Figure 2.9
        10. Figure 2.10
        11. Figure 2.11
        12. Figure 2.12
        13. Figure 2.13
        14. Figure 2.14
        15. Figure 2.15
        16. Figure 2.16
        17. Figure 2.17
        18. Figure 2.18
        19. Figure 2.19
        20. Figure 2.20
        21. Figure 2.21
        1. Table 1.1
    3. Chapter 3 Using Statistics to Identify Spam
      1. 3.1 Introduction
        1. 3.1.1 Computational Topics
      2. 3.2 Anatomy of an email Message
      3. 3.3 Reading the email Messages
      4. 3.4 Text Mining and Naïve Bayes Classification
      5. 3.5 Finding the Words in a Message
        1. 3.5.1 Splitting the Message into Its Header and Body
        2. 3.5.2 Removing Attachments from the Message Body
        3. 3.5.3 Extracting Words from the Message Body
        4. 3.5.4 Completing the Data Preparation Process
      6. 3.6 Implementing the Naïve Bayes Classifier
        1. 3.6.1 Test and Training Data
        2. 3.6.2 Probability Estimates from Training Data
        3. 3.6.3 Classifying New Messages
        4. 3.6.4 Computational Considerations
      7. 3.7 Recursive Partitioning and Classification Trees
      8. 3.8 Organizing an email Message into an R Data Structure
        1. 3.8.1 Processing the Header
        2. 3.8.2 Processing Attachments
        3. 3.8.3 Testing Our Code on More email Data
        4. 3.8.4 Completing the Process
      9. 3.9 Deriving Variables from the email Message
        1. 3.9.1 Checking Our Code for Errors
      10. 3.10 Exploring the email Feature Set
      11. 3.11 Fitting the rpart() Model to the email Data
      12. 3.12 Exercises
      13. Bibliography
        1. Figure 3.1
        2. Figure 3.2
        3. Figure 3.3
        4. Figure 3.4
        5. Figure 3.5
        6. Figure 3.6
        7. Figure 3.7
        8. Figure 3.8
        9. Figure 3.9
        1. Table 3.1
    4. Chapter 4 Processing Robot and Sensor Log Files: Seeking a Circular Target
      1. 4.1 Description
        1. 4.1.1 Computational Topics
      2. 4.2 The Data
        1. 4.2.1 Reading an Entire Log File
        2. 4.2.2 Exploring Log Files
        3. 4.2.3 Visualizing the Path
        4. 4.2.4 Exploring a "Look"
        5. 4.2.5 The Error Distribution for Range Values
      3. 4.3 Detecting a Circular Target
        1. 4.3.1 Connecting Segments Behind the Robot
        2. 4.3.2 Determining If a Segment Corresponds to a Circle
      4. 4.4 Detecting the Target with Streaming Data in Real Time
      5. Bibliography
        1. Figure 4.1
        2. Figure 4.2
        3. Figure 4.3
        4. Figure 4.4
        5. Figure 4.5
        6. Figure 4.6
        7. Figure 4.7
        8. Figure 4.8
        9. Figure 4.9
        10. Figure 4.10
        11. Figure 4.11
        12. Figure 4.12
        13. Figure 4.13
        14. Figure 4.14
        15. Figure 4.15
        16. Figure 4.16
    5. Chapter 5 Strategies for Analyzing a 12-Gigabyte Data Set: Airline Flight Delays
      1. 5.1 Introduction
        1. 5.1.1 Computational Topics
      2. 5.2 Acquiring the Airline Data Set
      3. 5.3 Computing with Massive Data: Getting Flight Delay Counts
        1. 5.3.1 The R Programming Environment
        2. 5.3.2 The UNIX Shell
        3. 5.3.3 An SQL Database with R
        4. 5.3.4 The bigmemory Package with R
      4. 5.4 Explorations Using Parallel Computing: The Distribution of Flight Delays
        1. 5.4.1 Writing a Parallelizable Loop with foreach
        2. 5.4.2 Using the Split-Apply-Combine Approach for Better Performance
        3. 5.4.3 Using Split-Apply-Combine to Find the Best Time to Fly
      5. 5.5 From Exploration to Model: Do Older Planes Suffer Greater Delays?
      6. Bibliography
        1. Figure 5.1
  9. Part II Simulation Studies
    1. Chapter 6 Pairs Trading
      1. 6.1 The Problem
        1. 6.1.1 Computational Topics
      2. 6.2 The Data Format
      3. 6.3 Reading the Financial Data
      4. 6.4 Visualizing the Time Series
      5. 6.5 Finding Opening and Closing Positions
        1. 6.5.1 Identifying a Position
        2. 6.5.2 Displaying Positions
        3. 6.5.3 Finding All Positions
        4. 6.5.4 Computing the Profit for a Position
        5. 6.5.5 Finding the Optimal Value for k
      6. 6.6 Simulation Study
        1. 6.6.1 Simulating the Stock Price Series
        2. 6.6.2 Making stockSim() Faster
        3. Bibliography
          1. Figure 6.1
          2. Figure 6.2
          3. Figure 6.3
          4. Figure 6.4
          5. Figure 6.5
          6. Figure 6.6
          7. Figure 6.7
          8. Figure 6.8
          9. Figure 6.9
    2. Chapter 7 Simulation Study of a Branching Process
      1. 7.1 Introduction
        1. 7.1.1 The Monte Carlo Method
        2. 7.1.2 Computational Topics
      2. 7.2 Exploring the Random Process
      3. 7.3 Generating Offspring
        1. 7.3.1 Checking the Results
        2. 7.3.2 Considering Alternative Implementations
      4. 7.4 Profiling and Improving Our Code
      5. 7.5 From One Job's Offspring to an Entire Generation
      6. 7.6 Unit Testing
      7. 7.7 A Structure for the Function's Return Value
      8. 7.8 The Family Tree: Simulating the Branching Process
      9. 7.9 Replicating the Simulation
        1. 7.9.1 Analyzing the Simulation Results
      10. 7.10 Exercises
      11. Bibliography
        1. Figure 7.1
        2. Figure 7.2
        3. Figure 7.3
        4. Figure 7.4
        5. Figure 7.5
        6. Figure 7.6
        7. Figure 7.7
        8. Figure 7.8
        9. Figure 7.9
    3. Chapter 8 A Self-Organizing Dynamic System with a Phase Transition
      1. 8.1 Introduction and Motivation
        1. 8.1.1 Computational Topics
      2. 8.2 The Model
        1. 8.2.1 The Order Cars Move
      3. 8.3 Implementing the BML Model
        1. 8.3.1 Creating the Initial Grid Configuration
        2. 8.3.2 Testing the Grid Creation Function
        3. 8.3.3 Displaying the Grid
        4. 8.3.4 Visualizing the Grid
        5. 8.3.5 Simple and Convenient Object-Oriented Programming
        6. 8.3.6 Moving the Cars
      4. 8.4 Evaluating the Performance of the Code
      5. 8.5 Implementing the BML Model in C
        1. 8.5.1 The Algorithm in C
        2. 8.5.2 Compiling, Loading, and Calling the C Code
      6. 8.6 Running the Simulations
        1. 8.6.1 Exploring Car Velocity
      7. 8.7 Experimental Compilation
      8. Bibliography
        1. Figure 8.1
        2. Figure 8.2
        3. Figure 8.3
        4. Figure 8.4
        5. Figure 8.5
        6. Figure 8.6
        7. Figure 8.7
        8. Figure 8.8
        9. Figure 8.9
        10. Figure 8.10
        11. Figure 8.11
        12. Figure 8.12
        13. Figure 8.13
    4. Chapter 9 Simulating Blackjack
      1. 9.1 Introduction
        1. 9.1.1 Computational Topics
      2. 9.2 Blackjack Basics
        1. 9.2.1 Testing Functions
      3. 9.3 Playing a Hand of Blackjack
        1. 9.3.1 Creating Functions for the Player's Actions
      4. 9.4 Strategies for Playing
        1. 9.4.1 Developing the Optimal Strategy
      5. 9.5 Playing Many Games
      6. 9.6 A More Accurate Card Dealer Shoe
      7. 9.7 Counting Cards
      8. 9.8 Putting It All Together
      9. 9.9 Exercises
      10. Bibliography
        1. Figure 9.1
        2. Figure 9.2
        3. Figure 9.3
        4. Figure 9.4
        1. Table 9.1
        2. Table 9.2
  10. Part III Data and Web Technologies
    1. Chapter 10 Baseball: Exploring Data in a Relational Database
      1. 10.1 Introduction
        1. 10.1.1 Computational Topics
      2. 10.2 Sean Lahman's Database
        1. 10.2.1 Connecting to the Baseball Database from within R
      3. 10.3 Aggregating Salaries into Payroll
      4. 10.4 Merging Payroll Data with Information in Other Tables
        1. 10.4.1 Adding Team Names to the Payroll Data
        2. 10.4.2 Adding World Series Records to the Payroll Data
      5. 10.5 Exploring the Extreme Salaries
      6. 10.6 Exercises
      7. Bibliography
        1. Figure 10.1
        2. Figure 10.2
        1. Table 10.1
    2. Chapter 11 CIA Factbook Mashup
      1. 11.1 Introduction
        1. 11.1.1 Computational Topics
      2. 11.2 Acquiring the Data
        1. 11.2.1 Extracting Latitude and Longitude from a CSV File
      3. 11.3 Integrating Data from Different Sources
      4. 11.4 Preparing the Data for Plotting
        1. 11.4.1 Redoing the Merge of the Factbook and Location Data
      5. 11.5 Plotting with Google Earth™
      6. 11.6 Extracting Demographic Information from the CIA XML File
      7. 11.7 Generating KML Directly
      8. 11.8 Additional Computational Tasks
        1. 11.8.1 Creating Plotting Symbols
        2. 11.8.2 Efficiency in Generating <span xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" class="code"><span class="cItalic">KML</span></span> from Strings from Strings
        3. 11.8.3 Extracting Latitude and Longitude from an <span xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" class="code"><span class="cItalic">HTML</span></span> File File
      9. 11.9 Exercises
      10. Bibliography
        1. Figure 11.1
        2. Figure 11.2
        3. Figure 11.3
        4. Figure 11.4
        5. Figure 11.5
        6. Figure 11.6
        7. Figure 11.7
        8. Figure 11.8
        9. Figure 11.9
        10. Figure 11.10
    3. Chapter 12 Exploring Data Science Jobs with Web Scraping and Text Mining
      1. 12.1 Introduction and Motivation
        1. 12.1.1 Computational Topics
      2. 12.2 Exploring Different Web Sites
      3. 12.3 Preliminary/Exploratory Scraping: The Kaggle Job List
        1. 12.3.1 Processing the Text
        2. 12.3.2 Generalizing to Other Posts
        3. 12.3.3 Scraping the Kaggle Post List
      4. 12.4 Scraping CyberCoders.com
        1. 12.4.1 Getting the Skill List from a Job Post
        2. 12.4.2 Finding the Links to Job Postings in the Search Results
        3. 12.4.3 Finding the Next Page of Job Post Search Results
        4. 12.4.4 Putting It All Together
      5. 12.5 A Reusable Generic Framework for Arbitrary Sites
      6. 12.6 Scraping Career Builder
      7. 12.7 Scraping Monster.com
      8. 12.8 Analyzing the Results: The Important Skills
      9. 12.9 Note on Web Scraping
      10. 12.10 Exercises
      11. Bibliography
        1. Figure 12.1
        2. Figure 12.2
        3. Figure 12.3
        4. Figure 12.4
        5. Figure 12.5
        6. Figure 12.6
        7. Figure 12.7
        8. Figure 12.8
        9. Figure 12.9
        10. Figure 12.10
        11. Figure 12.11
  11. Colophon