You are previewing Mastering Data Analysis with R.
O'Reilly logo
Mastering Data Analysis with R

Book Description

Gain sharp insights into your data and solve real-world data science problems with R-from data munging to modeling and visualization

About This Book

  • Handle your data with precision and care for optimal business intelligence

  • Restructure and transform your data to inform decision-making

  • Packed with practical advice and tips to help you get to grips with data mining

  • Who This Book Is For

    If you are a data scientist or R developer who wants to explore and optimize your use of R's advanced features and tools, this is the book for you. A basic knowledge of R is required, along with an understanding of database logic.

    What You Will Learn

  • Connect to and load data from R's range of powerful databases

  • Successfully fetch and parse structured and unstructured data

  • Transform and restructure your data with efficient R packages

  • Define and build complex statistical models with glm

  • Develop and train machine learning algorithms

  • Visualize social networks and graph data

  • Deploy supervised and unsupervised classification algorithms

  • Discover how to visualize spatial data with R

  • In Detail

    R is an essential language for sharp and successful data analysis. Its numerous features and ease of use make it a powerful way of mining, managing, and interpreting large sets of data. In a world where understanding big data has become key, by mastering R you will be able to deal with your data effectively and efficiently.

    This book will give you the guidance you need to build and develop your knowledge and expertise. Bridging the gap between theory and practice, this book will help you to understand and use data for a competitive advantage.

    Beginning with taking you through essential data mining and management tasks such as munging, fetching, cleaning, and restructuring, the book then explores different model designs and the core components of effective analysis. You will then discover how to optimize your use of machine learning algorithms for classification and recommendation systems beside the traditional and more recent statistical methods.

    Style and approach

    Covering the essential tasks and skills within data science, Mastering Data Analysis provides you with solutions to the challenges of data science. Each section gives you a theoretical overview before demonstrating how to put the theory to work with real-world use cases and hands-on examples.

    Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the files e-mailed directly to you.

    Table of Contents

    1. Mastering Data Analysis with R
      1. Table of Contents
      2. Mastering Data Analysis with R
      3. Credits
      4. About the Author
      5. About the Reviewers
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      8. 1. Hello, Data!
        1. Loading text files of a reasonable size
          1. Data files larger than the physical memory
        2. Benchmarking text file parsers
        3. Loading a subset of text files
          1. Filtering flat files before loading to R
        4. Loading data from databases
          1. Setting up the test environment
          2. MySQL and MariaDB
          3. PostgreSQL
          4. Oracle database
          5. ODBC database access
          6. Using a graphical user interface to connect to databases
          7. Other database backends
        5. Importing data from other statistical systems
        6. Loading Excel spreadsheets
        7. Summary
      9. 2. Getting Data from the Web
        1. Loading datasets from the Internet
        2. Other popular online data formats
        3. Reading data from HTML tables
          1. Reading tabular data from static Web pages
        4. Scraping data from other online sources
        5. R packages to interact with data source APIs
          1. Socrata Open Data API
          2. Finance APIs
          3. Fetching time series with Quandl
          4. Google documents and analytics
          5. Online search trends
          6. Historical weather data
          7. Other online data sources
        6. Summary
      10. 3. Filtering and Summarizing Data
        1. Drop needless data
          1. Drop needless data in an efficient way
          2. Drop needless data in another efficient way
        2. Aggregation
          1. Quicker aggregation with base R commands
          2. Convenient helper functions
          3. High-performance helper functions
          4. Aggregate with data.table
        3. Running benchmarks
        4. Summary functions
          1. Adding up the number of cases in subgroups
        5. Summary
      11. 4. Restructuring Data
        1. Transposing matrices
        2. Filtering data by string matching
        3. Rearranging data
        4. dplyr versus data.table
        5. Computing new variables
          1. Memory profiling
          2. Creating multiple variables at a time
          3. Computing new variables with dplyr
        6. Merging datasets
        7. Reshaping data in a flexible way
          1. Converting wide tables to the long table format
          2. Converting long tables to the wide table format
          3. Tweaking performance
        8. The evolution of the reshape packages
        9. Summary
      12. 5. Building Models (authored by Renata Nemeth and Gergely Toth)
        1. The motivation behind multivariate models
        2. Linear regression with continuous predictors
          1. Model interpretation
          2. Multiple predictors
        3. Model assumptions
        4. How well does the line fit in the data?
        5. Discrete predictors
        6. Summary
      13. 6. Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth)
        1. The modeling workflow
        2. Logistic regression
          1. Data considerations
          2. Goodness of model fit
          3. Model comparison
        3. Models for count data
          1. Poisson regression
          2. Negative binomial regression
          3. Multivariate non-linear models
        4. Summary
      14. 7. Unstructured Data
        1. Importing the corpus
        2. Cleaning the corpus
        3. Visualizing the most frequent words in the corpus
        4. Further cleanup
          1. Stemming words
          2. Lemmatisation
        5. Analyzing the associations among terms
        6. Some other metrics
        7. The segmentation of documents
        8. Summary
      15. 8. Polishing Data
        1. The types and origins of missing data
        2. Identifying missing data
        3. By-passing missing values
          1. Overriding the default arguments of a function
        4. Getting rid of missing data
        5. Filtering missing data before or during the actual analysis
        6. Data imputation
          1. Modeling missing values
          2. Comparing different imputation methods
          3. Not imputing missing values
          4. Multiple imputation
        7. Extreme values and outliers
          1. Testing extreme values
        8. Using robust methods
        9. Summary
      16. 9. From Big to Small Data
        1. Adequacy tests
          1. Normality
          2. Multivariate normality
          3. Dependence of variables
          4. KMO and Barlett's test
        2. Principal Component Analysis
          1. PCA algorithms
          2. Determining the number of components
          3. Interpreting components
          4. Rotation methods
          5. Outlier-detection with PCA
        3. Factor analysis
        4. Principal Component Analysis versus Factor Analysis
        5. Multidimensional Scaling
        6. Summary
      17. 10. Classification and Clustering
        1. Cluster analysis
          1. Hierarchical clustering
          2. Determining the ideal number of clusters
          3. K-means clustering
          4. Visualizing clusters
        2. Latent class models
          1. Latent Class Analysis
          2. LCR models
        3. Discriminant analysis
        4. Logistic regression
        5. Machine learning algorithms
          1. The K-Nearest Neighbors algorithm
          2. Classification trees
          3. Random forest
          4. Other algorithms
        6. Summary
      18. 11. Social Network Analysis of the R Ecosystem
        1. Loading network data
        2. Centrality measures of networks
        3. Visualizing network data
          1. Interactive network plots
          2. Custom plot layouts
          3. Analyzing R package dependencies with an R package
        4. Further network analysis resources
        5. Summary
      19. 12. Analyzing Time-series
        1. Creating time-series objects
        2. Visualizing time-series
        3. Seasonal decomposition
        4. Holt-Winters filtering
        5. Autoregressive Integrated Moving Average models
        6. Outlier detection
        7. More complex time-series objects
        8. Advanced time-series analysis
        9. Summary
      20. 13. Data Around Us
        1. Geocoding
        2. Visualizing point data in space
        3. Finding polygon overlays of point data
        4. Plotting thematic maps
        5. Rendering polygons around points
          1. Contour lines
          2. Voronoi diagrams
        6. Satellite maps
        7. Interactive maps
          1. Querying Google Maps
          2. JavaScript mapping libraries
        8. Alternative map designs
        9. Spatial statistics
        10. Summary
      21. 14. Analyzing the R Community
        1. R Foundation members
          1. Visualizing supporting members around the world
        2. R package maintainers
          1. The number of packages per maintainer
        3. The R-help mailing list
          1. Volume of the R-help mailing list
          2. Forecasting the e-mail volume in the future
        4. Analyzing overlaps between our lists of R users
          1. Further ideas on extending the capture-recapture models
        5. The number of R users in social media
        6. R-related posts in social media
        7. Summary
      22. A. References
        1. General good readings on R
        2. Chapter 1 – Hello, Data!
        3. Chapter 2 – Getting Data from the Web
        4. Chapter 3 – Filtering and Summarizing Data
        5. Chapter 4 – Restructuring Data
        6. Chapter 5 – Building Models (authored by Renata Nemeth and Gergely Toth)
        7. Chapter 6 – Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth)
        8. Chapter 7 – Unstructured Data
        9. Chapter 8 – Polishing Data
        10. Chapter 9 – From Big to Smaller Data
        11. Chapter 10 – Classification and Clustering
        12. Chapter 11 – Social Network Analysis of the R Ecosystem
        13. Chapter 12 – Analyzing Time-series
        14. Chapter 13 – Data Around Us
        15. Chapter 14 – Analysing the R Community
      23. Index