Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo
Big Data Analytics with R

Book Description

Utilize R to uncover hidden patterns in your Big Data

About This Book

  • Perform computational analyses on Big Data to generate meaningful results

  • Get a practical knowledge of R programming language while working on Big Data platforms like Hadoop, Spark, H2O and SQL/NoSQL databases,

  • Explore fast, streaming, and scalable data analysis with the most cutting-edge technologies in the market

  • Who This Book Is For

    This book is intended for Data Analysts, Scientists, Data Engineers, Statisticians, Researchers, who want to integrate R with their current or future Big Data workflows.

    It is assumed that readers have some experience in data analysis and understanding of data management and algorithmic processing of large quantities of data, however they may lack specific skills related to R.

    What You Will Learn

  • Learn about current state of Big Data processing using R programming language and its powerful statistical capabilities

  • Deploy Big Data analytics platforms with selected Big Data tools supported by R in a cost-effective and time-saving manner

  • Apply the R language to real-world Big Data problems on a multi-node Hadoop cluster, e.g. electricity consumption across various socio-demographic indicators and bike share scheme usage

  • Explore the compatibility of R with Hadoop, Spark, SQL and NoSQL databases, and H2O platform

  • In Detail

    Big Data analytics is the process of examining large and complex data sets that often exceed the computational capabilities. R is a leading programming language of data science, consisting of powerful functions to tackle all problems related to Big Data processing.

    The book will begin with a brief introduction to the Big Data world and its current industry standards. With introduction to the R language and presenting its development, structure, applications in real world, and its shortcomings. Book will progress towards revision of major R functions for data management and transformations. Readers will be introduce to Cloud based Big Data solutions (e.g. Amazon EC2 instances and Amazon RDS, Microsoft Azure and its HDInsight clusters) and also provide guidance on R connectivity with relational and non-relational databases such as MongoDB and HBase etc. It will further expand to include Big Data tools such as Apache Hadoop ecosystem, HDFS and MapReduce frameworks. Also other R compatible tools such as Apache Spark, its machine learning library Spark MLlib, as well as H2O.

    Style and approach

    This book will serve as a practical guide to tackling Big Data problems using R programming language and its statistical environment. Each section of the book will present you with concise and easy-to-follow steps on how to process, transform and analyse large data sets.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the code file.

    Table of Contents

    1. Big Data Analytics with R
      1. Big Data Analytics with R
      2. Credits
      3. About the Author
      4. Acknowledgement
      5. About the Reviewers
        1. eBooks, discount offers, and more
          1. Why subscribe?
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. The Era of Big Data
        1. Big Data – The monster re-defined
        2. Big Data toolbox - dealing with the giant
          1. Hadoop - the elephant in the room
          2. Databases
          3. Hadoop Spark-ed up
        3. R – The unsung Big Data hero
        4. Summary
      9. 2. Introduction to R Programming Language and Statistical Environment
        1. Learning R
        2. Revisiting R basics
          1. Getting R and RStudio ready
            1. Setting the URLs to R repositories
          2. R data structures
            1. Vectors
            2. Scalars
            3. Matrices
            4. Arrays
            5. Data frames
            6. Lists
          3. Exporting R data objects
        3. Applied data science with R
          1. Importing data from different formats
          2. Exploratory Data Analysis
          3. Data aggregations and contingency tables
          4. Hypothesis testing and statistical inference
            1. Tests of differences
              1. Independent t-test example (with power and effect size estimates)
              2. ANOVA example
            2. Tests of relationships
              1. An example of Pearson's r correlations
              2. Multiple regression example
          5. Data visualization packages
        4. Summary
      10. 3. Unleashing the Power of R from Within
        1. Traditional limitations of R
          1. Out-of-memory data
          2. Processing speed
        2. To the memory limits and beyond
          1. Data transformations and aggregations with the ff and ffbase packages
          2. Generalized linear models with the ff and ffbase packages
            1. Logistic regression example with ffbase and biglm
          3. Expanding memory with the bigmemory package
        3. Parallel R
          1. From bigmemory to faster computations
            1. An apply() example with the big.matrix object
            2. A for() loop example with the ffdf object
            3. Using apply() and for() loop examples on a data.frame
            4. A parallel package example
            5. A foreach package example
          2. The future of parallel processing in R
            1. Utilizing Graphics Processing Units with R
            2. Multi-threading with Microsoft R Open distribution
            3. Parallel machine learning with H2O and R
        4. Boosting R performance with the data.table package and other tools
          1. Fast data import and manipulation with the data.table package
            1. Data import with data.table
            2. Lightning-fast subsets and aggregations on data.table
            3. Chaining, more complex aggregations, and pivot tables with data.table
          2. Writing better R code
        5. Summary
      11. 4. Hadoop and MapReduce Framework for R
        1. Hadoop architecture
          1. Hadoop Distributed File System
          2. MapReduce framework
            1. A simple MapReduce word count example
          3. Other Hadoop native tools
          4. Learning Hadoop
        2. A single-node Hadoop in Cloud
          1. Deploying Hortonworks Sandbox on Azure
          2. A word count example in Hadoop using Java
          3. A word count example in Hadoop using the R language
            1. RStudio Server on a Linux RedHat/CentOS virtual machine
            2. Installing and configuring RHadoop packages
            3. HDFS management and MapReduce in R - a word count example
        3. HDInsight - a multi-node Hadoop cluster on Azure
          1. Creating your first HDInsight cluster
            1. Creating a new Resource Group
            2. Deploying a Virtual Network
            3. Creating a Network Security Group
            4. Setting up and configuring an HDInsight cluster
            5. Starting the cluster and exploring Ambari
            6. Connecting to the HDInsight cluster and installing RStudio Server
            7. Adding a new inbound security rule for port 8787
            8. Editing the Virtual Network's public IP address for the head node
          2. Smart energy meter readings analysis example – using R on HDInsight cluster
        4. Summary
      12. 5. R with Relational Database Management Systems (RDBMSs)
        1. Relational Database Management Systems (RDBMSs)
          1. A short overview of used RDBMSs
          2. Structured Query Language (SQL)
        2. SQLite with R
          1. Preparing and importing data into a local SQLite database
          2. Connecting to SQLite from RStudio
        3. MariaDB with R on a Amazon EC2 instance
          1. Preparing the EC2 instance and RStudio Server for use
          2. Preparing MariaDB and data for use
          3. Working with MariaDB from RStudio
        4. PostgreSQL with R on Amazon RDS
          1. Launching an Amazon RDS database instance
          2. Preparing and uploading data to Amazon RDS
          3. Remotely querying PostgreSQL on Amazon RDS from RStudio
        5. Summary
      13. 6. R with Non-Relational (NoSQL) Databases
        1. Introduction to NoSQL databases
          1. Review of leading non-relational databases
        2. MongoDB with R
          1. Introduction to MongoDB
            1. MongoDB data models
          2. Installing MongoDB with R on Amazon EC2
          3. Processing Big Data using MongoDB with R
            1. Importing data into MongoDB and basic MongoDB commands
            2. MongoDB with R using the rmongodb package
            3. MongoDB with R using the RMongo package
            4. MongoDB with R using the mongolite package
        3. HBase with R
          1. Azure HDInsight with HBase and RStudio Server
          2. Importing the data to HDFS and HBase
          3. Reading and querying HBase using the rhbase package
        4. Summary
      14. 7. Faster than Hadoop - Spark with R
        1. Spark for Big Data analytics
        2. Spark with R on a multi-node HDInsight cluster
          1. Launching HDInsight with Spark and R/RStudio
          2. Reading the data into HDFS and Hive
            1. Getting the data into HDFS
            2. Importing data from HDFS to Hive
          3. Bay Area Bike Share analysis using SparkR
        3. Summary
      15. 8. Machine Learning Methods for Big Data in R
        1. What is machine learning?
          1. Machine learning algorithms
          2. Supervised and unsupervised machine learning methods
          3. Classification and clustering algorithms
          4. Machine learning methods with R
          5. Big Data machine learning tools
        2. GLM example with Spark and R on the HDInsight cluster
          1. Preparing the Spark cluster and reading the data from HDFS
          2. Logistic regression in Spark with R
        3. Naive Bayes with H2O on Hadoop with R
          1. Running an H2O instance on Hadoop with R
          2. Reading and exploring the data in H2O
          3. Naive Bayes on H2O with R
        4. Neural Networks with H2O on Hadoop with R
          1. How do Neural Networks work?
          2. Running Deep Learning models on H2O
        5. Summary
      16. 9. The Future of R - Big, Fast, and Smart Data
        1. The current state of Big Data analytics with R
          1. Out-of-memory data on a single machine
          2. Faster data processing with R
          3. Hadoop with R
          4. Spark with R
          5. R with databases
          6. Machine learning with R
        2. The future of R
          1. Big Data
          2. Fast data
          3. Smart data
        3. Where to go next
        4. Summary