You are previewing Big Data Analytics with R and Hadoop.
O'Reilly logo
Big Data Analytics with R and Hadoop

Book Description

If you’re an R developer looking to harness the power of big data analytics with Hadoop, then this book tells you everything you need to integrate the two. You’ll end up capable of building a data analytics engine with huge potential.

  • Write Hadoop MapReduce within R

  • Learn data analytics with R and the Hadoop platform

  • Handle HDFS data within R

  • Understand Hadoop streaming with R

  • Encode and enrich datasets into R

  • In Detail

    Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations, and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. New methods of working with big data, such as Hadoop and MapReduce, offer alternatives to traditional data warehousing.

    Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop by various tools such as RHIPE and RHadoop. A powerful data analytics engine can be built, which can process analytics algorithms over a large scale dataset in a scalable manner. This can be implemented through data analytics operations of R, MapReduce, and HDFS of Hadoop.

    You will start with the installation and configuration of R and Hadoop. Next, you will discover information on various practical data analytics examples with R and Hadoop. Finally, you will learn how to import/export from various data sources to R. Big Data Analytics with R and Hadoop will also give you an easy understanding of the R and Hadoop connectors RHIPE, RHadoop, and Hadoop streaming.

    Table of Contents

    1. Big Data Analytics with R and Hadoop
      1. Table of Contents
      2. Big Data Analytics with R and Hadoop
      3. Credits
      4. About the Author
      5. Acknowledgment
      6. About the Reviewers
      7. www.PacktPub.com
        1. Support files, eBooks, discount offers and more
          1. Why Subscribe?
          2. Free Access for Packt account holders
      8. Preface
        1. Introducing R
        2. Understanding features of R
        3. Studying the popularity of R
        4. Introducing Big Data
        5. Getting information about popular organizations that hold Big Data
        6. Introducing Hadoop
        7. Exploring Hadoop features
          1. Studying Hadoop components
          2. Understanding the reason for using R and Hadoop together
        8. What this book covers
        9. What you need for this book
        10. Who this book is for
        11. Conventions
        12. Reader feedback
        13. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      9. 1. Getting Ready to Use R and Hadoop
        1. Installing R
        2. Installing RStudio
        3. Understanding the features of R language
          1. Using R packages
          2. Performing data operations
          3. Increasing community support
          4. Performing data modeling in R
        4. Installing Hadoop
          1. Understanding different Hadoop modes
          2. Understanding Hadoop installation steps
            1. Installing Hadoop on Linux, Ubuntu flavor (single node cluster)
            2. Installing Hadoop on Linux, Ubuntu flavor (multinode cluster)
            3. Installing Cloudera Hadoop on Ubuntu
        5. Understanding Hadoop features
          1. Understanding HDFS
            1. Understanding the characteristics of HDFS
          2. Understanding MapReduce
        6. Learning the HDFS and MapReduce architecture
          1. Understanding the HDFS architecture
            1. Understanding HDFS components
          2. Understanding the MapReduce architecture
            1. Understanding MapReduce components
          3. Understanding the HDFS and MapReduce architecture by plot
        7. Understanding Hadoop subprojects
        8. Summary
      10. 2. Writing Hadoop MapReduce Programs
        1. Understanding the basics of MapReduce
        2. Introducing Hadoop MapReduce
          1. Listing Hadoop MapReduce entities
          2. Understanding the Hadoop MapReduce scenario
            1. Loading data into HDFS
            2. Executing the Map phase
            3. Shuffling and sorting
            4. Reducing phase execution
          3. Understanding the limitations of MapReduce
          4. Understanding Hadoop's ability to solve problems
          5. Understanding the different Java concepts used in Hadoop programming
        3. Understanding the Hadoop MapReduce fundamentals
          1. Understanding MapReduce objects
          2. Deciding the number of Maps in MapReduce
          3. Deciding the number of Reducers in MapReduce
          4. Understanding MapReduce dataflow
          5. Taking a closer look at Hadoop MapReduce terminologies
        4. Writing a Hadoop MapReduce example
          1. Understanding the steps to run a MapReduce job
            1. Learning to monitor and debug a Hadoop MapReduce job
            2. Exploring HDFS data
          2. Understanding several possible MapReduce definitions to solve business problems
        5. Learning the different ways to write Hadoop MapReduce in R
          1. Learning RHadoop
          2. Learning RHIPE
          3. Learning Hadoop streaming
        6. Summary
      11. 3. Integrating R and Hadoop
        1. Introducing RHIPE
          1. Installing RHIPE
            1. Installing Hadoop
            2. Installing R
            3. Installing protocol buffers
            4. Environment variables
            5. The rJava package installation
            6. Installing RHIPE
          2. Understanding the architecture of RHIPE
          3. Understanding RHIPE samples
            1. RHIPE sample program (Map only)
            2. Word count
          4. Understanding the RHIPE function reference
            1. Initialization
            2. HDFS
            3. MapReduce
        2. Introducing RHadoop
          1. Understanding the architecture of RHadoop
          2. Installing RHadoop
            1. Understanding RHadoop examples
              1. Word count
            2. Understanding the RHadoop function reference
              1. The hdfs package
              2. The rmr package
        3. Summary
      12. 4. Using Hadoop Streaming with R
        1. Understanding the basics of Hadoop streaming
        2. Understanding how to run Hadoop streaming with R
          1. Understanding a MapReduce application
          2. Understanding how to code a MapReduce application
          3. Understanding how to run a MapReduce application
            1. Executing a Hadoop streaming job from the command prompt
            2. Executing the Hadoop streaming job from R or an RStudio console
          4. Understanding how to explore the output of MapReduce application
            1. Exploring an output from the command prompt
            2. Exploring an output from R or an RStudio console
          5. Understanding basic R functions used in Hadoop MapReduce scripts
          6. Monitoring the Hadoop MapReduce job
        3. Exploring the HadoopStreaming R package
          1. Understanding the hsTableReader function
          2. Understanding the hsKeyValReader function
          3. Understanding the hsLineReader function
          4. Running a Hadoop streaming job
            1. Executing the Hadoop streaming job
        4. Summary
      13. 5. Learning Data Analytics with R and Hadoop
        1. Understanding the data analytics project life cycle
          1. Identifying the problem
          2. Designing data requirement
          3. Preprocessing data
          4. Performing analytics over data
          5. Visualizing data
        2. Understanding data analytics problems
          1. Exploring web pages categorization
            1. Identifying the problem
            2. Designing data requirement
              1. Understanding the required Google Analytics data attributes
                1. Collecting data
            3. Preprocessing data
            4. Performing analytics over data
            5. Visualizing data
          2. Computing the frequency of stock market change
            1. Identifying the problem
            2. Designing data requirement
            3. Preprocessing data
            4. Performing analytics over data
            5. Visualizing data
          3. Predicting the sale price of blue book for bulldozers – case study
            1. Identifying the problem
            2. Designing data requirement
            3. Preprocessing data
            4. Performing analytics over data
            5. Understanding Poisson-approximation resampling
              1. Fitting random forests with RHadoop
        3. Summary
      14. 6. Understanding Big Data Analysis with Machine Learning
        1. Introduction to machine learning
          1. Types of machine-learning algorithms
        2. Supervised machine-learning algorithms
          1. Linear regression
            1. Linear regression with R
            2. Linear regression with R and Hadoop
          2. Logistic regression
            1. Logistic regression with R
            2. Logistic regression with R and Hadoop
        3. Unsupervised machine learning algorithm
          1. Clustering
            1. Clustering with R
            2. Performing clustering with R and Hadoop
        4. Recommendation algorithms
          1. Steps to generate recommendations in R
          2. Generating recommendations with R and Hadoop
        5. Summary
      15. 7. Importing and Exporting Data from Various DBs
        1. Learning about data files as database
          1. Understanding different types of files
          2. Installing R packages
          3. Importing the data into R
          4. Exporting the data from R
        2. Understanding MySQL
          1. Installing MySQL
          2. Installing RMySQL
          3. Learning to list the tables and their structure
          4. Importing the data into R
          5. Understanding data manipulation
        3. Understanding Excel
          1. Installing Excel
          2. Importing data into R
          3. Understanding data manipulation with R and Excel
          4. Exporting the data to Excel
        4. Understanding MongoDB
          1. Installing MongoDB
            1. Mapping SQL to MongoDB
            2. Mapping SQL to MongoQL
          2. Installing rmongodb
          3. Importing the data into R
          4. Understanding data manipulation
        5. Understanding SQLite
          1. Understanding features of SQLite
          2. Installing SQLite
          3. Installing RSQLite
          4. Importing the data into R
          5. Understanding data manipulation
        6. Understanding PostgreSQL
          1. Understanding features of PostgreSQL
          2. Installing PostgreSQL
          3. Installing RPostgreSQL
          4. Exporting the data from R
        7. Understanding Hive
          1. Understanding features of Hive
          2. Installing Hive
            1. Setting up Hive configurations
          3. Installing RHive
          4. Understanding RHive operations
        8. Understanding HBase
          1. Understanding HBase features
          2. Installing HBase
          3. Installing thrift
          4. Installing RHBase
          5. Importing the data into R
          6. Understanding data manipulation
        9. Summary
      16. A. References
        1. R + Hadoop help materials
        2. R groups
        3. Hadoop groups
        4. R + Hadoop groups
        5. Popular R contributors
        6. Popular Hadoop contributors
      17. Index