You are previewing Learning Data Mining with R.
O'Reilly logo
Learning Data Mining with R

Book Description

Develop key skills and techniques with R to create and customize data mining algorithms

In Detail

Being able to deal with the array of problems that you may encounter during complex statistical projects can be difficult. If you have only a basic knowledge of R, this book will provide you with the skills and knowledge to successfully create and customize the most popular data mining algorithms to overcome these difficulties.

You will learn how to manipulate data with R using code snippets and be introduced to mining frequent patterns, association, and correlations while working with R programs. Discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions written in R based on RHadoop projects. You will finish this book feeling confident in your ability to know which data mining algorithm to apply in any situation.

What You Will Learn

  • Discover how you can manipulate data with R using code snippets

  • Get to know the top classification algorithms written in R

  • Develop best practices in the fields of graph mining and network analysis

  • Find out the solutions to mine text and web data with appropriate support from R

  • Familiarize yourself with algorithms written in R for spatial data mining, text mining, and web data mining

  • Explore solutions written in R based on RHadoop projects

  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Learning Data Mining with R
      1. Table of Contents
      2. Learning Data Mining with R
      3. Credits
      4. About the Author
      5. Acknowledgments
      6. About the Reviewers
      7. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      8. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      9. 1. Warming Up
        1. Big data
          1. Scalability and efficiency
        2. Data source
        3. Data mining
          1. Feature extraction
          2. Summarization
          3. The data mining process
            1. CRISP-DM
            2. SEMMA
        4. Social network mining
          1. Social network
        5. Text mining
          1. Information retrieval and text mining
          2. Mining text for prediction
        6. Web data mining
        7. Why R?
          1. What are the disadvantages of R?
        8. Statistics
          1. Statistics and data mining
          2. Statistics and machine learning
          3. Statistics and R
          4. The limitations of statistics on data mining
        9. Machine learning
          1. Approaches to machine learning
          2. Machine learning architecture
        10. Data attributes and description
          1. Numeric attributes
          2. Categorical attributes
          3. Data description
          4. Data measuring
        11. Data cleaning
          1. Missing values
          2. Junk, noisy data, or outlier
        12. Data integration
        13. Data dimension reduction
          1. Eigenvalues and Eigenvectors
          2. Principal-Component Analysis
          3. Singular-value decomposition
          4. CUR decomposition
        14. Data transformation and discretization
          1. Data transformation
          2. Normalization data transformation methods
          3. Data discretization
        15. Visualization of results
          1. Visualization with R
        16. Time for action
        17. Summary
      10. 2. Mining Frequent Patterns, Associations, and Correlations
        1. An overview of associations and patterns
          1. Patterns and pattern discovery
            1. The frequent itemset
            2. The frequent subsequence
            3. The frequent substructures
          2. Relationship or rules discovery
            1. Association rules
            2. Correlation rules
        2. Market basket analysis
          1. The market basket model
          2. A-Priori algorithms
            1. Input data characteristics and data structure
            2. The A-Priori algorithm
            3. The R implementation
            4. A-Priori algorithm variants
          3. The Eclat algorithm
            1. The R implementation
          4. The FP-growth algorithm
            1. Input data characteristics and data structure
            2. The FP-growth algorithm
            3. The R implementation
          5. The GenMax algorithm with maximal frequent itemsets
            1. The R implementation
          6. The Charm algorithm with closed frequent itemsets
            1. The R implementation
          7. The algorithm to generate association rules
            1. The R implementation
        3. Hybrid association rules mining
          1. Mining multilevel and multidimensional association rules
          2. Constraint-based frequent pattern mining
        4. Mining sequence dataset
          1. Sequence dataset
          2. The GSP algorithm
        5. The R implementation
          1. The SPADE algorithm
            1. The R implementation
          2. Rule generation from sequential patterns
        6. High-performance algorithms
        7. Time for action
        8. Summary
      11. 3. Classification
        1. Classification
        2. Generic decision tree induction
          1. Attribute selection measures
          2. Tree pruning
          3. General algorithm for the decision tree generation
          4. The R implementation
        3. High-value credit card customers classification using ID3
          1. The ID3 algorithm
          2. The R implementation
          3. Web attack detection
          4. High-value credit card customers classification
        4. Web spam detection using C4.5
          1. The C4.5 algorithm
          2. The R implementation
          3. A parallel version with MapReduce
          4. Web spam detection
        5. Web key resource page judgment using CART
          1. The CART algorithm
          2. The R implementation
          3. Web key resource page judgment
        6. Trojan traffic identification method and Bayes classification
          1. Estimating
            1. Prior probability estimation
            2. Likelihood estimation
          2. The Bayes classification
          3. The R implementation
          4. Trojan traffic identification method
        7. Identify spam e-mail and Naïve Bayes classification
          1. The Naïve Bayes classification
          2. The R implementation
          3. Identify spam e-mail
        8. Rule-based classification of player types in computer games and rule-based classification
          1. Transformation from decision tree to decision rules
          2. Rule-based classification
          3. Sequential covering algorithm
          4. The RIPPER algorithm
            1. The R implementation
          5. Rule-based classification of player types in computer games
        9. Time for action
        10. Summary
      12. 4. Advanced Classification
        1. Ensemble (EM) methods
          1. The bagging algorithm
          2. The boosting and AdaBoost algorithms
          3. The Random forests algorithm
          4. The R implementation
          5. Parallel version with MapReduce
        2. Biological traits and the Bayesian belief network
          1. The Bayesian belief network (BBN) algorithm
          2. The R implementation
          3. Biological traits
        3. Protein classification and the k-Nearest Neighbors algorithm
          1. The kNN algorithm
          2. The R implementation
        4. Document retrieval and Support Vector Machine
          1. The SVM algorithm
          2. The R implementation
          3. Parallel version with MapReduce
          4. Document retrieval
        5. Classification using frequent patterns
          1. The associative classification
            1. CBA
          2. Discriminative frequent pattern-based classification
          3. The R implementation
          4. Text classification using sentential frequent itemsets
        6. Classification using the backpropagation algorithm
          1. The BP algorithm
          2. The R implementation
          3. Parallel version with MapReduce
        7. Time for action
        8. Summary
      13. 5. Cluster Analysis
        1. Search engines and the k-means algorithm
          1. The k-means clustering algorithm
          2. The kernel k-means algorithm
          3. The k-modes algorithm
          4. The R implementation
          5. Parallel version with MapReduce
          6. Search engine and web page clustering
        2. Automatic abstraction of document texts and the k-medoids algorithm
          1. The PAM algorithm
          2. The R implementation
          3. Automatic abstraction and summarization of document text
        3. The CLARA algorithm
          1. The CLARA algorithm
          2. The R implementation
        4. CLARANS
          1. The CLARANS algorithm
          2. The R implementation
        5. Unsupervised image categorization and affinity propagation clustering
          1. Affinity propagation clustering
          2. The R implementation
          3. Unsupervised image categorization
          4. The spectral clustering algorithm
          5. The R implementation
        6. News categorization and hierarchical clustering
          1. Agglomerative hierarchical clustering
          2. The BIRCH algorithm
          3. The chameleon algorithm
          4. The Bayesian hierarchical clustering algorithm
          5. The probabilistic hierarchical clustering algorithm
          6. The R implementation
          7. News categorization
        7. Time for action
        8. Summary
      14. 6. Advanced Cluster Analysis
        1. Customer categorization analysis of e-commerce and DBSCAN
          1. The DBSCAN algorithm
          2. Customer categorization analysis of e-commerce
        2. Clustering web pages and OPTICS
          1. The OPTICS algorithm
          2. The R implementation
          3. Clustering web pages
        3. Visitor analysis in the browser cache and DENCLUE
          1. The DENCLUE algorithm
          2. The R implementation
          3. Visitor analysis in the browser cache
        4. Recommendation system and STING
          1. The STING algorithm
          2. The R implementation
          3. Recommendation systems
        5. Web sentiment analysis and CLIQUE
          1. The CLIQUE algorithm
          2. The R implementation
          3. Web sentiment analysis
        6. Opinion mining and WAVE clustering
          1. The WAVE cluster algorithm
          2. The R implementation
          3. Opinion mining
        7. User search intent and the EM algorithm
          1. The EM algorithm
          2. The R implementation
          3. The user search intent
        8. Customer purchase data analysis and clustering high-dimensional data
          1. The MAFIA algorithm
          2. The SURFING algorithm
          3. The R implementation
          4. Customer purchase data analysis
        9. SNS and clustering graph and network data
          1. The SCAN algorithm
          2. The R implementation
          3. Social networking service (SNS)
        10. Time for action
        11. Summary
      15. 7. Outlier Detection
        1. Credit card fraud detection and statistical methods
          1. The likelihood-based outlier detection algorithm
          2. The R implementation
          3. Credit card fraud detection
        2. Activity monitoring – the detection of fraud involving mobile phones and proximity-based methods
          1. The NL algorithm
          2. The FindAllOutsM algorithm
          3. The FindAllOutsD algorithm
          4. The distance-based algorithm
          5. The Dolphin algorithm
          6. The R implementation
          7. Activity monitoring and the detection of mobile fraud
        3. Intrusion detection and density-based methods
          1. The OPTICS-OF algorithm
          2. The High Contrast Subspace algorithm
          3. The R implementation
          4. Intrusion detection
        4. Intrusion detection and clustering-based methods
          1. Hierarchical clustering to detect outliers
          2. The k-means-based algorithm
          3. The ODIN algorithm
          4. The R implementation
        5. Monitoring the performance of the web server and classification-based methods
          1. The OCSVM algorithm
          2. The one-class nearest neighbor algorithm
          3. The R implementation
          4. Monitoring the performance of the web server
        6. Detecting novelty in text, topic detection, and mining contextual outliers
          1. The conditional anomaly detection (CAD) algorithm
          2. The R implementation
          3. Detecting novelty in text and topic detection
        7. Collective outliers on spatial data
          1. The route outlier detection (ROD) algorithm
          2. The R implementation
          3. Characteristics of collective outliers
        8. Outlier detection in high-dimensional data
          1. The brute-force algorithm
          2. The HilOut algorithm
          3. The R implementation
        9. Time for action
        10. Summary
      16. 8. Mining Stream, Time-series, and Sequence Data
        1. The credit card transaction flow and STREAM algorithm
          1. The STREAM algorithm
          2. The single-pass-any-time clustering algorithm
          3. The R implementation
          4. The credit card transaction flow
        2. Predicting future prices and time-series analysis
          1. The ARIMA algorithm
          2. Predicting future prices
        3. Stock market data and time-series clustering and classification
          1. The hError algorithm
          2. Time-series classification with the 1NN classifier
          3. The R implementation
          4. Stock market data
        4. Web click streams and mining symbolic sequences
          1. The TECNO-STREAMS algorithm
          2. The R implementation
          3. Web click streams
        5. Mining sequence patterns in transactional databases
          1. The PrefixSpan algorithm
          2. The R implementation
        6. Time for action
        7. Summary
      17. 9. Graph Mining and Network Analysis
        1. Graph mining
          1. Graph
          2. Graph mining algorithms
        2. Mining frequent subgraph patterns
          1. The gPLS algorithm
          2. The GraphSig algorithm
          3. The gSpan algorithm
          4. Rightmost path extensions and their supports
          5. The subgraph isomorphism enumeration algorithm
          6. The canonical checking algorithm
          7. The R implementation
        3. Social network mining
          1. Community detection and the shingling algorithm
          2. The node classification and iterative classification algorithms
          3. The R implementation
        4. Time for action
        5. Summary
      18. 10. Mining Text and Web Data
        1. Text mining and TM packages
        2. Text summarization
          1. Topic representation
          2. The multidocument summarization algorithm
          3. The Maximal Marginal Relevance algorithm
          4. The R implementation
        3. The question answering system
        4. Genre categorization of web pages
        5. Categorizing newspaper articles and newswires into topics
          1. The N-gram-based text categorization
          2. The R implementation
        6. Web usage mining with web logs
          1. The FCA-based association rule mining algorithm
          2. The R implementation
        7. Time for action
        8. Summary
      19. A. Algorithms and Data Structures
      20. Index