O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Mastering Data Mining with Python – Find patterns hidden in your data

Book Description

Learn how to create more powerful data mining applications with this comprehensive Python guide to advance data analytics techniques

About This Book

  • Dive deeper into data mining with Python – don’t be complacent, sharpen your skills!

  • From the most common elements of data mining to cutting-edge techniques, we’ve got you covered for any data-related challenge

  • Become a more fluent and confident Python data-analyst, in full control of its extensive range of libraries

  • Who This Book Is For

    This book is for data scientists who are already familiar with some basic data mining techniques such as SQL and machine learning, and who are comfortable with Python. If you are ready to learn some more advanced techniques in data mining in order to become a data mining expert, this is the book for you!

    What You Will Learn

  • Explore techniques for finding frequent itemsets and association rules in large data sets

  • Learn identification methods for entity matches across many different types of data

  • Identify the basics of network mining and how to apply it to real-world data sets

  • Discover methods for detecting the sentiment of text and for locating named entities in text

  • Observe multiple techniques for automatically extracting summaries and generating topic models for text

  • See how to use data mining to fix data anomalies and how to use machine learning to identify outliers in a data set

  • In Detail

    Data mining is an integral part of the data science pipeline. It is the foundation of any successful data-driven strategy – without it, you'll never be able to uncover truly transformative insights. Since data is vital to just about every modern organization, it is worth taking the next step to unlock even greater value and more meaningful understanding.

    If you already know the fundamentals of data mining with Python, you are now ready to experiment with more interesting, advanced data analytics techniques using Python's easy-to-use interface and extensive range of libraries.

    In this book, you'll go deeper into many often overlooked areas of data mining, including association rule mining, entity matching, network mining, sentiment analysis, named entity recognition, text summarization, topic modeling, and anomaly detection. For each data mining technique, we'll review the state-of-the-art and current best practices before comparing a wide variety of strategies for solving each problem. We will then implement example solutions using real-world data from the domain of software engineering, and we will spend time learning how to understand and interpret the results we get.

    By the end of this book, you will have solid experience implementing some of the most interesting and relevant data mining techniques available today, and you will have achieved a greater fluency in the important field of Python data analytics.

    Style and approach

    This book will teach you the intricacies in applying data mining using real-world scenarios and will act as a very practical solution to your data mining needs.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Mastering Data Mining with Python – Find patterns hidden in your data
      1. Table of Contents
      2. Mastering Data Mining with Python – Find patterns hidden in your data
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. www.PacktPub.com
        1. eBooks, discount offers, and more
          1. Why subscribe?
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Expanding Your Data Mining Toolbox
        1. What is data mining?
        2. How do we do data mining?
          1. The Fayyad et al. KDD process
          2. The Han et al. KDD process
          3. The CRISP-DM process
          4. The Six Steps process
          5. Which data mining methodology is the best?
        3. What are the techniques used in data mining?
          1. What techniques are we going to use in this book?
        4. How do we set up our data mining work environment?
        5. Summary
      9. 2. Association Rule Mining
        1. What are frequent itemsets?
          1. The diapers and beer urban legend
          2. Frequent itemset mining basics
        2. Towards association rules
          1. Support
          2. Confidence
          3. Association rules
          4. An example with data
          5. Added value – fixing a flaw in the plan
          6. Methods for finding frequent itemsets
        3. A project – discovering association rules in software project tags
        4. Summary
      10. 3. Entity Matching
        1. What is entity matching?
          1. Merging data
            1. Merging datasets vertically
            2. Merging datasets horizontally
          2. Techniques for matching
          3. Attribute-based similarity matching
            1. Be careful of pairwise comparisons
            2. Leverage rare values
          4. Methods for matching attributes
            1. Range-based or distance from target
            2. String edit distance
            3. Hamming distance
            4. Levenshtein distance
            5. Soundex
          5. Leveraging disjoint sets
          6. Context-based similarity matching
          7. Machine learning-based entity matching
          8. Evaluation of entity matching techniques
            1. Efficiency – how long does it take to do the matching?
            2. Effectiveness – how accurate are the matches that we generate?
            3. Usefulness – how practical is the matching procedure to use?
        2. Entity matching project
          1. Difficulties with matching software projects
          2. Two examples
          3. Matching on project names
          4. Matching on people names
          5. Matching on URLs
          6. Matching on topics and description keywords
          7. The dataset
          8. The code
          9. The results
            1. How many entity matches did we find?
            2. How good are the pairs we found?
        3. Summary
      11. 4. Network Analysis
        1. What is a network?
        2. Measuring a network
          1. Degree of a network
          2. Diameter of a network
          3. Walks, paths, and trails in a network
          4. Components of a network
          5. Centrality of a network
            1. Closeness centrality
            2. Degree centrality
            3. Betweenness centrality
            4. Other measures of centrality
        3. Representing graph data
          1. Adjacency matrix
          2. Edge lists and adjacency lists
          3. Differences between graph data structures
          4. Importing data into a graph structure
            1. Adjacency list format
            2. Edge list format
            3. GEXF and GraphML
            4. GDF
            5. Python pickle
            6. JSON
            7. JSON node and link series
            8. JSON trees
            9. Pajek format
        4. A real project
          1. Exploring the data
          2. Generating the network files
          3. Understanding our data as a network
            1. Generating simple network metrics
            2. Playing with the parameters of a network
            3. Analyzing subgraphs
            4. Analyzing cliques and centrality in the subgraphs
            5. Looking for change over time
        5. Summary
      12. 5. Sentiment Analysis in Text
        1. What is sentiment analysis?
        2. The basics of sentiment analysis
          1. The structure of an opinion
          2. Document-level and sentence-level analysis
          3. Important features of opinions
        3. Sentiment analysis algorithms
          1. General-purpose data collections
            1. Hu and Liu's sentiment analysis lexicon
            2. SentiWordNet
            3. Vader sentiment
        4. Sentiment mining application
          1. Motivating the project
          2. Data preparation
          3. Data analysis of chat messages
          4. Data analysis of e-mail messages
        5. Summary
      13. 6. Named Entity Recognition in Text
        1. Why look for named entities?
        2. Techniques for named entity recognition
          1. Tagging parts of speech
            1. Classes of named entities
        3. Building and evaluating NER systems
          1. NER and partial matches
          2. Handling partial matches
        4. Named entity recognition project
          1. A simple NER tool
            1. Apache Board meeting minutes
            2. Django IRC chat
            3. GnuIRC summaries
            4. LKML e-mails
        5. Summary
      14. 7. Automatic Text Summarization
        1. What is automatic text summarization?
        2. Tools for text summarization
          1. Naive text summarization using NLTK
          2. Text summarization using Gensim
          3. Text summarization using Sumy
            1. Sumy's Luhn summarizer
            2. Sumy's TextRank summarizer
            3. Sumy's LSA summarizer
            4. Sumy's Edmundson summarizer
        3. Summary
      15. 8. Topic Modeling in Text
        1. What is topic modeling?
        2. Latent Dirichlet Allocation
        3. Gensim for topic modeling
          1. Understanding Gensim LDA topics
          2. Understanding Gensim LDA passes
          3. Applying a Gensim LDA model to new documents
          4. Serializing Gensim LDA objects
            1. Serializing a dictionary
            2. Serializing a corpus
            3. Serializing a model
        4. Gensim LDA for a larger project
        5. Summary
      16. 9. Mining for Data Anomalies
        1. What are data anomalies?
          1. Missing data
            1. Locating missing data
            2. Zero values
          2. Fixing missing data
            1. Ignore the problem rows
            2. Fix the problem manually
            3. Use a fabricated value
            4. Use a central measure
            5. Use Last Observation Carried Forward
            6. Use a similar value
            7. Use the most likely value
          3. Data errors
            1. Truncated fields
            2. Data type and character set errors
            3. Logic or semantic errors
          4. Outliers
            1. Visual mining for outliers
            2. Statistical detection of outliers
              1. Detecting outliers with modified z-scores
              2. Detecting outliers by combining statistics and visual mining
              3. Detecting outliers with machine learning
        2. Summary
      17. Index