You are previewing Learning Data Mining with Python.
O'Reilly logo
Learning Data Mining with Python

Book Description

Harness the power of Python to analyze data and create insightful predictive models

In Detail

The next step in the information age is to gain insights from the deluge of data coming our way. Data mining provides a way of finding this insight, and Python is one of the most popular languages for data mining, providing both power and flexibility in analysis.

This book teaches you to design and develop data mining applications using a variety of datasets, starting with basic classification and affinity analysis. Next, we move on to more complex data types including text, images, and graphs. In every chapter, we create models that solve real-world problems.

There is a rich and varied set of libraries available in Python for data mining. This book covers a large number, including the IPython Notebook, pandas, scikit-learn and NLTK.

Each chapter of this book introduces you to new algorithms and techniques. By the end of the book, you will gain a large insight into using Python for data mining, with a good knowledge and understanding of the algorithms and implementations.

What You Will Learn

  • Apply data mining concepts to real-world problems

  • Determine the author of a document based on their writing style

  • Use APIs to download datasets from social media and other online services

  • Find and extract good features from difficult datasets

  • Create models that solve real-world problems

  • Design and develop data mining applications using a variety of datasets

  • Set up reproducible experiments and generate robust results

  • Recommend movies, online celebrities, and news articles based on personal preferences

  • Compute on big data, including real-time data from the Internet

  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Learning Data Mining with Python
      1. Table of Contents
      2. Learning Data Mining with Python
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      8. 1. Getting Started with Data Mining
        1. Introducing data mining
        2. Using Python and the IPython Notebook
          1. Installing Python
          2. Installing IPython
          3. Installing scikit-learn
        3. A simple affinity analysis example
          1. What is affinity analysis?
          2. Product recommendations
          3. Loading the dataset with NumPy
          4. Implementing a simple ranking of rules
          5. Ranking to find the best rules
        4. A simple classification example
        5. What is classification?
          1. Loading and preparing the dataset
          2. Implementing the OneR algorithm
          3. Testing the algorithm
        6. Summary
      9. 2. Classifying with scikit-learn Estimators
        1. scikit-learn estimators
          1. Nearest neighbors
          2. Distance metrics
          3. Loading the dataset
          4. Moving towards a standard workflow
          5. Running the algorithm
          6. Setting parameters
        2. Preprocessing using pipelines
          1. An example
          2. Standard preprocessing
          3. Putting it all together
        3. Pipelines
        4. Summary
      10. 3. Predicting Sports Winners with Decision Trees
        1. Loading the dataset
          1. Collecting the data
          2. Using pandas to load the dataset
          3. Cleaning up the dataset
          4. Extracting new features
        2. Decision trees
          1. Parameters in decision trees
          2. Using decision trees
        3. Sports outcome prediction
          1. Putting it all together
        4. Random forests
          1. How do ensembles work?
          2. Parameters in Random forests
          3. Applying Random forests
          4. Engineering new features
        5. Summary
      11. 4. Recommending Movies Using Affinity Analysis
        1. Affinity analysis
          1. Algorithms for affinity analysis
          2. Choosing parameters
        2. The movie recommendation problem
          1. Obtaining the dataset
          2. Loading with pandas
          3. Sparse data formats
        3. The Apriori implementation
          1. The Apriori algorithm
          2. Implementation
        4. Extracting association rules
          1. Evaluation
        5. Summary
      12. 5. Extracting Features with Transformers
        1. Feature extraction
          1. Representing reality in models
          2. Common feature patterns
          3. Creating good features
        2. Feature selection
          1. Selecting the best individual features
        3. Feature creation
          1. Principal Component Analysis
        4. Creating your own transformer
          1. The transformer API
          2. Implementation details
          3. Unit testing
          4. Putting it all together
        5. Summary
      13. 6. Social Media Insight Using Naive Bayes
        1. Disambiguation
          1. Downloading data from a social network
          2. Loading and classifying the dataset
          3. Creating a replicable dataset from Twitter
        2. Text transformers
          1. Bag-of-words
          2. N-grams
          3. Other features
        3. Naive Bayes
          1. Bayes' theorem
          2. Naive Bayes algorithm
          3. How it works
        4. Application
          1. Extracting word counts
          2. Converting dictionaries to a matrix
          3. Training the Naive Bayes classifier
          4. Putting it all together
          5. Evaluation using the F1-score
          6. Getting useful features from models
        5. Summary
      14. 7. Discovering Accounts to Follow Using Graph Mining
        1. Loading the dataset
          1. Classifying with an existing model
          2. Getting follower information from Twitter
          3. Building the network
          4. Creating a graph
          5. Creating a similarity graph
        2. Finding subgraphs
          1. Connected components
          2. Optimizing criteria
        3. Summary
      15. 8. Beating CAPTCHAs with Neural Networks
        1. Artificial neural networks
          1. An introduction to neural networks
        2. Creating the dataset
          1. Drawing basic CAPTCHAs
          2. Splitting the image into individual letters
          3. Creating a training dataset
          4. Adjusting our training dataset to our methodology
        3. Training and classifying
          1. Back propagation
          2. Predicting words
        4. Improving accuracy using a dictionary
          1. Ranking mechanisms for words
          2. Putting it all together
        5. Summary
      16. 9. Authorship Attribution
        1. Attributing documents to authors
          1. Applications and use cases
          2. Attributing authorship
          3. Getting the data
        2. Function words
          1. Counting function words
          2. Classifying with function words
        3. Support vector machines
          1. Classifying with SVMs
          2. Kernels
        4. Character n-grams
          1. Extracting character n-grams
        5. Using the Enron dataset
          1. Accessing the Enron dataset
          2. Creating a dataset loader
          3. Putting it all together
          4. Evaluation
        6. Summary
      17. 10. Clustering News Articles
        1. Obtaining news articles
          1. Using a Web API to get data
          2. Reddit as a data source
          3. Getting the data
        2. Extracting text from arbitrary websites
          1. Finding the stories in arbitrary websites
          2. Putting it all together
        3. Grouping news articles
          1. The k-means algorithm
          2. Evaluating the results
          3. Extracting topic information from clusters
          4. Using clustering algorithms as transformers
        4. Clustering ensembles
          1. Evidence accumulation
          2. How it works
          3. Implementation
        5. Online learning
          1. An introduction to online learning
          2. Implementation
        6. Summary
      18. 11. Classifying Objects in Images Using Deep Learning
        1. Object classification
        2. Application scenario and goals
          1. Use cases
        3. Deep neural networks
          1. Intuition
          2. Implementation
          3. An introduction to Theano
          4. An introduction to Lasagne
          5. Implementing neural networks with nolearn
        4. GPU optimization
          1. When to use GPUs for computation
          2. Running our code on a GPU
        5. Setting up the environment
        6. Application
          1. Getting the data
          2. Creating the neural network
          3. Putting it all together
        7. Summary
      19. 12. Working with Big Data
        1. Big data
        2. Application scenario and goals
        3. MapReduce
          1. Intuition
          2. A word count example
          3. Hadoop MapReduce
        4. Application
          1. Getting the data
          2. Naive Bayes prediction
            1. The mrjob package
            2. Extracting the blog posts
            3. Training Naive Bayes
            4. Putting it all together
            5. Training on Amazon's EMR infrastructure
        5. Summary
      20. A. Next Steps…
        1. Chapter 1 – Getting Started with Data Mining
          1. Scikit-learn tutorials
          2. Extending the IPython Notebook
        2. Chapter 2 – Classifying with scikit-learn Estimators
          1. Scalability with the nearest neighbor
          2. More complex pipelines
          3. Comparing classifiers
        3. Chapter 3: Predicting Sports Winners with Decision Trees
          1. More on pandas
          2. More complex features
        4. Chapter 4 – Recommending Movies Using Affinity Analysis
          1. New datasets
          2. The Eclat algorithm
        5. Chapter 5 – Extracting Features with Transformers
          1. Adding noise
          2. Vowpal Wabbit
        6. Chapter 6 – Social Media Insight Using Naive Bayes
          1. Spam detection
          2. Natural language processing and part-of-speech tagging
        7. Chapter 7 – Discovering Accounts to Follow Using Graph Mining
          1. More complex algorithms
          2. NetworkX
        8. Chapter 8 – Beating CAPTCHAs with Neural Networks
          1. Better (worse?) CAPTCHAs
          2. Deeper networks
          3. Reinforcement learning
        9. Chapter 9 – Authorship Attribution
          1. Increasing the sample size
          2. Blogs dataset
          3. Local n-grams
        10. Chapter 10 – Clustering News Articles
          1. Evaluation
          2. Temporal analysis
          3. Real-time clusterings
        11. Chapter 11: Classifying Objects in Images Using Deep Learning
          1. Keras and Pylearn2
          2. Mahotas
        12. Chapter 12 – Working with Big Data
          1. Courses on Hadoop
          2. Pydoop
          3. Recommendation engine
        13. More resources
      21. Index