You are previewing Building Machine Learning Systems with Python - Second Edition.
O'Reilly logo
Building Machine Learning Systems with Python - Second Edition

Book Description

Get more from your data through creating practical machine learning systems with Python

In Detail

Using machine learning to gain deeper insights from data is a key skill required by modern application developers and analysts alike. Python is a wonderful language to develop machine learning applications. As a dynamic language, it allows for fast exploration and experimentation. With its excellent collection of open source machine learning libraries you can focus on the task at hand while being able to quickly try out many ideas.

This book shows you exactly how to find patterns in your raw data. You will start by brushing up on your Python machine learning knowledge and introducing libraries. You’ll quickly get to grips with serious, real-world projects on datasets, using modeling, creating recommendation systems. Later on, the book covers advanced topics such as topic modeling, basket analysis, and cloud computing. These will extend your abilities and enable you to create large complex systems.

With this book, you gain the tools and understanding required to build your own systems, tailored to solve your real-world data analysis problems.

What You Will Learn

  • Build a classification system that can be applied to text, images, or sounds

  • Use NumPy, SciPy, scikit-learn – scientific Python open source libraries for scientific computing and machine learning

  • Explore the mahotas library for image processing and computer vision

  • Build a topic model for the whole of Wikipedia

  • Employ Amazon Web Services to run analysis on the cloud

  • Debug machine learning problems

  • Get to grips with recommendations using basket analysis

  • Recommend products to users based on past purchases

  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Building Machine Learning Systems with Python Second Edition
      1. Table of Contents
      2. Building Machine Learning Systems with Python Second Edition
      3. Credits
      4. About the Authors
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Getting Started with Python Machine Learning
        1. Machine learning and Python – a dream team
        2. What the book will teach you (and what it will not)
        3. What to do when you are stuck
        4. Getting started
          1. Introduction to NumPy, SciPy, and matplotlib
          2. Installing Python
          3. Chewing data efficiently with NumPy and intelligently with SciPy
          4. Learning NumPy
            1. Indexing
            2. Handling nonexisting values
            3. Comparing the runtime
          5. Learning SciPy
        5. Our first (tiny) application of machine learning
          1. Reading in the data
          2. Preprocessing and cleaning the data
          3. Choosing the right model and learning algorithm
            1. Before building our first model…
            2. Starting with a simple straight line
            3. Towards some advanced stuff
            4. Stepping back to go forward – another look at our data
            5. Training and testing
            6. Answering our initial question
        6. Summary
      9. 2. Classifying with Real-world Examples
        1. The Iris dataset
          1. Visualization is a good first step
          2. Building our first classification model
            1. Evaluation – holding out data and cross-validation
        2. Building more complex classifiers
        3. A more complex dataset and a more complex classifier
          1. Learning about the Seeds dataset
          2. Features and feature engineering
          3. Nearest neighbor classification
        4. Classifying with scikit-learn
          1. Looking at the decision boundaries
        5. Binary and multiclass classification
        6. Summary
      10. 3. Clustering – Finding Related Posts
        1. Measuring the relatedness of posts
          1. How not to do it
          2. How to do it
        2. Preprocessing – similarity measured as a similar number of common words
          1. Converting raw text into a bag of words
            1. Counting words
            2. Normalizing word count vectors
            3. Removing less important words
            4. Stemming
              1. Installing and using NLTK
              2. Extending the vectorizer with NLTK's stemmer
            5. Stop words on steroids
          2. Our achievements and goals
        3. Clustering
          1. K-means
          2. Getting test data to evaluate our ideas on
          3. Clustering posts
        4. Solving our initial challenge
          1. Another look at noise
        5. Tweaking the parameters
        6. Summary
      11. 4. Topic Modeling
        1. Latent Dirichlet allocation
          1. Building a topic model
        2. Comparing documents by topics
          1. Modeling the whole of Wikipedia
        3. Choosing the number of topics
        4. Summary
      12. 5. Classification – Detecting Poor Answers
        1. Sketching our roadmap
        2. Learning to classify classy answers
          1. Tuning the instance
          2. Tuning the classifier
        3. Fetching the data
          1. Slimming the data down to chewable chunks
          2. Preselection and processing of attributes
          3. Defining what is a good answer
        4. Creating our first classifier
          1. Starting with kNN
          2. Engineering the features
          3. Training the classifier
          4. Measuring the classifier's performance
          5. Designing more features
        5. Deciding how to improve
          1. Bias-variance and their tradeoff
          2. Fixing high bias
          3. Fixing high variance
          4. High bias or low bias
        6. Using logistic regression
          1. A bit of math with a small example
          2. Applying logistic regression to our post classification problem
        7. Looking behind accuracy – precision and recall
        8. Slimming the classifier
        9. Ship it!
        10. Summary
      13. 6. Classification II – Sentiment Analysis
        1. Sketching our roadmap
        2. Fetching the Twitter data
        3. Introducing the Naïve Bayes classifier
          1. Getting to know the Bayes' theorem
          2. Being naïve
          3. Using Naïve Bayes to classify
          4. Accounting for unseen words and other oddities
          5. Accounting for arithmetic underflows
        4. Creating our first classifier and tuning it
          1. Solving an easy problem first
          2. Using all classes
          3. Tuning the classifier's parameters
        5. Cleaning tweets
        6. Taking the word types into account
          1. Determining the word types
          2. Successfully cheating using SentiWordNet
          3. Our first estimator
          4. Putting everything together
        7. Summary
      14. 7. Regression
        1. Predicting house prices with regression
          1. Multidimensional regression
          2. Cross-validation for regression
        2. Penalized or regularized regression
          1. L1 and L2 penalties
          2. Using Lasso or ElasticNet in scikit-learn
          3. Visualizing the Lasso path
          4. P-greater-than-N scenarios
          5. An example based on text documents
          6. Setting hyperparameters in a principled way
        3. Summary
      15. 8. Recommendations
        1. Rating predictions and recommendations
          1. Splitting into training and testing
          2. Normalizing the training data
          3. A neighborhood approach to recommendations
          4. A regression approach to recommendations
          5. Combining multiple methods
        2. Basket analysis
          1. Obtaining useful predictions
          2. Analyzing supermarket shopping baskets
          3. Association rule mining
          4. More advanced basket analysis
        3. Summary
      16. 9. Classification – Music Genre Classification
        1. Sketching our roadmap
        2. Fetching the music data
          1. Converting into a WAV format
        3. Looking at music
          1. Decomposing music into sine wave components
        4. Using FFT to build our first classifier
          1. Increasing experimentation agility
          2. Training the classifier
          3. Using a confusion matrix to measure accuracy in multiclass problems
          4. An alternative way to measure classifier performance using receiver-operator characteristics
        5. Improving classification performance with Mel Frequency Cepstral Coefficients
        6. Summary
      17. 10. Computer Vision
        1. Introducing image processing
          1. Loading and displaying images
          2. Thresholding
          3. Gaussian blurring
          4. Putting the center in focus
          5. Basic image classification
          6. Computing features from images
          7. Writing your own features
          8. Using features to find similar images
          9. Classifying a harder dataset
        2. Local feature representations
        3. Summary
      18. 11. Dimensionality Reduction
        1. Sketching our roadmap
        2. Selecting features
          1. Detecting redundant features using filters
            1. Correlation
            2. Mutual information
          2. Asking the model about the features using wrappers
          3. Other feature selection methods
        3. Feature extraction
          1. About principal component analysis
            1. Sketching PCA
            2. Applying PCA
          2. Limitations of PCA and how LDA can help
        4. Multidimensional scaling
        5. Summary
      19. 12. Bigger Data
        1. Learning about big data
          1. Using jug to break up your pipeline into tasks
          2. An introduction to tasks in jug
          3. Looking under the hood
          4. Using jug for data analysis
          5. Reusing partial results
        2. Using Amazon Web Services
          1. Creating your first virtual machines
            1. Installing Python packages on Amazon Linux
            2. Running jug on our cloud machine
          2. Automating the generation of clusters with StarCluster
        3. Summary
      20. A. Where to Learn More Machine Learning
        1. Online courses
        2. Books
          1. Question and answer sites
          2. Blogs
          3. Data sources
          4. Getting competitive
        3. All that was left out
        4. Summary
      21. Index