You are previewing scikit-learn Cookbook.
O'Reilly logo
scikit-learn Cookbook

Book Description

Over 50 recipes to incorporate scikit-learn into every step of the data science pipeline, from feature extraction to model building and model evaluation

In Detail

Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility, and within the Python data space, scikit-learn is the unequivocal choice for machine learning. Its consistent API and plethora of features help solve any machine learning problem it comes across.

The book starts by walking through different methods to prepare your data—be it a dataset with missing values or text columns that require the categories to be turned into indicator variables. After the data is ready, you'll learn different techniques aligned with different objectives—be it a dataset with known outcomes such as sales by state, or more complicated problems such as clustering similar customers. Finally, you'll learn how to polish your algorithm to ensure that it's both accurate and resilient to new datasets.

What You Will Learn

  • Address algorithms of various levels of complexity and learn how to analyze data at the same time
  • Handle common data problems such as feature extraction and missing data
  • Understand how to evaluate your models against themselves and any other model
  • Discover just enough math needed to learn how to think about the connections between various algorithms
  • Customize the machine learning algorithm to fit your problem, and learn how to modify it when the situation calls for it
  • Incorporate other packages from the Python ecosystem to munge and visualize your dataset
  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. scikit-learn Cookbook
      1. Table of Contents
      2. scikit-learn Cookbook
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Sections
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
          5. See also
        5. Conventions
        6. Reader feedback
        7. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      8. 1. Premodel Workflow
        1. Introduction
        2. Getting sample data from external sources
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
          5. See also
        3. Creating sample data for toy analysis
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Scaling data to the standard normal
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Creating idempotent scalar objects
            2. Handling sparse imputations
        5. Creating binary features through thresholding
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Sparse matrices
            2. The fit method
        6. Working with categorical variables
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. DictVectorizer
            2. Patsy
        7. Binarizing label features
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        8. Imputing missing values through various strategies
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        9. Using Pipelines for multiple preprocessing steps
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Reducing dimensionality with PCA
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        11. Using factor analysis for decomposition
          1. Getting ready
          2. How to do it...
          3. How it works...
        12. Kernel PCA for nonlinear dimensionality reduction
          1. Getting ready
          2. How to do it...
          3. How it works...
        13. Using truncated SVD to reduce dimensionality
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Sign flipping
            2. Sparse matrices
        14. Decomposition to classify with DictionaryLearning
          1. Getting ready
          2. How to do it...
          3. How it works...
        15. Putting it all together with Pipelines
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        16. Using Gaussian processes for regression
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
        17. Defining the Gaussian process object directly
          1. Getting ready
          2. How to do it…
          3. How it works…
        18. Using stochastic gradient descent for regression
          1. Getting ready
          2. How to do it…
          3. How it works…
      9. 2. Working with Linear Models
        1. Introduction
        2. Fitting a line through data
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        3. Evaluating the linear regression model
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        4. Using ridge regression to overcome linear regression's shortfalls
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Optimizing the ridge regression parameter
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        6. Using sparsity to regularize models
          1. Getting ready
          2. How to do it...
          3. How it works...
            1. Lasso cross-validation
            2. Lasso for feature selection
        7. Taking a more fundamental approach to regularization with LARS
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        8. Using linear methods for classification – logistic regression
          1. Getting ready
          2. How to do it...
          3. There's more...
        9. Directly applying Bayesian ridge regression
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        10. Using boosting to learn from errors
          1. Getting ready
          2. How to do it...
          3. How it works...
      10. 3. Building Models with Distance Metrics
        1. Introduction
        2. Using KMeans to cluster data
          1. Getting ready
          2. How to do it…
          3. How it works...
        3. Optimizing the number of centroids
          1. Getting ready
          2. How to do it…
          3. How it works…
        4. Assessing cluster correctness
          1. Getting ready
          2. How to do it...
          3. There's more...
        5. Using MiniBatch KMeans to handle more data
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Quantizing an image with KMeans clustering
          1. Getting ready
          2. How do it…
          3. How it works…
        7. Finding the closest objects in the feature space
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        8. Probabilistic clustering with Gaussian Mixture Models
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Using KMeans for outlier detection
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Using k-NN for regression
          1. Getting ready
          2. How to do it…
          3. How it works...
      11. 4. Classifying Data with scikit-learn
        1. Introduction
        2. Doing basic classifications with Decision Trees
          1. Getting ready
          2. How to do it…
          3. How it works…
        3. Tuning a Decision Tree model
          1. Getting ready
          2. How to do it…
          3. How it works…
        4. Using many Decision Trees – random forests
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
        5. Tuning a random forest model
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
        6. Classifying data with support vector machines
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
        7. Generalizing with multiclass classification
          1. Getting ready
          2. How to do it…
          3. How it works…
        8. Using LDA for classification
          1. Getting ready
          2. How to do it…
          3. How it works…
        9. Working with QDA – a nonlinear LDA
          1. Getting ready
          2. How to do it…
          3. How it works…
        10. Using Stochastic Gradient Descent for classification
          1. Getting ready
          2. How to do it…
        11. Classifying documents with Naïve Bayes
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
        12. Label propagation with semi-supervised learning
          1. Getting ready
          2. How to do it…
          3. How it works…
      12. 5. Postmodel Workflow
        1. Introduction
        2. K-fold cross validation
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Automatic cross validation
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Cross validation with ShuffleSplit
          1. Getting ready
          2. How to do it...
        5. Stratified k-fold
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Poor man's grid search
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Brute force grid search
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Using dummy estimators to compare results
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Regression model evaluation
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Feature selection
          1. Getting ready
          2. How to do it...
          3. How it works...
        11. Feature selection on L1 norms
          1. Getting ready
          2. How to do it...
          3. How it works...
        12. Persisting models with joblib
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
      13. Index