You are previewing Python: Real-World Data Science.
O'Reilly logo
Python: Real-World Data Science

Book Description

Unleash the power of Python and its robust data science capabilities

About This Book

  • Unleash the power of Python 3 objects

  • Learn to use powerful Python libraries for effective data processing and analysis

  • Harness the power of Python to analyze data and create insightful predictive models

  • Unlock deeper insights into machine learning with this vital guide to cutting-edge predictive analytics

  • Who This Book Is For

    Entry-level analysts who want to enter in the data science world will find this course very useful to get themselves acquainted with Python’s data science capabilities for doing real-world data analysis.

    What You Will Learn

  • Install and setup Python

  • Implement objects in Python by creating classes and defining methods

  • Get acquainted with NumPy to use it with arrays and array-oriented computing in data analysis

  • Create effective visualizations for presenting your data using Matplotlib

  • Process and analyze data using the time series capabilities of pandas

  • Interact with different kind of database systems, such as file, disk format, Mongo, and Redis

  • Apply data mining concepts to real-world problems

  • Compute on big data, including real-time data from the Internet

  • Explore how to use different machine learning models to ask different questions of your data

  • In Detail

    The Python: Real-World Data Science course will take you on a journey to become an efficient data science practitioner by thoroughly understanding the key concepts of Python. This learning path is divided into four modules and each module are a mini course in their own right, and as you complete each one, you’ll have gained key skills and be ready for the material in the next module.

    The course begins with getting your Python fundamentals nailed down. After getting familiar with Python core concepts, it’s time that you dive into the field of data science. In the second module, you'll learn how to perform data analysis using Python in a practical and example-driven way. The third module will teach you how to design and develop data mining applications using a variety of datasets, starting with basic classification and affinity analysis to more complex data types including text, images, and graphs. Machine learning and predictive analytics have become the most important approaches to uncover data gold mines. In the final module, we'll discuss the necessary details regarding machine learning concepts, offering intuitive yet informative explanations on how machine learning algorithms work, how to use them, and most importantly, how to avoid the common pitfalls.

    Style and approach

    This course includes all the resources that will help you jump into the data science field with Python and learn how to make sense of data. The aim is to create a smooth learning path that will teach you how to get started with powerful Python libraries and perform various data science techniques in depth.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Data Science with Python
      1. Table of Contents
      2. Data Science with Python
      3. Meet Your Course Guide
      4. What's so cool about Data Science?
      5. Course Structure
      6. Course Journey
      7. The Course Roadmap and Timeline
      8. 1. Course Module 1: Python Fundamentals
        1. 1. Introduction and First Steps – Take a Deep Breath
          1. A proper introduction
          2. Enter the Python
          3. About Python
            1. Portability
            2. Coherence
            3. Developer productivity
            4. An extensive library
            5. Software quality
            6. Software integration
            7. Satisfaction and enjoyment
          4. What are the drawbacks?
          5. Who is using Python today?
          6. Setting up the environment
            1. Python 2 versus Python 3 – the great debate
          7. What you need for this course
            1. Installing Python
            2. Installing IPython
            3. Installing additional packages
          8. How you can run a Python program
            1. Running Python scripts
            2. Running the Python interactive shell
            3. Running Python as a service
            4. Running Python as a GUI application
          9. How is Python code organized
            1. How do we use modules and packages
          10. Python's execution model
            1. Names and namespaces
            2. Scopes
          11. Guidelines on how to write good code
          12. The Python culture
          13. A note on the IDEs
        2. 2. Object-oriented Design
          1. Introducing object-oriented
          2. Objects and classes
          3. Specifying attributes and behaviors
            1. Data describes objects
            2. Behaviors are actions
          4. Hiding details and creating the public interface
          5. Composition
          6. Inheritance
            1. Inheritance provides abstraction
            2. Multiple inheritance
          7. Case study
        3. 3. Objects in Python
          1. Creating Python classes
            1. Adding attributes
            2. Making it do something
              1. Talking to yourself
              2. More arguments
            3. Initializing the object
            4. Explaining yourself
          2. Modules and packages
            1. Organizing the modules
              1. Absolute imports
              2. Relative imports
          3. Organizing module contents
          4. Who can access my data?
          5. Third-party libraries
          6. Case study
        4. 4. When Objects Are Alike
          1. Basic inheritance
            1. Extending built-ins
            2. Overriding and super
          2. Multiple inheritance
            1. The diamond problem
            2. Different sets of arguments
          3. Polymorphism
          4. Abstract base classes
            1. Using an abstract base class
            2. Creating an abstract base class
            3. Demystifying the magic
          5. Case study
        5. 5. Expecting the Unexpected
          1. Raising exceptions
            1. Raising an exception
            2. The effects of an exception
            3. Handling exceptions
            4. The exception hierarchy
            5. Defining our own exceptions
          2. Case study
        6. 6. When to Use Object-oriented Programming
          1. Treat objects as objects
          2. Adding behavior to class data with properties
            1. Properties in detail
            2. Decorators – another way to create properties
            3. Deciding when to use properties
          3. Manager objects
            1. Removing duplicate code
            2. In practice
          4. Case study
        7. 7. Python Data Structures
          1. Empty objects
          2. Tuples and named tuples
            1. Named tuples
          3. Dictionaries
            1. Dictionary use cases
            2. Using defaultdict
              1. Counter
          4. Lists
            1. Sorting lists
          5. Sets
          6. Extending built-ins
          7. Queues
            1. FIFO queues
            2. LIFO queues
            3. Priority queues
          8. Case study
        8. 8. Python Object-oriented Shortcuts
          1. Python built-in functions
            1. The len() function
            2. Reversed
            3. Enumerate
            4. File I/O
            5. Placing it in context
          2. An alternative to method overloading
            1. Default arguments
            2. Variable argument lists
            3. Unpacking arguments
          3. Functions are objects too
            1. Using functions as attributes
            2. Callable objects
          4. Case study
        9. 9. Strings and Serialization
          1. Strings
            1. String manipulation
            2. String formatting
              1. Escaping braces
              2. Keyword arguments
              3. Container lookups
              4. Object lookups
              5. Making it look right
            3. Strings are Unicode
              1. Converting bytes to text
              2. Converting text to bytes
            4. Mutable byte strings
          2. Regular expressions
            1. Matching patterns
              1. Matching a selection of characters
              2. Escaping characters
              3. Matching multiple characters
              4. Grouping patterns together
            2. Getting information from regular expressions
              1. Making repeated regular expressions efficient
          3. Serializing objects
            1. Customizing pickles
            2. Serializing web objects
          4. Case study
        10. 10. The Iterator Pattern
          1. Design patterns in brief
          2. Iterators
            1. The iterator protocol
          3. Comprehensions
            1. List comprehensions
            2. Set and dictionary comprehensions
            3. Generator expressions
          4. Generators
            1. Yield items from another iterable
          5. Coroutines
            1. Back to log parsing
            2. Closing coroutines and throwing exceptions
            3. The relationship between coroutines, generators, and functions
          6. Case study
        11. 11. Python Design Patterns I
          1. The decorator pattern
            1. A decorator example
            2. Decorators in Python
          2. The observer pattern
            1. An observer example
          3. The strategy pattern
            1. A strategy example
            2. Strategy in Python
          4. The state pattern
            1. A state example
            2. State versus strategy
            3. State transition as coroutines
          5. The singleton pattern
            1. Singleton implementation
          6. The template pattern
            1. A template example
        12. 12. Python Design Patterns II
          1. The adapter pattern
          2. The facade pattern
          3. The flyweight pattern
          4. The command pattern
          5. The abstract factory pattern
          6. The composite pattern
        13. 13. Testing Object-oriented Programs
          1. Why test?
            1. Test-driven development
          2. Unit testing
            1. Assertion methods
            2. Reducing boilerplate and cleaning up
            3. Organizing and running tests
            4. Ignoring broken tests
          3. Testing with py.test
            1. One way to do setup and cleanup
            2. A completely different way to set up variables
            3. Skipping tests with py.test
          4. Imitating expensive objects
          5. How much testing is enough?
          6. Case study
            1. Implementing it
        14. 14. Concurrency
          1. Threads
            1. The many problems with threads
              1. Shared memory
              2. The global interpreter lock
            2. Thread overhead
          2. Multiprocessing
            1. Multiprocessing pools
            2. Queues
            3. The problems with multiprocessing
          3. Futures
          4. AsyncIO
            1. AsyncIO in action
            2. Reading an AsyncIO future
            3. AsyncIO for networking
            4. Using executors to wrap blocking code
            5. Streams
              1. Executors
          5. Case study
      9. 2. Course Module 2: Data Analysis
        1. 1. Introducing Data Analysis and Libraries
          1. Data analysis and processing
          2. An overview of the libraries in data analysis
          3. Python libraries in data analysis
            1. NumPy
            2. pandas
            3. Matplotlib
            4. PyMongo
            5. The scikit-learn library
        2. 2. NumPy Arrays and Vectorized Computation
          1. NumPy arrays
            1. Data types
            2. Array creation
            3. Indexing and slicing
            4. Fancy indexing
            5. Numerical operations on arrays
          2. Array functions
          3. Data processing using arrays
            1. Loading and saving data
            2. Saving an array
            3. Loading an array
          4. Linear algebra with NumPy
          5. NumPy random numbers
        3. 3. Data Analysis with pandas
          1. An overview of the pandas package
          2. The pandas data structure
            1. Series
            2. The DataFrame
          3. The essential basic functionality
            1. Reindexing and altering labels
            2. Head and tail
            3. Binary operations
            4. Functional statistics
            5. Function application
            6. Sorting
          4. Indexing and selecting data
          5. Computational tools
          6. Working with missing data
          7. Advanced uses of pandas for data analysis
            1. Hierarchical indexing
            2. The Panel data
        4. 4. Data Visualization
          1. The matplotlib API primer
            1. Line properties
            2. Figures and subplots
          2. Exploring plot types
            1. Scatter plots
            2. Bar plots
            3. Contour plots
            4. Histogram plots
          3. Legends and annotations
          4. Plotting functions with pandas
          5. Additional Python data visualization tools
            1. Bokeh
            2. MayaVi
        5. 5. Time Series
          1. Time series primer
          2. Working with date and time objects
          3. Resampling time series
          4. Downsampling time series data
          5. Upsampling time series data
          6. Timedeltas
          7. Time series plotting
        6. 6. Interacting with Databases
          1. Interacting with data in text format
            1. Reading data from text format
            2. Writing data to text format
          2. Interacting with data in binary format
            1. HDF5
          3. Interacting with data in MongoDB
          4. Interacting with data in Redis
            1. The simple value
            2. List
            3. Set
            4. Ordered set
        7. 7. Data Analysis Application Examples
          1. Data munging
            1. Cleaning data
            2. Filtering
            3. Merging data
            4. Reshaping data
          2. Data aggregation
          3. Grouping data
      10. 3. Course Module 3: Data Mining
        1. 1. Getting Started with Data Mining
          1. Introducing data mining
          2. A simple affinity analysis example
            1. What is affinity analysis?
            2. Product recommendations
            3. Loading the dataset with NumPy
            4. Implementing a simple ranking of rules
            5. Ranking to find the best rules
          3. A simple classification example
          4. What is classification?
            1. Loading and preparing the dataset
            2. Implementing the OneR algorithm
            3. Testing the algorithm
        2. 2. Classifying with scikit-learn Estimators
          1. scikit-learn estimators
            1. Nearest neighbors
            2. Distance metrics
            3. Loading the dataset
            4. Moving towards a standard workflow
            5. Running the algorithm
            6. Setting parameters
          2. Preprocessing using pipelines
            1. An example
            2. Standard preprocessing
            3. Putting it all together
          3. Pipelines
        3. 3. Predicting Sports Winners with Decision Trees
          1. Loading the dataset
            1. Collecting the data
            2. Using pandas to load the dataset
            3. Cleaning up the dataset
            4. Extracting new features
          2. Decision trees
            1. Parameters in decision trees
            2. Using decision trees
          3. Sports outcome prediction
            1. Putting it all together
          4. Random forests
            1. How do ensembles work?
            2. Parameters in Random forests
            3. Applying Random forests
            4. Engineering new features
        4. 4. Recommending Movies Using Affinity Analysis
          1. Affinity analysis
            1. Algorithms for affinity analysis
            2. Choosing parameters
          2. The movie recommendation problem
            1. Obtaining the dataset
            2. Loading with pandas
            3. Sparse data formats
          3. The Apriori implementation
            1. The Apriori algorithm
            2. Implementation
          4. Extracting association rules
            1. Evaluation
        5. 5. Extracting Features with Transformers
          1. Feature extraction
            1. Representing reality in models
            2. Common feature patterns
            3. Creating good features
          2. Feature selection
            1. Selecting the best individual features
          3. Feature creation
          4. Creating your own transformer
            1. The transformer API
            2. Implementation details
            3. Unit testing
            4. Putting it all together
        6. 6. Social Media Insight Using Naive Bayes
          1. Disambiguation
            1. Downloading data from a social network
            2. Loading and classifying the dataset
            3. Creating a replicable dataset from Twitter
          2. Text transformers
            1. Bag-of-words
            2. N-grams
            3. Other features
          3. Naive Bayes
            1. Bayes' theorem
            2. Naive Bayes algorithm
            3. How it works
          4. Application
            1. Extracting word counts
            2. Converting dictionaries to a matrix
            3. Training the Naive Bayes classifier
            4. Putting it all together
            5. Evaluation using the F1-score
            6. Getting useful features from models
        7. 7. Discovering Accounts to Follow Using Graph Mining
          1. Loading the dataset
            1. Classifying with an existing model
            2. Getting follower information from Twitter
            3. Building the network
            4. Creating a graph
            5. Creating a similarity graph
          2. Finding subgraphs
            1. Connected components
            2. Optimizing criteria
        8. 8. Beating CAPTCHAs with Neural Networks
          1. Artificial neural networks
            1. An introduction to neural networks
          2. Creating the dataset
            1. Drawing basic CAPTCHAs
            2. Splitting the image into individual letters
            3. Creating a training dataset
            4. Adjusting our training dataset to our methodology
          3. Training and classifying
            1. Back propagation
            2. Predicting words
          4. Improving accuracy using a dictionary
            1. Ranking mechanisms for words
            2. Putting it all together
        9. 9. Authorship Attribution
          1. Attributing documents to authors
            1. Applications and use cases
            2. Attributing authorship
            3. Getting the data
          2. Function words
            1. Counting function words
            2. Classifying with function words
          3. Support vector machines
            1. Classifying with SVMs
            2. Kernels
          4. Character n-grams
            1. Extracting character n-grams
          5. Using the Enron dataset
            1. Accessing the Enron dataset
            2. Creating a dataset loader
            3. Putting it all together
            4. Evaluation
        10. 10. Clustering News Articles
          1. Obtaining news articles
            1. Using a Web API to get data
            2. Reddit as a data source
            3. Getting the data
          2. Extracting text from arbitrary websites
            1. Finding the stories in arbitrary websites
            2. Putting it all together
          3. Grouping news articles
            1. The k-means algorithm
            2. Evaluating the results
            3. Extracting topic information from clusters
            4. Using clustering algorithms as transformers
          4. Clustering ensembles
            1. Evidence accumulation
            2. How it works
            3. Implementation
          5. Online learning
            1. An introduction to online learning
            2. Implementation
        11. 11. Classifying Objects in Images Using Deep Learning
          1. Object classification
          2. Application scenario and goals
            1. Use cases
          3. Deep neural networks
            1. Intuition
            2. Implementation
            3. An introduction to Theano
            4. An introduction to Lasagne
            5. Implementing neural networks with nolearn
          4. GPU optimization
            1. When to use GPUs for computation
            2. Running our code on a GPU
          5. Setting up the environment
          6. Application
            1. Getting the data
            2. Creating the neural network
            3. Putting it all together
        12. 12. Working with Big Data
          1. Big data
          2. Application scenario and goals
          3. MapReduce
            1. Intuition
            2. A word count example
            3. Hadoop MapReduce
          4. Application
            1. Getting the data
            2. Naive Bayes prediction
              1. The mrjob package
              2. Extracting the blog posts
              3. Training Naive Bayes
              4. Putting it all together
              5. Training on Amazon's EMR infrastructure
        13. 13. Next Steps…
          1. Chapter 1 – Getting Started with Data Mining
            1. Scikit-learn tutorials
            2. Extending the IPython Notebook
          2. Chapter 2 – Classifying with scikit-learn Estimators
            1. More complex pipelines
            2. Comparing classifiers
          3. Chapter 3: Predicting Sports Winners with Decision Trees
            1. More on pandas
          4. Chapter 4 – Recommending Movies Using Affinity Analysis
            1. The Eclat algorithm
          5. Chapter 5 – Extracting Features with Transformers
            1. Vowpal Wabbit
          6. Chapter 6 – Social Media Insight Using Naive Bayes
            1. Natural language processing and part-of-speech tagging
          7. Chapter 7 – Discovering Accounts to Follow Using Graph Mining
            1. More complex algorithms
          8. Chapter 8 – Beating CAPTCHAs with Neural Networks
            1. Deeper networks
            2. Reinforcement learning
          9. Chapter 9 – Authorship Attribution
            1. Local n-grams
          10. Chapter 10 – Clustering News Articles
            1. Real-time clusterings
          11. Chapter 11 – Classifying Objects in Images Using Deep Learning
            1. Keras and Pylearn2
            2. Mahotas
          12. Chapter 12 – Working with Big Data
            1. Courses on Hadoop
            2. Pydoop
            3. Recommendation engine
          13. More resources
      11. 4. Course Module 4: Machine Learning
        1. 1. Giving Computers the Ability to Learn from Data
          1. How to transform data into knowledge
          2. The three different types of machine learning
            1. Making predictions about the future with supervised learning
              1. Classification for predicting class labels
              2. Regression for predicting continuous outcomes
            2. Solving interactive problems with reinforcement learning
            3. Discovering hidden structures with unsupervised learning
              1. Finding subgroups with clustering
              2. Dimensionality reduction for data compression
          3. An introduction to the basic terminology and notations
          4. A roadmap for building machine learning systems
            1. Preprocessing – getting data into shape
            2. Training and selecting a predictive model
            3. Evaluating models and predicting unseen data instances
          5. Using Python for machine learning
        2. 2. Training Machine Learning Algorithms for Classification
          1. Artificial neurons – a brief glimpse into the early history of machine learning
          2. Implementing a perceptron learning algorithm in Python
            1. Training a perceptron model on the Iris dataset
          3. Adaptive linear neurons and the convergence of learning
            1. Minimizing cost functions with gradient descent
            2. Implementing an Adaptive Linear Neuron in Python
            3. Large scale machine learning and stochastic gradient descent
        3. 3. A Tour of Machine Learning Classifiers Using scikit-learn
          1. Choosing a classification algorithm
          2. First steps with scikit-learn
            1. Training a perceptron via scikit-learn
          3. Modeling class probabilities via logistic regression
            1. Logistic regression intuition and conditional probabilities
            2. Learning the weights of the logistic cost function
            3. Training a logistic regression model with scikit-learn
            4. Tackling overfitting via regularization
          4. Maximum margin classification with support vector machines
            1. Maximum margin intuition
            2. Dealing with the nonlinearly separable case using slack variables
            3. Alternative implementations in scikit-learn
          5. Solving nonlinear problems using a kernel SVM
            1. Using the kernel trick to find separating hyperplanes in higher dimensional space
          6. Decision tree learning
            1. Maximizing information gain – getting the most bang for the buck
            2. Building a decision tree
            3. Combining weak to strong learners via random forests
          7. K-nearest neighbors – a lazy learning algorithm
        4. 4. Building Good Training Sets – Data Preprocessing
          1. Dealing with missing data
            1. Eliminating samples or features with missing values
            2. Imputing missing values
            3. Understanding the scikit-learn estimator API
          2. Handling categorical data
            1. Mapping ordinal features
            2. Encoding class labels
            3. Performing one-hot encoding on nominal features
          3. Partitioning a dataset in training and test sets
          4. Bringing features onto the same scale
          5. Selecting meaningful features
            1. Sparse solutions with L1 regularization
            2. Sequential feature selection algorithms
          6. Assessing feature importance with random forests
        5. 5. Compressing Data via Dimensionality Reduction
          1. Unsupervised dimensionality reduction via principal component analysis
            1. Total and explained variance
            2. Feature transformation
            3. Principal component analysis in scikit-learn
          2. Supervised data compression via linear discriminant analysis
            1. Computing the scatter matrices
            2. Selecting linear discriminants for the new feature subspace
            3. Projecting samples onto the new feature space
            4. LDA via scikit-learn
          3. Using kernel principal component analysis for nonlinear mappings
            1. Kernel functions and the kernel trick
            2. Implementing a kernel principal component analysis in Python
              1. Example 1 – separating half-moon shapes
              2. Example 2 – separating concentric circles
            3. Projecting new data points
            4. Kernel principal component analysis in scikit-learn
        6. 6. Learning Best Practices for Model Evaluation and Hyperparameter Tuning
          1. Streamlining workflows with pipelines
            1. Loading the Breast Cancer Wisconsin dataset
            2. Combining transformers and estimators in a pipeline
          2. Using k-fold cross-validation to assess model performance
            1. The holdout method
            2. K-fold cross-validation
          3. Debugging algorithms with learning and validation curves
            1. Diagnosing bias and variance problems with learning curves
            2. Addressing overfitting and underfitting with validation curves
          4. Fine-tuning machine learning models via grid search
            1. Tuning hyperparameters via grid search
            2. Algorithm selection with nested cross-validation
          5. Looking at different performance evaluation metrics
            1. Reading a confusion matrix
            2. Optimizing the precision and recall of a classification model
            3. Plotting a receiver operating characteristic
            4. The scoring metrics for multiclass classification
        7. 7. Combining Different Models for Ensemble Learning
          1. Learning with ensembles
          2. Implementing a simple majority vote classifier
            1. Combining different algorithms for classification with majority vote
          3. Evaluating and tuning the ensemble classifier
          4. Bagging – building an ensemble of classifiers from bootstrap samples
          5. Leveraging weak learners via adaptive boosting
        8. 8. Predicting Continuous Target Variables with Regression Analysis
          1. Introducing a simple linear regression model
          2. Exploring the Housing Dataset
            1. Visualizing the important characteristics of a dataset
          3. Implementing an ordinary least squares linear regression model
            1. Solving regression for regression parameters with gradient descent
            2. Estimating the coefficient of a regression model via scikit-learn
          4. Fitting a robust regression model using RANSAC
          5. Evaluating the performance of linear regression models
          6. Using regularized methods for regression
          7. Turning a linear regression model into a curve – polynomial regression
            1. Modeling nonlinear relationships in the Housing Dataset
            2. Dealing with nonlinear relationships using random forests
              1. Decision tree regression
              2. Random forest regression
        9. A. Reflect and Test Yourself! Answers
          1. Module 2: Data Analysis
            1. Chapter 1: Introducing Data Analysis and Libraries
            2. Chapter 2: Object-oriented Design
            3. Chapter 3: Data Analysis with pandas
            4. Chapter 4: Data Visualization
            5. Chapter 5: Time Series
            6. Chapter 6: Interacting with Databases
            7. Chapter 7: Data Analysis Application Examples
          2. Module 3: Data Mining
            1. Chapter 1: Getting Started with Data Mining
            2. Chapter 2: Classifying with scikit-learn Estimators
            3. Chapter 3: Predicting Sports Winners with Decision Trees
            4. Chapter 4: Recommending Movies Using Affinity Analysis
            5. Chapter 5: Extracting Features with Transformers
            6. Chapter 6: Social Media Insight Using Naive Bayes
            7. Chapter 7: Discovering Accounts to Follow Using Graph Mining
            8. Chapter 8: Beating CAPTCHAs with Neural Networks
            9. Chapter 9: Authorship Attribution
            10. Chapter 10: Clustering News Articles
            11. Chapter 11: Classifying Objects in Images Using Deep Learning
            12. Chapter 12: Working with Big Data
          3. Module 4: Machine Learning
            1. Chapter 1: Giving Computers the Ability to Learn from Data
            2. Chapter 2: Training Machine Learning
            3. Chapter 3: A Tour of Machine Learning Classifiers Using scikit-learn
            4. Chapter 4: Building Good Training Sets – Data Preprocessing
            5. Chapter 5: Compressing Data via Dimensionality Reduction
            6. Chapter 6: Learning Best Practices for Model Evaluation and Hyperparameter Tuning
            7. Chapter 7: Combining Different Models for Ensemble Learning
            8. Chapter 8: Predicting Continuous Target Variables with Regression Analysis
        10. B. Bibliography
      12. Index