O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Python: Data Analytics and Visualization

Book Description

Understand, evaluate, and visualize data

About This Book

  • Learn basic steps of data analysis and how to use Python and its packages
  • A step-by-step guide to predictive modeling including tips, tricks, and best practices
  • Effectively visualize a broad set of analyzed data and generate effective results

Who This Book Is For

This book is for Python Developers who are keen to get into data analysis and wish to visualize their analyzed data in a more efficient and insightful manner.

What You Will Learn

  • Get acquainted with NumPy and use arrays and array-oriented computing in data analysis
  • Process and analyze data using the time-series capabilities of Pandas
  • Understand the statistical and mathematical concepts behind predictive analytics algorithms
  • Data visualization with Matplotlib
  • Interactive plotting with NumPy, Scipy, and MKL functions
  • Build financial models using Monte-Carlo simulations
  • Create directed graphs and multi-graphs
  • Advanced visualization with D3

In Detail

You will start the course with an introduction to the principles of data analysis and supported libraries, along with NumPy basics for statistics and data processing. Next, you will overview the Pandas package and use its powerful features to solve data-processing problems. Moving on, you will get a brief overview of the Matplotlib API .Next, you will learn to manipulate time and data structures, and load and store data in a file or database using Python packages. You will learn how to apply powerful packages in Python to process raw data into pure and helpful data using examples. You will also get a brief overview of machine learning algorithms, that is, applying data analysis results to make decisions or building helpful products such as recommendations and predictions using Scikit-learn.

After this, you will move on to a data analytics specialization—predictive analytics. Social media and IOT have resulted in an avalanche of data. You will get started with predictive analytics using Python. You will see how to create predictive models from data. You will get balanced information on statistical and mathematical concepts, and implement them in Python using libraries such as Pandas, scikit-learn, and NumPy. You’ll learn more about the best predictive modeling algorithms such as Linear Regression, Decision Tree, and Logistic Regression. Finally, you will master best practices in predictive modeling.

After this, you will get all the practical guidance you need to help you on the journey to effective data visualization. Starting with a chapter on data frameworks, which explains the transformation of data into information and eventually knowledge, this path subsequently cover the complete visualization process using the most popular Python libraries with working examples

This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products:

  • Getting Started with Python Data Analysis, Phuong Vo.T.H &Martin Czygan
  • Learning Predictive Analytics with Python, Ashish Kumar
  • Mastering Python Data Visualization, Kirthi Raman

Style and approach

The course acts as a step-by-step guide to get you familiar with data analysis and the libraries supported by Python with the help of real-world examples and datasets. It also helps you gain practical insights into predictive modeling by implementing predictive-analytics algorithms on public datasets with Python. The course offers a wealth of practical guidance to help you on this journey to data visualization

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Python: Data Analytics and Visualization
    1. Table of Contents
    2. Python: Data Analytics and Visualization
    3. Credits
    4. Preface
      1. What this learning path covers
      2. What you need for this learning path
      3. Who this learning path is for
      4. Reader feedback
      5. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    5. 1. Module 1
      1. 1. Introducing Data Analysis and Libraries
        1. Data analysis and processing
        2. An overview of the libraries in data analysis
        3. Python libraries in data analysis
          1. NumPy
          2. Pandas
          3. Matplotlib
          4. PyMongo
          5. The scikit-learn library
        4. Summary
      2. 2. NumPy Arrays and Vectorized Computation
        1. NumPy arrays
          1. Data types
          2. Array creation
          3. Indexing and slicing
          4. Fancy indexing
          5. Numerical operations on arrays
        2. Array functions
        3. Data processing using arrays
          1. Loading and saving data
          2. Saving an array
          3. Loading an array
        4. Linear algebra with NumPy
        5. NumPy random numbers
        6. Summary
      3. 3. Data Analysis with Pandas
        1. An overview of the Pandas package
        2. The Pandas data structure
          1. Series
          2. The DataFrame
        3. The essential basic functionality
          1. Reindexing and altering labels
          2. Head and tail
          3. Binary operations
          4. Functional statistics
          5. Function application
          6. Sorting
        4. Indexing and selecting data
        5. Computational tools
        6. Working with missing data
        7. Advanced uses of Pandas for data analysis
          1. Hierarchical indexing
          2. The Panel data
        8. Summary
      4. 4. Data Visualization
        1. The matplotlib API primer
          1. Line properties
          2. Figures and subplots
        2. Exploring plot types
          1. Scatter plots
          2. Bar plots
          3. Contour plots
          4. Histogram plots
        3. Legends and annotations
        4. Plotting functions with Pandas
        5. Additional Python data visualization tools
          1. Bokeh
          2. MayaVi
        6. Summary
      5. 5. Time Series
        1. Time series primer
        2. Working with date and time objects
        3. Resampling time series
        4. Downsampling time series data
        5. Upsampling time series data
        6. Time zone handling
        7. Timedeltas
        8. Time series plotting
        9. Summary
      6. 6. Interacting with Databases
        1. Interacting with data in text format
          1. Reading data from text format
          2. Writing data to text format
        2. Interacting with data in binary format
          1. HDF5
        3. Interacting with data in MongoDB
        4. Interacting with data in Redis
          1. The simple value
          2. List
          3. Set
          4. Ordered set
        5. Summary
      7. 7. Data Analysis Application Examples
        1. Data munging
          1. Cleaning data
          2. Filtering
          3. Merging data
          4. Reshaping data
        2. Data aggregation
        3. Grouping data
        4. Summary
      8. 8. Machine Learning Models with scikit-learn
        1. An overview of machine learning models
        2. The scikit-learn modules for different models
        3. Data representation in scikit-learn
        4. Supervised learning – classification and regression
        5. Unsupervised learning – clustering and dimensionality reduction
        6. Measuring prediction performance
        7. Summary
    6. 2. Module 2
      1. 1. Getting Started with Predictive Modelling
        1. Introducing predictive modelling
          1. Scope of predictive modelling
            1. Ensemble of statistical algorithms
            2. Statistical tools
            3. Historical data
            4. Mathematical function
            5. Business context
          2. Knowledge matrix for predictive modelling
          3. Task matrix for predictive modelling
        2. Applications and examples of predictive modelling
          1. LinkedIn's "People also viewed" feature
            1. What it does?
            2. How is it done?
          2. Correct targeting of online ads
            1. How is it done?
          3. Santa Cruz predictive policing
            1. How is it done?
          4. Determining the activity of a smartphone user using accelerometer data
            1. How is it done?
          5. Sport and fantasy leagues
            1. How was it done?
        3. Python and its packages – download and installation
          1. Anaconda
          2. Standalone Python
          3. Installing a Python package
            1. Installing pip
            2. Installing Python packages with pip
        4. Python and its packages for predictive modelling
        5. IDEs for Python
        6. Summary
      2. 2. Data Cleaning
        1. Reading the data – variations and examples
          1. Data frames
          2. Delimiters
        2. Various methods of importing data in Python
          1. Case 1 – reading a dataset using the read_csv method
        3. The read_csv method
        4. Use cases of the read_csv method
          1. Passing the directory address and filename as variables
          2. Reading a .txt dataset with a comma delimiter
          3. Specifying the column names of a dataset from a list
        5. Case 2 – reading a dataset using the open method of Python
          1. Reading a dataset line by line
          2. Changing the delimiter of a dataset
        6. Case 3 – reading data from a URL
        7. Case 4 – miscellaneous cases
          1. Reading from an .xls or .xlsx file
          2. Writing to a CSV or Excel file
        8. Basics – summary, dimensions, and structure
        9. Handling missing values
          1. Checking for missing values
          2. What constitutes missing data?
            1. How missing values are generated and propagated
          3. Treating missing values
            1. Deletion
            2. Imputation
        10. Creating dummy variables
        11. Visualizing a dataset by basic plotting
          1. Scatter plots
          2. Histograms
          3. Boxplots
        12. Summary
      3. 3. Data Wrangling
        1. Subsetting a dataset
          1. Selecting columns
          2. Selecting rows
          3. Selecting a combination of rows and columns
          4. Creating new columns
        2. Generating random numbers and their usage
          1. Various methods for generating random numbers
          2. Seeding a random number
          3. Generating random numbers following probability distributions
            1. Probability density function
            2. Cumulative density function
            3. Uniform distribution
            4. Normal distribution
          4. Using the Monte-Carlo simulation to find the value of pi
            1. Geometry and mathematics behind the calculation of pi
          5. Generating a dummy data frame
        3. Grouping the data – aggregation, filtering, and transformation
          1. Aggregation
          2. Filtering
          3. Transformation
          4. Miscellaneous operations
        4. Random sampling – splitting a dataset in training and testing datasets
          1. Method 1 – using the Customer Churn Model
          2. Method 2 – using sklearn
          3. Method 3 – using the shuffle function
        5. Concatenating and appending data
        6. Merging/joining datasets
          1. Inner Join
          2. Left Join
          3. Right Join
          4. An example of the Inner Join
          5. An example of the Left Join
          6. An example of the Right Join
          7. Summary of Joins in terms of their length
        7. Summary
      4. 4. Statistical Concepts for Predictive Modelling
        1. Random sampling and the central limit theorem
        2. Hypothesis testing
          1. Null versus alternate hypothesis
          2. Z-statistic and t-statistic
          3. Confidence intervals, significance levels, and p-values
          4. Different kinds of hypothesis test
          5. A step-by-step guide to do a hypothesis test
          6. An example of a hypothesis test
        3. Chi-square tests
        4. Correlation
        5. Summary
      5. 5. Linear Regression with Python
        1. Understanding the maths behind linear regression
          1. Linear regression using simulated data
            1. Fitting a linear regression model and checking its efficacy
            2. Finding the optimum value of variable coefficients
        2. Making sense of result parameters
          1. p-values
          2. F-statistics
          3. Residual Standard Error
        3. Implementing linear regression with Python
          1. Linear regression using the statsmodel library
          2. Multiple linear regression
          3. Multi-collinearity
            1. Variance Inflation Factor
        4. Model validation
          1. Training and testing data split
          2. Summary of models
          3. Linear regression with scikit-learn
          4. Feature selection with scikit-learn
        5. Handling other issues in linear regression
          1. Handling categorical variables
          2. Transforming a variable to fit non-linear relations
          3. Handling outliers
          4. Other considerations and assumptions for linear regression
        6. Summary
      6. 6. Logistic Regression with Python
        1. Linear regression versus logistic regression
        2. Understanding the math behind logistic regression
          1. Contingency tables
          2. Conditional probability
          3. Odds ratio
          4. Moving on to logistic regression from linear regression
          5. Estimation using the Maximum Likelihood Method
            1. Likelihood function:
            2. Log likelihood function:
            3. Building the logistic regression model from scratch
          6. Making sense of logistic regression parameters
            1. Wald test
            2. Likelihood Ratio Test statistic
            3. Chi-square test
        3. Implementing logistic regression with Python
          1. Processing the data
          2. Data exploration
          3. Data visualization
          4. Creating dummy variables for categorical variables
          5. Feature selection
          6. Implementing the model
        4. Model validation and evaluation
          1. Cross validation
        5. Model validation
          1. The ROC curve
            1. Confusion matrix
        6. Summary
      7. 7. Clustering with Python
        1. Introduction to clustering – what, why, and how?
          1. What is clustering?
          2. How is clustering used?
          3. Why do we do clustering?
        2. Mathematics behind clustering
          1. Distances between two observations
            1. Euclidean distance
            2. Manhattan distance
            3. Minkowski distance
            4. The distance matrix
          2. Normalizing the distances
          3. Linkage methods
            1. Single linkage
            2. Compete linkage
            3. Average linkage
            4. Centroid linkage
            5. Ward's method
          4. Hierarchical clustering
          5. K-means clustering
        3. Implementing clustering using Python
          1. Importing and exploring the dataset
          2. Normalizing the values in the dataset
          3. Hierarchical clustering using scikit-learn
          4. K-Means clustering using scikit-learn
            1. Interpreting the cluster
        4. Fine-tuning the clustering
          1. The elbow method
          2. Silhouette Coefficient
        5. Summary
      8. 8. Trees and Random Forests with Python
        1. Introducing decision trees
          1. A decision tree
        2. Understanding the mathematics behind decision trees
          1. Homogeneity
          2. Entropy
          3. Information gain
          4. ID3 algorithm to create a decision tree
          5. Gini index
          6. Reduction in Variance
          7. Pruning a tree
          8. Handling a continuous numerical variable
          9. Handling a missing value of an attribute
        3. Implementing a decision tree with scikit-learn
          1. Visualizing the tree
          2. Cross-validating and pruning the decision tree
        4. Understanding and implementing regression trees
          1. Regression tree algorithm
          2. Implementing a regression tree using Python
        5. Understanding and implementing random forests
          1. The random forest algorithm
          2. Implementing a random forest using Python
          3. Why do random forests work?
          4. Important parameters for random forests
        6. Summary
      9. 9. Best Practices for Predictive Modelling
        1. Best practices for coding
          1. Commenting the codes
          2. Defining functions for substantial individual tasks
            1. Example 1
            2. Example 2
            3. Example 3
          3. Avoid hard-coding of variables as much as possible
          4. Version control
          5. Using standard libraries, methods, and formulas
        2. Best practices for data handling
        3. Best practices for algorithms
        4. Best practices for statistics
        5. Best practices for business contexts
        6. Summary
      10. A. A List of Links
    7. 3. Module 3
      1. 1. A Conceptual Framework for Data Visualization
        1. Data, information, knowledge, and insight
          1. Data
          2. Information
          3. Knowledge
          4. Data analysis and insight
        2. The transformation of data
          1. Transforming data into information
            1. Data collection
            2. Data preprocessing
            3. Data processing
            4. Organizing data
            5. Getting datasets
          2. Transforming information into knowledge
          3. Transforming knowledge into insight
        3. Data visualization history
          1. Visualization before computers
            1. Minard's Russian campaign (1812)
            2. The Cholera epidemics in London (1831-1855)
            3. Statistical graphics (1850-1915)
            4. Later developments in data visualization
        4. How does visualization help decision-making?
          1. Where does visualization fit in?
          2. Data visualization today
            1. What is a good visualization?
        5. Visualization plots
          1. Bar graphs and pie charts
            1. Bar graphs
            2. Pie charts
          2. Box plots
          3. Scatter plots and bubble charts
            1. Scatter plots
            2. Bubble charts
          4. KDE plots
        6. Summary
      2. 2. Data Analysis and Visualization
        1. Why does visualization require planning?
        2. The Ebola example
        3. A sports example
          1. Visually representing the results
        4. Creating interesting stories with data
          1. Why are stories so important?
          2. Reader-driven narratives
            1. Gapminder
            2. The State of the Union address
            3. Mortality rate in the USA
            4. A few other example narratives
          3. Author-driven narratives
        5. Perception and presentation methods
          1. The Gestalt principles of perception
        6. Some best practices for visualization
          1. Comparison and ranking
          2. Correlation
          3. Distribution
          4. Location-specific or geodata
          5. Part-to-whole relationships
          6. Trends over time
        7. Visualization tools in Python
          1. Development tools
            1. Canopy from Enthought
            2. Anaconda from Continuum Analytics
        8. Interactive visualization
          1. Event listeners
          2. Layouts
            1. Circular layout
            2. Radial layout
            3. Balloon layout
        9. Summary
      3. 3. Getting Started with the Python IDE
        1. The IDE tools in Python
          1. Python 3.x versus Python 2.7
          2. Types of interactive tools
            1. IPython
            2. Plotly
          3. Types of Python IDE
            1. PyCharm
            2. PyDev
            3. Interactive Editor for Python (IEP)
            4. Canopy from Enthought
            5. Anaconda from Continuum Analytics
              1. An overview of Spyder
              2. An overview of conda
        2. Visualization plots with Anaconda
          1. The surface-3D plot
          2. The square map plot
        3. Interactive visualization packages
          1. Bokeh
          2. VisPy
        4. Summary
      4. 4. Numerical Computing and Interactive Plotting
        1. NumPy, SciPy, and MKL functions
          1. NumPy
            1. NumPy universal functions
            2. Shape and reshape manipulation
            3. An example of interpolation
            4. Vectorizing functions
            5. Summary of NumPy linear algebra
          2. SciPy
            1. An example of linear equations
            2. The vectorized numerical derivative
          3. MKL functions
          4. The performance of Python
        2. Scalar selection
        3. Slicing
          1. Slice using flat
        4. Array indexing
          1. Numerical indexing
          2. Logical indexing
        5. Other data structures
          1. Stacks
          2. Tuples
          3. Sets
          4. Queues
          5. Dictionaries
          6. Dictionaries for matrix representation
            1. Sparse matrices
              1. Visualizing sparseness
            2. Dictionaries for memoization
          7. Tries
        6. Visualization using matplotlib
          1. Word clouds
          2. Installing word clouds
          3. Input for word clouds
            1. Web feeds
            2. The Twitter text
          4. Plotting the stock price chart
            1. Obtaining data
        7. The visualization example in sports
        8. Summary
      5. 5. Financial and Statistical Models
        1. The deterministic model
          1. Gross returns
        2. The stochastic model
          1. Monte Carlo simulation
            1. What exactly is Monte Carlo simulation?
            2. An inventory problem in Monte Carlo simulation
            3. Monte Carlo simulation in basketball
            4. The volatility plot
            5. Implied volatilities
          2. The portfolio valuation
          3. The simulation model
          4. Geometric Brownian simulation
          5. The diffusion-based simulation
        3. The threshold model
          1. Schelling's Segregation Model
        4. An overview of statistical and machine learning
          1. K-nearest neighbors
          2. Generalized linear models
            1. Bayesian linear regression
        5. Creating animated and interactive plots
        6. Summary
      6. 6. Statistical and Machine Learning
        1. Classification methods
        2. Understanding linear regression
        3. Linear regression
        4. Decision tree
          1. An example
        5. The Bayes theorem
        6. The Naïve Bayes classifier
        7. The Naïve Bayes classifier using TextBlob
          1. Installing TextBlob
          2. Downloading corpora
          3. The Naïve Bayes classifier using TextBlob
        8. Viewing positive sentiments using word clouds
        9. k-nearest neighbors
        10. Logistic regression
        11. Support vector machines
        12. Principal component analysis
          1. Installing scikit-learn
        13. k-means clustering
        14. Summary
      7. 7. Bioinformatics, Genetics, and Network Models
        1. Directed graphs and multigraphs
          1. Storing graph data
          2. Displaying graphs
            1. igraph
            2. NetworkX
            3. Graph-tool
              1. PageRank
        2. The clustering coefficient of graphs
        3. Analysis of social networks
        4. The planar graph test
        5. The directed acyclic graph test
        6. Maximum flow and minimum cut
        7. A genetic programming example
        8. Stochastic block models
        9. Summary
      8. 8. Advanced Visualization
        1. Computer simulation
          1. Python's random package
          2. SciPy's random functions
          3. Simulation examples
          4. Signal processing
          5. Animation
          6. Visualization methods using HTML5
          7. How is Julia different from Python?
          8. D3.js for visualization
          9. Dashboards
        2. Summary
      9. B. Go Forth and Explore Visualization
        1. An overview of conda
        2. Packages installed with Anaconda
        3. Packages websites
        4. About matplotlib
    8. Bibliography
    9. Index