You are previewing Mastering Python for Data Science.
O'Reilly logo
Mastering Python for Data Science

Book Description

Explore the world of data science through Python and learn how to make sense of data

About This Book

  • Master data science methods using Python and its libraries

  • Create data visualizations and mine for patterns

  • Advanced techniques for the four fundamentals of Data Science with Python - data mining, data analysis, data visualization, and machine learning

  • Who This Book Is For

    If you are a Python developer who wants to master the world of data science then this book is for you. Some knowledge of data science is assumed.

    What You Will Learn

  • Manage data and perform linear algebra in Python

  • Derive inferences from the analysis by performing inferential statistics

  • Solve data science problems in Python

  • Create high-end visualizations using Python

  • Evaluate and apply the linear regression technique to estimate the relationships among variables.

  • Build recommendation engines with the various collaborative filtering algorithms

  • Apply the ensemble methods to improve your predictions

  • Work with big data technologies to handle data at scale

  • In Detail

    Data science is a relatively new knowledge domain which is used by various organizations to make data driven decisions. Data scientists have to wear various hats to work with data and to derive value from it. The Python programming language, beyond having conquered the scientific community in the last decade, is now an indispensable tool for the data science practitioner and a must-know tool for every aspiring data scientist. Using Python will offer you a fast, reliable, cross-platform, and mature environment for data analysis, machine learning, and algorithmic problem solving.

    This comprehensive guide helps you move beyond the hype and transcend the theory by providing you with a hands-on, advanced study of data science.

    Beginning with the essentials of Python in data science, you will learn to manage data and perform linear algebra in Python. You will move on to deriving inferences from the analysis by performing inferential statistics, and mining data to reveal hidden patterns and trends. You will use the matplot library to create high-end visualizations in Python and uncover the fundamentals of machine learning. Next, you will apply the linear regression technique and also learn to apply the logistic regression technique to your applications, before creating recommendation engines with various collaborative filtering algorithms and improving your predictions by applying the ensemble methods.

    Finally, you will perform K-means clustering, along with an analysis of unstructured data with different text mining techniques and leveraging the power of Python in big data analytics.

    Style and approach

    This book is an easy-to-follow, comprehensive guide on data science using Python. The topics covered in the book can all be used in real world scenarios.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the files e-mailed directly to you.

    Table of Contents

    1. Mastering Python for Data Science
      1. Table of Contents
      2. Mastering Python for Data Science
      3. Credits
      4. About the Author
      5. About the Reviewers
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      8. 1. Getting Started with Raw Data
        1. The world of arrays with NumPy
          1. Creating an array
          2. Mathematical operations
            1. Array subtraction
          3. Squaring an array
            1. A trigonometric function performed on the array
            2. Conditional operations
            3. Matrix multiplication
          4. Indexing and slicing
          5. Shape manipulation
        2. Empowering data analysis with pandas
          1. The data structure of pandas
            1. Series
            2. DataFrame
            3. Panel
          2. Inserting and exporting data
            1. CSV
            2. XLS
            3. JSON
            4. Database
        3. Data cleansing
          1. Checking the missing data
          2. Filling the missing data
          3. String operations
          4. Merging data
        4. Data operations
          1. Aggregation operations
          2. Joins
            1. The inner join
            2. The left outer join
            3. The full outer join
            4. The groupby function
        5. Summary
      9. 2. Inferential Statistics
        1. Various forms of distribution
          1. A normal distribution
            1. A normal distribution from a binomial distribution
          2. A Poisson distribution
          3. A Bernoulli distribution
        2. A z-score
        3. A p-value
        4. One-tailed and two-tailed tests
        5. Type 1 and Type 2 errors
        6. A confidence interval
        7. Correlation
        8. Z-test vs T-test
        9. The F distribution
        10. The chi-square distribution
          1. Chi-square for the goodness of fit
        11. The chi-square test of independence
        12. ANOVA
        13. Summary
      10. 3. Finding a Needle in a Haystack
        1. What is data mining?
        2. Presenting an analysis
        3. Studying the Titanic
          1. Which passenger class has the maximum number of survivors?
          2. What is the distribution of survivors based on gender among the various classes?
          3. What is the distribution of nonsurvivors among the various classes who have family aboard the ship?
          4. What was the survival percentage among different age groups?
        4. Summary
      11. 4. Making Sense of Data through Advanced Visualization
        1. Controlling the line properties of a chart
          1. Using keyword arguments
          2. Using the setter methods
          3. Using the setp() command
        2. Creating multiple plots
        3. Playing with text
        4. Styling your plots
        5. Box plots
        6. Heatmaps
        7. Scatter plots with histograms
        8. A scatter plot matrix
        9. Area plots
        10. Bubble charts
        11. Hexagon bin plots
        12. Trellis plots
        13. A 3D plot of a surface
        14. Summary
      12. 5. Uncovering Machine Learning
        1. Different types of machine learning
          1. Supervised learning
          2. Unsupervised learning
          3. Reinforcement learning
        2. Decision trees
        3. Linear regression
        4. Logistic regression
        5. The naive Bayes classifier
        6. The k-means clustering
        7. Hierarchical clustering
        8. Summary
      13. 6. Performing Predictions with a Linear Regression
        1. Simple linear regression
        2. Multiple regression
        3. Training and testing a model
        4. Summary
      14. 7. Estimating the Likelihood of Events
        1. Logistic regression
          1. Data preparation
          2. Creating training and testing sets
          3. Building a model
          4. Model evaluation
          5. Evaluating a model based on test data
          6. Model building and evaluation with SciKit
        2. Summary
      15. 8. Generating Recommendations with Collaborative Filtering
        1. Recommendation data
        2. User-based collaborative filtering
          1. Finding similar users
          2. The Euclidean distance score
          3. The Pearson correlation score
          4. Ranking the users
          5. Recommending items
        3. Item-based collaborative filtering
        4. Summary
      16. 9. Pushing Boundaries with Ensemble Models
        1. The census income dataset
          1. Exploring the census data
            1. Hypothesis 1: People who are older earn more
            2. Hypothesis 2: Income bias based on working class
            3. Hypothesis 3: People with more education earn more
            4. Hypothesis 4: Married people tend to earn more
            5. Hypothesis 5: There is a bias in income based on race
            6. Hypothesis 6: There is a bias in the income based on occupation
            7. Hypothesis 7: Men earn more
            8. Hypothesis 8: People who clock in more hours earn more
            9. Hypothesis 9: There is a bias in income based on the country of origin
        2. Decision trees
        3. Random forests
        4. Summary
      17. 10. Applying Segmentation with k-means Clustering
        1. The k-means algorithm and its working
          1. A simple example
        2. The k-means clustering with countries
          1. Determining the number of clusters
        3. Clustering the countries
        4. Summary
      18. 11. Analyzing Unstructured Data with Text Mining
        1. Preprocessing data
        2. Creating a wordcloud
        3. Word and sentence tokenization
        4. Parts of speech tagging
        5. Stemming and lemmatization
          1. Stemming
          2. Lemmatization
        6. The Stanford Named Entity Recognizer
        7. Performing sentiment analysis on world leaders using Twitter
        8. Summary
      19. 12. Leveraging Python in the World of Big Data
        1. What is Hadoop?
          1. The programming model
          2. The MapReduce architecture
          3. The Hadoop DFS
          4. Hadoop's DFS architecture
        2. Python MapReduce
          1. The basic word count
          2. A sentiment score for each review
          3. The overall sentiment score
          4. Deploying the MapReduce code on Hadoop
        3. File handling with Hadoopy
        4. Pig
        5. Python with Apache Spark
          1. Scoring the sentiment
          2. The overall sentiment
        6. Summary
      20. Index