You are previewing Practical Data Analysis - Second Edition.
O'Reilly logo
Practical Data Analysis - Second Edition

Book Description

A practical guide to obtaining, transforming, exploring, and analyzing data using Python, MongoDB, and Apache Spark

About This Book

  • Learn to use various data analysis tools and algorithms to classify, cluster, visualize, simulate, and forecast your data

  • Apply Machine Learning algorithms to different kinds of data such as social networks, time series, and images

  • A hands-on guide to understanding the nature of data and how to turn it into insight

  • Who This Book Is For

    This book is for developers who want to implement data analysis and data-driven algorithms in a practical way. It is also suitable for those without a background in data analysis or data processing. Basic knowledge of Python programming, statistics, and linear algebra is assumed.

    What You Will Learn

  • Acquire, format, and visualize your data

  • Build an image-similarity search engine

  • Generate meaningful visualizations anyone can understand

  • Get started with analyzing social network graphs

  • Find out how to implement sentiment text analysis

  • Install data analysis tools such as Pandas, MongoDB, and Apache Spark

  • Get to grips with Apache Spark

  • Implement machine learning algorithms such as classification or forecasting

  • In Detail

    Beyond buzzwords like Big Data or Data Science, there are a great opportunities to innovate in many businesses using data analysis to get data-driven products. Data analysis involves asking many questions about data in order to discover insights and generate value for a product or a service.

    This book explains the basic data algorithms without the theoretical jargon, and you’ll get hands-on turning data into insights using machine learning techniques. We will perform data-driven innovation processing for several types of data such as text, Images, social network graphs, documents, and time series, showing you how to implement large data processing with MongoDB and Apache Spark.

    Style and approach

    This is a hands-on guide to data analysis and data processing. The concrete examples are explained with simple code and accessible data.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Practical Data Analysis - Second Edition
      1. Practical Data Analysis - Second Edition
      2. Credits
      3. About the Authors
      4. About the Reviewers
      5. www.PacktPub.com
        1. eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      6. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      7. 1. Getting Started
        1. Computer science
        2. Artificial intelligence
        3. Machine learning
        4. Statistics
        5. Mathematics
        6. Knowledge domain
        7. Data, information, and knowledge
          1. Inter-relationship between data, information, and knowledge
          2. The nature of data
        8. The data analysis process
          1. The problem
          2. Data preparation
          3. Data exploration
          4. Predictive modeling
          5. Visualization of results
        9. Quantitative versus qualitative data analysis
        10. Importance of data visualization
        11. What about big data?
        12. Quantified self
          1. Sensors and cameras
          2. Social network analysis
        13. Tools and toys for this book
          1. Why Python?
          2. Why mlpy?
          3. Why D3.js?
          4. Why MongoDB?
        14. Summary
      8. 2. Preprocessing Data
        1. Data sources
          1. Open data
          2. Text files
          3. Excel files
          4. SQL databases
          5. NoSQL databases
          6. Multimedia
          7. Web scraping
        2. Data scrubbing
          1. Statistical methods
          2. Text parsing
          3. Data transformation
        3. Data formats
          1. Parsing a CSV file with the CSV module
            1. Parsing CSV file using NumPy
          2. JSON
            1. Parsing JSON file using the JSON module
          3. XML
            1. Parsing XML in Python using the XML module
          4. YAML
        4. Data reduction methods
          1. Filtering and sampling
          2. Binned algorithm
          3. Dimensionality reduction
        5. Getting started with OpenRefine
          1. Text facet
          2. Clustering
          3. Text filters
          4. Numeric facets
          5. Transforming data
          6. Exporting data
          7. Operation history
        6. Summary
      9. 3. Getting to Grips with Visualization
        1. What is visualization?
        2. Working with web-based visualization
        3. Exploring scientific visualization
        4. Visualization in art
        5. The visualization life cycle
        6. Visualizing different types of data
          1. HTML
          2. DOM
          3. CSS
          4. JavaScript
          5. SVG
        7. Getting started with D3.js
          1. Bar chart
          2. Pie chart
          3. Scatter plots
          4. Single line chart
          5. Multiple line chart
        8. Interaction and animation
        9. Data from social networks
        10. An overview of visual analytics
        11. Summary
      10. 4. Text Classification
        1. Learning and classification
        2. Bayesian classification
          1. Naïve Bayes
        3. E-mail subject line tester
        4. The data
        5. The algorithm
        6. Classifier accuracy
        7. Summary
      11. 5. Similarity-Based Image Retrieval
        1. Image similarity search
        2. Dynamic time warping
        3. Processing the image dataset
        4. Implementing DTW
        5. Analyzing the results
        6. Summary
      12. 6. Simulation of Stock Prices
        1. Financial time series
        2. Random Walk simulation
        3. Monte Carlo methods
        4. Generating random numbers
        5. Implementation in D3js
        6. Quantitative analyst
        7. Summary
      13. 7. Predicting Gold Prices
        1. Working with time series data
          1. Components of a time series
        2. Smoothing time series
        3. Lineal regression
        4. The data - historical gold prices
        5. Nonlinear regressions
          1. Kernel Ridge Regressions
          2. Smoothing the gold prices time series
          3. Predicting in the smoothed time series
          4. Contrasting the predicted value
        6. Summary
      14. 8. Working with Support Vector Machines
        1. Understanding the multivariate dataset
        2. Dimensionality reduction
          1. Linear Discriminant Analysis (LDA)
          2. Principal Component Analysis (PCA)
        3. Getting started with SVM
          1. Kernel functions
          2. The double spiral problem
          3. SVM implemented on mlpy
        4. Summary
      15. 9. Modeling Infectious Diseases with Cellular Automata
        1. Introduction to epidemiology
          1. The epidemiology triangle
        2. The epidemic models
          1. The SIR model
          2. Solving the ordinary differential equation for the SIR model with SciPy
          3. The SIRS model
        3. Modeling with Cellular Automaton
          1. Cell, state, grid, neighborhood
          2. Global stochastic contact model
        4. Simulation of the SIRS model in CA with D3.js
        5. Summary
      16. 10. Working with Social Graphs
        1. Structure of a graph
          1. Undirected graph
          2. Directed graph
        2. Social networks analysis
        3. Acquiring the Facebook graph
        4. Working with graphs using Gephi
        5. Statistical analysis
          1. Male to female ratio
        6. Degree distribution
          1. Histogram of a graph
          2. Centrality
        7. Transforming GDF to JSON
        8. Graph visualization with D3.js
        9. Summary
      17. 11. Working with Twitter Data
        1. The anatomy of Twitter data
          1. Tweet
          2. Followers
          3. Trending topics
        2. Using OAuth to access Twitter API
        3. Getting started with Twython
          1. Simple search using Twython
          2. Working with timelines
          3. Working with followers
          4. Working with places and trends
          5. Working with user data
          6. Streaming API
        4. Summary
      18. 12. Data Processing and Aggregation with MongoDB
        1. Getting started with MongoDB
          1. Database
          2. Collection
          3. Document
          4. Mongo shell
          5. Insert/Update/Delete
          6. Queries
        2. Data preparation
          1. Data transformation with OpenRefine
          2. Inserting documents with PyMongo
        3. Group
        4. Aggregation framework
          1. Pipelines
          2. Expressions
        5. Summary
      19. 13. Working with MapReduce
        1. An overview of MapReduce
        2. Programming model
        3. Using MapReduce with MongoDB
          1. Map function
          2. Reduce function
          3. Using mongo shell
          4. Using Jupyter
          5. Using PyMongo
        4. Filtering the input collection
        5. Grouping and aggregation
        6. Counting the most common words in tweets
        7. Summary
      20. 14. Online Data Analysis with Jupyter and Wakari
        1. Getting started with Wakari
          1. Creating an account in Wakari
        2. Getting started with IPython notebook
          1. Data visualization
        3. Introduction to image processing with PIL
          1. Opening an image
          2. Working with an image histogram
          3. Filtering
          4. Operations
          5. Transformations
        4. Getting started with pandas
          1. Working with Time Series
          2. Working with multivariate datasets with DataFrame
          3. Grouping, Aggregation, and Correlation
        5. Sharing your Notebook
          1. The data
        6. Summary
      21. 15. Understanding Data Processing using Apache Spark
        1. Platform for data processing
          1. The Cloudera platform
          2. Installing Cloudera VM
        2. An introduction to the distributed file system
          1. First steps with Hadoop Distributed File System - HDFS
          2. File management with HUE - web interface
        3. An introduction to Apache Spark
          1. The Spark ecosystem
          2. The Spark programming model
          3. An introductory working example of Apache Startup
        4. Summary