Julia for Data Science

Book description

Explore the world of data science from scratch with Julia by your side

About This Book

  • An in-depth exploration of Julia's growing ecosystem of packages

  • Work with the most powerful open-source libraries for deep learning, data wrangling, and data visualization

  • Learn about deep learning using Mocha.jl and give speed and high performance to data analysis on large data sets

  • Who This Book Is For

    This book is aimed at data analysts and aspiring data scientists who have a basic knowledge of Julia or are completely new to it. The book also appeals to those competent in R and Python and wish to adopt Julia to improve their skills set in Data Science. It would be beneficial if the readers have a good background in statistics and computational mathematics.

    What You Will Learn

  • Apply statistical models in Julia for data-driven decisions

  • Understanding the process of data munging and data preparation using Julia

  • Explore techniques to visualize data using Julia and D3 based packages

  • Using Julia to create self-learning systems using cutting edge machine learning algorithms

  • Create supervised and unsupervised machine learning systems using Julia. Also, explore ensemble models

  • Build a recommendation engine in Julia

  • Dive into Julia’s deep learning framework and build a system using Mocha.jl

  • In Detail

    Julia is a fast and high performing language that's perfectly suited to data science with a mature package ecosystem and is now feature complete. It is a good tool for a data science practitioner. There was a famous post at Harvard Business Review that Data Scientist is the sexiest job of the 21st century. (https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century).

    This book will help you get familiarised with Julia's rich ecosystem, which is continuously evolving, allowing you to stay on top of your game.

    This book contains the essentials of data science and gives a high-level overview of advanced statistics and techniques. You will dive in and will work on generating insights by performing inferential statistics, and will reveal hidden patterns and trends using data mining. This has the practical coverage of statistics and machine learning. You will develop knowledge to build statistical models and machine learning systems in Julia with attractive visualizations.

    You will then delve into the world of Deep learning in Julia and will understand the framework, Mocha.jl with which you can create artificial neural networks and implement deep learning.

    This book addresses the challenges of real-world data science problems, including data cleaning, data preparation, inferential statistics, statistical modeling, building high-performance machine learning systems and creating effective visualizations using Julia.

    Style and approach

    This practical and easy-to-follow yet comprehensive guide will get you learning about Julia with respect to data science. Each topic is explained thoroughly and placed in context. For the more inquisitive, we dive deeper into the language and its use case. This is the one true guide to working with Julia in data science.

    Table of contents

    1. Julia for Data Science
      1. Julia for Data Science
      2. Credits
      3. About the Author
      4. About the Reviewer
      5. www.PacktPub.com
        1. Why subscribe?
      6. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      7. 1. The Groundwork – Julia's Environment
        1. Julia is different
        2. Setting up the environment
          1. Installing Julia (Linux)
          2. Installing Julia (Mac)
          3. Installing Julia (Windows)
          4. Exploring the source code
        3. Using REPL
        4. Using Jupyter Notebook
        5. Package management
          1. Pkg.status() – package status
          2. Pkg.add() – adding packages
          3. Working with unregistered packages
            1. Pkg.update() – package update
          4. METADATA repository
          5. Developing packages
          6. Creating a new package
        6. Parallel computation using Julia
        7. Julia's key feature – multiple dispatch
          1. Methods in multiple dispatch
          2. Ambiguities – method definitions
        8. Facilitating language interoperability
          1. Calling Python code in Julia
        9. Summary
        10. References
      8. 2. Data Munging
        1. What is data munging?
          1. The data munging process
        2. What is a DataFrame?
          1. The NA data type and its importance
          2. DataArray – a series-like data structure
          3. DataFrames – tabular data structures
          4. Installation and using DataFrames.jl
            1. Writing the data to a file
          5. Working with DataFrames
            1. Understanding DataFrames joins
          6. The Split-Apply-Combine strategy
          7. Reshaping the data
          8. Sorting a dataset
          9. Formula - a special data type for mathematical expressions
          10. Pooling data
          11. Web scraping
        3. Summary
        4. References
      9. 3. Data Exploration
        1. Sampling
          1. Population
          2. Weight vectors
        2. Inferring column types
        3. Basic statistical summaries
          1. Calculating the mean of the array or dataframe
        4. Scalar statistics
          1. Standard deviations and variances
        5. Measures of variation
          1. Z-scores
          2. Entropy
          3. Quantiles
          4. Modes
          5. Summary of datasets
        6. Scatter matrix and covariance
        7. Computing deviations
        8. Rankings
        9. Counting functions
        10. Histograms
        11. Correlation analysis
        12. Summary
        13. References
      10. 4. Deep Dive into Inferential Statistics
        1. Installation
        2. Understanding the sampling distribution
        3. Understanding the normal distribution
          1. Parameter estimation
        4. Type hierarchy in Distributions.jl
          1. Understanding Sampleable
            1. Representing probabilistic distributions
        5. Univariate distributions
          1. Retrieving parameters
          2. Statistical functions
          3. Evaluation of probability
          4. Sampling in Univariate distributions
          5. Understanding Discrete Univariate distributions and types
            1. Bernoulli distribution
            2. Binomial distribution
          6. Continuous distributions
            1. Cauchy distribution
            2. Chi distribution
            3. Chi-square distribution
        6. Truncated distributions
          1. Truncated normal distributions
        7. Understanding multivariate distributions
          1. Multinomial distribution
          2. Multivariate normal distribution
          3. Dirichlet distribution
        8. Understanding matrixvariate distributions
          1. Wishart distribution
          2. Inverse-Wishart distribution
        9. Distribution fitting
          1. Distribution selection
            1. Symmetrical distributions
            2. Skew distributions to the right
            3. Skew distributions to the left
          2. Maximum Likelihood Estimation
          3. Sufficient statistics
          4. Maximum-a-Posteriori estimation
        10. Confidence interval
          1. Interpreting the confidence intervals
            1. Usage
        11. Understanding z-score
          1. Interpreting z-scores
        12. Understanding the significance of the P-value
          1. One-tailed and two-tailed test
        13. Summary
        14. References
      11. 5. Making Sense of Data Using Visualization
        1. Difference between using and importall
        2. Pyplot for Julia
          1. Multimedia I/O
          2. Installation
          3. Basic plotting
            1. Plot using sine and cosine
        3. Unicode plots
          1. Installation
          2. Examples
            1. Generating Unicode scatterplots
            2. Generating Unicode line plots
        4. Visualizing using Vega
          1. Installation
          2. Examples
            1. Scatterplot
          3. Heatmaps in Vega
        5. Data visualization using Gadfly
          1. Installing Gadfly
          2. Interacting with Gadfly using plot function
            1. Example
          3. Using Gadfly to plot DataFrames
          4. Using Gadfly to visualize functions and expressions
          5. Generating an image with multiple layers
          6. Generating plots with different aesthetics using statistics
            1. The step function
            2. The quantile-quantile function
            3. Ticks in Gadfly
          7. Generating plots with different aesthetics using Geometry
            1. Boxplots
            2. Using Geometry to create density plots
            3. Using Geometry to create histograms
            4. Bar plots
            5. Histogram2d - the two-dimensional histogram
            6. Smooth line plot
            7. Subplot grid
            8. Horizontal and vertical lines
            9. Plotting a ribbon
            10. Violin plots
            11. Beeswarm plots
          8. Elements - scale
            1. x_continuous and y_continuous
            2. x_discrete and y_discrete
            3. Continuous color scale
          9. Elements - guide
          10. Understanding how Gadfly works
        6. Summary
        7. References
      12. 6. Supervised Machine Learning
        1. What is machine learning?
          1. Uses of machine learning
          2. Machine learning and ethics
        2. Machine learning – the process
          1. Different types of machine learning
          2. What is bias-variance trade-off?
          3. Effects of overfitting and underfitting on a model
        3. Understanding decision trees
          1. Building decision trees – divide and conquer
          2. Where should we use decision tree learning?
          3. Advantages of decision trees
          4. Disadvantages of decision trees
          5. Decision tree learning algorithms
            1. How a decision tree algorithm works
            2. Understanding and measuring purity of node
          6. An example
        4. Supervised learning using Naïve Bayes
          1. Advantages of Naïve Bayes
          2. Disadvantages of Naïve Bayes
          3. Uses of Naïve Bayes classification
          4. How Bayesian methods work
            1. Posterior probabilities
            2. Class-conditional probabilities
            3. Prior probabilities
            4. Evidence
          5. The bag of words
            1. Advantages of using Naïve Bayes as a spam filter
            2. Disadvantages of Naïve Bayes filters
          6. Examples of Naïve Bayes
        5. Summary
        6. References
      13. 7. Unsupervised Machine Learning
        1. Understanding clustering
          1. How are clusters formed?
          2. Types of clustering
            1. Hierarchical clustering
            2. Overlapping, exclusive, and fuzzy clustering
            3. Differences between partial versus complete clustering
        2. K-means clustering
          1. K-means algorithm
            1. Algorithm of K-means
            2. Associating the data points with the closest centroid
            3. How to choose the initial centroids?
            4. Time-space complexity of K-means algorithms
          2. Issues with K-means
            1. Empty clusters in K-means
            2. Outliers in the dataset
          3. Different types of cluster
            1. K-means – strengths and weaknesses
          4. Bisecting K-means algorithm
          5. Getting deep into hierarchical clustering
          6. Agglomerative hierarchical clustering
            1. How proximity is computed
            2. Strengths and weaknesses of hierarchical clustering
          7. Understanding the DBSCAN technique
            1. So, what is density?
            2. How are points classified using center-based density
            3. DBSCAN algorithm
            4. Strengths and weaknesses of the DBSCAN algorithm
          8. Cluster validation
          9. Example
        3. Summary
        4. References
      14. 8. Creating Ensemble Models
        1. What is ensemble learning?
          1. Understanding ensemble learning
          2. How to construct an ensemble
            1. Combination strategies
          3. Subsampling training dataset
            1. Bagging
              1. When does bagging work?
            2. Boosting
              1. Boosting approach
              2. Boosting algorithm
            3. AdaBoost – boosting by sampling
              1. What is boosting doing?
              2. The bias and variance decomposition
          4. Manipulating the input features
          5. Injecting randomness
        2. Random forests
          1. Features of random forests
          2. How do random forests work?
          3. The out-of-bag (oob) error estimate
            1. Gini importance
            2. Proximities
        3. Implementation in Julia
          1. Learning and prediction
        4. Why is ensemble learning superior?
          1. Applications of ensemble learning
        5. Summary
        6. References
      15. 9. Time Series
        1. What is forecasting?
          1. Decision-making process
            1. The dynamics of a system
        2. What is TimeSeries?
          1. Trends, seasonality, cycles, and residuals
            1. Difference from standard linear regression
            2. Basic objectives of the analysis
            3. Types of models
            4. Important characteristics to consider first
            5. Systematic pattern and random noise
            6. Two general aspects of time series patterns
          2. Trend analysis
            1. Smoothing
            2. Fitting a function
          3. Analysis of seasonality
            1. Autocorrelation correlogram
              1. Examining correlograms
            2. Partial autocorrelations
            3. Removing serial dependency
          4. ARIMA
            1. Common processes
            2. ARIMA methodology
              1. Identification
              2. Estimation and forecasting
              3. The constant in ARIMA models
              4. Identification phase
              5. Seasonal models
            3. Parameter estimation
            4. Evaluation of the model
            5. Interrupted time series ARIMA
          5. Exponential smoothing
            1. Simple exponential smoothing
            2. Indices of lack of fit (error)
        3. Implementation in Julia
          1. The TimeArray time series type
          2. Using time constraints
            1. when
            2. from
            3. to
            4. findwhen
            5. find
            6. Mathematical, comparison, and logical operators
            7. Applying methods to TimeSeries
              1. Lag
              2. Lead
              3. Percentage
            8. Combining methods in TimeSeries
              1. Merge
              2. Collapse
              3. Map
        4. Summary
        5. References
      16. 10. Collaborative Filtering and Recommendation System
        1. What is a recommendation system?
          1. The utility matrix
        2. Association rule mining
          1. Measures of association rules
          2. How to generate the item sets
          3. How to generate the rules
        3. Content-based filtering
          1. Steps involved in content-based filtering
          2. Advantages of content-based filtering
          3. Limitations of content-based filtering
        4. Collaborative filtering
          1. Baseline prediction methods
          2. User-based collaborative filtering
          3. Item-item collaborative filtering
            1. Algorithm of item-based collaborative filtering
        5. Building a movie recommender system
        6. Summary
      17. 11. Introduction to Deep Learning
        1. Revisiting linear algebra
          1. A gist of scalars
          2. A brief outline of vectors
          3. The importance of matrices
          4. What are tensors?
        2. Probability and information theory
          1. Why probability?
        3. Differences between machine learning and deep learning
          1. What is deep learning?
          2. Deep feedforward networks
            1. Understanding the hidden layers in a neural network
            2. The motivation of neural networks
          3. Understanding regularization
          4. Optimizing deep learning models
            1. The case of optimization
        4. Implementation in Julia
          1. Network architecture
          2. Types of layers
          3. Neurons (activation functions)
          4. Understanding regularizers for ANN
          5. Norm constraints
          6. Using solvers in deep neural networks
          7. Coffee breaks
          8. Image classification with pre-trained Imagenet CNN
        5. Summary
        6. References

    Product information

    • Title: Julia for Data Science
    • Author(s): Anshul Joshi
    • Release date: September 2016
    • Publisher(s): Packt Publishing
    • ISBN: 9781785289699