You are previewing Julia for Data Science.
O'Reilly logo
Julia for Data Science

Book Description

Master how to use the Julia language to solve business critical data science challenges. After covering the importance of Julia to the data science community and several essential data science principles, we start with the basics including how to install Julia and its powerful libraries. Many examples are provided as we illustrate how to leverage each Julia command, dataset, and function.

Specialized script packages are introduced and described. Hands-on problems representative of those commonly encountered throughout the data science pipeline are provided, and we guide you in the use of Julia in solving them using published datasets. Many of these scenarios make use of existing packages and built-in functions, as we cover:
  1. An overview of the data science pipeline along with an example illustrating the key points, implemented in Julia
  2. Options for Julia IDEs
  3. Programming structures and functions
  4. Engineering tasks, such as importing, cleaning, formatting and storing data, as well as performing data preprocessing
  5. Data visualization and some simple yet powerful statistics for data exploration purposes
  6. Dimensionality reduction and feature evaluation
  7. Machine learning methods, ranging from unsupervised (different types of clustering) to supervised ones (decision trees, random forests, basic neural networks, regression trees, and Extreme Learning Machines)
  8. Graph analysis including pinpointing the connections among the various entities and how they can be mined for useful insights.
Each chapter concludes with a series of questions and exercises to reinforce what you learned. The last chapter of the book will guide you in creating a data science application from scratch using Julia.

Table of Contents

  1. Introduction
  2. CHAPTER 1: Introducing Julia
    1. How Julia Improves Data Science
      1. Data science workflow
      2. Julia’s adoption by the data science community
    2. Julia Extensions
      1. Package quality
      2. Finding new packages
    3. About the Book
  3. CHAPTER 2: Setting Up the Data Science Lab
    1. Julia IDEs
      1. Juno
      2. IJulia
      3. Additional IDEs
    2. Julia Packages
      1. Finding and selecting packages
      2. Installing packages
      3. Using packages
      4. Hacking packages
    3. IJulia Basics
      1. Handling files
        1. Creating a notebook
        2. Saving a notebook
        3. Renaming a notebook
        4. Loading a notebook
        5. Exporting a notebook
      2. Organizing code in .jl files
      3. Referencing code
      4. Working directory
    4. Datasets We Will Use
      1. Dataset descriptions
        1. Magic dataset
        2. OnlineNewsPopularity dataset
        3. Spam Assassin dataset
      2. Downloading datasets
      3. Loading datasets
        1. CSV files
        2. Text files
    5. Coding and Testing a Simple Machine Learning Algorithm in Julia
      1. Algorithm description
      2. Algorithm implementation
      3. Algorithm testing
    6. Saving Your Workspace into a Data File
      1. Saving data into delimited files
      2. Saving data into native Julia format
      3. Saving data into text files
    7. Help!
    8. Summary
    9. Chapter Challenge
  4. CHAPTER 3: Learning the Ropes of Julia
    1. Data Types
    2. Arrays
      1. Array basics
      2. Accessing multiple elements in an array
      3. Multidimensional arrays
    3. Dictionaries
    4. Basic Commands and Functions
      1. print(), println()
      2. typemax(), typemin()
      3. collect()
      4. show()
      5. linspace()
    5. Mathematical Functions
      1. round()
      2. rand(), randn()
      3. sum()
      4. mean()
    6. Array and Dictionary Functions
      1. in
      2. append!()
      3. pop!()
      4. push!()
      5. splice!()
      6. insert!()
      7. sort(), sort!()
      8. get()
      9. Keys(), values()
      10. length(), size()
    7. Miscellaneous Functions
      1. time()
      2. Conditionals
        1. if-else statements
      3. string()
      4. map()
      5. VERSION()
    8. Operators, Loops and Conditionals
      1. Operators
        1. Alphanumeric operators (<, >, ==, <=, >=, !=)
        2. Logical operators (&&, ||)
      2. Loops
        1. for-loops
        2. while-loops
      3. break command
    9. Summary
    10. Chapter Challenge
  5. CHAPTER 4: Going Beyond the Basics in Julia
    1. String Manipulation
      1. split()
      2. join()
      3. Regex functions
        1. ismatch()
        2. match()
        3. matchall()
        4. eachmatch()
    2. Custom Functions
      1. Function structure
      2. Anonymous functions
      3. Multiple dispatch
      4. Function example
    3. Implementing a Simple Algorithm
    4. Creating a Complete Solution
    5. Summary
    6. Chapter Challenge
  6. CHAPTER 5: Julia Goes All Data Science-y
    1. Data Science Pipeline
    2. Data Engineering
      1. Data preparation
      2. Data exploration
      3. Data representation
    3. Data Modeling
      1. Data discovery
      2. Data learning
    4. Information Distillation
      1. Data product creation
      2. Insight, deliverance, and visualization
    5. Keep an Open Mind
    6. Applying the Data Science Pipeline to a Real-World Problem
      1. Data preparation
      2. Data exploration
      3. Data representation
      4. Data discovery
      5. Data learning
      6. Data product creation
      7. Insight, deliverance, and visualization
    7. Summary
    8. Chapter Challenge
  7. CHAPTER 6: Julia the Data Engineer
    1. Data Frames
      1. Creating and populating a data frame
      2. Data frames basics
        1. Variable names in a data frame
      3. Accessing particular variables in a data frame
      4. Exploring a data frame
      5. Filtering sections of a data frame
      6. Applying functions to a data frame’s variables
      7. Working with data frames
      8. Altering data frames
      9. Sorting the contents of a data frame
      10. Data frame tips
    2. Importing and Exporting Data
      1. Accessing .json data files
      2. Storing data in .json files
      3. Loading data files into data frames
      4. Saving data frames into data files
    3. Cleaning Up Data
      1. Cleaning up numeric data
      2. Cleaning up text data
    4. Formatting and Transforming Data
      1. Formatting numeric data
      2. Formatting text data
      3. Importance of data types
    5. Applying Data Transformations to Numeric Data
      1. Normalization
      2. Discretization (binning) and binarization
      3. Binary to continuous (binary classification only)
      4. Applying data transformations to text data
      5. Case normalization
      6. Vectorization
    6. Preliminary Evaluation of Features
      1. Regression
      2. Classification
      3. Feature evaluation tips
    7. Summary
    8. Chapter Challenge
  8. CHAPTER 7: Exploring Datasets
    1. Listening to the Data
      1. Packages used in this chapter
    2. Computing Basic Statistics and Correlations
      1. Variable summary
      2. Correlations among variables
      3. Comparability between two variables
    3. Plots
      1. Grammar of graphics
      2. Preparing data for visualization
      3. Box plots
      4. Bar plots
      5. Line plots
      6. Scatter plots
        1. Basic scatter plots
        2. Scatter plots using the output of t-SNE algorithm
      7. Histograms
      8. Exporting a plot to a file
    4. Hypothesis Testing
      1. Testing basics
      2. Types of errors
      3. Sensitivity and specificity
      4. Significance and power of a test
      5. Kruskal-Wallis tests
      6. T-tests
      7. Chi-square tests
    5. Other Tests
    6. Statistical Testing Tips
    7. Case Study: Exploring the OnlineNewsPopularity Dataset
      1. Variable stats
      2. Visualization
      3. Hypotheses
      4. T-SNE magic
      5. Conclusions
    8. Summary
    9. Chapter Challenge
  9. CHAPTER 8: Manipulating the Fabric of the Data Space
    1. Principal Components Analysis (PCA)
      1. Applying PCA in Julia
      2. Independent Components Analysis (ICA): most popular alternative of PCA
    2. Feature Evaluation and Selection
      1. Overview of the methodology
      2. Using Julia for feature evaluation and selection using cosine similarity
      3. Using Julia for feature evaluation and selection using DID
      4. Pros and cons of the feature evaluation and selection approach
    3. Other Dimensionality Reduction Techniques
      1. Overview of the alternative dimensionality reduction methods
        1. Genetic algorithms
        2. Discernibility-based approach
      2. When to use a sophisticated dimensionality reduction method
    4. Summary
    5. Chapter Challenge
  10. CHAPTER 9: Sampling Data and Evaluating Results
    1. Sampling Techniques
      1. Basic sampling
      2. Stratified sampling
    2. Performance Metrics for Classification
      1. Confusion matrix
      2. Accuracy metrics
        1. Basic accuracy
        2. Weighted accuracy
      3. Precision and recall metrics
      4. F1 metric
      5. Misclassification cost
        1. Defining the cost matrix
        2. Calculating the total misclassification cost
      6. Receiver Operating Characteristic (ROC) Curve and related metrics
        1. ROC Curve
        2. AUC Metric
        3. Gini Coefficient
    3. Performance Metrics for Regression
      1. MSE Metric and its variant, RMSE
      2. SSE Metric
      3. Other metrics
    4. K-fold Cross Validation (KFCV)
      1. Applying KFCV in Julia
      2. KFCV tips
    5. Summary
    6. Chapter Challenge
  11. CHAPTER 10: Unsupervised Machine Learning
    1. Unsupervised Learning Basics
      1. Clustering types
      2. Distance metrics
    2. Grouping Data with K-means
      1. K-means using Julia
      2. K-means tips
    3. Density and the DBSCAN Approach
      1. DBSCAN algorithm
      2. Applying DBSCAN in Julia
    4. Hierarchical Clustering
      1. Applying hierarchical clustering in Julia
      2. When to use hierarchical clustering
    5. Validation Metrics for Clustering
      1. Silhouettes
      2. Clustering validation metrics tips
    6. Effective Clustering Tips
      1. Dealing with high dimensionality
      2. Normalization
      3. Visualization tips
    7. Summary
    8. Chapter Challenge
  12. CHAPTER 11: Supervised Machine Learning
    1. Decision Trees
      1. Implementing decision trees in Julia
      2. Decision tree tips
    2. Regression Trees
      1. Implementing regression trees in Julia
      2. Regression tree tips
    3. Random Forests
      1. Implementing random forests in Julia for classification
      2. Implementing random forests in Julia for regression
      3. Random forest tips
    4. Basic Neural Networks
      1. Implementing neural networks in Julia
      2. Neural network tips
    5. Extreme Learning Machines
      1. Implementing ELMs in Julia
      2. ELM tips
    6. Statistical Models for Regression Analysis
      1. Implementing statistical regression in Julia
      2. Statistical regression tips
    7. Other Supervised Learning Systems
      1. Boosted trees
      2. Support vector machines
      3. Transductive systems
      4. Deep learning systems
      5. Bayesian networks
    8. Summary
    9. Chapter Challenge
  13. CHAPTER 12: Graph Analysis
    1. Importance of Graphs
    2. Custom Dataset
    3. Statistics of a Graph
    4. Cycle Detection
      1. Julia the cycle detective
    5. Connected Components
    6. Cliques
    7. Shortest Path in a Graph
    8. Minimum Spanning Trees
      1. Julia the MST botanist
      2. Saving and loading graphs from a file
    9. Graph Analysis and Julia’s Role in it
    10. Summary
    11. Chapter Challenge
  14. CHAPTER 13: Reaching the Next Level
    1. Julia Community
      1. Sites to interact with other Julians
      2. Code repositories
      3. Videos
      4. News
    2. Practice What You’ve Learned
      1. Some features to get you started
      2. Some thoughts on this project
    3. Final Thoughts about Your Experience with Julia in Data Science
      1. Refining your Julia programming skills
      2. Contributing to the Julia project
      3. Future of Julia in data science
  15. APPENDIX A: Downloading and Installing Julia and IJulia
  16. APPENDIX B: Useful Websites Related to Julia
  17. APPENDIX C: Packages Used in This Book
  18. APPENDIX D: Bridging Julia with Other Platforms
    1. Bridging Julia with R
      1. Running a Julia script in R
      2. Running an R script in Julia
    2. Bridging Julia with Python
      1. Running a Julia script in Python
      2. Running a Python script in Julia
  19. APPENDIX E: Parallelization in Julia
  20. APPENDIX F: Answers to Chapter Challenges
    1. Chapter 2
    2. Chapter 3
    3. Chapter 4
    4. Chapter 5
    5. Chapter 6
    6. Chapter 7
    7. Chapter 8
    8. Chapter 9
    9. Chapter 10
    10. Chapter 11
    11. Chapter 12
    12. Chapter 13
  21. Index