O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Clojure for Data Science

Book Description

Statistics, big data, and machine learning for Clojure programmers

About This Book

  • Write code using Clojure to harness the power of your data

  • Discover the libraries and frameworks that will help you succeed

  • A practical guide to understanding how the Clojure programming language can be used to derive insights from data

  • Who This Book Is For

    This book is aimed at developers who are already productive in Clojure but who are overwhelmed by the breadth and depth of understanding required to be effective in the field of data science. Whether you’re tasked with delivering a specific analytics project or simply suspect that you could be deriving more value from your data, this book will inspire you with the opportunities–and inform you of the risks–that exist in data of all shapes and sizes.

    What You Will Learn

  • Perform hypothesis testing and understand feature selection and statistical significance to interpret your results with confidence

  • Implement the core machine learning techniques of regression, classification, clustering and recommendation

  • Understand the importance of the value of simple statistics and distributions in exploratory data analysis

  • Scale algorithms to web-sized datasets efficiently using distributed programming models on Hadoop and Spark

  • Apply suitable analytic approaches for text, graph, and time series data

  • Interpret the terminology that you will encounter in technical papers

  • Import libraries from other JVM languages such as Java and Scala

  • Communicate your findings clearly and convincingly to nontechnical colleagues

  • In Detail

    The term “data science” has been widely used to define this new profession that is expected to interpret vast datasets and translate them to improved decision-making and performance. Clojure is a powerful language that combines the interactivity of a scripting language with the speed of a compiled language. Together with its rich ecosystem of native libraries and an extremely simple and consistent functional approach to data manipulation, which maps closely to mathematical formula, it is an ideal, practical, and flexible language to meet a data scientist’s diverse needs.

    Taking you on a journey from simple summary statistics to sophisticated machine learning algorithms, this book shows how the Clojure programming language can be used to derive insights from data. Data scientists often forge a novel path, and you’ll see how to make use of Clojure’s Java interoperability capabilities to access libraries such as Mahout and Mllib for which Clojure wrappers don’t yet exist. Even seasoned Clojure developers will develop a deeper appreciation for their language’s flexibility!

    You’ll learn how to apply statistical thinking to your own data and use Clojure to explore, analyze, and visualize it in a technically and statistically robust way. You can also use Incanter for local data processing and ClojureScript to present interactive visualisations and understand how distributed platforms such as Hadoop sand Spark’s MapReduce and GraphX’s BSP solve the challenges of data analysis at scale, and how to explain algorithms using those programming models.

    Above all, by following the explanations in this book, you’ll learn not just how to be effective using the current state-of-the-art methods in data science, but why such methods work so that you can continue to be productive as the field evolves into the future.

    Style and approach

    This is a practical guide to data science that teaches theory by example through the libraries and frameworks accessible from the Clojure programming language.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Clojure for Data Science
      1. Table of Contents
      2. Clojure for Data Science
      3. Credits
      4. About the Author
      5. Acknowledgments
      6. About the Reviewer
      7. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      8. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      9. 1. Statistics
        1. Downloading the sample code
        2. Running the examples
        3. Downloading the data
        4. Inspecting the data
        5. Data scrubbing
        6. Descriptive statistics
          1. The mean
          2. Interpreting mathematical notation
          3. The median
        7. Variance
        8. Quantiles
        9. Binning data
        10. Histograms
        11. The normal distribution
          1. The central limit theorem
        12. Poincaré's baker
          1. Generating distributions
        13. Skewness
          1. Quantile-quantile plots
        14. Comparative visualizations
          1. Box plots
          2. Cumulative distribution functions
        15. The importance of visualizations
          1. Visualizing electorate data
        16. Adding columns
          1. Adding derived columns
        17. Comparative visualizations of electorate data
        18. Visualizing the Russian election data
        19. Comparative visualizations
          1. Probability mass functions
          2. Scatter plots
          3. Scatter transparency
        20. Summary
      10. 2. Inference
        1. Introducing AcmeContent
        2. Download the sample code
        3. Load and inspect the data
        4. Visualizing the dwell times
        5. The exponential distribution
          1. The distribution of daily means
        6. The central limit theorem
        7. Standard error
        8. Samples and populations
        9. Confidence intervals
          1. Sample comparisons
          2. Bias
        10. Visualizing different populations
        11. Hypothesis testing
          1. Significance
        12. Testing a new site design
          1. Performing a z-test
          2. Student's t-distribution
          3. Degrees of freedom
        13. The t-statistic
        14. Performing the t-test
          1. Two-tailed tests
        15. One-sample t-test
        16. Resampling
        17. Testing multiple designs
          1. Calculating sample means
        18. Multiple comparisons
          1. Introducing the simulation
          2. Compile the simulation
        19. The browser simulation
        20. jStat
        21. B1
          1. Scalable Vector Graphics
        22. Plotting probability densities
        23. State and Reagent
          1. Updating state
          2. Binding the interface
        24. Simulating multiple tests
        25. The Bonferroni correction
        26. Analysis of variance
        27. The F-distribution
        28. The F-statistic
        29. The F-test
        30. Effect size
          1. Cohen's d
        31. Summary
      11. 3. Correlation
        1. About the data
        2. Inspecting the data
        3. Visualizing the data
        4. The log-normal distribution
          1. Visualizing correlation
          2. Jittering
        5. Covariance
        6. Pearson's correlation
          1. Sample r and population rho
        7. Hypothesis testing
        8. Confidence intervals
        9. Regression
          1. Linear equations
          2. Residuals
        10. Ordinary least squares
          1. Slope and intercept
          2. Interpretation
          3. Visualization
          4. Assumptions
        11. Goodness-of-fit and R-square
        12. Multiple linear regression
        13. Matrices
          1. Dimensions
          2. Vectors
          3. Construction
          4. Addition and scalar multiplication
          5. Matrix-vector multiplication
          6. Matrix-matrix multiplication
          7. Transposition
          8. The identity matrix
          9. Inversion
        14. The normal equation
          1. More features
        15. Multiple R-squared
        16. Adjusted R-squared
          1. Incanter's linear model
            1. The F-test of model significance
          2. Categorical and dummy variables
          3. Relative power
        17. Collinearity
          1. Multicollinearity
        18. Prediction
          1. The confidence interval of a prediction
          2. Model scope
          3. The final model
        19. Summary
      12. 4. Classification
        1. About the data
        2. Inspecting the data
        3. Comparisons with relative risk and odds
        4. The standard error of a proportion
          1. Estimation using bootstrapping
        5. The binomial distribution
          1. The standard error of a proportion formula
        6. Significance testing proportions
          1. Adjusting standard errors for large samples
        7. Chi-squared multiple significance testing
          1. Visualizing the categories
          2. The chi-squared test
          3. The chi-squared statistic
          4. The chi-squared test
        8. Classification with logistic regression
          1. The sigmoid function
          2. The logistic regression cost function
          3. Parameter optimization with gradient descent
          4. Gradient descent with Incanter
          5. Convexity
        9. Implementing logistic regression with Incanter
          1. Creating a feature matrix
          2. Evaluating the logistic regression classifier
          3. The confusion matrix
          4. The kappa statistic
        10. Probability
          1. Bayes theorem
          2. Bayes theorem with multiple predictors
        11. Naive Bayes classification
          1. Implementing a naive Bayes classifier
          2. Evaluating the naive Bayes classifier
            1. Comparing the logistic regression and naive Bayes approaches
        12. Decision trees
          1. Information
          2. Entropy
          3. Information gain
          4. Using information gain to identify the best predictor
          5. Recursively building a decision tree
          6. Using the decision tree for classification
          7. Evaluating the decision tree classifier
        13. Classification with clj-ml
          1. Loading data with clj-ml
          2. Building a decision tree in clj-ml
        14. Bias and variance
          1. Overfitting
          2. Cross-validation
          3. Addressing high bias
        15. Ensemble learning and random forests
          1. Bagging and boosting
        16. Saving the classifier to a file
        17. Summary
      13. 5. Big Data
        1. Downloading the code and data
          1. Inspecting the data
          2. Counting the records
        2. The reducers library
          1. Parallel folds with reducers
          2. Loading large files with iota
          3. Creating a reducers processing pipeline
          4. Curried reductions with reducers
          5. Statistical folds with reducers
          6. Associativity
          7. Calculating the mean using fold
          8. Calculating the variance using fold
        3. Mathematical folds with Tesser
          1. Calculating covariance with Tesser
          2. Commutativity
          3. Simple linear regression with Tesser
          4. Calculating a correlation matrix
        4. Multiple regression with gradient descent
          1. The gradient descent update rule
          2. The gradient descent learning rate
          3. Feature scaling
          4. Feature extraction
          5. Creating a custom Tesser fold
            1. Creating a matrix-sum fold
          6. Calculating the total model error
            1. Creating a matrix-mean fold
          7. Applying a single step of gradient descent
          8. Running iterative gradient descent
        5. Scaling gradient descent with Hadoop
          1. Gradient descent on Hadoop with Tesser and Parkour
            1. Parkour distributed sources and sinks
            2. Running a feature scale fold with Hadoop
            3. Running gradient descent with Hadoop
            4. Preparing our code for a Hadoop cluster
            5. Building an uberjar
            6. Submitting the uberjar to Hadoop
        6. Stochastic gradient descent
          1. Stochastic gradient descent with Parkour
            1. Defining a mapper
            2. Parkour shaping functions
            3. Defining a reducer
            4. Specifying Hadoop jobs with Parkour graph
            5. Chaining mappers and reducers with Parkour graph
        7. Summary
      14. 6. Clustering
        1. Downloading the data
        2. Extracting the data
        3. Inspecting the data
        4. Clustering text
          1. Set-of-words and the Jaccard index
          2. Tokenizing the Reuters files
            1. Applying the Jaccard index to documents
            2. The bag-of-words and Euclidean distance
          3. Representing text as vectors
          4. Creating a dictionary
        5. Creating term frequency vectors
          1. The vector space model and cosine distance
          2. Removing stop words
          3. Stemming
        6. Clustering with k-means and Incanter
          1. Clustering the Reuters documents
        7. Better clustering with TF-IDF
          1. Zipf's law
          2. Calculating the TF-IDF weight
          3. k-means clustering with TF-IDF
          4. Better clustering with n-grams
        8. Large-scale clustering with Mahout
          1. Converting text documents to a sequence file
          2. Using Parkour to create Mahout vectors
          3. Creating distributed unique IDs
          4. Distributed unique IDs with Hadoop
          5. Sharing data with the distributed cache
          6. Building Mahout vectors from input documents
        9. Running k-means clustering with Mahout
          1. Viewing k-means clustering results
          2. Interpreting the clustered output
        10. Cluster evaluation measures
          1. Inter-cluster density
          2. Intra-cluster density
          3. Calculating the root mean square error with Parkour
            1. Loading clustered points and centroids
          4. Calculating the cluster RMSE
          5. Determining optimal k with the elbow method
          6. Determining optimal k with the Dunn index
          7. Determining optimal k with the Davies-Bouldin index
        11. The drawbacks of k-means
          1. The Mahalanobis distance measure
        12. The curse of dimensionality
        13. Summary
      15. 7. Recommender Systems
        1. Download the code and data
        2. Inspect the data
        3. Parse the data
        4. Types of recommender systems
          1. Collaborative filtering
        5. Item-based and user-based recommenders
        6. Slope One recommenders
          1. Calculating the item differences
          2. Making recommendations
          3. Practical considerations for user and item recommenders
        7. Building a user-based recommender with Mahout
        8. k-nearest neighbors
        9. Recommender evaluation with Mahout
          1. Evaluating distance measures
            1. The Pearson correlation similarity
            2. Spearman's rank similarity
          2. Determining optimum neighborhood size
          3. Information retrieval statistics
            1. Precision
            2. Recall
          4. Mahout's information retrieval evaluator
            1. F-measure and the harmonic mean
            2. Fall-out
            3. Normalized discounted cumulative gain
            4. Plotting the information retrieval results
          5. Recommendation with Boolean preferences
            1. Implicit versus explicit feedback
        10. Probabilistic methods for large sets
          1. Testing set membership with Bloom filters
        11. Jaccard similarity for large sets with MinHash
          1. Reducing pair comparisons with locality-sensitive hashing
            1. Bucketing signatures
        12. Dimensionality reduction
          1. Plotting the Iris dataset
          2. Principle component analysis
          3. Singular value decomposition
        13. Large-scale machine learning with Apache Spark and MLlib
          1. Loading data with Sparkling
          2. Mapping data
          3. Distributed datasets and tuples
          4. Filtering data
          5. Persistence and caching
        14. Machine learning on Spark with MLlib
          1. Movie recommendations with alternating least squares
          2. ALS with Spark and MLlib
          3. Making predictions with ALS
          4. Evaluating ALS
          5. Calculating the sum of squared errors
        15. Summary
      16. 8. Network Analysis
        1. Download the data
          1. Inspecting the data
          2. Visualizing graphs with Loom
        2. Graph traversal with Loom
          1. The seven bridges of Königsberg
        3. Breadth-first and depth-first search
        4. Finding the shortest path
          1. Minimum spanning trees
          2. Subgraphs and connected components
          3. SCC and the bow-tie structure of the web
        5. Whole-graph analysis
        6. Scale-free networks
        7. Distributed graph computation with GraphX
          1. Creating RDGs with Glittering
          2. Measuring graph density with triangle counting
            1. GraphX partitioning strategies
          3. Running the built-in triangle counting algorithm
          4. Implement triangle counting with Glittering
            1. Step one – collecting neighbor IDs
            2. Steps two, three, and four – aggregate messages
            3. Step five – dividing the counts
          5. Running the custom triangle counting algorithm
          6. The Pregel API
          7. Connected components with the Pregel API
            1. Step one – map vertices
            2. Steps two and three – the message function
            3. Step four – update the attributes
            4. Step five – iterate to convergence
          8. Running connected components
          9. Calculating the size of the largest connected component
          10. Detecting communities with label propagation
            1. Step one – map vertices
            2. Step two – send the vertex attribute
            3. Step three – aggregate value
            4. Step four – vertex function
            5. Step five – set the maximum iterations count
          11. Running label propagation
          12. Measuring community influence using PageRank
          13. The flow formulation
            1. Implementing PageRank with Glittering
            2. Sort by highest influence
          14. Running PageRank to determine community influencers
        8. Summary
      17. 9. Time Series
        1. About the data
          1. Loading the Longley data
        2. Fitting curves with a linear model
        3. Time series decomposition
          1. Inspecting the airline data
            1. Visualizing the airline data
          2. Stationarity
          3. De-trending and differencing
        4. Discrete time models
          1. Random walks
          2. Autoregressive models
          3. Determining autocorrelation in AR models
          4. Moving-average models
          5. Determining autocorrelation in MA models
          6. Combining the AR and MA models
          7. Calculating partial autocorrelation
            1. Autocovariance
            2. PACF with Durbin-Levinson recursion
            3. Plotting partial autocorrelation
            4. Determining ARMA model order with ACF and PACF
          8. ACF and PACF of airline data
          9. Removing seasonality with differencing
        5. Maximum likelihood estimation
          1. Calculating the likelihood
          2. Estimating the maximum likelihood
            1. Nelder-Mead optimization with Apache Commons Math
          3. Identifying better models with Akaike Information Criterion
        6. Time series forecasting
          1. Forecasting with Monte Carlo simulation
        7. Summary
      18. 10. Visualization
        1. Download the code and data
        2. Exploratory data visualization
          1. Representing a two-dimensional histogram
        3. Using Quil for visualization
          1. Drawing to the sketch window
          2. Quil's coordinate system
          3. Plotting the grid
          4. Specifying the fill color
          5. Color and fill
          6. Outputting an image file
        4. Visualization for communication
          1. Visualizing wealth distribution
          2. Bringing data to life with Quil
          3. Drawing bars of differing widths
          4. Adding a title and axis labels
          5. Improving the clarity with illustrations
          6. Adding text to the bars
          7. Incorporating additional data
          8. Drawing complex shapes
          9. Drawing curves
          10. Plotting compound charts
          11. Output to PDF
        5. Summary
      19. Index