You are previewing Mastering Clojure Data Analysis.
O'Reilly logo
Mastering Clojure Data Analysis

Book Description

If you'd like to apply your Clojure skills to performing data analysis, this is the book for you. The example based approach aids fast learning and covers basic to advanced topics. Get deeper into your data.

In Detail

Clojure is a Lisp dialect built on top of the Java Virtual Machine. As data increasingly invades more and more parts of our lives, we continually need more tools to deal with it effectively. Data can be organized effectively using Clojure data tools.

Mastering Clojure Data Analysis teaches you how to analyze and visualize complex datasets. With this book, you'll learn how to perform data analysis using established scientific methods with the modern, powerful Clojure programming language with the help of exciting examples drawn from real-world data. This will help you get to grips with advanced topics such as network analysis, the characteristics of social networks, applying topic modeling to get a handle on unstructured textual data, and GIS analysis to apply geospatial techniques to your data analysis problems.

With this guide, you'll learn how to leverage the power and flexibility of Clojure to dig into your data and access the insights it hides.

What You Will Learn

  • Use geospatial data to learn about geographical patterns in data
  • Use sentiment analysis to determine people's opinions from online reviews
  • Frame and implement statistical experiments
  • Use A/B testing to determine the best UI to keep users engaged
  • Work with time series data
  • Learn how to use parallelization and concurrency to work with large datasets
  • Use topic modeling to find the subjects discussed in a group of documents
  • Use network analysis to learn about online social networks
  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the files e-mailed directly to you.

    Table of Contents

    1. Mastering Clojure Data Analysis
      1. Table of Contents
      2. Mastering Clojure Data Analysis
      3. Credits
      4. About the Author
      5. About the Reviewers
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      8. 1. Network Analysis – The Six Degrees of Kevin Bacon
        1. Analyzing social networks
        2. Getting the data
        3. Understanding graphs
        4. Implementing the graphs
          1. Loading the data
        5. Measuring social network graphs
          1. Density
          2. Degrees
          3. Paths
          4. Average path length
          5. Network diameter
          6. Clustering coefficient
          7. Centrality
          8. Degrees of separation
        6. Visualizing the graph
          1. Setting up ClojureScript
          2. A force-directed layout
          3. A hive plot
          4. A pie chart
        7. Summary
      9. 2. GIS Analysis – Mapping Climate Change
        1. Understanding GIS
        2. Mapping the climate change
          1. Downloading and extracting the data
            1. Downloading the files
            2. Extracting the files
          2. Transforming the data – filtering
          3. Rolling averages
            1. Reading the data
          4. Interpolating sample points and generating heat maps using inverse distance weighting (IDW)
        3. Working with map projections
          1. Finding a base map
        4. Working with ArcGIS
        5. Summary
      10. 3. Topic Modeling – Changing Concerns in the State of the Union Addresses
        1. Understanding data in the State of Union addresses
        2. Understanding topic modeling
        3. Preparing for visualizations
        4. Setting up the project
        5. Getting the data
          1. Loading the data into MALLET
          2. Visualizing with D3 and ClojureScript
          3. Exploring the topics
            1. Exploring topic 43
            2. Exploring topic 26
            3. Exploring topic 42
        6. Summary
      11. 4. Classifying UFO Sightings
        1. Getting the data
        2. Extracting the data
        3. Dealing with messy data
        4. Visualizing UFO data
        5. Description
        6. Topic modeling descriptions
        7. Hoaxes
          1. Preparing the data
            1. Reading the data into a sequence of data records
            2. Splitting the NUFORC comments
            3. Categorizing the documents based on the comments
            4. Partitioning the documents into directories based on the categories
            5. Dividing them into training and test sets
          2. Classifying the data
            1. Coding the classifier interface
              1. Setting up the Pipe and InstanceList
              2. Training
              3. Classifying
              4. Validating
              5. Tying it all together
            2. Running the classifier and examining the results
        8. Summary
      12. 5. Benford's Law – Detecting Natural Progressions of Numbers
        1. Learning about Benford's Law
          1. Applying Benford's law to compound interest
          2. Looking at the world population data
        2. Failing Benford's Law
        3. Case studies
        4. Summary
      13. 6. Sentiment Analysis – Categorizing Hotel Reviews
        1. Understanding sentiment analysis
        2. Getting hotel review data
        3. Exploring the data
        4. Preparing the data
          1. Tokenizing
          2. Creating feature vectors
          3. Creating feature vector functions and POS tagging
        5. Cross-validating the results
        6. Calculating error rates
        7. Using the Weka machine learning library
          1. Connecting Weka and cross-validation
          2. Understanding maximum entropy classifiers
          3. Understanding naive Bayesian classifiers
        8. Running the experiment
        9. Examining the results
          1. Combining the error rates
        10. Improving the results
        11. Summary
      14. 7. Null Hypothesis Tests – Analyzing Crime Data
        1. Introducing confirmatory data analysis
        2. Understanding null hypothesis testing
          1. Understanding the process
            1. Formulating an initial hypothesis
            2. Stating the null and alternative hypotheses
            3. Determining appropriate tests
            4. Selecting the significance level
            5. Determining the critical region
            6. Calculating the test statistics and its probability
            7. Deciding whether to reject the null hypothesis or not
          2. Flipping coins
            1. Formulating an initial hypothesis
            2. Stating the null and alternative hypotheses
            3. Identifying the statistical assumptions in the sample
            4. Determining appropriate tests
              1. Selecting the significance level
              2. Determining the critical region
              3. Calculating the test statistic and its probability
              4. Deciding whether to reject the null hypothesis or not
        3. Understanding burglary rates
          1. Getting the data
          2. Parsing the Excel files
          3. Pulling out raw data
            1. Growing a data tree
            2. Cutting down the data tree
            3. Putting it all together
            4. Transforming the data
            5. Joining the data sources
            6. Pivoting the data
            7. Filtering the missing data
            8. Putting it all together
        4. Exploring the data
          1. Generating summary statistics
            1. Summarizing UNODC crime data
            2. Summarizing World Bank land area and GNI data
          2. Generating more charts and graphs
        5. Conducting the experiment
          1. Formulating an initial hypothesis
          2. Stating the null and alternative hypotheses
          3. Identifying the statistical assumptions in the sample
          4. Determining which tests are appropriate
            1. Understanding Spearman's rank correlation coefficient
          5. Selecting the significance level
          6. Determining the critical region
          7. Calculating the test statistic and its probability
          8. Deciding whether to reject the null hypothesis or not
        6. Interpreting the results
        7. Summary
      15. 8. A/B Testing – Statistical Experiments for the Web
        1. Defining A/B testing
        2. Conducting an A/B test
          1. Planning the experiment
          2. Framing the statistics
          3. Building the experiment
            1. Looking at options to build the site
          4. Implementing A/B testing on the server
            1. Understanding the scaffolded site
          5. Building the test site
          6. Implementing A/B testing
          7. Viewing the results
            1. Looking at A/B testing as a user
          8. Analyzing the results
            1. Understanding the t-test
              1. Testing coin tosses
          9. Testing the results
        3. Summary
      16. 9. Analyzing Social Data Participation
        1. Setting up the project
          1. Understanding the analyses
          2. Understanding social network data
          3. Understanding knowledge-based social networks
          4. Introducing the 80/20 rule
            1. Getting the data
            2. Looking at the amount of data
              1. Looking at the data format
            3. Defining and loading the data
            4. Counting frequencies
            5. Sorting and ranking
            6. Finding the patterns of participation
          5. Matching the 80/20 rule
          6. Looking for the 20 percent of questioners
          7. Looking for the 20 percent of respondents
          8. Combining ranks
            1. Looking at those who only post questions
            2. Looking at those who only post answers
            3. Looking at those who post both questions and answers
          9. Finding the up-voted answers
          10. Processing the answers
            1. Predicting the accepted answer
          11. Setting up
            1. Creating the InstanceList object
          12. Training sets and Test sets
            1. Training
            2. Testing
          13. Evaluating the outcome
        2. Summary
      17. 10. Modeling Stock Data
        1. Learning about financial data analysis
        2. Setting up the basics
          1. Setting up the library
          2. Getting the data
        3. Getting prepared with data
          1. Working with news articles
          2. Working with stock data
        4. Analyzing the text
          1. Analyzing vocabulary
          2. Stop lists
          3. Hapax and Dis Legomena
          4. TF-IDF
        5. Inspecting the stock prices
        6. Merging text and stock features
        7. Analyzing both text and stock features together with neural nets
          1. Understanding neural nets
          2. Setting up the neural net
          3. Training the neural net
          4. Running the neural net
          5. Validating the neural net
          6. Finding the best parameters
        8. Predicting the future
          1. Loading stock prices
          2. Loading news articles
          3. Creating training and test sets
          4. Finding the best parameters for the neural network
          5. Training and validating the neural network
          6. Running the network on new data
        9. Taking it with a grain of salt
          1. Related to this project
          2. Related to machine learning and market modeling in general
        10. Summary
      18. Index