Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

server-clusterIn the last decade, the term Big Data has emerged to describe extremely large and complex datasets. With the popularity of sensors and social media, data is being gathered at faster and faster rates. Big Data is focused on reliably processing this massive amount of data, using commodity hardware and open source tools. Follow along in this Big Data roundup for a tour through some of our most popular blog posts on the topic, as well as popular resources in Safari Books Online.

The Story of Hadoop

Apache Hadoop, the center of the big data universe, is an open source software framework designed for writing and running distributed applications that process large amounts of data in a distributed system. The running and configuring a Hadoop cluster post outlines what steps to take when installing Hadoop. Another big data technology, Storm, works in a similar fashion to a Hadoop cluster. One key difference is that a MapReduce job eventually finishes, while a Storm job processes messages forever, or at least until the user kills it.

Enter Apache Hive

Analyzing data in Hadoop requires developers to write these MapReduce jobs. This process has the side effect of alienating a community of SQL-savvy analysts from analyzing the data stored in Hadoop clusters. To deal with this, a SQL-like interface for querying and analyzing data stored in Hadoop clusters, Apache Hive, was created. In order to increase the speed of queries, data partitioning can be used with Hive, splitting data within a table across multiple partitions.


In this big data realm, new database technologies have been created that are branded under the term of NoSQL. They provide unprecedented levels of scalability, shedding the rigid structure of traditional SQL databases and adopting a more flexible data model. HBase is one such NoSQL database that allows a large amount of data to be stored and processed in the form of very large tables, comprising of billions of rows of data. If you are running HBase, be sure you learn how to test and iterate rapidly on new features.

NoSQL Apache Cassandra

Another NoSQL database that is inspired by Google’s BigTable implementation is Apache Cassandra. Cassandra has a unique data model composed of rows, columns, and column families all stored in a single table. This unique model requires different data modeling techniques from traditional relational databases.

The Big Data Resources in Safari Books Online

In this list you will find Big Data resources, starting from high level concepts of business intelligence, data analysis and data mining. The list then works its way down to the tools needed for number crunching mathematical toolkits, machine learning, and natural language processing. In order to implement these concepts beyond the “toy” stage, infrastructure tools are covered in the Cloud Services section.

Data Analysis

Python for Data Analysis is packed with practical case studies that show you how to effectively solve a broad set of data analysis problems using several Python libraries. Instructions include manipulating, processing, cleaning, and crunching structured data in Python.
Doing Bayesian Data Analysis provides an accessible approach to Bayesian data analysis, since material is explained clearly with concrete examples. The book begins with the basics, including essential concepts of probability and random sampling, and gradually progresses to advanced hierarchical modeling methods for realistic data.
Read Head First Data Analysis and you’ll quickly learn how to collect and organize data, sort the distractions from the truth, find meaningful patterns, draw conclusions, predict the future, and present your findings to others.

Data Mining

Data Mining: Concepts and Techniques, 3rd Edition equips you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data.
Data Mining: Practical Machine Learning Tools and Techniques, 3rd Edition offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. Inside, you’ll learn all you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining, including both tried-and-true techniques of today as well as methods at the leading edge of contemporary research.

Business Intelligence

Knight’s Microsoft Business Intelligence 24-Hour Trainer provides you with the just the right amount of information to perform basic business analysis and reporting. You’ll explore the components and related tools that comprise the Microsoft BI toolset as well as the new BI features of Office 2010.
Data Mining For Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel® with XLMiner®, Second Edition provides an accessible approach to Bayesian data analysis, as material is explained clearly with concrete examples. The book begins with the basics, including essential concepts of probability and random sampling, and gradually progresses to advanced hierarchical modeling methods for realistic data.

Mathematical Toolkits: R Programming Language

Parallel R describes how to give R parallel muscle. Coverage includes stalwarts such as snow and multicore, and also newer techniques such as Hadoop and Amazon’s cloud computing platform.
Read the R Cookbook and learn how to perform data analysis with R quickly and efficiently using the task-oriented recipes in this cookbook.
R in a Nutshell, 2nd Edition provides a quick and practical guide to just about everything you can do with the open source R language and software environment. You’ll learn how to write R functions and use R packages to help you prepare, visualize, and analyze data.

Machine Learning

Machine Learning in Action blends the foundational theories of machine learning with the practical realities of building tools for everyday data analysis. You’ll use the flexible Python programming language to build programs that implement algorithms for data classification, forecasting, recommendations, and higher-level features like summarization and simplification.
Machine Learning for Hackers is a great book if you’re an experienced programmer interested in crunching data. This book will get you started with machine learning—a toolkit of algorithms that enables computers to train themselves to automate useful tasks.


Big Data Analytics: Turning Big Data into Big Money demonstrates the importance of analytics, defines the processes, highlights the tangible and intangible values and discusses how you can turn a business liability into actionable material that can be used to redefine markets, improve profits and identify new business opportunities.
The Analytical Puzzle: Profitable Data Warehousing, Business Intelligence and Analytics describes an unbiased, practical, and comprehensive approach to building a data warehouse which will lead to an increased level of business intelligence within your organization. New technologies continuously impact this approach and therefore this book explains how to leverage big data, cloud computing, data warehouse appliances, data mining, predictive analytics, data visualization and mobile devices.


Head First Statistics teaches you everything you want and need to know about statistics through engaging, interactive, and thought-provoking material, full of puzzles, stories, quizzes, visual aids, and real-world examples.
Statistics in a Nutshell, 2nd Edition is a clear and concise introduction and reference for anyone new to the subject. Thoroughly revised and expanded, this edition helps you gain a solid understanding of statistics without the numbing complexity of many college texts.

Cloud Services

Because cloud computing involves various technologies, protocols, platforms, and infrastructure elements, Cloud Computing Bible is just what you need if you’ll be using or implementing cloud computing.
By reading Cloud Security and Privacy you’ll learn what’s at stake when you trust your data to the cloud, and what you can do to keep your virtual infrastructure and web applications secure.


This Big Data roundup has walked you through the many resources that you can find on the Safari Books Online Blog, as well as the many books on this topic that you can access in the Safari Books Online library, including the Big Data Bibliography. Leave comments about anything covered in this post, and also what Big Data topics or technologies you’d like to see us cover in this blog.

Tags: analytics, Apache Cassandra, Apache Hive, Big Data, Business Intelligence, Cloud Services, Data Mining, Hadoop, HBase, machine learning, NoSQL, R, roundup, statistics,

Comments are closed.