Apache Pig

HDFS and MR are storage and compute engines at the core of Hadoop. The raw implementation of parallel processing applications is complex and error prone. Apache Pig provides a wrapper around the parallel processing jobs on Hadoop. Pig makes it easy to process large datasets by providing a simple programming interface and API. The tasks and actions written with Pig are inherently parallelized on the underlying Hadoop cluster. In the context of cyber security, Pig can be used for the implementation of complex parallel data aggregation and anomaly detection tasks along with preparation of the training data for supervised learning in case the CI protection application is leveraging machine learning algorithms. 

Get Artificial Intelligence for Big Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.