A guest post by Kristen Hardwick, who has worked with several different parallel paradigms – including Grid, Cluster, and Cloud. She currently works at Spry where her focus is on designing and developing Big Data analytics for the Hadoop ecosystem. Kristen holds both a Bachelor of Science degree and a Master’s degree in Computer Science from Clemson University, with an emphasis on Parallel Computing.
This blog provides you with an introduction to Apache Giraph, covering details about graphs in general and the benefits of using Giraph.
Graph Vertices and Edges
One task that is extremely important for any analyst is the process of discovering and understanding connections. This process typically manifests in the form of graph processing. A graph can be defined as a collection of vertices and edges, where the vertices can have values and the edges can have directions and weights. Two of the most common ways of interacting with graphs include traversal (following edges between a set of connected vertices) and search (finding vertices in the graph that meet a set of conditions).
This graph structure of vertices and edges is extremely useful in a wide variety of real world applications. For example, imagine the problem of finding driving directions between two addresses. In this scenario, a graph of the roadway system can be built by considering roads as the edges, and intersections as the vertices. A related, larger problem over that same graph might be the process of optimizing the order in which a business makes its deliveries.
Even a natural system like the brain can be treated as a graph – where the billions of neurons are the vertices and the trillions of synapses connecting them are the edges. Once the brain is represented in this manner, research can be conducted to help us understand general brain function in addition to diseases and conditions that affect the passageways – like Alzheimer’s.
Graphs and MapReduce
In today’s analytic world, where data volume and velocity are growing faster than ever before, the benefits of using a parallel computing platform like Hadoop to process the information is clear. The appeal of Hadoop’s main building block – MapReduce – is that by transforming the input data into key/value pairs and splitting the pairs among the workers, the parts can be processed independently and then merged together to form the final result set. By design, there is no communication between tasks to ensure that no synchronization overhead affects task completion. Unfortunately, the traditional MapReduce style of execution does not lend itself to graph applications. A graph algorithm usually requires sending messages between vertices or performing “hops” to travel across an edge from one vertex to another as the bulk of the processing. Executing this in typical MapReduce fashion requires each hop or message to be processed in its own job. This means for each iteration of the algorithm:
- A MapReduce job will be launched.
- The key/value pairs will be read from disk and divided among the mappers.
- The appropriate portion of the graph will be built inside each map task.
- The actual processing will be performed on the input data in each map task.
- Intermediate values output from the map tasks will be written to disk.
- The intermediate data will be read from disk, sorted, and sent to the reducers.
- The reduce tasks will rebuild the appropriate portion of the graph.
- The actual processing will be performed on the intermediate data in each reduce task.
- The output from the reducers will be written out such that the graph is altered and ready for the next iteration.
- The MapReduce job will perform its cleanup step, placing its output in the directory the next job should read from.
Apache Giraph aims to eliminate this unnecessary overhead by providing a MapReduce compatible solution that is optimized for graph processing applications. It is written in Java and is intended to be an open-source alternative to Google’s Pregel. Giraph allows the user to write applications that are oriented toward the vertex, with processing performed in memory to eliminate unnecessary job launch and disk interactions. Its Bulk Synchronous Parallel (BSP) based programming model replaces the chain of MapReduce jobs with a single mapper-only job that executes a set of computation actions called “supersteps.”
At the beginning of the Giraph job, each worker is initialized with the set of key/value pairs containing the data for its set of vertices, and the appropriate collection of edges for each one. This graph initialization happens once and lasts for the life of the application. Once the graph is in place, the superstep actions can begin. In each superstep, the following actions will take place in each vertex:
- Receive unordered messages from the previous superstep and perform processing steps as determined by the vertex data and the particular message.
- Send appropriate messages to any connected vertices.
These actions are repeated in each vertex, once per iteration. Any hop or message passing that takes place between vertices causes a new superstep, not a new iteration of the entire MapReduce job. The job will halt when all vertices indicate that there are no more messages to process and that all work is complete. Since Giraph uses a single mapper-only job for its execution and performs the processing in memory, there is no unnecessary sorting of data, and the only times the disk is accessed are for the initial creation of the graph and to write out check-pointing information and final results. These improvements ensure that the benefits of distributing the processing to multiple nodes are not eclipsed by unnecessary overhead.
Giraph entered into the Apache Incubator process in summer 2011, and had its first release approved in early 2012. Its most recent official version (1.0.0) was released in May 2013. The trunk version (1.1.0) includes support for YARN and is being actively worked on by the community. Giraph is filling a significant gap in the Hadoop-based analytic world, and it will be extremely interesting to follow this technology as it continues to evolve.
Look below for some great Big Data books from Safari Books Online.
Not a subscriber? Sign up for a free trial.
Safari Books Online has the content you need
|Hadoop Real-World Solutions Cookbook provides in depth explanations and code examples. The book covers (un)loading to and from HDFS, graph analytics with Giraph, batch data analysis using Hive, Pig, and MapReduce, machine learning approaches with Mahout, debugging and troubleshooting MapReduce, and columnar storage and retrieval of structured data using Apache Accumulo.|
|Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 is written by YARN project founder Arun Murthy and project lead Vinod Kumar Vavilapalli and demonstrates how YARN increases scalability and cluster utilization, enables new programming models and services, and opens new options beyond Java and batch processing. They walk you through the entire YARN project lifecycle, from installation through deployment.|
|Professional Hadoop Solutions is a practical, detailed guide to building and implementing those solutions, with code-level instruction in the popular Wrox tradition. It covers storing data with HDFS and Hbase, processing data with MapReduce, and automating data processing with Oozie. Hadoop security, running Hadoop with Amazon Web Services, best practices, and automating Hadoop processes in real time are also covered in depth.|