Posted on by & filed under Content - Highlights and Reviews, Information Technology, Programming & Development, Web Development.

Originally, HBase was built using ant, and that worked pretty well. Then HBase got bigger – it had more code, more dependencies and a more complicated end product.

HBase switched to using Maven; there was some pain, but generally Maven is the ‘right’ way to build a java project. And on the whole, Maven was great – adding dependencies was easy, we got a website for next to nothing and pretty easily added custom builds for different flavors of HBase (a security build, and also multiple versions built against different Hadoop distributions).

NOTE: Maven is a bit odd if you have never dealt with it before – goals, phases, modules and an explicit inheritance design between modules – but can provide a lot of things for free. The Maven guide is a great place to get started and in particular helping you understand how all the pieces fit together.

Then HBase got bigger. It became untenable to run all the tests locally, build time was steadily creeping up, dependency entanglement was quickly becoming overwhelming, and there was a desire to use drop-in replacements for certain features.

Enter modules.

The end goal is to have a logical set of modules that emulated the services being run. hbase-commons would have all the common utilities, hbase-client would have all the necessary classes for the client, hbase-server would have server-side classes, hbase-regionserver would have all the regionserver code, etc. Not all of these modules have been created (in fact most haven’t), but now that a multi-module build process is in place, we can add modules as we have time to pull out components or as new components are needed.

Lessons Learned #1 – Start Small

HBase has hundreds of files, with thousands of lines of code, that have been interweaved over the last six years. When attempting HBASE-4336, initially I attempted to (1) move to a multi-module build and (2) detangle all the dependencies.

Don’t do this – it’s a world of hurt.

Given the size of the project and the tangled web of dependencies it quickly became untenable to detangle all the cross-package references and manage the incoming commits – the code was changing too fast to keep up.

Instead, I ended up just starting with a couple simple modules – a parent module (hbase), the server (hbase-server) which held all the code, module for building the website (hbase-site), and a final module for putting it all together to a final package (hbase-assembly). The majority of this change was moving all the code down one level in the hierarchy to the right child module and futzing with the poms to get HBase building correctly. Even this small change required a moratorium on commits for the nearly 4 days and lots of frantic hacking to let developers get back to committing code.

Once you have your code building the ‘multi-module’ way, you can then slowly, over time, break out subsections into their own modules. Its been about 5 months (at time of writing) since I finished the initial modularization, and we have gone from an hbase-server modules holding all the code to six modules, with an hbase-examples module coming soon and a planning hbase-client module planned (whenever someone gets around to rewriting the client). Below is the current state of HBase 0.96 (currently trunk).

Lessons Learned #2: Using the maven-assembly-plugin

Once we have mulitple modules, we need to combine them into a single tarball that can be distributed. This is where the maven-assembly-plugin comes into play. The initial modularization HBase used the assembly:single goal, which advises you to make a special ‘assembly’ module that depends on all other modules. This ensures that all the other modules get build _before_ the assembly module.

Generally, this works pretty well. The assembly descriptors are a bit finicky but after looking your assembly descriptors and the documentation for a bit, it starts to make sense. Also, you should look at the examples to see some more advanced
usage.

2.1 – Using the dependencySet property

The dependencySet property in the assembly descriptor can be used to easily pull in the other modules to our final tarball. However, some of the modules may also build a tests jar via the maven-jar-plugin’s jar:test-jar goal, which just includes all the compiles test files. For example, the hbase-server module produces hbase-server.jar and hbase-server-tests.jar.

The problem here lies in the fact that the maven-jar-plugin’s jar:test-jar goal only runs after the tests are compiled. This means if you bind the assembly to an early phase, you won’t be able to find the *-test.jar, and your assembly won’t be complete (though interestingly, Maven probably won’t fail to build). The trick here is to make sure you bind your assembly to a late phase, like “package”, to ensure the tests are built.

One of the side effects from doing this on your project (like HBase) will be the need to run

from a fresh source install, rather than a simple

to ensure that Maven builds all the dependencies and can find them for future builds. This is even more important as you move beyond a single module holding all the source code. And don’t worry – people will get confused, even if it is a well-documented part of the ‘build from source’ process.

2.2 – dropping the hbase-assembly module

Though using an assembly module is technically the ‘correct’ thing to do, it’s still a bit of a pain when we already have a pom that keeps track of all the children – the parent pom! To use the assembly-plugin within the parent pom you need to switch from using the assembly:single goal to the assembly:assembly goal*.The other major change you will need to make is to switch from using dependencySets to moduleSets and the useAllReactorProjects flag.


Otherwise, your build can remain almost exactly the same (minus the -assembly
module)!

*the assembly:assembly goal is deprecated in the current maven-assembly-plugin, and yes, the HBase project knows. Its usually bad form to adopt a deprecated function, but we have found it to be non-problematic (though we can never upgrade).

Lessons Learned #3 – handling multiple dependency versions

HBase is in a frustrating and unique position, in that it must support multiple versions of Hadoop. We manage this via runtime reflection checks for expected methods and then picking the right methods based on the included jars – hacky, but it works.

From a build perspective, we used a handful of different build profiles and properties (e.g. -Dhadoop.version=22, -Dhadoop.version=23, -Psecurity). However, this gets to be a jumbled mess once you have code across multiple modules, especially because profiles are not inherited. This means if the parent pom has a profile, the child won’t necessarily have that profile, though the properties set on the parent will be set in the child. HBase leverages now uses profiles to manage the correct dependency versions, which are then picked up in the child poms.

However, to further complicate things HBase also added a hadoop ‘shim-layer’ (hbase-hadoop-compat), that abstracts away the differences between Hadoop 1.X and Hadoop 2.X. The assembly descriptor in hbase/src/assembly/components.xml defines the simple copying of elements of (configs, bin files, etc), but then we pick the right versions of dependencies in custom assembly descriptors for the right versions of hadoop. So, we have a hadoop-one-compat.xml and a hadoop-two-compat.xml which define the correct modules to use and specified in the specific profile. Here is what the meat of the hadoop-one-compat.xml looks like:


As HBase is the ‘Hadoop Database’, we have a need for such a heavily engineering shim layer; it’s generally not something you will need. In fact, we got along without an explicit layer until 0.96, handling the entire mapping at runtime. Even now, we are still primarily using the shim to do mapping for metrics classes.

The HBase build has grown to be a complicated beast to build, with lots of complicated end goals and implicit features that are not always obvious even after lots of study. If you’re interested, the source code is worth investigating, if only to see some cool stuff that Maven can do for your own project.

Hopefully, this gives you some idea of the history and the more tricky things that showed up when moving HBase from a single module Maven project to a full fledged multi-module project. While it was a lot of pain in the beginning, it has helped making working with HBase source code and has lead to cleaner, better engineering.

What Maven bugs features have you found? What other tricks have you used in your own multi-module project?

Safari Books Online has the content you need

Below are some HBase books to help you develop applications, or you can check out all of the HBase books and training videos available from Safari Books Online. You can browse the content in preview mode or you can gain access to more information with a free trial or subscription to Safari Books Online.

If your organization is looking for a storage solution to accommodate a virtually endless amount of data, this book will show you how Apache HBase can fulfill your needs. As the open source implementation of Google’s BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. HBase: The Definitive Guide provides the details you require to evaluate this high-performance, non-relational database, or put it into practice right away.
HBase Administration Cookbook provides practical examples and simple step-by-step instructions for you to administrate HBase with ease. The recipes cover a wide range of processes for managing a fully distributed, highly available HBase cluster on the cloud. Working with such a huge amount of data means that an organized and manageable process is key and this book will help you to achieve that.
Ready to unlock the power of your data? With Hadoop: The Definitive Guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. You will also find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Start your FREE 10-day trial to Safari Books Online

About this author

Jesse Yates has been living and breathing distributed systems since college. He’s worked with Hadoop, HBase, Storm, and almost all the other Big Data buzz words too. In his free time he writes for his blog, rock climbs and runs marathons. He currently works as a software developer at Salesforce.com and is a committer on HBase.

Tags: Big Data, distributed systems, Hadoop, HBase, Maven, scalable databases,

Comments are closed.