Posted on by & filed under Content - Highlights and Reviews, Information Technology, Programming & Development, Web Development.

This post assumes that you have some HBase expertise. If you don’t have a lot of experience with HBase, be sure to look at the HBase books available from Safari Books Online.

HBase stores data on disk in two places: write-ahead logs (WALs) and HFiles. The former allows you to atomically and durably update HBase without requiring the data to be in a certain format. Therefore, you can quickly write data to the WAL to ensure it isn’t lost, even if the regionserver serving that data crashes. HBase requires each data file (HFiles) to be sorted, so it can efficiently merge them for reads; the more files you need to merge, the slower the reads. Therefore, HBase only periodically flushes its writes from memory (after being written to the WAL for durability) into an HFile. Over time, you start to accumulate increasing HFiles per region, leading to slower and slower read times. To avoid an infinite number of HFiles, HBase periodically compacts the HFiles down into a single (still sorted) HFiles. The WAL ensures you don’t lose any data, and the asynchronous flush means you also (1) serve recent writes from memory and (2) get large, sorted files on disk.

In HBase, write-ahead logs (WALs/HLogs) are always retained for a short while before they are deleted in the /hbase/.oldlogs directory. A similar mechanism was added in HBase 0.96 (and backported to 0.94) that moves HFiles to the /hbase/.archive directory when they are no longer being served. Custom retention policies for HFiles and HLogs can be used to build comprehensive backup and testing solutions. In fact, I’m going to show you how you can write your own HFile retention policy based around Zookeeper.

File Cleaning

By default, both HFiles and WALs are retained for a configurable amount of time in their archive directories. By default, archived files are retained for 60 seconds, though you can specify custom times via hbase.master.hfilecleaner.ttl (HFiles) or hbase.master.logcleaner.ttl (WALs) in your hbase-site.xml).

When a WAL is no longer needed, it is archived from its log directory to the /hbase/.oldlogs directory.

WAL names include their source server, giving you some provenance for the archived HLogs. However, HFiles are just random UUIDs, so you must retain some provenance when archiving HFiles. Therefore, when an HFiles is moved, its directory structure is retained in the /hbase/.archive directory; this makes it easier to associate a given HFile back to its source table.

When you write your own retention policies you only have to concern yourself with whether or not the individual HFile or WAL should be deleted; directories are automatically removed if there are no files under it.

The cleaner process, which checks to see if files should be deleted, is run every hbase.master.cleaner.interval milliseconds (60000 ms by default). Each cleaner has a list of cleaner delegates (this is what you implement to get a custom retention policy), and is loaded from the configuration. Each file that is found in the archive directory on each iteration of the cleaner is submitted to the configured list of cleaners in the order they are specified in the configuration. If any of the policies say the file isn’t deletable (returns false from isFileDeletable(Path)), the remaining cleaners in the chain are skipped and the file is retained. Note that it’s important to consider the order you specify file cleaners as otherwise you may end up wasting cycles checking if files should be deleted.

Saving HFiles with a Custom Policy

Let’s say you have a backup job that just copies all the existing HFiles for a table. If you don’t want to disable the table, you will need to worry about the table compacting away an HFiles while you are copying the files. Now, you could write your own HDFS hardlinks (a monumental task, progress in the HDFS community). Alternatively, you can just list all the HFiles for a table and then attempt to dist-cp the files to their location. If you don’t find the file, you can then look into the /hbase/.archive directory to find the file. To ensure that the file isn’t removed you can post the file list into zookeeper and watch this list with a custom HFile cleaner.

Let’s start with a simple retention policy – never delete any HFiles. As of 0.96, you need to extend the BaseHFileCleanerDelegate class and then add that class to your hbase-site.xml. This simple policy would look like this:

Okay, that’s a start, but over time you will eventually fill up your filesystem, crashing the cluster – definitely a bad idea! Instead, hook up a cleaner to Zoo Keeper to keep track of the file you should retain.

When you want to save HFiles, just write to a ‘retention’ node in Zookeeper. Then whenever the cleaner chore runs, you just check that retention node in your custom policy to see if you should save that HFile (see MyCustomHFileCleaner at the end of the post). See the code example at the end of this post for an implementation.

It’s important to note that BaseHFileCleanerDelegate::setConf(Configuration) will always be called exactly once before the cleaner is run, ensuring you have a chance to setup a connection to zookeeper.

This is a really simple example, that doesn’t do anything smart about making sure the write propagated to Zookeeper, cache any results, etc. For more a more in depth example of how you could leverage Zookeeper in a more complicated fashion to get cleaner results (no pun intended) using a built-in solution, see the LongTermArchivingHFileCleaner.

This should be enough to give you a good idea of how file cleaning works in HBase and how you can leverage it to build your own custom retention policies. How are you using the new HFile archiving? What kind of file cleaners are you writing?

Safari Books Online has the content you need

Below are some HBase books to help you develop applications, or you can check out all of the HBase books and training videos available from Safari Books Online. You can browse the content in preview mode or you can gain access to more information with a free trial or subscription to Safari Books Online.

If your organization is looking for a storage solution to accommodate a virtually endless amount of data, this book will show you how Apache HBase can fulfill your needs. As the open source implementation of Google’s BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. HBase: The Definitive Guide provides the details you require to evaluate this high-performance, non-relational database, or put it into practice right away.
HBase Administration Cookbook provides practical examples and simple step-by-step instructions for you to administrate HBase with ease. The recipes cover a wide range of processes for managing a fully distributed, highly available HBase cluster on the cloud. Working with such a huge amount of data means that an organized and manageable process is key and this book will help you to achieve that.
Ready to unlock the power of your data? With Hadoop: The Definitive Guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. You will also find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Start your FREE 10-day trial to Safari Books Online

About this author

Jesse Yates has been living and breathing distributed systems since college. He’s worked with Hadoop, HBase, Storm, and almost all the other Big Data buzz words too. In his free time he writes for his blog, rock climbs and runs marathons. He currently works as a software developer at and is a committer on HBase.

Tags: Big Data, file retention, Hadoop, HBase, scalable databases, testing,

Comments are closed.