Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

A guest post by Paul Lathrop, a software engineer for Krux Digital, Inc. In his prior life as an operations engineer, Paul specialized in building and/or breaking complex systems using Puppet for several companies including Digg, SimpleGeo, and Krux.

At Krux, we make extensive use of piped to stream data from our front-end servers into our real time pipeline via several “collectors.” These are nodes that are configured to consume the data stream(s) from piped, divide it into “messages”, and enqueue those messages for processing by storm. For reliability, we have multiple collector nodes and we can configure piped to failover from one collector to another in the case of a collector failure.

Failover isn’t enough to give us a reliable and scalable pipeline. We also need to distribute the load among the collectors. Since the Krux infrastructure is built on AWS, the Elastic Load Balancing, or ELB, would seem like an ideal solution. If, however, you aren’t yet deployed in a VPC environment (and Krux was built before VPC was viable, so we are not deployed in VPC), you can’t use the ELB internally – all of your ELB instances are public.

With a little ingenuity and some Ruby code, we can leverage Puppet to help us distribute the load among our collector instances. We wrote a custom Puppet function called fqdn_rotate to help us generate the correct piped configs. fqdn_rotate is pretty straight-forward Ruby:

Drop this handy function into the lib/puppet/parser/function subdirectory of an appropriate module in your Puppet manifests to make use of it. In order to configure piped, we first define an array of collector hosts:

Ideally, given any number n of front-end servers, each of our four collectors will receive streams from n/4 of the front-end servers, and piped will be configured to send to the other collectors in case of failure. For example, with 16 front-end servers, the piped configurations we want will use these server settings (assuming the collectors listen on port 44444):

Our template call looks something like this (somewhat simplified for clarity):

And the piped template just uses the $_hosts variable, which now contains the rotated array of collectors:

With a few simple lines of Ruby code, we’ve been able to simply and predictably produce a configuration that will evenly distribute our piped streams across our collector cluster. Even better, when we update the list of collectors in our manifests, the load will automatically re-balance according to the number of front-end hosts and collectors.

Many thanks to my employer, Krux, for giving me permission to share this code. Feel free to contact me at if you have any questions!

See below for some Puppet resources from Safari Books Online.

Read these titles on Safari Books Online

Not a subscriber? Sign up for a free trial.

Pro Puppet is an in-depth guide to installing, using, and developing the popular configuration management tool Puppet. The book is a comprehensive follow-up to the previous title Pulling Strings with Puppet. Puppet provides a way to automate everything from user management to server configuration. You’ll learn how to create Puppet recipes, extend Puppet, and use Facter to gather configuration data from your servers.
Instant Puppet 3 Starter provides you with all the information that you need, from startup to complete confidence in its use. This book will explore and teach the core components of Puppet, consisting of setting up a working client and server and building your first custom module.
Puppet 3 Beginner’s Guide gets you up and running with Puppet straight away, with complete real world examples. Each chapter builds your skills, adding new Puppet features, always with a practical focus. You’ll learn everything you need to manage your whole infrastructure with Puppet.

About the author

PaulLathrop.Headshot Paul Lathrop is a software engineer for Krux Digital, Inc., building infrastructure and back-end services while working to bring down the walls between software and operations engineers. In his prior life as an operations engineer, Paul specialized in building and/or breaking complex systems using Puppet for several companies including Digg, SimpleGeo, and Krux, and can be reached at

Tags: AWS, collectors, ELB, Krux, nodes, piped, Puppet, Puppet functions, Ruby, Storm,

Comments are closed.