Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

codeA guest post by Kasper Grud Skat Madsen, who currently works as a Ph.D. student at the University of Southern Denmark. His research interests are data stream management and cloud computing. He is a contributor to storm and storm-deploy. Furthermore, he is involved in a project trying to extend Storm, called Enorm.

In the last post, we developed a Storm topology to maintain the set of unique users from different geographical regions. In this post, we will show how to deploy this topology to Amazon Elastic Compute Cloud (Amazon EC2) using the storm-deploy project. Storm-deploy is a Clojure project, based on Pallet.

In order to deploy to Amazon EC2, you will need an Amazon Web Services account, with access to Amazon EC2 as a minimum. Go to aws.amazon.com to sign up. Furthermore, it is assumed you are running on Linux and have Java JDK >= 6 installed. Storm-deploy expects to find a password-less RSA key pair: ~/.ssh/. If you are missing it, generate a new pair: ssh-keygen -t rsa.

The storm-deploy project works by contacting Amazon EC2 and requesting new instances. It then installs the needed software, such as Storm, Zookeeper and needed libraries. After configuring the new instances, the deployment system will include your local public ssh key in the authorized_keys on the new instances, such that only you can connect to the instances. Your system is “attached” to the recently deployed cluster, such that the storm/bin/storm script knows which cluster to execute commands on.

Install storm-deploy

Install Leiningen 2 (a tool to handle Clojure projects):

  1. Download the script: wget https://raw.github.com/technomancy/leiningen/stable/bin/lein
  2. Move the script to $PATH, e.g.: mv lein /usr/local/bin/
  3. Make the script executable, e.g.: chmod +x /usr/local/bin/lein
  4. Execute lein (leiningen will install automatically)

Download storm-deploy

Now download storm-deploy:

  1. Clone storm-deploy: git clone https://github.com/nathanmarz/storm-deploy.git
  2. cd storm-deploy
  3. Download all dependencies: lein deps

Configure storm-deploy

You must now configure which cloud provider the topology should be deployed on. Create the following file: ~/.pallet/config.clj:

  • PRIVATE_KEY_PATH: Path to private rsa key, e.g. ~/.ssh/id_rsa.
  • PUBLIC_KEY_PATH: Path to public rsa key, e.g. ~/.ssh/id_rsa.pub.
  • AWS_ACCOUNT_ID: Can be found on the AWS site, under Security Credentials and Account Identifiers. The id is in the format XXXX-XXXX-XXXX.
  • AWS_ACCESS_KEY: Can be found on the AWS site, under Security Credentials and Access Keys.
  • AWS_ACCESS_KEY_SECRET: Can only be read once upon creating the Access Keys. It you have not saved the information, you must create a new set of Access Keys. This can be done on the AWS site, under Security Credentials and Access Keys. Then click Create New Access Key.
  • AWS_REGION: Should be the same region as the new instances will be started in. For this project, choose us-east-1.

The actual cluster configuration is defined in the file storm-deploy/conf/clusters.yaml. I recommend you keep the default values for this project. Notice, the region must be the same as the AWS_REGION just set in the .pallet configuration file.

Launching a cluster

Launching a cluster is easy. Just go to the storm-deploy folder, and run the following command:

  • CLUSTER_NAME: Give the cluster a name (must be lowercase).
  • BRANCH_ID: The Branch of Storm to deploy on the cluster – if omitted the master branch will be used. Ensure you use the same version as the one used to write the topology. Since our job is programmed on 0.8.2, BRANCH_ID should be 0.8.2.
  • COMMIT_ID: If commit is omitted, the newest commit from the branch will be used.

Let’s deploy the cluster right now:

In case of errors, please try and delete security groups and key pairs from the AWS EC2 interface. If you have followed the instructions so far, it should work without problems.

Testing if all is good

Before submitting the job, we should test if Storm is deployed correctly by examining the Storm UI. Open NIMBUS_PUBLIC_IP:8080 in a browser, and check that all of the daemons and instances are running as expected. In case of errors, log into the instances by issuing ssh storm@IP, and check the logs in ~/storm/logs.

Packing the job

The full source of the job we are going to deploy can be obtained from github:

Load the job into your favorite IDE, and modify line 11 of Job.jar, to runLocal = false. Now compile the project (with Java 1.6 compliance), since the default image launched by the deployment process has Java 1.6 installed. Finally, the job must be packaged into a JAR file. Go to the folder that contains all of the class files (and the input.txt), and execute the following:

This creates a JAR file called job.jar, which contains all of the class files of our project and has job.class set as the main class.

Submitting the job

Now, we are ready to submit the job. Go to the storm-0.8.2/bin, we downloaded in the first post. A job is submitted by the method:

To submit our job, simply execute:

Check out the Storm UI again, and see that the job is running. You can also log in to the instances (ssh storm@IP) and check the logfiles in ~/storm/logs/.

Killing the job

Storm jobs run until they are killed. The command to kill a storm job is the following:

To kill our job:

Killing the cluster

To kill the cluster on Amazon EC2, the command is:

To kill our cluster, go back to the storm-deploy folder and execute the following:

After killing the cluster, please consult the Amazon EC2 interface (aws.amazon.com), to ensure that all is properly killed. Otherwise, you could be facing a large bill in the near future.

Conclusion

If you have followed the previous post about Storm and this post on how to deploy storm, you now know the basics on how to write jobs and deploy them. Before deploying Storm in a production setup, please consult the Storm wiki to learn about topics we have not covered in the two short posts, such as fault-tolerance, ganglia and trident.

See below for Storm and EC2 resources from Safari Books Online.

Not a subscriber? Sign up for a free trial.

Safari Books Online has the content you need

Getting Started with Storm introduces you to Storm, a distributed, JVM-based system for processing streaming data. Through simple tutorials, sample Java code, and a complete real-world scenario, you’ll learn how to build fast, fault-tolerant solutions that process results as soon as the data arrives.
Storm Real-time Processing Cookbook begins with setting up the development environment and then teaches log stream processing. This is followed by real-time payments workflow, distributed RPC, integrating it with other software such as Hadoop and Apache Camel, and more.
Programming Amazon EC2 provides architects, developers, and administrators with an end-to-end approach to designing and building a practical application on Amazon Elastic Compute Cloud (EC2), a central part of the AWS platform. You’ll focus on critical issues such as load balancing, scaling, monitoring, and automation in the process.

Tags: Amazon, Amazon Web Services, AWS, Clojure, cluster, EC2, Pallet, Storm, storm-deploy, zookeeper,

Comments are closed.