Chapter 4. Phase 3: Enhancing Your Cloud Solution

Having started your migration to the cloud, there are a set of considerations that will enable you to start taking your cloud-based systems to the next level.

Design for Failure at the Network as well as Application Layers

It is a mantra that is as old as the cloud itself: the cloud doesn’t guarantee success, but it does give you the tools to deal with failure. The ability to dynamically create infrastructure on demand removes the dependency on hardware that is characteristic of data centers.

Everything you do in the cloud should assume failure will happen.

This is now a common practice for server infrastructure. The most famous example is Netflix’s Chaos Monkey: a tool that goes around intentionally disabling elements of the infrastructure to ensure that their resiliency systems can cope. This is rarely done at the network level, but the same rules can apply.

If your system suffers Internet performance problems, such as routing issues that add overhead onto every request, then your system should be aware of this and be able to easily switch to another location. For example, if you normally serve content from Virginia because the majority of your requests come from users in Chicago, then in the event of Internet performance issues, it should be easy to switch to serving content from another relevant location, such as San Francisco. This allows you to respond not only to Internet performance issues but also to outages seen at specific data centers. Alternatively, rather than switching location—as this is not always possible due to practical issues (e.g., data)—the system could also be configured to move into a lower-bandwidth/minimized-service interaction state to minimize the impact.

Understand the Cost of Performance and Monitoring as a Core Part of Capacity Planning

The cloud allows you to provide all the systems needed to deliver a scalable system, but those systems do not come for free. Anyone who has used cloud-based services will tell you that it is very easy to run up much higher bills than expected. However, this can be mitigated by intelligent system design.

Adding complexity will add cost—not only in terms of cloud costs, but also in terms of development and maintenance overhead. It is essential that you consider the following:

Level of usage

Scale systems only to the level of usage that you anticipate. There is no need to future-proof systems. You build systems that can scale, not that are at a capacity to meet any anticipated future demand. Good system architecture is essential here and, like other things, cloud-based system architecture is different from on-premise system architecture. As a general rule, the aim should be to use cloud-based services where possible, as they are prebuilt to be scalable with no input from you, and some are also built to be region-independent. Where you are building upon virtual machines, the aim should be for them to be horizontally scalable, meaning you can add and remove servers when desired with no impact on users.

Where your users are coming from

Only scale systems to meet demand in areas where you have a user base that warrants the additional cost and effort. Building and maintaining a multiregion system is a complex task, particularly when it comes to data management, so it is not something that should be entered into lightly. Before committing, use your monitoring to determine if there is sufficient demand from the region and, more importantly, what the impact is on users of the configuration that you have in place.

When your users are coming

The nature of cloud systems, with their “pay as you use” charging method and on-demand creation and destruction of resources, means that you can scale your system up and down as needed. It is therefore best practice to analyze when your systems are busy and scale up to meet demand and back down again afterwards. This can be on a daily, hourly, or even minute-by-minute basis.

How tolerant your users are

With an intelligent set of monitoring tools, you can determine how tolerant your users are of performance issues. For example, you may determine that users in Australia see performance that is notably worse than that seen by users in other areas of the world, which could trigger a need to invest in expanding to cloud providers with better Internet performance for Australian users. However, before making such an investment, it is a good idea to understand the impact that poor performance is having on those users. There are a couple of ways to investigate this: you could analyze the performance of your competitors to see how well you compare in that area, or alternatively, you could change performance and assess the impact. Improving performance is typically a complex task, so one option is to consciously reduce performance on your system to see the business impact. This may seem like an unusual suggestion, and it may be hard to sell within your business, but, while obviously not foolproof, it can be a quick and effective method of determining the value of investing a lot of time and effort in performance improvements.

Combining all these factors, you can construct a system that is scaled to meet the optimal delivery to users while minimizing cost and complexity. However, like everything else, the cost of building and maintaining this system must be included when considering the cost optimization. In other words, don’t spend six months of time building a system that will save the equivalent of one month of time in reduced cloud costs.

However, as you look to optimize your cloud service, there are a few common mistakes that can undermine your efforts.

Flawed Thinking: Moving to the Cloud Means You Don’t Need an Ops Team

A common misconception is that a move to the cloud can be accompanied by a reduction in the levels of ops support that is needed for production systems. This is completely untrue. Cloud-based systems need as much in-house expertise as any other hosted systems. Instead, it’s the nature of the job that is changing. Managing a cloud-based system is as complex as managing an on-premise system. It requires a high level of specialized knowledge and understanding of the implementation, as well as industry knowledge in both traditional networking and cloud systems. Cloud systems do not look after themselves, but they do provide a new paradigm in how to build, monitor, and maintain these systems. This requires as much expertise and management as on-premise systems.

Because little pre-emptive optimization of the core infrastructure can be done, the focus becomes more on building fault tolerance into systems, building comprehensive monitoring solutions, and the ability to react quickly to situations to take advantage of the scaling and geographical options offered by cloud-based services. The skillsets of your ops team will need to evolve to meet these new challenges.

Flawed Thinking: Third Parties are Optimized for You

Modern systems are not only dealing with incoming systems, but they are also routing requests out to other remote systems, often over the public Internet. This could be to third-party services or other services within the organization. When assessing the Internet performance of systems after migration to the cloud, it is essential that you also consider the performance of communications with these dependencies.

Phase 3: Dos and Don’ts

Do

  • Assume that components could fail at any point—this includes network connectivity

  • Have contingency plans in place to deal with networking issues

  • Consider costs when building any solution

  • Build a system that can scale, not one that is already scaled

  • Understand your users when planning where and how to scale your system

  • Aim to build a system that is always operating close to capacity

  • Realize that cloud systems require specialist knowledge to manage them

  • Realize that the nature of the work will change

  • Ensure that you understand the impact of poor performance of third-party systems

  • Remember to assess the performance of dependent applications

  • Consider installing dedicated connections to external systems

Don’t

  • Feel that any failover process has to be a complex automated process—a tested and documented manual process can be equally valid

  • Assume that after moving to the cloud the ops overhead will be reduced

  • Try to build a system that is sized to be future proof

Get Optimizing Cloud Migration now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.