High Availability is Critical (Rule of Three, Fail in Silos)
- Rule of "there must be at least three of everything" (likely more than three through clustering and partitioning)
- One is a single point of failure, two is likely active/standby and likely will mean 0% availability for 100% of your users during recovery, three is where you start to be able to guarantee that you will have greater than 0% availability during failures (some users might see failure, but not all should see failure, and eventually none should)
- Do not let part of your application failing bring down the entire application
- Apply circuit breaker and bulkhead approaches to contain failures and ensure your critical paths are isolated from dependencies and each other.
- These rules apply not only for your core service servers, but all of the underlying cloud configuration and operational servers around your servers. You are only as strong as your weakest link.
- If a failure occurs once, study it and learn what it takes to automate the recovery.
- No failure (other than the first time) should result in a pager call to an operator. Automate the recovery of the failing component through API's to your infrastructure, servers and procedures.
- Test that this is true before the system tests it for you. Run chaos monkey like procedures simulating failures you've seen in the past and ensure your automation recovers from failure.
Practice Devops
- Ensure that you can do live traffic shifting/shaping that allows for key devops practices such as live debugging, canary testing, red/black deployments (lights on code changes across clusters) and fast rollback in times of trouble.
- Ensure all code is A/B tested with key "performance" indicators (KPI's) that relate back to your key business objectives. A business owner should be able to view the "goodness" of your code through these KPI's.
- The system must be observable to the developers as well. You must design in monitoring so the developers can understand how the code is behaving in production. As you merge your operational and application specific monitoring leverage "ops" experts. They have already learned what to watch for - the hard way.
Prepare for Elastic Scale
- Elastic scale is really an extension of a good HA architecture as much of the application architecture enables both.
- Design your system to support arbitrary scale at every tier of its architecture. Consider the front end edge tier carefully as it will be the single entry point for all users and can be a great place to offload all other tiers.
- For most people manual scaling will proceed "follow the workload" automatic scaling. Take steps to learn the system in manual scaling and transition to automatic scaling when ready.
- Similarly, HA levels of scale will precede web levels of scale. You might start small, but if you have architected correctly scaling up should be less painful. If you hit walls in scalability, step back and look for the non-clusterable part of your architecture and fix it.
There are many technologies that can help you achieve the above abilities. While I didn't want to focus on technology in this article, we have found some of the NetflixOSS based approaches to be very helpful starting points in achieving these abilities. Over time we expect our enablement team we've been building at IBM to continue to document the technologies we are using. Our goal is to be as open as possible about how we achieve our goals for our public cloud services so we can not only teach but learn from others doing similar things. Stay tuned for more information.