Tuesday, November 26, 2013

Books from the trenches/bowels that focus on HA and devops

I recently finished both the Release It! and The Phoenix Project books.

Up until recently I have been mostly a "shrink wrapped" software engineer.  What I mean by this is I have worked on multiple WebSphere related software releases.  These releases were shipped to consumers in binary form.  These releases had fix packs and on the application server alone, I worked through at least five major versions over 10 years.  Shipping such general purpose products are in some ways harder then using them for a specific application as we have to consider release to release to release compatibility and balancing new features across a wide variety of use cases.

However, consumers of the releases had it far harder than us "shrink wrap" engineers in another important way.  The consumers actually have to operate that software and applications on top of that software in 24/7 environments.  I have had my share of helping customers when things go wrong including many onsite visits and phone triages of major outages.  While I have helped in critical performance related situations and system outages with many customers worldwide, I never carried a pager in my career (something by step father taught me to avoid based on his career in IT).  In only ever being involved in major outages and never maintaining systems 24/7 I have missed out on experience that are very common to the majority of IT users -- working on making systems run better, working to make sure outages don't occur in the first place, and maintaining those systems daily.  If I was carrying a pager, I think I would have learned what these books teach and would have been interested in reading them to help me be better at maintaining my systems.  In my new role, working with IBM public cloud services, the experience documented in these books is invaluable (certainly worth the price of the book many fold).

These two books deal with these topics based on "in the trenches experience".  They are different than typical technical manuals in that they teach through live examples in the real world (or fictionalized real world experiences in the case of The Phoenix Project).  As such, they aren't as boring as "text book" coverage of similar material.

The first book "Release It!" was recommended to me by folks at Netflix.  The book covers the topics of stability and capacity first.  Both of these sections start with a real world example of where people failed to provide for a stable system or one that couldn't handle capacity spikes.  The book then covers common anti-patterns that will cause your system to fail.  Finally the book wrap ups each chapter with patterns that can be applied to your architecture that allow your system to have far better stability and handling of capacity.  The next section covers lessons learned about networking, security, availability and administration allowing you to think about important concepts when designing and implementing any large scale distributed system.  The final section covers operations giving a great example of how when done right operations can be very smooth even under pretty bad scenarios which helps explain the follow-on discussion of why transparency (logging, monitoring, configuration management) and automation (devops) should be applied.  I can see clearly how Netflix and NetflixOSS has implemented high quality implementations of many of the concepts discussed in this book (and beyond):
  • Stability pattern of handshaking
    • Eureka and ELB healthchecks
  • Stability pattern of circuit breakers/fail fast/timeouts
    • Hystrix
  • Stability pattern of decoupling middleware
    • Microservices architecture (Karyon/Ribbon)
  • Testing for stability
    • Simian Army
  • Capacity patterns of leveraging a shared nothing middle tier and partitioned data tier
    • Cloud native and Cassandra
  • Capacity patterns of leveraging caching
    • EVcache
  • Making releases easy through consistent devops
    • Asgard
  • Operational transparency
    • Edda and Servo and Turbine (and Hystrix again)
  • Configuration management
    • Archaius
My only complaint with "Release It!" is it could benefit from a refresh that considered more recent advances in cloud technology and devops.  However, the examples are still relevant.  For example, there are samples that show how relational databases and Java code accessing them through EJB's could result in a completely blocked thread pool many tiers away from the source problem database.  While these technologies are less the "norm" these days the underlying problem and diagnosis is still applicable today.  However, there are now more solutions in this space (as shown by NetflixOSS and the rise cloud native application architectures) than there were in the past.

Given the "The Phoenix Project" is a story it is worth noting that there are spoilers below, so stop reading if you don't want to learn what happens in the end.

The second book "The Phoenix Project" is a story of a mid level manager being promoted to VP of IT operations reporting to the CEO a day after the CIO and overall manager of IT operations (the hero's previous boss) was canned.  The story then lays out the political landscape of the company that (all the way to CEO) is setup (at the time) in a way that is forcing IT to fail taking the company down in flames.  Similarly, you learn of the state of IT within the company, which pits development against IT and parts of IT against itself.  Over the course of the book, little by little the hero learns how to "fix" IT integrating it across the company in a supportive way instead of the failure prone cost center it started as.  A prospective board member who was quite familiar with lean methods in manufacturing teaches the hero to start to run IT like an efficient manufacturing plant.  In doing so, the hero learns to not only optimize the IT organization for efficiency and quicker turn for standard IT needs, but also how to tie IT to improving the top line business goals.  The hero learns why optimizing the whole (vs. deal with "code" pigs thrown half alive over a wall to ops) of IT across the business, development and ops as a single team can accomplish amazing industry leading results.

I doubt I would have read a book that dryly covers ITIL, Kanban, lean and even devops.  The story and soap opera behind the story of the resurrection of the phoenix from the ashes of a broken IT organization "tricked" me into not only finishing the book but finding time to finish it in a few nights of reading.  My favorite part of the story was the drunken night with John.  I'm not too proud to admit that I have been there and done that when things looked bleak on projects.  It's nice to see the human side of how living in the IT business affects your life.  I can also relate to the tension in the hero between passion for his job and company and wanting to get home to the family at night.

I highly recommend both books.  Both have motivated me to read a few related "text books" that cover similar topics.  Well done Michael Nygard, Gene Kim, Kevin Behr, and George Spafford.  I hope everyone who reads this review puts both books onto their tablet for their next plane ride or vacation.