Monday, February 10, 2014

Acme Air Cloud Native Code Details - Part 1

Previously, I blogged at a high level of what changes were required to get Acme Air ported to be cloud native using IBM middleware, the IBM cloud, and NetflixOSS technology.  I have gotten the question of how to see these code changes with more specifics.  In my next few blog posts I'll try to give you code pointers to help you copy the approach. Note that you could always just go to the two source trees and do a more detailed compare. The github project for the original is here. The github project for the NetflixOSS enabled version is here.

First, even before we started on NetflixOSS enabling Acme Air, we worked to show that it could be run at Web Scale. As compared to other desktop and mobile applications, there were two basic things we did to make this possible.

Stateless Design

We focused on making sure every request was independent from the server perspective.  That would allow us handle any request on any server.  As classic JEE developers, this was pretty easy except for two commonly stateful aspect of JEE, storing application specific transient state and authentication state stored in session objects.

As a performance guy, the first was pretty easy to avoid.  I've spent years advising clients on avoiding storing large application specific transient state in JEE sessions.  While app servers have spent considerable time optimizing memory to memory session state technologies, nothing can help performance when developers start overusing session in their applications.  App server memory to memory replication is problematic when you want to move to a web scale implementation as it is usually server specific vs. user specific and requires semi-static specific configuration across a small cluster.   Database backed replication is problematic at web scale as it requires a database that can web scale which wasn't the case when session replication was designed in most application servers.  Web 2.0 and AJAX has helped application specific transient state.  It is now rather straight forward to offload such state storage to the web browser or mobile application.  All you need to do is send to the client what they want the store and let the client send back what is pertinent to each specific request.  Of course, like server side session data, be careful to keep the size of such transient state to the bare minimum to keep request latency in check.  In the case of state that doesn't belong on the client, consider a distributed data cache keyed on the user with time to live expiration.  Unlike server side session data, be sure to validate all data sent back to the server from the client.

The second was a bit harder to avoid.  The state of authentication to the services being offered is certainly something we cannot trust the client to supply. We implemented an approach similar to what we've seen from other online web scale properties.  When a client logs into the application, we generate a token and send that back as a cookie to the client while storing this token in a web scale data tier designed for low latency - specifically WebSphere eXtreme Scale.  Later on all REST calls to that service we expect the client to send back the token as a cookie and before we call any business logic we check that the token is valid (that we created it and it hasn't timed out).  In JEE we implemented this check as a ServletFilter to ensure it was run before any business logic in the called REST service.  We implemented this with a security naive implementation, but recommended authentication protocols would follow a similar approach.

public class RESTCookieSessionFilter implements Filter {
  public void doFilter(ServletRequest req, ServletResponse resp, FilterChain chain) throws IOException, ServletException {

    Cookie cookies[] = request.getCookies();
    Cookie sessionCookie = null;
    if (cookies != null) {
      for (Cookie c : cookies) {
        if (c.getName().equals(LoginREST.SESSIONID_COOKIE_NAME)) {
          sessionCookie = c;
          break;
        }
      }

      String sessionId = "";
      if (sessionCookie != null)
        sessionId = sessionCookie.getValue().trim();

      CustomerSession cs = customerService.validateSession(sessionId);
      if (cs != null) {
        request.setAttribute(LOGIN_USER, cs.getCustomerid());
        // session is valid, let the request continue to flow
        chain.doFilter(req, resp);
        return;
      }
      else {
        response.sendError(HttpServletResponse.SC_FORBIDDEN);
        return;
      }
  }
}


You can find this code in RESTCookieFilter.java and see it added to the REST chain in web.xml.

As we'll see in follow-on articles the change to have a stateless service tier was important in enabling elastic scaling with auto recovery. Once you free your service from having to remember state, any server instance that exists (or even ones that didn't exist when the first request was handled) can handle the next request.

Partitioned and Scale Out Data Tier Design

In an application we used before Acme Air for working on web scale experiments, we learned how to design our data storage to be well partitioned.  The previous application worked against a monolithic relational database and we found that the definition of the data access services wasn't ready for a partitioned NoSQL storage engine.  We learned to fix this you need to more carefully consider your data model and what impacts that has on your data access service interfaces in your application.

The previous application had three aspects and we learned that it was easy to convert two of the tree aspects to singly keyed set of related data tables.  In these two aspects we further found cases where secondary indexes were used, but really were not needed.  Specifically we found places where we relied on a secondary index when the primary key was known but not passed down to the data service due to a data service tier that was designed assuming index scans and joins were possible and efficient when running on a single database system.  We were able to remedy this problem by redesigning our data access services to always require callers provide the primary key.  You can see this design reflected in Acme Air's data service interfaces.  Note that you will always see the partition key for each data type passed on all data service calls.

For the third aspect of the previous application, there was data that needed to be queried by two different keys (consider bookings that need to be queried by passenger id as well as flight instance id).  In partitioned data storage there are two ways to deal with this.  You can store the data once by one key and implement a user maintained index for the second key.  Alternatively you can duplicate the data (denormalizing based on which type of access is required).  This approach (as compared to a user maintained index) trades off storage with ensuring a guaranteed single network access for the data regardless of which key type is being queried.

In Acme Air currently, we have implemented carefully keyed partitioned data storage.  We haven't yet added operations that would key the same data differently.  If we added a view of the booking data per flight as required by gate agents (so far bookings are purely accessed by passenger), it would require we implement user maintained indexes or duplicate data.

Much like there was side benefits of doing stateless mid tier design, there is an elastic scaling and auto recovery benefit of using a purely partitioned and non-monolithic instanced data tier.  Once the data service holding your data is partitioned NoSQL technologies can easily replicate and move that data around within a data cluster as required when the instances fail or need to scale out.

In the next post, I'll move on from this foundation of stateless mid tier services and partitioned / scale out data tier to cover topics such as breaking the application into micro-services and containers that make highly available instance registration and cross instance routing easier.