Recently, I have had many interesting conversations with colleagues, clients and partners about the uptime and availability of Cloud services offered by giants like Microsoft and Google and smaller (or specialist) players such as FpWeb.

The ever increasing occurance of these conversations has got me pondering availability on a philosphical level rather than solely in the context of how we can architect systems that are 100% available.  Can we really build a Cloud that is 100% available?  If we can, can this be done at a reasonable price? Even if it can be done, should we do it?

Much of my thinking has been in and around the concepts of maintenance and resilience.  To my mind both maintenance and resilience have the highest individual impacts on how “available” Cloud services can be.

Lets take a look at these two concepts individually.

Given the base assumption that we are working in a Microsoft powered world (the services I am involved in are solely for Microsoft systems at this time, henc the slant) maintenance is unfortunately, frequently required to keep systems healthy and optimal.  Patches, updates and Service Packs are commonplace and frequent with the architecture of Microsoft operating systems and applications not really being well suited to live updates on the fly.  We can mitigate against downtime (such as reboots) through the effective use of techniques available to us in our virtualisation platform (such as live migration) or through techniques made available to us at the platform/application layer by making use of HA/cluster configurations (such as multiple server SharePoint farms or multi-node SQL Server clusters or AAG) but all of this is generally moot if the lines in and out of our datacentres fail.  Maintenance is not really the problem to solve.

If we consider resilience, we open up a much more complex problem space.  Resilience needs to exist in all of the tiers of a complex system for the system as a whole to be resilient.  Physical server hardware (the lights and clockwork bit) need to be resilient with multi-processor, multi-power supply, multi-hard disk, etc. techniques being employed.  Farms and clusters need to be implemented at a platform level to provide resilience of applications and higher levels of resilience (such as replication of data between datacentres) need to be employed to provide resilience in Cloud contexts.

This all becomes very complex, expensive and hard to implement and manage at every level.

It is worth noting that neither the resilience or availability are essential characteristics of Cloud Computing (as defined by NIST here), they have simply become associated to Cloud computing as, I believe, Cloud vendors have pushed this as a benefit of their service.

Is this kind of dumb?  Maybe.

If you consider Cloud Computing on a global scale (after all datacentres and end clients could literally be on opposite points of the planet) offering a high level of availability as an intrinsic part of your service is potentially quite risky, right?  You have control over your datacentres (we hope) but once things get out onto the information superhighway, it’s anybody’s guess as to whether connections can be maintained thus ensuring your customer can consume your service 100% of the time.  Content Delivery Network (CDN) technologies (such as Akamai) can help with these types of risks, but even then, the pipe between datacentre and client is largely in the hands of nobody specific.

Thinking about recent high profile issues with Public Cloud services,  Microsoft (Office 365), Amazon (EC2, AWS) and Google (GApps) have all suffered from major outages in recent times that were as likely to have been caused by non-datacentre issues as datacentre issues.

By way of example, the recent outage of Office 365 caused by a “major power blackout in Southern California” was unlikely to be a datacentre issue per se.  It’s a nailed on certainty that every Microsoft datacentre has enough power generating capability of its own to operate for a considerable length of time without mains power from the grid.  But even with on-site power generation and all the on-site datacentre  resilience in the world won’t keep your service available if the infrastructure that your datacentre connects to outside of your facility is unavailable for some reason.

I struggle with this, I really do.  Apparently, the non-availability of a single datacentre (of many that Microsoft operates) resulted in a total outage of Office 365 (and other services including SkyDrive and Hotmail)  for many hours.  How can this be?  The coupling of two established and relatively simple technologies (datacentre replication and global load-balancing) would surely prevent such an occurrance?

This is a good example of the vast and complex problem space associated to the delivery of mass-market Cloud services.

The cause of this problem, as I see it, is that “the Cloud” is not a single entity or product, its a collection of entities, many of which are outside the control of Cloud vendors.  If we consider there are 3 of these entities forming up the delivery of Cloud services: vendor datacentre, internet delivery and client environment, we already have 66% of the service delivery outside the control of the Cloud service vendor.

How can a 100% availability guarantee be made given these circumstances?  Vendors will argue that their service is guaranteed to be 100% available from their datacentre but this is not necessarily a guarantee of availability of the service from the perspective of the client.

I don’t like posts to get to long so I’m calling time on this one but in the coming weeks I’ll philosophise further on additional questions such as:

  • Does the overall, wider context set of benefits associated to the Cloud counter the possibility for failure, error or outage?
  • Do SLA’s really provide anything other than a “better to ask for forgiveness rather than for permission” false safety net?

more to follow…