There’s little argument against the proposition that Google should be the biggest and most effective public cloud vendor. They have the scale, the networks, the operational experience and are comfortable with commodity-like economics – everything points to them being the one to beat. But then time and again they make serious missteps – previously it was a Google exec’s ill-advised comments about the long term prospects for their cloud platform.
Today it is the news, first reported by the sleuths at The Register, that Google discovered a potentially catastrophic bug, one that could potentially delete users’ virtual disks. It seems that a bug within Google’s code meant that one of its server migration commands could potentially have “erroneously and permanently) deleted disks that were attached to virtual instances. All credit to the Google Compute Engine team for proactively warning users of the bug, and advising that the bug had been fixed. But it will undoubtedly cause jitters among customers. According to the email:
We discovered a serious issue with the gcutil moveinstances command: with the release of the new auto-delete feature for persistent disks, there is an unintended interaction that can result in accidental deletion of the disk(s) of the instance being moved.
Essentially what it means is that if a GCE customer had a disk sitting in Google server storage, upon using the command line to move the disk to another data center, the attached disk could potentially have been obliterated before the transfer happened. The result being that all that data would have been permanently and irrevocably lost. Ouch indeed.
Now of course Netflix taught us that cloud instances go down, and that customers need to be using tools that best help them “architect for failure”. But a bug like this goes beyond failure, it’s an example of an erroneous command that users would have previously had no visibility over. It’s the ultimate nightmare and the reason many IT practitioners eschew the cloud – because you can’t see it, you don’t know if someone has inadvertently or intentionally done something dumb deep down in the code.
As I said, all credit to Google for fixing the bug and going public on this. But if this bug got through testing (for a time at least) how many more bugs lie undiscovered and waiting to pounce?
…a significant gut-wrenching price reduction to its main cloud storage services within the coming month.
But really, we’re not talking about price savings here. Sure AWS has introduced the world to infrastructure that can be acquired easily and cheaply, but it hasn’t done so at the cost of robustness. While AWS customers need to plan for failure, notwithstanding unplanned outages that occur from time to time, the AWS services work largely as advertised. Rogue bugs making their way into products is a shocking look and one which is even less palatable as enterprises start to look to public cloud vendors for services.
I’ve long said that Google can set the cloud world on fire, but in order to do so it needs to move away from the Google modus operandi and think more like the legacy vendors it is dismissive of. That old saying about “no one ever being fired for buying IBM” is there for a reason. IBM, and its ilk, have shown overs years that what they might lack in innovation they more than make up for in robustness. It’s time for Google to look at their approach and accept the fact that its customers won’t take second best and need a truly enterprise-grade product that they can rely upon.