How will enterprise data architecture evolve over the next five years?
In 2013, I spent a lot of time talking about Hadoop’s development towards being a central destination for data. Hadoop may enter an organization for a specific use case, but data attracts data. Once in the door, Hadoop tends to become a center of gravity. This effect is amplified by the appeal of big data being not just about the data size, but the agility it brings to an organization.
However, to exist feasibly in this way, Hadoop needs more than just a data crunching engine and a small army of willing Java programmers. It must become an enterprise platform that supports application development. By the end of the 2013, the major Hadoop vendors had all formulated a platform strategy: be it the Cloudera Enterprise Data Hub, or Hortonworks Data Platform.
But what do the big data vendors mean by this?
The data lake dream is of a place with data-centered architecture, where silos are minimized, and processing happens with little friction in a scalable, distributed environment. Applications are no longer islands, and exist within the data cloud, taking advantage of high bandwidth access to data and scalable computing resource. Data itself is no longer restrained by initial schema decisions, and can be exploited more freely by the enterprise.
I call it a dream, because we’ve a way to go to make the vision come true. It is, however, an accessible dream.
I’ve set out to describe the four levels of Hadoop maturity that lead us to the dream of the data lake. From these levels we can see where today’s Hadoop vendors are, and understand where our own organizations sit.
Four Levels Of Data Lake Maturity
(1) Life Before Hadoop
- Applications stand alone with their databases
- Some applications contribute data to a data warehouse
- Analysts run reporting and analytics in data warehouse
(2) Hadoop Is Introduced
- Applications contribute data to Hadoop
- Hadoop runs batch MapReduce jobs
- Hadoop used for ETL into warehouse or analytic databases
- Hadoop data reintroduced into applications
(3) Growing The Data Lake
- Newly built systems center around Hadoop by default
- Applications use each other’s data via Hadoop
- Interactive use of Hadoop as in-Hadoop databases deployed (e.g. Impala, Greenplum, Spark)
- Hadoop becomes a default data destination, governance and metadata become important
- Data warehouse use becomes the exception, where legacy or special requirements dictate
- External data sources integrated via Hadoop
(4) Data Lake And Application Cloud
- New applications are built on a Hadoop application platform around the data lake
- Hadoop matures as an elastic distributed data computing platform, for both operational and analytical functions
- Data lake adds security and governance layers
- Data availability increases, application deployment time decreases
- Some apps still have special or legacy needs and execute independently
A vision is necessarily forward-looking. In reality, many organizations are only just starting to kick the tires of Hadoop. Of those enterprises who are using Hadoop, most are in the early stages of this process in level (2), with a few front-runners living at level (3). Those organizations are big enough to face and invest in solutions to challenges that the vendors haven’t yet stepped up to, such as managing provenance, data discovery and fine-grained security.
Does anybody live the dream fully yet? Arguably, yes, the internal infrastructures developed at Google and Facebook certainly provide their developers with the advantages and agility of the data lake dream.
Big data software vendors themselves are ushering in the early stages of level (3), with the focus for 2014 being on application development. We see new companies such as Continuuity and Pivotal addressing the developer experience for big data.
Regardless of where you are now, take some time to look to the future. We’re on a journey towards connecting enterprise data together. As business is increasingly digital, access to data will become a critical priority, as will speed of development and deployment. The data lake is a dream that can match those demands.