Bradford Cross recently wrote a great article on Techcrunch about the Big Data Renaissance. If you haven’t read it yet, you should check it out.
Here at IA Ventures much of what he touches upon is part of our core investing thesis — that the Big Data Revolution is all about democratization. What was once the sole domain of large organizations is increasingly available to everyone, even the smallest startup. This revolution is being fueled simultaneously by technology and market trends: low-cost commodity hardware (multicore and GPGPU), cloud computing, and advanced algorithms that parallelize well combined with exponentially increasing amounts of actionable data. This results in an ideal situation — technology push and market pull.
My gripe with his article is that I disagree with the assertion that “Distributed systems are about making trade offs and a move toward problem-specific solutions rather than one-size-fits-all stacks.” I contend that one-off deployments are the norm today not because they inherently require it or because of some fundamental property of distributed systems, but rather because of short-term necessity. These systems are so new that common foundational components have not yet emerged/matured and the systems are so large that they are pushing the envelope of what is possible with today’s technology. The result is that building distributed systems currently requires a lot of duct tape, bailing wire, hand-tuning and careful optimization. For Big Data systems today there is no room for inefficiencies.
This custom development is not special to distributed systems. It is true of the early days of any technology. It starts when engineering/business needs are not met by traditional approaches. Visionary developers create new approaches and solutions that push the bounds of what is technologically possible. It takes tremendous effort to even make these systems work — they have to be problem-specific and hand-tuned. Things don’t remain custom forever. Over time abstractions are developed, optimizations are formalized, and simultaneously Moore’s Law makes it possible to accept a less efficient implementation in exchange for easier more rapid development.
That’s where the computer industry was when people were writing assembly code … but modern languages, compilers, and Moore’s Law abstracted away much of the low level stuff. That’s where storage and analytics were until the invention of RDBMS. That’s the story of computer networking. The history of computing is all about evolving from hand tuned systems that require visionary rock-star developers building problem-specific solutions to stacks of reusable components that take care of the standard parts allowing organizations to add value and customization on top. This will be the story of Big Data distributed systems.
Today Pig and Hive are examples of components in an emergent Big Data Stack. Of course there is not going to be a single take-it-or-leave-it stack. Rather (just like with all tech stacks) it will consist of standard reusable components that can be mixed, matched and customized to fit ones needs and problem requirements. The value added customizations will generally sit on top. The stacks themselves will grow over time commoditizing what was once hand-coded. There will be different stacks for different workload types. We are moving *from* problem-specific solutions *to* reusable stacks. These stacks that will lead to the democratization of Big Data — making the insights and actionable information contained in the Big Data potentially available to companies of all sizes and skill levels. At least thats what we’re betting on.