What Will This Bubble’s Legacy Be? Open Source Big Data and Analytics Tools
April 22, 2011 No CommentsFirst of all, let’s leave aside the issue of whether we’re in a bubble or not, and just assume that we are. Ashlee Vance has an excellent piece in Business Week looking at one tragic aspect of this bubble: too many mathematicians are flocking to Silicon Valley to work for companies like Google, Facebook and Zynga to work on advertising platforms. Former Facebook employee and Cloudera co-founder Jeff Hammerbacher is quoted saying “The best minds of my generation are thinking about how to make people click ads. That sucks.”
I couldn’t agree more. But I disagree with the subheading of the piece “Tech bubbles happen, but we usually gain from the innovation left behind. This one–driven by social networking–could leave us empty-handed.” Thanks to this bubble, we’ve already got Apache Hadoop, Apache Cassandra, Membase and many other free open source tools for working with big data. If the bubble popped tomorrow, researchers in many fields would still have all of these tools.
Strangely, Vance doesn’t mention Hadoop once in the article – even though an enterprise distribution of Hadoop is the main product Cloudera sells. Hadoop is based on research papers published by Google, and Yahoo funded and continues to fund much of its development.
Facebook open-sourced its database system Cassandra in 2008. Zynga was one of the original contributors to Membase, which was released in 2010. Twitter open-sourced FlockDB. LinkedIn open-sourced Project Voldemort. VMWare sponsors Redis.
Venture backed startups such as Cloudera, DataStax and Couchbase have emerged to commercialize some of these projects, and these companies contribute back to the original projects.
And it’s not just run-off from Internet giants fueling this movement. Revolution Analytics, another venture funded startup, develops a free version of its distribution of the R statistical programming language. It recently announced that it would provide its enterprise edition for free for those competing in Kaggle’s data science competitions. Neo Technology open-sourced its graph database Neo4j, Basho open-sourced Riak and 10gen open-sourced MongoDB.