January 28, 2014

Do I need a Hadoop cluster?

Hadoop is a distributed map reduce architecture inspired by Google's work on distributed systems running on commodity hardware in the early 2000s. It has since then been deployed by several large and small corporations and has become the preferred technology stack for big data applications.

Cloudera is the most notable example of companies that have built successful businesses around design, installation and service of Hadoop clusters.

Hadoop is most suitable for extremely large datasets (in the Terabytes of data) and non-interactive analytics. There now exisits several option of Hadoop clusters on the cloud that have drastically reduced any upfront investment in technology. However, data scientists and IT professionals are still needed to properly configure and operate a Hadoop cluster.

Even more importantly, developing analyses requires familiarity with the "map and reduce" parallel programming paradigm, which is very different from SQL languages that are more widespread among business analysts. Technologies like Hive (open source, part of the Apache project), Impala (by Cloudera), Presto (opensourced by Facebook) and Splice Machine allow you to run SQL on Hadoop clusters, but are still not delivering the ease of use and interactive nature that is desirable for business analytics.

We have built MegaPivot to make it easier to build large datasets and run interactive queries on them. Creation of datasets is as simple as importing CSV files for a Dropbox folder and interactive queries can be built with a drag-and-drop interface borrowing well-known concepts from Microsoft Excel pivot tables.

MegaPivot is a web-based big data analytics application built on Google BigQuery infrastructure.

Build pivot tables on billions of rows in seconds.
Easily import CSV files of any size directly from your Dropbox folder.
Share your reports with your colleagues.

Sign Up with Google