For Cassandra, Newer May Not Be Better
March 8, 2016 No CommentsFeatured article by Moshe Kranc, Chief Technology Officer at Ness
One of the leaders in the NOSQL database space is Cassandra. Originally developed by Facebook in 2008, Cassandra combined the schema-less flexibility of Google’s BigTable with the reliability of Amazon’s Dynamo. The result was a database that supported schema-less data variety, tunable consistency, seamless failover and full master-master datacenter replication.
The knock on Cassandra in those early days was that it was hard for beginners to use. Unlike MongoDB, which worked immediately in a single node configuration out of the box, Cassandra required some configuration before it would run at all. Once that hurdle was passed, there was a myriad of parameters to tune, many of which required a deep understanding of how Cassandra worked internally.
Administrators only began to appreciate the control these parameters enabled as their database’s size increased. At 5 TB, MongoDB seemed to run out of steam, while Cassandra could be tuned to support hundreds of terabytes of data. As NOSQL programmers from that era would say: “Your first month with MongoDB is by far your best month, and your first month with Cassandra is by far your worst month.”
Cassandra’s data model also provided great flexibility along with great complexity. Since columns in a row are sorted by column name, Cassandra data modelers became adept at packing many concatenated field values into a single column name, and concatenating many column values into a single value. The result was highly compact and efficient, but often difficult to understand or explain.
To lower the barrier to entry for Cassandra users, Datastax, the commercial Cassandra vendor, introduced Cassandra Query Language (CQL), which made using Cassandra as easy as writing familiar relational SQL. But, this simplicity came with a cost. Firstly, CQL requires a schema, i.e., a column must be declared before it can be assigned a value in any row. Datastax decided that a schema was essential in order to enable users to communicate and understand a table’s structure. There is some merit to this argument, but it negates one of the fundamental pillars of Big Data – Variety, the ability to seamlessly support unstructured data.
The second big change in SQL was to limit the flexibility of column names and values. Instead of allowing arbitrary column names and values, to support arbitrary grouping and sorting, CQL introduced fixed rules: All columns within a column family must be sorted by the same “group by” columns. (Datastax claims that more complex sorting strategies can be supported via the MAP type, but MAP only supports <key, value> pairs, and is not very efficient for querying.)
The result is a bloated data format that performs far worse than pre-CQL Cassandra. How much worse? Evan Chen, in https://www.oreilly.com/ideas/apache-cassandra-for-analytics-a-performance-and-storage-analysis, has determined that CQL is an order of magnitude larger and an order of magnitude slower than pre-CQL Cassandra. This confirms quantitatively what many have observed empirically – Cassandra seems to be getting worse and worse with each new version.
You might argue that you can always program in CQL, but with an eye on optimizing the underlying column format via cassandra-cli. Unfortunately, Datastax discourages this – cassandra-cli warns that it cannot be used to view tables created via CQL. Furthermore, in the upcoming data storage engine, cassandra-cli will cease to work, and there will be no way to view the underlying column format.
Datastax deserves praise for advancing Cassandra and making it more accessible. But, the price may have been too steep – the more accessible Cassandra becomes, the less efficient and flexible it becomes. Perhaps it is time for the Cassandra community to “take back” Cassandra, i.e., branch off an OpenSource version that is independent of Datastax – a version for experienced Cassandra programmers that still has the flexibility and efficiency that made us love Cassandra in the first place.
_______________