I am a field engineer and evangelist for Imply (the company behind Druid), and I was a field engineer and evangelist for Datastax (the company behind Cassandra). As a result, I’ve seen things. A lot of them are cool: designing elegant, high-performance, scalable data architectures that work flawlessly in production. Some of them are not so pretty: selecting wrong systems for critical use cases and spending countless hours to make them fit.
If you are reading this because you are considering whether to use Apache Cassandra/DSE/ScyllaDB or Apache Druid/Imply, then you can just stop right now. You are already headed towards not pretty. If you, like many people I talk to, are evaluating Cassandra (or HBase), but starting to think that Druid is a better fit, then let me save you from the black hole of white papers known as db-engines.
Everything you think you know about Cassandra and Druid is probably wrong. The two databases fundamentally do not address the same set of use cases.
Cassandra and Druid are both fast and highly scalable. They can both deal with time-related data. They also both have cool mascots, but that’s where the similarity ends. The major differences between Cassandra and Druid is inextricably tied to the query patterns dictated by your use case.
Query Patterns are Key
It’s critical to understand anticipated query patterns when selecting any database for your applications. Unlike the olden days (the early 2000s) RDBMSs are no longer the only answer to the “which database should I use?” question.
The simplest question to first ask yourself is, do you have an OLTP or OLAP use case? Cassandra works great for OLTP use cases and Druid works great for OLAP ones. If you already know your use case, then great, you know what system to use. Not sure? More details await!
If your queries ALWAYS constrain on a single column in the WHERE clause, for example on a field such as deviceID or customerID, and you are looking to quickly (sub-second response time) scoop up any and all data related to that ID field reliably, and you are doing nothing else, then Cassandra is your mythological creature of choice. Seriously. You are welcome to argue with me, I don’t mind. I welcome distributed database discourse.
If your use case is such that you honestly have no idea what your WHERE clause will look like, but you know that multiple ID columns will probably need to be queried reliably in less than a few seconds, then Druid is your best bet. Queries matter, people! Know thy query, know thy database. I’m sure someone famous said that.
Understanding what queries you trying to optimize for is the foundation for our discussion. Of course, there’s a lot more details to cover, which can be broken down to two major topics: architecture and data.
Druid and Cassandra Architectures Compared
Druid and Cassandra are both distributed database systems that were designed from inception to withstand outages, and are therefore well-suited for modern infrastructure and application requirements. Both systems allow you to flexibly configure data replication, and distribute data in tiers to take advantage of faster, more expensive hardware. Both are transparently sharded, so you don’t really need to care what data lives on which machine, and both can be easily scaled up and down, with no downtime, to meet rising capacity demands.
Of course, there more than one way to skin a distributed database, and many of the implementation details between the two systems are very different. I will touch on some highlights, but this is by no means an exhaustive list. First, Druid relies on Apache Zookeeper for coordination, while Cassandra has its own system based on the gossip protocol. Zookeeper was much more battle tested by the time Druid was conceived (2011), and the developers felt that since most of implementations of Druid at the time were being used with Apache Hadoop, people would already have Zookeeper in their stack. On the other hand, Cassandra rarely sits in the same space as Hadoop, so reliance on anything from that stack was quickly removed for simplicity. Also, Cassandra has built-in multi datacenter active/active replication that allows for always-on applications with very little outside management required. In contrast, Druid doesn’t rely on replication to guarantee availability within a DC, and instead always maintains a copy of all data segments from all time in deep storage (a distributed file system). They are not used in the read path, and are available to bring a cluster back to operation from near and/or complete death scenarios. You can totally do multi-datacenter with Druid, but it is up to you to manage and maintain it.
Druid was designed in the post-cloud-eating-the-world era, and therefore has the ability to separate processes from each other to be run on the most appropriate hardware. For example, master servers do not require the same CPU, memory, or storage horsepower that data servers do. With Cassandra, you run all processes on a single machine using a single hardware profile. You need to be a lot more careful when you choose your hardware for Cassandra, and this can end up being pretty costly in clouds like AWS. Of course, this is totally ironic since Cassandra gets its inspiration from the Dynamo white paper. Alternatively, Druid gets its inspiration from search systems, timeseries DBs, and traditional analytics DBs.
Druid and Cassandra Data Structures Compared
In terms of how data is stored, the 2 systems are much more different than they are alike, which sheds some light on why query patterns are the bottom line in choosing between the two. They are both well suited to time-based data, and Druid has many special optimizations if a timestamp is in the data. They are also both fairly flexible in terms of schema. Neither supports JOINs or FOREIGN KEY constraints. They both support evolving schemas and nested data sets, so they are much more forgiving than the relational systems of old. Most importantly, both systems share the requirement that to get the most out of the system, you should understand what your query patterns are going to look like BEFORE designing your schema. This is the price you pay for flexibility and speed over RDBMS, and most modern application design philosophies support this newer way of thinking.
I won’t go into exhaustive detail about how data is laid out on disk for each database, but in summary, Cassandra is fundamentally a key-value store and distributes data around the cluster by a PARTITION KEY, then sorts the data on that partition (or row) by the CLUSTERING key. Adding new data to that row is almost free, and updates are handled by marking the previous cell value with a tombstone and adding the new value to the row. Eventually, you will need to compact these partitions as data becomes fragmented over multiple files, but remember that you are amortizing your INSERTs and UPDATEs over time with almost instantaneous commits. This makes scanning a single partition or row very fast as the disk head only performs a single seek operation. However, if you want more than a single Cassandra partition, performance goes south fairly quickly as scatter/gather queries are an anti-pattern, and secondary indexes are only useful in extremely rare and specific occasions. Therefore, when you know what partition you want to scan, and you don’t want to do any aggregations, GROUPBYs, or any other more analytical operations, then you are in good shape. The result is that Cassandra is great for small, tightly constrained, well-known queries and high-volume inserts and updates.
In contrast, Druid is fundamentally a column store, and is designed for analytical queries (GROUPBYs with complex WHERE clauses) that need to scan across multiple partitions. Druid stores its index in segment files, which are partitioned by time. Segment files are columnar, with the data for each column laid out in separate data structures. By storing each column separately, Druid can decrease query latency by scanning only those columns that are required for a query. There are different column types for different data types (string, numbers, etc.). Different columns can have different encoding and compression algorithms applied. For example, string columns will be dictionary encoded, LZF compressed, and have search indexes created for faster filtering. Numeric columns will have completely different compression and encoding algorithms applied. Druid segments are immutable once finalized, so updates in Druid have limitations. Although more recent versions of Druid have added “lookups”, or the ability to join a mutable table external to Druid with an immutable one in Druid, I would not recommend Druid for any workflows where the same underlying data is frequently updated and those updates need to complete in less than a second (say, powering a social media profile page). Druid supports bulk updates, which are more commonly seen with analytic workloads.
If you have made it this far, congratulations! Hopefully, by now you understand the way I laid out my argument and my statement that the choice between Cassandra and Druid is all about the use case and how it relates to the way queries run. Cassandra is best for use cases that are write heavy with small, highly constrained queries (OLTP). Druid is best for use cases that are read heavy, and require full analytical query capacity (OLAP). Both are great systems and can be awesome tools when applied correctly, but choose wisely, gentle reader, for the consequences of building your application on the wrong one could bring your business to a grinding halt. No system is a unicorn that can do everything you need. The key is to understand your query patterns and your workload.
A great way to get hands-on with Druid is through a Free Imply Download or Imply Cloud Trial.
Photo taken on my iPhone by me @ Candytopia SF