Cassandra Introduction

Today’s world is impatient and at the same time, it’s generating terabytes of data every hour every day. The need of the hour is infinite scalability, high availability, and maximum I/O throughput. If you are into backend engineering or diving into system designs, or simply curious about exploring new technologies, you must have heard about Cassandra and how popular it’s getting daily. Let’s get to know about Cassandra as a database, what makes it so powerful, and understand where it can be a good fit.

Brief History

Cassandra is inspired from Google Big Table(developed in 2005) and Amazon’s DynamoDB(developed in 2007). Cassandra was developed at Facebook by Avinash Lakhsman, one of the authors of DynamoDB and Prashant Malik to power the Inbox Search Feature.

Facebook released it on Google Code in 2008. Shortly after that, Cassandra becomes an Apache Incubator project in 2009 and became a top-level project for the Apache Foundation in 2010. Today, Cassandra is freely available under the Apache License 2.0.

Some say that the chosen name Cassandra was opted by Facebook engineers because Cassandra was a cursed Oracle.

Eventually, Facebook replaced Cassandra with HBase, another NoSQL database, for their Inbox Search project, but they continue to use Cassandra in their Instagram division, which supports over 1 billion monthly active users.

Present Day

Cassandra has gone through more than 10 releases as of now and it’s used by 400+ companies across 40 countries. Cassandra is one of the most widely used NoSQL database and it’s due to cassandra that the NoSQL scene has got a significant boost in the past years. 90% of Fortune 100 companies uses Cassandra even when there are plethora of NoSQL options available.

What is Cassandra?

The technical definition of Cassandra includes a few of the core engineering terminologies that are in itself a subject of research but here’s the definition of Cassandra as a database:

Cassandra is a distributed decentralized column-oriented open-source NoSQL database with high availability which is highly scalable with tunable consistency.

Now let’s break down the definition and understand the terminologies.

Distributed Decentralized System

A distributed system is made up of a collection of nodes. A node is a computing system that can communicate with other nodes and can store data. This way a distributed system can hold a huge volume of data and can be scaled horizontally by simply adding new nodes to the system.

Traditionally we have been building centralized systems where there was a single node responsible for data management & data ownership. It carry out the desired computation via commanding the other nodes in the system. This system is dependent on the master node and is not scalable horizontally. Even with vertical scaling, the master node can easily reach its limit. The severe drawback of a centralized system is that if the central node fails the whole system goes down and the central/master node is then called the Single Point of Failure


centralized-system

A centralized system: the master node with authority commanding the other nodes


To avoid the limitations of a centralized system and SPoF, a distributed decentralized system was introduced. This system is a collection of nodes where the computation is spread across the nodes instead of a single system responsible for doing all the computation. It is basically a peer-to-peer network of nodes and no single node has complete control over the system. The different nodes communicate with each other to achieve a common goal.


decentralized-system

A distributed decentralized system: collection of nodes communicating with one another without a central authority


Column Oriented

We are very well aware of the legendary relational databases like MySQL and PostgreSQL, many of us must have used relational databases at least once in software projects. Relational databases are time-proven and have come a long way since the 1980s where the data storage is based on a tabular format combined of rows & columns. A particular piece of data is stored in a cell.

Here’s an example of a sample product table:

id brand color publisher author model_id title pages
101 Sony black Bravia
201 Penguin Daniel Kahneman Thinking Fast and Slow 450

So as we can see that schema formation in a relational table is not very flexible. If we want to store a variety of product that has several mismatching columns then we have to store null for the cells that do not have any value. Keeping NULL may seem a way out but that’s problematic. Each empty cell takes some storage (say 1 byte) and now considering you have 10 billion of empty cells in the above table, then you are wasting almost 10 GB of storage.

Column Value Pair at our rescue!

Now the above data can be simplified using the column value pair storage as follow:

id attribute_name attribute_value
101 brand Sony
101 color black
101 model_id Bravia
201 publisher Penguin
201 author Daniel Kahneman
201 title Thinking Fast and Slow
201 pages 450

Now we are not having any NULL or empty values in the above structure and hence no wastage of storage. We created a new row for every attribute for the given id. And now let’s see what we have gained:

  • Disk space is minimized by not saving empty cells
  • Not tightly bounded to table schema, the flexibility of adding columns in future

Here you can have a visual representation of the column-oriented data storage model in the figure below


column-storage-model

Visualization of the column storage model


High Availability

In the modern world, we expect systems to be always available to serve us. This 24/7 availability is not always guaranteed. In the world of software engineering, engineers offer Service Level Agreements (SLAs) that guarantee the minimum availability levels. The most popular of the all, 99.999%, is the five-nines of availability. The availability percentage is calculated as:

Percentage of availability = (total elapsed time – the sum of downtime)/total elapsed time

Cassandra is designed as a peer-to-peer distributed system where all nodes are same the data is distributed among the nodes in the cluster. Since there is no single point of failure, Cassandra guarantees high availability of data by implementing a fault-tolerant storage system equipped with failure detection. Cassandra achieves fault tolerance via replication.

Highly Scalable

As already discussed, Cassandra utilizes a distributed system under the hood and hence it can scale exponentially just by increasing the number of nodes. It’s very common thing in Cassandra to add a node or remove a node from the cluster during regular work hours without waiting for the system load to drop during the night.

Another key feature of Cassandra is that it works on commodity hardware and does not require any specialized hardware components.

Tunable Consistency

Now, this is a really interesting one. Compared to the relational database, as soon as you write a data row in the table it guarantees that you will get the written data instantly if you try to fetch it.

This is not the case with Cassandra. Cassandra is a distributed network of nodes where when you write data to any of the nodes, the system replicates it to the other existing nodes. In this case, if you try to fetch the data written then the system does not guarantee that the node which is acknowledging your read request has the latest written data with it.

You can tune the consistency level of Cassandra but then it comes at the cost of availability.

Where Cassandra Fits?

So to summarize, we can conclude that Cassandra is having the following benefits:

  • Highly available
  • Highly scalable
  • High throughput
  • Eventually consistent

So Cassandra is a great fit for the areas where we need to write a high volume of data at the cost of consistency. A few example systems where Cassandra can be used are:

  • Product catalog management
  • Analytics data storage
  • Logging and auditing systems
  • Logistics systems
  • Data tracking systems for healthcare, weather, and telemetric

Wrap Up

So it was the introduction and explanation of the Cassandra database. The next write-up will be regarding its architecture and detailed use cases covering where not to use Cassandra.

Stay healthy, stay blessed!