Introducing NOSQL

Jeff Davies
5 min readDec 20, 2021

Watch the video

Have you ever wondered what a NoSQL database is and why you should care about them?

One of the best ways to understand NoSQL is by comparing NoSQL to a traditional relational SQL database. In relational databases you model data with each table representing an entity, like a customer, a product, an order and the line items in the order. You then model the relationships between those tables. For example, one customer may have any number of orders, but each order belongs to a single customer.

At scale, it is not uncommon to have a hundred or more tables with each table having on average 3–5 relationships to other tables. When you read the data, you often join the tables to get the information you need. For example, select all of the orders for customer ID = 1234 in the last 3 months that contained the product with the ID of 5678.

The problems with SQL databases start to appear when you have complex queries with many joins; the queries become quite slow. Also, writing to a SQL database is much slower than reading from it and this is a serious problem at scale. Finally, traditional SQL databases struggle with large amounts of data and large numbers of transactions per second. They are prohibitively costly to scale. That’s why NoSQL was invented.

NoSQL does away with table joins and instead models information differently. It is this different approach to data modeling that gives NoSQL its tremendous speed and scalability. Furthermore, NoSQL is a distributed database. This means that a significant disaster can occur (data centers going down, etc) and you still have access to your data. I call this “disaster resilience”. It is built into NoSQL, unlike monolithic SQL databases that require costly and complicated disaster recovery schemes.

Horizontal scaling with commodity hardware

NoSQL scales horizontally using commodity hardware. This provides a huge savings over scaling monolithic relational databases. Furthermore, NoSQL provides fast reads and writes.

The technical implementation details vary from one NoSQL database product to the next, so I’m going to focus on the Apache Cassandra database. Whenever you read or write to the database, your request will be directed to one of the known nodes in the Cassandra database. This node becomes the “coordinator node”.

Every node can be a coordinator node!

Apache Cassandra does not designate specialized coordinator nodes because that leads to single points of failure. Instead, with Apache Cassandra, all nodes can act as a coordinator node for each read and write, thereby eliminating single points of failure.

When writing data, the concept of replication factor comes into play. The replication factor is how many copies of the data do you want to store for redundancy and disaster resilience. In this example we have 6 nodes. Commonly, a replication factor of 3 is used to ensure that data is never lost. So for each write command that is received by the coordinator node, that node will ensure that the data is written to 3 different nodes. Of course, sometimes the coordinator node itself is also one of the 3 nodes that will store that particular piece of data.

Similarly, when a read request comes in, whichever node receives the read request becomes the coordinator node. The coordinator node then requests the data from a “quorum” of nodes. The quorum value can be set on a per request basis.

What’s a quorum? It’s the number of necessary read responses before Cassandra can determine if the read was successful. It is based on the replication factor for your database. I won’t bore you with the math but a simple example should help clarify things. Let’s assume your replication factor is 3. Your quorum value will be 2. But Cassandra doesn’t waste time or network bandwidth by fetching the same data twice! [Show read request, data read and the digest] Instead it does a full read of the query results from the fastest known node in the ring and calculates the digest, or checksum of that read. From the second node it only requests the digest. If the 2 digests match, then the data is returned. If the 2 digests don’t match, then Cassandra will check to see which digest has the most recent data, and then return the results from the node with the most recent data. If the digests don’t match it’s because one of the nodes simply hasn’t completed its write yet. Usually this is corrected in a matter of milliseconds.

If a node goes down for whatever reason, Cassandra is smart enough not to ask it for data during a read. If that node was supposed to participate in the write event, then Cassandra stores hints about that operation until the node comes back up, at which time all hints are updated on the node.
Cassandra doesn’t reactively have to wait for a node to timeout. It runs internal protocols so that if a node goes down for whatever reason, all of the other nodes are informed proactively. Similarly, when the node returns to active service, all other nodes are made aware of that change in node availability and automatically resume normal read/write operations with it.

SQL and NoSQL are not competing technologies. They are different tools that solve different problems. SQL is a great tool for running smaller workloads that require ad-hoc queries where execution speed is not a primary concern. NoSQL is designed for large-scale transactions where all reads and writes need to execute in milliseconds. Furthermore, NoSQL scales using commodity hardware and is designed for both Big Data and Fast Data which address today’s data needs. As always, you want to choose the right tool for the right job.

Interested in learning more? Head to datastax.com to see how you can get started with a free NoSQL database in the cloud today.

--

--

Jeff Davies

Long time software engineer, software architect, technical evangelist and motorcycle enthusiast.