· NoSQL Cassandra

Cassandra Deep Dive: The Read Before Write Evil

In Cassandra, the read before write is considered as an anti-pattern. The Storage Engine section of the Cassandra documentation describes the problems associated with this anti-pattern. Let’s take a deep dive to see why this pattern is described as an anti-pattern, are there any tricks to overcome the disadvantages and what are the alternative technologies if your use case really need to have this pattern.

Usecase 1

Let’s take an example. In an order management system, an order goes through different statuses. Let’s say the following are the sequnce of different order statuses; “created”, “payment complete”, “shipment scheduled” and “shipped”. Correspondingly, in our Cassandra table, we have a column named order_status. Let’s also assume that different components of the system can simultaneously update the order_status. In Cassandra, when multiple clients write to the same row, the last update always wins. But, it is very much possible that an older order_status can come late and overwrite the latest status. How do you prevent this kind of situation so that the system will always have the latest status for any order? One way to prevent this is to introduce a status_timestamp column. Each component can read the status_timestamp from the table and update the table only if when it’s timestamp is newer than the existing value. This is how we run into the read before write pattern.

Problems with read before write in Cassandra

Following are a few problems with read before write pattern

Performance Impacts

In a read before write scenario, there is always an additional read operation. If you want your read to be consistent, you will have to set the consistency as QUORUM (assuming that write consistency will be QUORUM). This will create a few more round trips between replica nodes.

Data Inconsistency

One problem using a timestamp field is, it cannot guarantee consistency. If multiple components read the status_timestamp at the same time, there is a possibility that older order_status will overwrite a newer value. One way to ensure data consistency is by using Light Weight Transactions (LWT). You can use LWT to update the order_status as following

UPDATE order 
SET order_status = 'shipped', status_timestamp = 524232424233  
WHERE order_id = '1234' 
IF status_timestamp = '12222223344'

Even though these are called Light Weight transactions, these are heavy weight stuff in Cassandra world. Paxos protocol has to be employed to make the update transactional, which will involve a lot of round trips between different replica nodes. While LWT will solve the inconsistency problem, it will introduce the performance problem. Combining the cost of LWT with the additional read operation the overall performance will be severly impacted.

Solution 1: Solve the problem by avoiding the read before write pattern

By slightly changing the above design, we can avoid the read-before-write pattern. Let’s add the column status_timestamp as a clustering key. So my table will look like

/** Table Definition : order **/
CREATE TABLE order {
order_id text, 
status_timestamp timeuuid, 
order_status text,
PRIMARY_KEY (order_id, status_timestamp)

Now, instead of doing the additional read, we can always insert the status_timestamp and order_status into this table. To find the latest status of an order, we will select all the rows with order_id and take the row with the latest status_timestamp. By default, clustering key columns will be ordered and it will be very easy to find the latest value.

Solution 2: Use the “USING TIMESTAMP” clause

In CQL, we can specify a USING TIMESTAMP value in the INSERT and UPDATE statements. Cassandra will store this timestamp internally as the last updated timestamp of each column that get’s updated. Cassandra will also make sure that, the INSERT/UPDATE will be applied only if the timestamp is newer than the existing timestamp value of that column. By making use of this feature, we can avoid the read before write in this usecase.

/** Table Definition : order **/
CREATE TABLE order {
order_id text, 
order_status text,
PRIMARY_KEY (order_id)

/** Update using timestamp **/
UPDATE order USING TIMESTAMP 162524132422
SET order_status = 'shipped' 
WHERE order_id = '1234' 

While using this method, it is important to make sure that the clock is synchronized among different application servers, because application servers are generating the timestamp.

Usecase 2

Let’s say we need to design a simple inventory management system which stores the absolute inventory of each sku. Suppose there are two operations which can change the inventory of each sku, (1)Purchase (Company purchase more products from vendors. This will increase the inventory) and (2)Sales (Company sells products to customers. This will decrease the inventory). Each Purchase/Sale operation will change the inventory by a delta value +x or -x. To always keep the absolute inventory in the Cassandra table, we need to read the existing value, then add or subtract x and write back the new value. You can see that, this is a potential usecase which can introduce the read-before-write problems. How do we solve it without the read-before-write?

Solution 1: Using Cassandra collection types to store delta values

/** Table Definition: inventory **/
CREATE TABLE inventory ( 
sku	text, 
delta_value set<int>, 
PRIMARY KEY (sku)

/** Inserting a delta value **/
UPDATE inventory 
SET delta_value = delta_value + {10} 
WHERE sku = '123456'

/** To find the absolute inventory for a product, 
read the delta_value column and take sum of all values **/
SELECT delta_value 
FROM inventory 
WHERE sku = '123456'

By defining the delta_value as a set column, we can always insert values into the set, effectively avoiding the necessity to read before write. But this method will introduce another challenge. The set will keep on growing and will make the column very wide. In addition, tombstones will get generated for each insert, which will widen the column further.

One solution to keep down the size of the collection column is to do periodic summation, keep the sum in a separate column, and then delete those delta values from the collection. After the gc_grace_seconds, compaction will kick in and will cleanup all the tombstones.

Solution 2: Using Cassandra columns to store delta values

/** Table Definition: inventory **/
CREATE TABLE inventory ( 
sku	text, 
delta_time timeuuid, 
delta_value int,
PRIMARY KEY (sku, delta_time)

/** Inserting a delta value **/
INSERT INTO inventory 
(sku, delta_time, delta_value) VALUES ('123456', now(), 10) 

/** To find the absolute inventory for a product, 
read all the delta_value and take sum of all values **/
SELECT sum(delta_value) 
FROM inventory 
WHERE sku = '123456'

The benefit of this design is that, there are no collection columns. Instead of making the column wider, this design will make the row wider, which is a desirable effect. Like the previous solution, here also we need to sum up all the delta values periodically to keep the number of delta values within limits. To store the sum of delta values, we can introduce another column as below

/** Table Definition: inventory **/
CREATE TABLE inventory ( 
sku	text, 
delta_time timeuuid, 
delta_value int,
delta_sum int STATIC,
PRIMARY KEY (sku, delta_time)

Solution 3: Choosing another technology which is more suited to this usecase

Sometimes, Cassandra may not be suited for these needs. There are other systems which might be well suited for these usecases. For example, Hazelcast IMDG is a better choice in the above scenario because Hazelcast can provide single thread sequential access to each partitions.