Sunday, August 31, 2008

Read-modify-write

Quite a bit of software development revolves around getting a piece of structured data from a store, modifying some part of it, and writing it back. If the store is local, and the setting is not very contentious - that is, the probability that two threads are going to be modifying the same structure at the same time is rather low - a mutex or a critical section provides an easy solution.

However, what if the object is managed by a remote web service? What if two threads reading and writing the same object is a common thing? Locking an object remotely is not easy - one has to deal with the situation where client goes away and leaves the lock hanging and other nasty corner cases.

Consider the following example. Let's say we have a structure that has title, description, and a checksum:

#define MAX_TITLE 128
#define MAX_DESCR 512
struct Record {
char title[MAX_TITLE];
char descr[MAX_DESCR];
unsigned long crc;
};

The record lives inside a blob store. Depending on the implementation, the blob store may or may not know about the internal structure of the record. To make the exercise interesting, let's assume that it does not own the crc computation algorithm - this happens on the client.

Depending on the API, the client can either write parts of the structure, or all at once.

The catch with updating the parts of the structure is that there is a field (crc) that depends on all of the rest of the structure.

Consider the following scenario: thread A reads the structure, modifies the title, and computes the new crc. Thread B reads the structure, modifies the description, and computes the new crc. Then both of them write the result of their work as (title, crc) and (description, crc). The new data structure ends up with the new title and the new description, but the crc is wrong - depending on who wrote last, it would be computed based on either the new title and the old description, or the old title and the new description.

The consistency problem can be rectified by only allowing writing the whole record at once, title, description and crc. This comes with its one set of problem - an early reader, later writer can wipe out changes by a different thread that writes its changes in between. In this scenario thread A reads the structure and updates title and crc. Thread B reads the structure and updates description and crc. Thread B then writes the data first, then thread A writes its changes out - wiping out modifications by thread B.

Short of distributed lock service (or a DTC) which makes the job very complicated and potentially rather slow, there is no direct solution to this problem. There is, however, a way to redefine the problem that makes it much easier: allowing the store to fail conflicting writes.

The question then becomes much simpler - how do we detect the conflict?

One easy (and quite standard way of doing it) is to maintain a version of the structure.

#define MAX_TITLE 128
#define MAX_DESCR 512
struct Record {
char title[MAX_TITLE];
char descr[MAX_DESCR];
unsigned long crc;
unsigned long version;
};

The version is read by the client and is handed back as part of the update request. It is incremented by the server every time the structure has been modified. If the client supplies the version number other than what currently is in store, the server assumes that the structure has already been modified by someone else, and fails the write. The client then can re-read the structure, reapply its changes, and attempt the write again.

Some systems use a time stamp instead, which makes it somewhat harder for a malicious client to game the system. The disadvantage of course is that one needs a counter with a high enough resolution that guarantees that a read and a write cannot complete within one "tick". Ever growing system performance coupled with the desire of OS vendors to preserve backwards compatibility (in this case, the resolution of a counter defined by a given API set) makes this approach brittle.

This approach is easy to understand, but it has a disadvantage - it exposes the internals of the concurrency management to the client and makes the client responsible for behaving in good faith.

How can we transfer the responsibility for the atomicity and correctness of the updates to the server? It turns out to be rather straightforward: for every read that results in the later update, we will let the CLIENT supply us with a hard to guess identifier (say, a GUID) to server as a transaction ID. We will accumulate these IDs in the internal implementation of the object (never to be returned to a client). We will only allow writes that are accompanied by a GUID that is known to us, and we will clear the known GUIDs the moment the write succeeds, immediately making all other updates stale.

// Synchronized externally
class RecordInternal {
private:
Record record;
set reads;
public:
void Read(GUID *id, Record *rec) {
*rec = record;
if (id)
reads.add(*id);
}
boolean Write(GUID &id, Record &rec) {
if (reads.count(id) == 0)
return false;
record = rec;
reads.clear();
return true;
}
}

Potential improvement would be to keep a map between the GUID and the timestamp of the read request instead of the set of all GUIDs - this would allow us to run down very old outstanding updates which will probably never complete because of the client failure.

On the client side, it would be convenient to wrap the messy business of generating the GUID in an "update" class, making the use of the API look as follows:

Update update;
Record record;
service.Read(update.id(), &record);
...update the record...
service.Write(update.id(), record);

2 comments:

Илья Казначеев said...

Есть еще такой способ.
Правда, я боюсь, он толком нигде не применяется по прискорбным причинам.

А именно, клиент посылает серверу не список изменившихся полей, а лямбду
Функцию, которая берет старую запись и возвращает новую запись, то есть, Record -> Record

Тогда первый клиент зафигачит на сервер такую лямбду:

l1 record = checksum (record {title = 'Новое название'})

А второй - такую лямбду:
l2 record = checksum (record {descr = 'Новое описание'})

Сервер возьмет, и сделает:
intermediateRecord = l1 oldRecord;
newRecord = l2 intermediateRecord;
И сохранит в базу newRecord.

Причем вызываются ляюбды в том порядке, в котором получены, и каждая обновляет только те данные, которые хочет.

В принципе, по этому принципу действуют хранимые процедуры, да :)

alexandroid said...

I think this example is overcomplicated because of initial design.

I would like to suggest a metaphor/example: think of electronic trading.

Clients (trading applications) send requests to exchange (server) to add order (new structure), modify existing order (update structure) or remove the order. Order may be simultaneously managed (modified, canceled) by different traders authorized on the same account. Also an order can be filled at any time by the exchange so the next request of the client to modify/cancel will be rejected.

So, such protocols are usually implemented by assigning IDs to different requests and object states in time. For example, original structure (order) has ID 1. When client sends request to update it, it says "please replace order with ID=1 by this info". Updating the structure on server automatically gives it new ID which is sent to all clients back.

After modification was done, all clients have to use new ID (say, 2) to do any other actions for this order. If there are several simultaneous requests, only one of them will be accepted and all other will be rejected (since order ID will change as soon as 1st modification request is accepted).

So, here we have several corner points:

1) updates are atomic (i.e. clients update entire structure at once).

2) structure has ID which works as a "key" to access to it using any requests. If the object changes, its key is changed too and broadcasted back to all clients.

3) There is no way requests from clients can be processed sequentially for the same object (with the same key). To me it seems very natural -- after 1st client updated the structure, there is no sense to rewrite it with the information from 2nd client, since most probably 2nd client did his work without knowledge of change from 1st.

Hope it helps.