The unexpected side effects of converting a single threaded service into a multi-thread, multi-instance service.

We’re in the middle of one of the most critical migrations - moving to the cloud. One of the most frequently used terms about this shift is scale : the ability to run mutiple instances of something, without worrying about the operational overheads.

During this migration, we are looking at ways of parallelizing pretty much every background service. One such service is our External Clicks worker. Well since we were in a hurry and we needed to migrate ~500GB of data to the new servers, we decided to run multiple instances of this worker.

All was well. Well, almost.

Somehow, this worker ended up duplicating data. On digging through the app logic, we saw that this service had a “Get Or Create” logic; think of it like Upsert. Since multiple workers were running in parallel, almost at the same time, this logic duplicated the data.

This, is what you would call a Race Condition.

Enough jibber-jabber. Let’s see some code.

/* code snippet from "Problem.cs"
* Here, we try to "GetOrCreateRow" with Id 3. (1 and 2 are already taken)
* We invoke 3 such operations in parallel
*/ 

var id = 3;
var op = new NoLockOperation();

Parallel.Invoke(
  () => {
    op.GetOrCreateRow(id);
  },
  () => {
    op.GetOrCreateRow(id);
  },
  () => {
    op.GetOrCreateRow(id);
  });

The method definition for GetOrCreateRow looks like this :

public Row GetOrCreateRow(int id) 
{
  var currentThread = Thread.CurrentThread;
  Console.WriteLine("In GetOrCreateRow(" + id + ") | Thread Id : " + currentThread.ManagedThreadId);
  var exists = rows.Exists(e => e.Id == id);
  if (!exists) {
    rows.Add(new Row(id, "Row-" + id));
  }
  var firstMatch = rows.Find(e => e.Id == id);
  return firstMatch;
}

Since there is no lock on this shared resource, each thread creates the Row with Id 3; causing the data to be duplicated (well, triplicated in this case).

Not to worry though; this is not an uncommon situation.

Let’s take a cue from this SO answer, and change the definition for GetOrCreateRow

/*
 * From "LockedOperation.cs"
 */
...

private static readonly Object locker = new Object();

...

public Row GetOrCreateRow(int id) 
{
  var currentThread = Thread.CurrentThread;
  Console.WriteLine("In GetOrCreateRow(" + id + ") | Thread Id : " + currentThread.ManagedThreadId);
  lock(locker) {
    var exists = rows.Exists(e => e.Id == id);
    if (!exists) {
      rows.Add(new Row(id, "Row-" + id));
    }
    var firstMatch = rows.Find(e => e.Id == id);
    return firstMatch;
  }
}

Well yeah, it’s that simple!

The output of the program is shown below

bash-3.2$ dotnet run
Project RaceConditionSample (.NETCoreApp,Version=v1.0) was previously compiled. Skipping compilation.
The Problem :
----------
In GetOrCreateRow(3) | Thread Id : 1
In GetOrCreateRow(3) | Thread Id : 3
In GetOrCreateRow(3) | Thread Id : 4
Sleeping for 5 seconds..
Current state of rows : 1,2,3,3,3

The Solution :
----------
In GetOrCreateRow(3) | Thread Id : 5
In GetOrCreateRow(3) | Thread Id : 1
In GetOrCreateRow(3) | Thread Id : 4
Sleeping for 5 seconds..
Current state of rows : 1,2,3

There are more ways in which you could synchronize a multi threaded application and avoid these race conditions. Take a look at MutexOperation.cs to see the Mutex variation of this demo.

I also wanted to throw some light on how we leverage multithreading and mutex synchronization in our background services. We have the classic case of an increment-decrement counter situation. We use this counter to scale out our workers.

public void UpdateWorkerCount(bool increment) 
{
  mutex.WaitOne();
  if (increment)
    --counter;
  else
    ++counter;
  mutex.ReleaseMutex();
}