Investigating How MongoDB's "Majority" Write Concern Works

17 December 2012

Introduction

I've recently been studying an online MongoDB course from 10gen, the developers of MongoDB. MongoDB is a NoSQL database which offers a ton of features, including replica sets with automatic failover.

This isn't a beginner's HOWTO on MongoDB; for that, head over to mongodb.org and hit the "Try It Out" button. Instead, the article is to address a question asked on the 10gen course forums, posed by @Kajten entitled w = 'majority'...

In documentation we have - This ensures that write operation will never be subject to a rollback in the course of normal operation.

If happens some kind network error or server down - this flag 100% guaranteed that rollback not will be or if data dont write maybe rollback ?

For a little bit of background, one server in a MongoDB replica set is classed as the primary, and all writes must go to this. The primary then uses aynchronous replication to send the writes to the secondary servers in a set. We can set what is known as write concern to ensure that our writes are received by a minimum number of servers, throwing an error if that doesn't occur.

The MongoDB documentation makes a claim that if we set the write concern (w) to "majority", then even after network issues or servers going down, then the write will be persisted through the problems, and is guaranteed not to be rolled back.

When do Rollbacks Normally Occur

A rollback would occur if we had a server which comes back online after an outage, and it contains data which the rest of the replica set are not aware of. Because there could be many other write operations carried out on the replica set while the server has been offline, it can't simply apply the write operations to the other servers, and so the write is rolled back.

The rolled back write is not lost completely ... it's recorded in a rollback log on the server, but the complexity of re-applying such "lost" writes is so great, that it's generally not done.

Instead, for writes of importance, it's typical to perform the writes with write concern of 'majority' in order that we know that it won't be subject to rollback.

Simple Case

In simple terms, it's quite easy to see why this would be the case. If we know for certain that the write was received by the majority of servers (the write concern "majority" proves that this happened) and then the primary fails, the remaining servers will "elect" a new primary server to take over. However, they can only carry out such an election if the majority of servers are up and communicating with each other. So, when a new primary is elected, we know that the majority of servers received the write, and that the majority of servers are up. Simple maths ensures that we know that at least one of the servers still up received the write. MongoDB (under normal circumstances) will choose a new primary which has the latest version of the data, hence a server which has received the write. Put simply, the write will be persisted as the new primary has it.

A More Complex Example

I then started to think of some edge cases where the normal circumstances wouldn't be met. It's possible to configure a MongoDB server such that it can never be elected as a primary. I then envisaged a setup where I didn't quite know how the new primary would end up with the write. Consider a replica set with three nodes, as follows:

Server A - PRIMARY
Server B - SECONDARY
Server C - SECONDARY (can not become primary)

So, let's consider the following steps.

Server B goes down.
Data is written to the primary server with write conern 'majority'.
The fact that this write succeeds means we know it was received by server C, but not B (as B is down).
Server A goes down.
No election takes place, as we don't have the majority of servers up.
Server B comes back up.
This triggers an election (we now have the majority of servers up) and server B is elected primary (C is configured so it cannot become primary).
The new primary server does not have the write when it is elected.
Server A (the old primary) comes back up. Is the write which A and C are aware of rolled back at any point?

Running MongoDB to Investigate

The following steps work on Windows. To run on MacOS X or Linux, remove the "start" from the beginning of the commands to run Mongod (the Mongo server process) and add the --fork switch which achieves the same, but is not available on Windows.

First, we'll fire up our three servers. We'll run them all on localhost and use different port numbers.

mkdir a
mkdir b
mkdir c
start mongod --smallfiles --oplogSize 50 --replSet rs --dbpath a --port 27017
start mongod --smallfiles --oplogSize 50 --replSet rs --dbpath b --port 27018
start mongod --smallfiles --oplogSize 50 --replSet rs --dbpath c --port 27019

Now, we'll connect to our primary (which is running on the default port) by running mongo. At the Mongo shell prompt, we'll configure replication like this.

config = { _id: "rs", members: [
    {_id: 0, host: "localhost:27017"},
    {_id: 1, host: "localhost:27018"},
    {_id: 2, host: "localhost:27019", priority:0}
]}
rs.initiate(config)

Running rs.status() periodically will show you the state of the set. Keep looking at the members.stateStr properties of the results of the status command. After around a minute, the replica set is configured and working properly, indicated by one server having the state of PRIMARY and two of them being SECONDARY. The prompt will eventually change to rs:PRIMRAY> to indicate we're now connected to the primary server.

Kill the second process by choosing its window and pressing Ctrl-C in it.

Now, back in the Mongo shell, write some data like this ...

db.foo.insert({x:1})
db.getLastError('majority',5000)

The getLastError call should return null if the insert was received by the majority of servers, and the 5000 means wait up to 5000 milliseconds before reporting a timeout.

Kill server A, and then restart server B by typing the second start mongod command again at a command prompt.

So, what happens?

Server B is elected primary (as it's the only running server capable of becoming primary, and the majority of servers are up). What's really interesting though is that server B checks its peers (only server C at this point) and determines that C has more recent data than it. Server B then gets the write from the secondary server and applies it, bringing it up to date.

This is unusual as we normally consider writes being propagated from the primary server to secondaries, yet in an election like this, it seems that primaries can actually pull data from secondary servers.

Once the new primary becomes available, we can connect to is by running mongo --port 27018. Then, at the shell, enter:

db.foo.find()

and you will see that the record with x=1 inserted above is returned. In other words, the data was persisted despite outages.

I thought this technique of replicating recent data from secondaries to the primary was fascinating. Job well done MongoDB!