Mongo - Network Failures

MongoDB is designed to support distributed operations and as such can tolerate network failures. In this notebook, we'll see how a MongoDB replica set fares in the face of simulated network failures.

(More technical details about how the MongoDB replica set is configured can be found in the notebook Dockering ).

In [1]:
from dockeringMongo import *
In [3]:
c = docker.Client(base_url='unix://var/run/docker.sock',
                  version='1.10',
                  timeout=10)
In [90]:
#Fire up three MongoDB containers that we'll use in a replica set

#STUB is an identifier used to label the nodes
#Superstition - short stub required?
STUB='rs3'
rsc=rs_config(c,STUB,num=3)
rsc
Out[90]:
{'members': [{'host': '172.17.0.2:27017', '_id': 0},
  {'host': '172.17.0.3:27017', '_id': 1},
  {'host': '172.17.0.4:27017', '_id': 2}],
 '_id': 'rs3'}
In [91]:
showContainers(c)
['/rs3_srv2'] Up Less than a second
['/rs3_srv1'] Up 1 seconds
['/rs3_srv0'] Up 1 seconds
In [92]:
#tidyAwayContainers(c,['rs3_srv0','rs3_srv1','rs3_srv2'])
In [93]:
docker_ps(c)
Out[93]:
[{'Created': 1415618951,
  'Names': ['/rs3_srv2'],
  'Ports': [{'PublicPort': 49185,
    'IP': '0.0.0.0',
    'Type': 'tcp',
    'PrivatePort': 27017}],
  'Command': 'usr/bin/mongod --replSet rs3 --noprealloc --smallfiles',
  'Image': 'dev24/mongodb:latest',
  'Id': '7b694cdab9f8e52f730b20e92470a7e8aae4d11b5434dce98a2ada7659fac921',
  'Status': 'Up 3 seconds'},
 {'Created': 1415618951,
  'Names': ['/rs3_srv1'],
  'Ports': [{'PublicPort': 49184,
    'IP': '0.0.0.0',
    'Type': 'tcp',
    'PrivatePort': 27017}],
  'Command': 'usr/bin/mongod --replSet rs3 --noprealloc --smallfiles',
  'Image': 'dev24/mongodb:latest',
  'Id': 'bae7907f61baa7fb16ccb481f06dcf1f9269ead7c4047f8241ebedb5e11a0b3b',
  'Status': 'Up 3 seconds'},
 {'Created': 1415618951,
  'Names': ['/rs3_srv0'],
  'Ports': [{'PublicPort': 49183,
    'IP': '0.0.0.0',
    'Type': 'tcp',
    'PrivatePort': 27017}],
  'Command': 'usr/bin/mongod --replSet rs3 --noprealloc --smallfiles',
  'Image': 'dev24/mongodb:latest',
  'Id': '256dc64e421a825766deceacc55869679633dd2c244267dbd909ec6628a59727',
  'Status': 'Up 3 seconds'}]
In [94]:
#Initialise the replica set
from pymongo import MongoClient

#We'll use the 0th server in the set as a the node
mc = MongoClient('localhost', get27017tcp_port(c,STUB+'_srv0'),replicaset=STUB)
In [95]:
#Set up connections to the other members of the replica set
client1=MongoClient('localhost', get27017tcp_port(c,STUB+'_srv1'),replicaset=STUB)
client2=MongoClient('localhost', get27017tcp_port(c,STUB+'_srv2'),replicaset=STUB)
In [96]:
#In the mongo console, we would typically use the command MONGOCLIENT.config() to initialise the replica set
#Here, we use the replSetInitiate admin command, applying it with the desired configuration
mc.admin.command( "replSetInitiate",rsc);
In [98]:
#We may need to wait a minute or two for the configuration to come up
#If you get an error message that suggests the configuration isn't up yet, wait a few seconds then rerun the cell
mc.admin.command('replSetGetStatus')
Out[98]:
{'date': datetime.datetime(2014, 11, 10, 11, 29, 35),
 'members': [{'self': True,
   '_id': 0,
   'stateStr': 'STARTUP2',
   'optime': Timestamp(1415618964, 1),
   'name': '172.17.0.2:27017',
   'uptime': 24,
   'health': 1.0,
   'state': 5,
   'optimeDate': datetime.datetime(2014, 11, 10, 11, 29, 24)},
  {'health': 1.0,
   'optime': Timestamp(0, 0),
   'optimeDate': datetime.datetime(1970, 1, 1, 0, 0),
   'lastHeartbeat': datetime.datetime(2014, 11, 10, 11, 29, 34),
   'pingMs': 1,
   '_id': 1,
   'lastHeartbeatMessage': 'still initializing',
   'stateStr': 'UNKNOWN',
   'name': '172.17.0.3:27017',
   'uptime': 1,
   'lastHeartbeatRecv': datetime.datetime(1970, 1, 1, 0, 0),
   'state': 6},
  {'health': 1.0,
   'optime': Timestamp(0, 0),
   'optimeDate': datetime.datetime(1970, 1, 1, 0, 0),
   'lastHeartbeat': datetime.datetime(2014, 11, 10, 11, 29, 34),
   'pingMs': 0,
   '_id': 2,
   'lastHeartbeatMessage': 'still initializing',
   'stateStr': 'UNKNOWN',
   'name': '172.17.0.4:27017',
   'uptime': 1,
   'lastHeartbeatRecv': datetime.datetime(1970, 1, 1, 0, 0),
   'state': 6}],
 'ok': 1.0,
 'myState': 5,
 'set': 'rs3'}
In [41]:
#The statusReport() command lets you ask a mongoDB node what it thinks the state of the world is
statusReport(mc)
I am machine id 0 as PRIMARY
Machine id 1 (SECONDARY) last heard from at 2014-11-10 10:31:56
Machine id 2 (SECONDARY) last heard from at 2014-11-10 10:31:56
In [99]:
#If we're quick, we can watch the machines come up into the fold
statusReport(client1)
Machine id 0 (UNKNOWN) last heard from at 1970-01-01 00:00:00
I am machine id 1 as STARTUP2
Machine id 2 (UNKNOWN) last heard from at 1970-01-01 00:00:00
In [100]:
statusReport(client1)
Machine id 0 (PRIMARY) last heard from at 2014-11-10 11:29:58
I am machine id 1 as STARTUP2
Machine id 2 (STARTUP2) last heard from at 2014-11-10 11:29:58
In [101]:
statusReport(client1)
Machine id 0 (PRIMARY) last heard from at 2014-11-10 11:30:24
I am machine id 1 as SECONDARY
Machine id 2 (SECONDARY) last heard from at 2014-11-10 11:30:24

Modelling Network Errors

In this section we'll use firewall rules between containers to model network partitions.

In [103]:
#The blockade library contains functions for setting up and tearing down iptable firewall rulesets
from blockade import *
In [104]:
#As we define blockade rules by IP address, we need to know those addresses
rsc
Out[104]:
{'members': [{'host': '172.17.0.2:27017', '_id': 0},
  {'host': '172.17.0.3:27017', '_id': 1},
  {'host': '172.17.0.4:27017', '_id': 2}],
 '_id': 'rs3'}
In [105]:
#Maybe worth creating an object that lists stuff like this by instance name?
memberIPaddress(rsc,'rs3_srv2')
Out[105]:
'172.17.0.4:27017'

The partition_containers() function creates a named ruleset that puts listed IP addresses into separate partitions.

Let's start by splitting off one of the secondary machines - rs3_srv2, on IP address 172.17.0.4:27017 - into a partition on its own:

In [106]:
#The machines in each partition are put into their own lists
#partition_containers(RULESET,[ PARTITION_1_LIST, PARTITION_2_LIST ]
partition_containers('test1w2s2', [ [netobj('172.17.0.4')],[netobj('172.17.0.2'),netobj('172.17.0.3')]])

Wait a few seconds for the machines to realise all is not well before proceeding...

...then check their statuses:

In [109]:
statusReport(mc)
I am machine id 0 as PRIMARY
Machine id 1 (SECONDARY) last heard from at 2014-11-10 11:32:42
Machine id 2 ((not reachable/healthy)) last heard from at 2014-11-10 11:32:38
In [110]:
statusReport(client1)
Machine id 0 (PRIMARY) last heard from at 2014-11-10 11:32:56
I am machine id 1 as SECONDARY
Machine id 2 ((not reachable/healthy)) last heard from at 2014-11-10 11:32:50
In [111]:
statusReport(client2)
Machine id 0 ((not reachable/healthy)) last heard from at 2014-11-10 11:32:46
Machine id 1 ((not reachable/healthy)) last heard from at 2014-11-10 11:32:51
I am machine id 2 as SECONDARY

Let's fix the network...

In [112]:
clear_iptables('test1w2s2')
In [113]:
#Wait a second or too then see how things are...
statusReport(mc)
I am machine id 0 as PRIMARY
Machine id 1 (SECONDARY) last heard from at 2014-11-10 11:34:16
Machine id 2 (SECONDARY) last heard from at 2014-11-10 11:34:17

What happens if we knock out the connection betweem the primary server and the secondary servers?

In [114]:
#Put the primary server into a partition on it's own
partition_containers('test1w2s2', [ [netobj('172.17.0.2')],[netobj('172.17.0.3'),netobj('172.17.0.4')]])

Again, wait a few seconds for things to happen... then check the statuses:

In [115]:
statusReport(mc)
I am machine id 0 as SECONDARY
Machine id 1 ((not reachable/healthy)) last heard from at 2014-11-10 11:34:48
Machine id 2 ((not reachable/healthy)) last heard from at 2014-11-10 11:34:49
In [116]:
statusReport(client1)
Machine id 0 ((not reachable/healthy)) last heard from at 2014-11-10 11:34:52
I am machine id 1 as SECONDARY
Machine id 2 (PRIMARY) last heard from at 2014-11-10 11:35:02
In [117]:
statusReport(client2)
Machine id 0 ((not reachable/healthy)) last heard from at 2014-11-10 11:34:53
Machine id 1 (SECONDARY) last heard from at 2014-11-10 11:35:03
I am machine id 2 as PRIMARY
In [118]:
clear_iptables('test1w2s2')
In [119]:
statusReport(mc)
I am machine id 0 as SECONDARY
Machine id 1 (SECONDARY) last heard from at 2014-11-10 11:35:55
Machine id 2 (PRIMARY) last heard from at 2014-11-10 11:35:54
In [120]:
statusReport(client1)
Machine id 0 (SECONDARY) last heard from at 2014-11-10 11:35:57
I am machine id 1 as SECONDARY
Machine id 2 (PRIMARY) last heard from at 2014-11-10 11:35:58
In [121]:
statusReport(client2)
Machine id 0 (SECONDARY) last heard from at 2014-11-10 11:36:14
Machine id 1 (SECONDARY) last heard from at 2014-11-10 11:36:13
I am machine id 2 as PRIMARY

Grab a copy of the logs from each of the servers to see what happened...

In [122]:
#In an ssh shell, the following sort of command displays a real time stream of stdio log messages from the container
#!docker.io logs --follow=true {STUB}_srv1
!docker.io logs {STUB}_srv0 > {STUB}_srv0_log.txt
!docker.io logs {STUB}_srv1 > {STUB}_srv1_log.txt
!docker.io logs {STUB}_srv2 > {STUB}_srv2_log.txt
#?we may also be able to use mongostat - http://docs.mongodb.org/manual/reference/program/mongostat/

Maybe we need to use a log parser or highlighter to pull out interesting messages? Maybe we need to visualise the communication between the servers, eg with a combined timeline?

See below for the log for rs3_srv0, the original primary. There are several things to note:

  • around 11:29:42, it mutters to itself about the state of the network and whether it's electable;
  • between 11:29:48 and 11:29:49 it goes for itself as PRIMARY;
  • through 11:30:03 multiple connections are opened with other nodes in the replica set;
  • by 11:30:06, the replica set is up and running.
  • at 11:32:38, sight of rs3_srv2 is lost and it is reported as DOWN as we partition it;
  • through 11:33, more connections are opened to rs3_srv1 - is this a defensive measure?
  • at 11:34:07, rs3_srv2 comes back as the partition is removed;
  • at 11:34:37, we lose a connection to rs3_srv2, and through 11:34:4x note things are really down as a result of the partition we placed rs3_srv0 into;
  • at 11:34:49.634, rs3_srv0 gives up PRIMARY;
  • at 11:35:39, the partition is removed and the connections to the other machines in the replica set are restored; rs3_srv0 doesn't elect itself back as PRIMARY - rs3_srv2 has that job now.
In [125]:
!tail -500 {STUB}_srv0_log.txt
note: noprealloc may hurt performance in many applications
Mon Nov 10 11:29:11.786 [initandlisten] MongoDB starting : pid=1 port=27017 dbpath=/data/db/ 64-bit host=256dc64e421a
Mon Nov 10 11:29:11.787 [initandlisten] db version v2.4.11
Mon Nov 10 11:29:11.787 [initandlisten] git version: fa13d1ee8da0f112f588570b4070f73d7af2f7fd
Mon Nov 10 11:29:11.787 [initandlisten] build info: Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49
Mon Nov 10 11:29:11.787 [initandlisten] allocator: tcmalloc
Mon Nov 10 11:29:11.787 [initandlisten] options: { noprealloc: true, replSet: "rs3", smallfiles: true }
Mon Nov 10 11:29:11.789 [initandlisten] journal dir=/data/db/journal
Mon Nov 10 11:29:11.790 [initandlisten] recover : no journal files present, no recovery needed
Mon Nov 10 11:29:11.953 [FileAllocator] allocating new datafile /data/db/local.ns, filling with zeroes...
Mon Nov 10 11:29:11.953 [FileAllocator] creating directory /data/db/_tmp
Mon Nov 10 11:29:12.073 [FileAllocator] done allocating datafile /data/db/local.ns, size: 16MB,  took 0.117 secs
Mon Nov 10 11:29:12.136 [FileAllocator] allocating new datafile /data/db/local.0, filling with zeroes...
Mon Nov 10 11:29:12.213 [FileAllocator] done allocating datafile /data/db/local.0, size: 16MB,  took 0.076 secs
Mon Nov 10 11:29:12.213 [initandlisten] command local.$cmd command: { create: "startup_log", size: 10485760, capped: true } ntoreturn:1 keyUpdates:0  reslen:37 260ms
Mon Nov 10 11:29:12.214 [initandlisten] waiting for connections on port 27017
Mon Nov 10 11:29:12.217 [websvr] admin web console waiting for connections on port 28017
Mon Nov 10 11:29:12.224 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Mon Nov 10 11:29:12.224 [rsStart] replSet info you may need to run replSetInitiate -- rs.initiate() in the shell -- if that is not already done
Mon Nov 10 11:29:17.687 [initandlisten] connection accepted from 172.17.42.1:36164 #1 (1 connection now open)
Mon Nov 10 11:29:21.324 [conn1] replSet replSetInitiate admin command received from client
Mon Nov 10 11:29:21.325 [conn1] replSet replSetInitiate config object parses ok, 3 members specified
Mon Nov 10 11:29:21.334 [conn1] replSet replSetInitiate all members seem up
Mon Nov 10 11:29:21.334 [conn1] ******
Mon Nov 10 11:29:21.335 [conn1] creating replication oplog of size: 1333MB...
Mon Nov 10 11:29:21.336 [FileAllocator] allocating new datafile /data/db/local.1, filling with zeroes...
Mon Nov 10 11:29:22.385 [FileAllocator] done allocating datafile /data/db/local.1, size: 511MB,  took 1.048 secs
Mon Nov 10 11:29:22.385 [FileAllocator] allocating new datafile /data/db/local.2, filling with zeroes...
Mon Nov 10 11:29:23.367 [FileAllocator] done allocating datafile /data/db/local.2, size: 511MB,  took 0.98 secs
Mon Nov 10 11:29:23.368 [FileAllocator] allocating new datafile /data/db/local.3, filling with zeroes...
Mon Nov 10 11:29:24.429 [FileAllocator] done allocating datafile /data/db/local.3, size: 511MB,  took 1.06 secs
Mon Nov 10 11:29:24.438 [conn1] ******
Mon Nov 10 11:29:24.440 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Mon Nov 10 11:29:24.440 [conn1] replSet info saving a newer config version to local.system.replset
Mon Nov 10 11:29:24.443 [conn1] replSet saveConfigLocally done
Mon Nov 10 11:29:24.443 [conn1] replSet replSetInitiate config now saved locally.  Should come online in about a minute.
Mon Nov 10 11:29:24.443 [conn1] command admin.$cmd command: { replSetInitiate: { members: [ { host: "172.17.0.2:27017", _id: 0 }, { host: "172.17.0.3:27017", _id: 1 }, { host: "172.17.0.4:27017", _id: 2 } ], _id: "rs3" } } ntoreturn:1 keyUpdates:0 locks(micros) W:3108180 reslen:112 3119ms
Mon Nov 10 11:29:34.445 [rsStart] replSet I am 172.17.0.2:27017
Mon Nov 10 11:29:34.446 [rsStart] replSet STARTUP2
Mon Nov 10 11:29:34.448 [rsHealthPoll] replSet member 172.17.0.4:27017 is up
Mon Nov 10 11:29:34.448 [rsHealthPoll] replSet member 172.17.0.3:27017 is up
Mon Nov 10 11:29:35.450 [rsSync] replSet SECONDARY
Mon Nov 10 11:29:40.453 [rsMgr] replSet not electing self, not all members up and we have been up less than 5 minutes
Mon Nov 10 11:29:42.204 [initandlisten] connection accepted from 172.17.0.3:53312 #2 (2 connections now open)
Mon Nov 10 11:29:42.205 [conn2] end connection 172.17.0.3:53312 (1 connection now open)
Mon Nov 10 11:29:42.205 [initandlisten] connection accepted from 172.17.0.3:53313 #3 (2 connections now open)
Mon Nov 10 11:29:42.296 [initandlisten] connection accepted from 172.17.0.4:36754 #4 (3 connections now open)
Mon Nov 10 11:29:42.299 [conn4] end connection 172.17.0.4:36754 (2 connections now open)
Mon Nov 10 11:29:42.300 [initandlisten] connection accepted from 172.17.0.4:36755 #5 (3 connections now open)
Mon Nov 10 11:29:42.461 [rsHealthPoll] replset info 172.17.0.4:27017 thinks that we are down
Mon Nov 10 11:29:42.462 [rsHealthPoll] replSet member 172.17.0.4:27017 is now in state STARTUP2
Mon Nov 10 11:29:42.464 [rsHealthPoll] replset info 172.17.0.3:27017 thinks that we are down
Mon Nov 10 11:29:42.467 [rsHealthPoll] replSet member 172.17.0.3:27017 is now in state STARTUP2
Mon Nov 10 11:29:42.468 [rsMgr] not electing self, 172.17.0.4:27017 would veto with 'I don't think 172.17.0.2:27017 is electable'
Mon Nov 10 11:29:42.469 [rsMgr] not electing self, 172.17.0.4:27017 would veto with 'I don't think 172.17.0.2:27017 is electable'
Mon Nov 10 11:29:48.469 [rsMgr] replSet info electSelf 0
Mon Nov 10 11:29:49.456 [rsMgr] replSet PRIMARY
Mon Nov 10 11:29:58.228 [conn3] end connection 172.17.0.3:53313 (2 connections now open)
Mon Nov 10 11:29:58.229 [initandlisten] connection accepted from 172.17.0.3:53321 #6 (3 connections now open)
Mon Nov 10 11:29:58.328 [conn5] end connection 172.17.0.4:36755 (2 connections now open)
Mon Nov 10 11:29:58.329 [initandlisten] connection accepted from 172.17.0.4:36761 #7 (3 connections now open)
Mon Nov 10 11:30:02.871 [initandlisten] connection accepted from 172.17.0.4:36763 #8 (4 connections now open)
Mon Nov 10 11:30:02.892 [conn8] end connection 172.17.0.4:36763 (3 connections now open)
Mon Nov 10 11:30:03.073 [initandlisten] connection accepted from 172.17.0.3:53325 #9 (4 connections now open)
Mon Nov 10 11:30:03.090 [conn9] end connection 172.17.0.3:53325 (3 connections now open)
Mon Nov 10 11:30:03.234 [initandlisten] connection accepted from 172.17.0.3:53326 #10 (4 connections now open)
Mon Nov 10 11:30:03.320 [initandlisten] connection accepted from 172.17.0.4:36766 #11 (5 connections now open)
Mon Nov 10 11:30:03.894 [initandlisten] connection accepted from 172.17.0.4:36767 #12 (6 connections now open)
Mon Nov 10 11:30:04.094 [initandlisten] connection accepted from 172.17.0.3:53329 #13 (7 connections now open)
Mon Nov 10 11:30:04.482 [rsHealthPoll] replSet member 172.17.0.4:27017 is now in state RECOVERING
Mon Nov 10 11:30:04.490 [rsHealthPoll] replSet member 172.17.0.3:27017 is now in state RECOVERING
Mon Nov 10 11:30:04.902 [slaveTracking] build index local.slaves { _id: 1 }
Mon Nov 10 11:30:04.904 [slaveTracking] build index done.  scanned 0 total records. 0.001 secs
Mon Nov 10 11:30:06.484 [rsHealthPoll] replSet member 172.17.0.4:27017 is now in state SECONDARY
Mon Nov 10 11:30:06.491 [rsHealthPoll] replSet member 172.17.0.3:27017 is now in state SECONDARY
Mon Nov 10 11:30:28.249 [conn6] end connection 172.17.0.3:53321 (6 connections now open)
Mon Nov 10 11:30:28.249 [initandlisten] connection accepted from 172.17.0.3:53333 #14 (7 connections now open)
Mon Nov 10 11:30:28.347 [conn7] end connection 172.17.0.4:36761 (6 connections now open)
Mon Nov 10 11:30:28.347 [initandlisten] connection accepted from 172.17.0.4:36773 #15 (8 connections now open)
Mon Nov 10 11:30:58.272 [conn14] end connection 172.17.0.3:53333 (6 connections now open)
Mon Nov 10 11:30:58.273 [initandlisten] connection accepted from 172.17.0.3:53339 #16 (7 connections now open)
Mon Nov 10 11:30:58.372 [conn15] end connection 172.17.0.4:36773 (6 connections now open)
Mon Nov 10 11:30:58.372 [initandlisten] connection accepted from 172.17.0.4:36779 #17 (7 connections now open)
Mon Nov 10 11:31:28.299 [conn16] end connection 172.17.0.3:53339 (6 connections now open)
Mon Nov 10 11:31:28.299 [initandlisten] connection accepted from 172.17.0.3:53345 #18 (7 connections now open)
Mon Nov 10 11:31:28.397 [conn17] end connection 172.17.0.4:36779 (6 connections now open)
Mon Nov 10 11:31:28.397 [initandlisten] connection accepted from 172.17.0.4:36785 #19 (8 connections now open)
Mon Nov 10 11:31:58.324 [conn18] end connection 172.17.0.3:53345 (6 connections now open)
Mon Nov 10 11:31:58.324 [initandlisten] connection accepted from 172.17.0.3:53351 #20 (7 connections now open)
Mon Nov 10 11:31:58.420 [conn19] end connection 172.17.0.4:36785 (6 connections now open)
Mon Nov 10 11:31:58.421 [initandlisten] connection accepted from 172.17.0.4:36791 #21 (7 connections now open)
Mon Nov 10 11:32:28.347 [conn20] end connection 172.17.0.3:53351 (6 connections now open)
Mon Nov 10 11:32:28.348 [initandlisten] connection accepted from 172.17.0.3:53357 #22 (7 connections now open)
Mon Nov 10 11:32:38.591 [rsHealthPoll] DBClientCursor::init call() failed
Mon Nov 10 11:32:38.591 [rsHealthPoll] replSet info 172.17.0.4:27017 is down (or slow to respond): 
Mon Nov 10 11:32:38.591 [rsHealthPoll] replSet member 172.17.0.4:27017 is now in state DOWN
Mon Nov 10 11:32:46.593 [rsHealthPoll] replset info 172.17.0.4:27017 heartbeat failed, retrying
Mon Nov 10 11:32:58.373 [conn22] end connection 172.17.0.3:53357 (6 connections now open)
Mon Nov 10 11:32:58.374 [initandlisten] connection accepted from 172.17.0.3:53375 #23 (7 connections now open)
Mon Nov 10 11:32:58.593 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:33:04.594 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:33:18.597 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:33:24.598 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:33:28.404 [conn23] end connection 172.17.0.3:53375 (6 connections now open)
Mon Nov 10 11:33:28.405 [initandlisten] connection accepted from 172.17.0.3:53396 #24 (7 connections now open)
Mon Nov 10 11:33:38.602 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:33:44.602 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:33:58.427 [conn24] end connection 172.17.0.3:53396 (6 connections now open)
Mon Nov 10 11:33:58.427 [initandlisten] connection accepted from 172.17.0.3:53417 #25 (8 connections now open)
Mon Nov 10 11:33:58.606 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:34:04.608 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:34:07.612 [rsHealthPoll] replset info 172.17.0.4:27017 thinks that we are down
Mon Nov 10 11:34:07.613 [rsHealthPoll] replSet member 172.17.0.4:27017 is up
Mon Nov 10 11:34:07.613 [rsHealthPoll] replSet member 172.17.0.4:27017 is now in state SECONDARY
Mon Nov 10 11:34:08.474 [initandlisten] connection accepted from 172.17.0.4:36866 #26 (8 connections now open)
Mon Nov 10 11:34:08.474 [conn26] end connection 172.17.0.4:36866 (7 connections now open)
Mon Nov 10 11:34:08.475 [initandlisten] connection accepted from 172.17.0.4:36867 #27 (9 connections now open)
Mon Nov 10 11:34:08.918 [initandlisten] connection accepted from 172.17.0.4:36868 #28 (9 connections now open)
Mon Nov 10 11:34:13.610 [conn11] end connection 172.17.0.4:36766 (8 connections now open)
Mon Nov 10 11:34:28.455 [conn25] end connection 172.17.0.3:53417 (7 connections now open)
Mon Nov 10 11:34:28.455 [initandlisten] connection accepted from 172.17.0.3:53432 #29 (9 connections now open)
Mon Nov 10 11:34:28.491 [conn27] end connection 172.17.0.4:36867 (7 connections now open)
Mon Nov 10 11:34:28.492 [initandlisten] connection accepted from 172.17.0.4:36872 #30 (8 connections now open)
Mon Nov 10 11:34:37.630 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:34:40.681 [rsHealthPoll] DBClientCursor::init call() failed
Mon Nov 10 11:34:40.681 [rsHealthPoll] replSet info 172.17.0.3:27017 is down (or slow to respond): 
Mon Nov 10 11:34:40.681 [rsHealthPoll] replSet member 172.17.0.3:27017 is now in state DOWN
Mon Nov 10 11:34:43.633 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:34:48.687 [rsHealthPoll] replset info 172.17.0.3:27017 heartbeat failed, retrying
Mon Nov 10 11:34:49.634 [rsHealthPoll] replSet info 172.17.0.4:27017 is down (or slow to respond): 
Mon Nov 10 11:34:49.634 [rsHealthPoll] replSet member 172.17.0.4:27017 is now in state DOWN
Mon Nov 10 11:34:49.634 [rsMgr] can't see a majority of the set, relinquishing primary
Mon Nov 10 11:34:49.634 [rsMgr] replSet relinquishing primary state
Mon Nov 10 11:34:49.634 [rsMgr] replSet SECONDARY
Mon Nov 10 11:34:49.634 [rsMgr] replSet closing client sockets after relinquishing primary
Mon Nov 10 11:34:49.635 [conn13] end connection 172.17.0.3:53329 (7 connections now open)
Mon Nov 10 11:34:49.635 [conn28] end connection 172.17.0.4:36868 (7 connections now open)
Mon Nov 10 11:34:49.635 [conn10] end connection 172.17.0.3:53326 (6 connections now open)
Mon Nov 10 11:34:49.635 [conn1] end connection 172.17.42.1:36164 (7 connections now open)
Mon Nov 10 11:34:49.635 [conn12] end connection 172.17.0.4:36767 (4 connections now open)
Mon Nov 10 11:34:57.637 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:34:57.637 [rsHealthPoll] replset info 172.17.0.3:27017 heartbeat failed, retrying
Mon Nov 10 11:35:00.891 [initandlisten] connection accepted from 172.17.42.1:36315 #31 (4 connections now open)
Mon Nov 10 11:35:01.635 [rsMgr] replSet can't see a majority, will not try to elect self
Mon Nov 10 11:35:03.637 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:35:09.638 [rsHealthPoll] replset info 172.17.0.3:27017 heartbeat failed, retrying
Mon Nov 10 11:35:17.642 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:35:21.642 [rsHealthPoll] replset info 172.17.0.3:27017 heartbeat failed, retrying
Mon Nov 10 11:35:23.643 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:35:33.647 [rsHealthPoll] replset info 172.17.0.3:27017 heartbeat failed, retrying
Mon Nov 10 11:35:37.649 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:35:39.474 [initandlisten] connection accepted from 172.17.0.3:53482 #32 (5 connections now open)
Mon Nov 10 11:35:39.651 [rsHealthPoll] replSet member 172.17.0.3:27017 is up
Mon Nov 10 11:35:39.651 [rsHealthPoll] replSet member 172.17.0.3:27017 is now in state SECONDARY
Mon Nov 10 11:35:39.652 [rsMgr] not electing self, 172.17.0.3:27017 would veto with '172.17.0.2:27017 is trying to elect itself but 172.17.0.4:27017 is already primary and more up-to-date'
Mon Nov 10 11:35:40.498 [initandlisten] connection accepted from 172.17.0.4:36922 #33 (6 connections now open)
Mon Nov 10 11:35:40.657 [rsHealthPoll] replSet member 172.17.0.4:27017 is up
Mon Nov 10 11:35:40.657 [rsHealthPoll] replSet member 172.17.0.4:27017 is now in state PRIMARY
Mon Nov 10 11:35:40.693 [rsBackgroundSync] replSet syncing to: 172.17.0.4:27017
Mon Nov 10 11:35:41.537 [rsSyncNotifier] replset setting oplog notifier to 172.17.0.4:27017
Mon Nov 10 11:35:41.541 [rsSyncNotifier] build index local.me { _id: 1 }
Mon Nov 10 11:35:41.543 [rsSyncNotifier] build index done.  scanned 0 total records. 0.002 secs
Mon Nov 10 11:35:57.487 [conn32] end connection 172.17.0.3:53482 (5 connections now open)
Mon Nov 10 11:35:57.487 [initandlisten] connection accepted from 172.17.0.3:53490 #34 (7 connections now open)
Mon Nov 10 11:35:58.513 [conn33] end connection 172.17.0.4:36922 (5 connections now open)
Mon Nov 10 11:35:58.513 [initandlisten] connection accepted from 172.17.0.4:36930 #35 (6 connections now open)

How about rs3_srv1's story?

  • through 11:29:4x, it starts to get into the replica set, agreeing to rs3_srv0 as PRIMARY at 11:29:48.472, and rebuilding itself through 11:30:03;
  • at 11:32:38.346 it loses sight of rs3_srv2, resighting it at 11:34:06;
  • at 11:34:40, lose sight of the current PRIMARY, rs3_srv0;
  • at 11:34:40.460, rs3_srv1 chooses not to stand for PRIMARY becuase it thinks rs3_srv2 will veto it, even though the server rs3_srv2 thinks is PRIMARY is the one that rs3_srv2 can't see. (??Do we need also need an example situation where secondaries can see each other but one of the secondaries can't see the PRIMARY?)
  • at 11:34:42 it votes for rs3_srv2 as PRIMARY; at 11:34:47, 11:34:47, and 11:34:58 it holds back from nominating itself becuase of its recent vote for someone else;
  • at 11:35:00, rs3_srv2 becomes PRIMARY;
  • at 11:35:39, rs3_srv0 rejoins as a SECONDARY.
In [126]:
!tail -500 {STUB}_srv1_log.txt
note: noprealloc may hurt performance in many applications
Mon Nov 10 11:29:11.889 [initandlisten] MongoDB starting : pid=1 port=27017 dbpath=/data/db/ 64-bit host=bae7907f61ba
Mon Nov 10 11:29:11.889 [initandlisten] db version v2.4.11
Mon Nov 10 11:29:11.889 [initandlisten] git version: fa13d1ee8da0f112f588570b4070f73d7af2f7fd
Mon Nov 10 11:29:11.889 [initandlisten] build info: Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49
Mon Nov 10 11:29:11.889 [initandlisten] allocator: tcmalloc
Mon Nov 10 11:29:11.889 [initandlisten] options: { noprealloc: true, replSet: "rs3", smallfiles: true }
Mon Nov 10 11:29:11.897 [initandlisten] journal dir=/data/db/journal
Mon Nov 10 11:29:11.897 [initandlisten] recover : no journal files present, no recovery needed
Mon Nov 10 11:29:12.042 [FileAllocator] allocating new datafile /data/db/local.ns, filling with zeroes...
Mon Nov 10 11:29:12.042 [FileAllocator] creating directory /data/db/_tmp
Mon Nov 10 11:29:12.101 [FileAllocator] done allocating datafile /data/db/local.ns, size: 16MB,  took 0.058 secs
Mon Nov 10 11:29:12.136 [FileAllocator] allocating new datafile /data/db/local.0, filling with zeroes...
Mon Nov 10 11:29:12.180 [FileAllocator] done allocating datafile /data/db/local.0, size: 16MB,  took 0.041 secs
Mon Nov 10 11:29:12.181 [initandlisten] command local.$cmd command: { create: "startup_log", size: 10485760, capped: true } ntoreturn:1 keyUpdates:0  reslen:37 139ms
Mon Nov 10 11:29:12.182 [initandlisten] waiting for connections on port 27017
Mon Nov 10 11:29:12.183 [websvr] admin web console waiting for connections on port 28017
Mon Nov 10 11:29:12.199 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Mon Nov 10 11:29:12.199 [rsStart] replSet info you may need to run replSetInitiate -- rs.initiate() in the shell -- if that is not already done
Mon Nov 10 11:29:19.446 [initandlisten] connection accepted from 172.17.42.1:38722 #1 (1 connection now open)
Mon Nov 10 11:29:21.327 [initandlisten] connection accepted from 172.17.0.2:34178 #2 (2 connections now open)
Mon Nov 10 11:29:21.328 [conn2] end connection 172.17.0.2:34178 (1 connection now open)
Mon Nov 10 11:29:21.330 [initandlisten] connection accepted from 172.17.0.2:34180 #3 (2 connections now open)
Mon Nov 10 11:29:22.201 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Mon Nov 10 11:29:32.202 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Mon Nov 10 11:29:42.205 [rsStart] trying to contact 172.17.0.2:27017
Mon Nov 10 11:29:42.208 [rsStart] replSet I am 172.17.0.3:27017
Mon Nov 10 11:29:42.208 [rsStart] replSet got config version 1 from a remote, saving locally
Mon Nov 10 11:29:42.208 [rsStart] replSet info saving a newer config version to local.system.replset
Mon Nov 10 11:29:42.211 [rsStart] replSet saveConfigLocally done
Mon Nov 10 11:29:42.211 [rsStart] replSet STARTUP2
Mon Nov 10 11:29:42.212 [rsSync] ******
Mon Nov 10 11:29:42.212 [rsSync] creating replication oplog of size: 1256MB...
Mon Nov 10 11:29:42.212 [FileAllocator] allocating new datafile /data/db/local.1, filling with zeroes...
Mon Nov 10 11:29:42.302 [initandlisten] connection accepted from 172.17.0.4:41232 #4 (3 connections now open)
Mon Nov 10 11:29:42.308 [initandlisten] connection accepted from 172.17.0.4:41233 #5 (4 connections now open)
Mon Nov 10 11:29:42.308 [conn4] end connection 172.17.0.4:41232 (3 connections now open)
Mon Nov 10 11:29:43.719 [FileAllocator] done allocating datafile /data/db/local.1, size: 511MB,  took 1.506 secs
Mon Nov 10 11:29:43.720 [FileAllocator] allocating new datafile /data/db/local.2, filling with zeroes...
Mon Nov 10 11:29:44.214 [rsHealthPoll] replSet member 172.17.0.2:27017 is up
Mon Nov 10 11:29:44.215 [rsHealthPoll] replSet member 172.17.0.2:27017 is now in state SECONDARY
Mon Nov 10 11:29:44.221 [rsHealthPoll] replSet member 172.17.0.4:27017 is up
Mon Nov 10 11:29:44.222 [rsHealthPoll] replSet member 172.17.0.4:27017 is now in state STARTUP2
Mon Nov 10 11:29:45.377 [FileAllocator] done allocating datafile /data/db/local.2, size: 511MB,  took 1.656 secs
Mon Nov 10 11:29:45.379 [FileAllocator] allocating new datafile /data/db/local.3, filling with zeroes...
Mon Nov 10 11:29:47.059 [FileAllocator] done allocating datafile /data/db/local.3, size: 511MB,  took 1.679 secs
Mon Nov 10 11:29:47.070 [rsSync] ******
Mon Nov 10 11:29:47.071 [rsSync] replSet initial sync pending
Mon Nov 10 11:29:47.072 [rsSync] replSet initial sync need a member to be primary or secondary to do our initial sync
Mon Nov 10 11:29:48.469 [conn3] end connection 172.17.0.2:34180 (2 connections now open)
Mon Nov 10 11:29:48.471 [initandlisten] connection accepted from 172.17.0.2:34190 #6 (3 connections now open)
Mon Nov 10 11:29:48.472 [conn6] replSet info voting yea for 172.17.0.2:27017 (0)
Mon Nov 10 11:29:50.222 [rsHealthPoll] replSet member 172.17.0.2:27017 is now in state PRIMARY
Mon Nov 10 11:30:03.072 [rsSync] replSet initial sync pending
Mon Nov 10 11:30:03.073 [rsSync] replSet syncing to: 172.17.0.2:27017
Mon Nov 10 11:30:03.079 [rsSync] build index local.me { _id: 1 }
Mon Nov 10 11:30:03.080 [rsSync] build index done.  scanned 0 total records. 0.001 secs
Mon Nov 10 11:30:03.083 [rsSync] build index local.replset.minvalid { _id: 1 }
Mon Nov 10 11:30:03.084 [rsSync] build index done.  scanned 0 total records. 0 secs
Mon Nov 10 11:30:03.084 [rsSync] replSet initial sync drop all databases
Mon Nov 10 11:30:03.084 [rsSync] dropAllDatabasesExceptLocal 1
Mon Nov 10 11:30:03.085 [rsSync] replSet initial sync clone all databases
Mon Nov 10 11:30:03.085 [rsSync] replSet initial sync data copy, starting syncup
Mon Nov 10 11:30:03.085 [rsSync] oplog sync 1 of 3
Mon Nov 10 11:30:03.086 [rsSync] oplog sync 2 of 3
Mon Nov 10 11:30:03.086 [rsSync] replSet initial sync building indexes
Mon Nov 10 11:30:03.086 [rsSync] oplog sync 3 of 3
Mon Nov 10 11:30:03.087 [rsSync] replSet initial sync finishing up
Mon Nov 10 11:30:03.088 [rsSync] replSet set minValid=5460a194:1
Mon Nov 10 11:30:03.090 [rsSync] replSet RECOVERING
Mon Nov 10 11:30:03.090 [rsSync] replSet initial sync done
Mon Nov 10 11:30:03.233 [rsBackgroundSync] replSet syncing to: 172.17.0.2:27017
Mon Nov 10 11:30:04.092 [rsSyncNotifier] replset setting oplog notifier to 172.17.0.2:27017
Mon Nov 10 11:30:04.242 [rsHealthPoll] replSet member 172.17.0.4:27017 is now in state RECOVERING
Mon Nov 10 11:30:05.097 [rsSync] replSet SECONDARY
Mon Nov 10 11:30:06.244 [rsHealthPoll] replSet member 172.17.0.4:27017 is now in state SECONDARY
Mon Nov 10 11:30:10.348 [conn5] end connection 172.17.0.4:41233 (2 connections now open)
Mon Nov 10 11:30:10.349 [initandlisten] connection accepted from 172.17.0.4:41245 #7 (3 connections now open)
Mon Nov 10 11:30:18.500 [conn6] end connection 172.17.0.2:34190 (2 connections now open)
Mon Nov 10 11:30:18.500 [initandlisten] connection accepted from 172.17.0.2:34202 #8 (4 connections now open)
Mon Nov 10 11:30:40.370 [conn7] end connection 172.17.0.4:41245 (2 connections now open)
Mon Nov 10 11:30:40.370 [initandlisten] connection accepted from 172.17.0.4:41251 #9 (3 connections now open)
Mon Nov 10 11:30:48.520 [conn8] end connection 172.17.0.2:34202 (2 connections now open)
Mon Nov 10 11:30:48.521 [initandlisten] connection accepted from 172.17.0.2:34208 #10 (4 connections now open)
Mon Nov 10 11:31:10.391 [conn9] end connection 172.17.0.4:41251 (2 connections now open)
Mon Nov 10 11:31:10.392 [initandlisten] connection accepted from 172.17.0.4:41257 #11 (4 connections now open)
Mon Nov 10 11:31:18.545 [conn10] end connection 172.17.0.2:34208 (2 connections now open)
Mon Nov 10 11:31:18.546 [initandlisten] connection accepted from 172.17.0.2:34214 #12 (4 connections now open)
Mon Nov 10 11:31:40.411 [conn11] end connection 172.17.0.4:41257 (2 connections now open)
Mon Nov 10 11:31:40.412 [initandlisten] connection accepted from 172.17.0.4:41263 #13 (3 connections now open)
Mon Nov 10 11:31:48.566 [conn12] end connection 172.17.0.2:34214 (2 connections now open)
Mon Nov 10 11:31:48.567 [initandlisten] connection accepted from 172.17.0.2:34220 #14 (3 connections now open)
Mon Nov 10 11:32:10.430 [conn13] end connection 172.17.0.4:41263 (2 connections now open)
Mon Nov 10 11:32:10.431 [initandlisten] connection accepted from 172.17.0.4:41269 #15 (3 connections now open)
Mon Nov 10 11:32:18.583 [conn14] end connection 172.17.0.2:34220 (2 connections now open)
Mon Nov 10 11:32:18.584 [initandlisten] connection accepted from 172.17.0.2:34226 #16 (3 connections now open)
Mon Nov 10 11:32:38.346 [rsHealthPoll] DBClientCursor::init call() failed
Mon Nov 10 11:32:38.346 [rsHealthPoll] replset info 172.17.0.4:27017 heartbeat failed, retrying
Mon Nov 10 11:32:39.346 [rsHealthPoll] replSet info 172.17.0.4:27017 is down (or slow to respond): 
Mon Nov 10 11:32:39.346 [rsHealthPoll] replSet member 172.17.0.4:27017 is now in state DOWN
Mon Nov 10 11:32:47.349 [rsHealthPoll] replset info 172.17.0.4:27017 heartbeat failed, retrying
Mon Nov 10 11:32:48.603 [conn16] end connection 172.17.0.2:34226 (2 connections now open)
Mon Nov 10 11:32:48.603 [initandlisten] connection accepted from 172.17.0.2:34240 #17 (4 connections now open)
Mon Nov 10 11:32:59.349 [rsHealthPoll] replset info 172.17.0.4:27017 heartbeat failed, retrying
Mon Nov 10 11:33:11.347 [rsHealthPoll] replset info 172.17.0.4:27017 heartbeat failed, retrying
Mon Nov 10 11:33:18.621 [conn17] end connection 172.17.0.2:34240 (2 connections now open)
Mon Nov 10 11:33:18.622 [initandlisten] connection accepted from 172.17.0.2:34261 #18 (3 connections now open)
Mon Nov 10 11:33:23.354 [rsHealthPoll] replset info 172.17.0.4:27017 heartbeat failed, retrying
Mon Nov 10 11:33:35.357 [rsHealthPoll] replset info 172.17.0.4:27017 heartbeat failed, retrying
Mon Nov 10 11:33:47.355 [rsHealthPoll] replset info 172.17.0.4:27017 heartbeat failed, retrying
Mon Nov 10 11:33:48.648 [conn18] end connection 172.17.0.2:34261 (2 connections now open)
Mon Nov 10 11:33:48.648 [initandlisten] connection accepted from 172.17.0.2:34282 #19 (3 connections now open)
Mon Nov 10 11:33:59.357 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:34:05.357 [rsHealthPoll] couldn't connect to 172.17.0.4:27017: couldn't connect to server 172.17.0.4:27017
Mon Nov 10 11:34:06.360 [rsHealthPoll] replset info 172.17.0.4:27017 thinks that we are down
Mon Nov 10 11:34:06.360 [rsHealthPoll] replSet member 172.17.0.4:27017 is up
Mon Nov 10 11:34:06.360 [rsHealthPoll] replSet member 172.17.0.4:27017 is now in state SECONDARY
Mon Nov 10 11:34:07.484 [initandlisten] connection accepted from 172.17.0.4:41339 #20 (4 connections now open)
Mon Nov 10 11:34:07.485 [initandlisten] connection accepted from 172.17.0.4:41340 #21 (5 connections now open)
Mon Nov 10 11:34:07.485 [conn20] end connection 172.17.0.4:41339 (4 connections now open)
Mon Nov 10 11:34:18.669 [conn19] end connection 172.17.0.2:34282 (3 connections now open)
Mon Nov 10 11:34:18.669 [initandlisten] connection accepted from 172.17.0.2:34301 #22 (4 connections now open)
Mon Nov 10 11:34:35.508 [conn21] end connection 172.17.0.4:41340 (3 connections now open)
Mon Nov 10 11:34:35.509 [initandlisten] connection accepted from 172.17.0.4:41350 #23 (4 connections now open)
Mon Nov 10 11:34:40.458 [rsHealthPoll] DBClientCursor::init call() failed
Mon Nov 10 11:34:40.458 [rsHealthPoll] replSet info 172.17.0.2:27017 is down (or slow to respond): 
Mon Nov 10 11:34:40.458 [rsHealthPoll] replSet member 172.17.0.2:27017 is now in state DOWN
Mon Nov 10 11:34:40.460 [rsMgr] not electing self, 172.17.0.4:27017 would veto with '172.17.0.3:27017 is trying to elect itself but 172.17.0.2:27017 is already primary and more up-to-date'
Mon Nov 10 11:34:42.111 [conn23] replSet info voting yea for 172.17.0.4:27017 (2)
Mon Nov 10 11:34:47.189 [rsMgr] replSet not trying to elect self as responded yea to someone else recently
Mon Nov 10 11:34:48.461 [rsHealthPoll] replset info 172.17.0.2:27017 heartbeat failed, retrying
Mon Nov 10 11:34:52.952 [rsMgr] replSet not trying to elect self as responded yea to someone else recently
Mon Nov 10 11:34:58.978 [rsMgr] replSet not trying to elect self as responded yea to someone else recently
Mon Nov 10 11:34:59.283 [rsBackgroundSync] Socket recv() timeout  172.17.0.2:27017
Mon Nov 10 11:34:59.283 [rsBackgroundSync] SocketException: remote: 172.17.0.2:27017 error: 9001 socket exception [RECV_TIMEOUT] server [172.17.0.2:27017] 
Mon Nov 10 11:34:59.284 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: 172.17.0.2:27017
Mon Nov 10 11:35:00.406 [rsHealthPoll] replSet member 172.17.0.4:27017 is now in state PRIMARY
Mon Nov 10 11:35:00.464 [rsHealthPoll] replset info 172.17.0.2:27017 heartbeat failed, retrying
Mon Nov 10 11:35:01.284 [rsBackgroundSync] replSet syncing to: 172.17.0.4:27017
Mon Nov 10 11:35:05.530 [conn23] end connection 172.17.0.4:41350 (3 connections now open)
Mon Nov 10 11:35:05.531 [initandlisten] connection accepted from 172.17.0.4:41374 #24 (4 connections now open)
Mon Nov 10 11:35:12.464 [rsHealthPoll] replset info 172.17.0.2:27017 heartbeat failed, retrying
Mon Nov 10 11:35:24.467 [rsHealthPoll] replset info 172.17.0.2:27017 heartbeat failed, retrying
Mon Nov 10 11:35:35.553 [conn24] end connection 172.17.0.4:41374 (3 connections now open)
Mon Nov 10 11:35:35.553 [initandlisten] connection accepted from 172.17.0.4:41395 #25 (5 connections now open)
Mon Nov 10 11:35:36.470 [rsHealthPoll] replset info 172.17.0.2:27017 heartbeat failed, retrying
Mon Nov 10 11:35:39.475 [rsHealthPoll] replset info 172.17.0.2:27017 thinks that we are down
Mon Nov 10 11:35:39.475 [rsHealthPoll] replSet member 172.17.0.2:27017 is up
Mon Nov 10 11:35:39.476 [rsHealthPoll] replSet member 172.17.0.2:27017 is now in state SECONDARY
Mon Nov 10 11:35:39.650 [initandlisten] connection accepted from 172.17.0.2:34355 #26 (5 connections now open)
Mon Nov 10 11:35:45.656 [conn26] end connection 172.17.0.2:34355 (4 connections now open)
Mon Nov 10 11:35:45.656 [initandlisten] connection accepted from 172.17.0.2:34359 #27 (5 connections now open)
Mon Nov 10 11:36:05.581 [initandlisten] connection accepted from 172.17.0.4:41408 #28 (6 connections now open)
Mon Nov 10 11:36:05.581 [conn25] end connection 172.17.0.4:41395 (5 connections now open)
Mon Nov 10 11:36:15.680 [conn27] end connection 172.17.0.2:34359 (4 connections now open)
Mon Nov 10 11:36:15.680 [initandlisten] connection accepted from 172.17.0.2:34365 #29 (5 connections now open)

How did things look from the perspective of rs3_2?

  • as with rs3_srv1, we have initialisation up to 11:30:04;
  • at 11:32:34.449, spot the first connection error; this server is lost in the wilderness...
  • at 11:34:07, come back up, not bothering to try to take on the mantle of PRIMARY;
  • at 11:34:40, lose sight of rs3_srv0, the current PRIMARY;
  • at 11:34:42.110, make a pitch to become PRIMARY;
  • at 11:34:58.999, become PRIMARY;
  • at 11:35:40.499, rs3_srv0 comes back, letting us know it thought we were down, and is accepted back in as SECONDARY.
In [127]:
!tail -500 {STUB}_srv2_log.txt
note: noprealloc may hurt performance in many applications
Mon Nov 10 11:29:12.064 [initandlisten] MongoDB starting : pid=1 port=27017 dbpath=/data/db/ 64-bit host=7b694cdab9f8
Mon Nov 10 11:29:12.064 [initandlisten] db version v2.4.11
Mon Nov 10 11:29:12.064 [initandlisten] git version: fa13d1ee8da0f112f588570b4070f73d7af2f7fd
Mon Nov 10 11:29:12.064 [initandlisten] build info: Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49
Mon Nov 10 11:29:12.064 [initandlisten] allocator: tcmalloc
Mon Nov 10 11:29:12.064 [initandlisten] options: { noprealloc: true, replSet: "rs3", smallfiles: true }
Mon Nov 10 11:29:12.070 [initandlisten] journal dir=/data/db/journal
Mon Nov 10 11:29:12.071 [initandlisten] recover : no journal files present, no recovery needed
Mon Nov 10 11:29:12.172 [FileAllocator] allocating new datafile /data/db/local.ns, filling with zeroes...
Mon Nov 10 11:29:12.172 [FileAllocator] creating directory /data/db/_tmp
Mon Nov 10 11:29:12.241 [FileAllocator] done allocating datafile /data/db/local.ns, size: 16MB,  took 0.067 secs
Mon Nov 10 11:29:12.260 [FileAllocator] allocating new datafile /data/db/local.0, filling with zeroes...
Mon Nov 10 11:29:12.288 [FileAllocator] done allocating datafile /data/db/local.0, size: 16MB,  took 0.027 secs
Mon Nov 10 11:29:12.288 [initandlisten] command local.$cmd command: { create: "startup_log", size: 10485760, capped: true } ntoreturn:1 keyUpdates:0  reslen:37 117ms
Mon Nov 10 11:29:12.289 [initandlisten] waiting for connections on port 27017
Mon Nov 10 11:29:12.290 [websvr] admin web console waiting for connections on port 28017
Mon Nov 10 11:29:12.293 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Mon Nov 10 11:29:12.293 [rsStart] replSet info you may need to run replSetInitiate -- rs.initiate() in the shell -- if that is not already done
Mon Nov 10 11:29:19.451 [initandlisten] connection accepted from 172.17.42.1:43299 #1 (1 connection now open)
Mon Nov 10 11:29:21.329 [initandlisten] connection accepted from 172.17.0.2:58014 #2 (2 connections now open)
Mon Nov 10 11:29:21.330 [conn2] end connection 172.17.0.2:58014 (1 connection now open)
Mon Nov 10 11:29:21.333 [initandlisten] connection accepted from 172.17.0.2:58016 #3 (2 connections now open)
Mon Nov 10 11:29:22.295 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Mon Nov 10 11:29:32.295 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Mon Nov 10 11:29:42.207 [initandlisten] connection accepted from 172.17.0.3:60156 #4 (3 connections now open)
Mon Nov 10 11:29:42.208 [conn4] end connection 172.17.0.3:60156 (2 connections now open)
Mon Nov 10 11:29:42.299 [rsStart] trying to contact 172.17.0.2:27017
Mon Nov 10 11:29:42.303 [rsStart] replSet I am 172.17.0.4:27017
Mon Nov 10 11:29:42.303 [rsStart] replSet got config version 1 from a remote, saving locally
Mon Nov 10 11:29:42.303 [rsStart] replSet info saving a newer config version to local.system.replset
Mon Nov 10 11:29:42.306 [rsStart] replSet saveConfigLocally done
Mon Nov 10 11:29:42.306 [rsStart] replSet STARTUP2
Mon Nov 10 11:29:42.307 [rsSync] ******
Mon Nov 10 11:29:42.307 [rsSync] creating replication oplog of size: 1253MB...
Mon Nov 10 11:29:42.308 [FileAllocator] allocating new datafile /data/db/local.1, filling with zeroes...
Mon Nov 10 11:29:42.309 [rsHealthPoll] replset info 172.17.0.3:27017 thinks that we are down
Mon Nov 10 11:29:42.309 [rsHealthPoll] replSet member 172.17.0.3:27017 is up
Mon Nov 10 11:29:42.309 [rsHealthPoll] replSet member 172.17.0.3:27017 is now in state STARTUP2
Mon Nov 10 11:29:43.945 [FileAllocator] done allocating datafile /data/db/local.1, size: 511MB,  took 1.629 secs
Mon Nov 10 11:29:43.968 [FileAllocator] allocating new datafile /data/db/local.2, filling with zeroes...
Mon Nov 10 11:29:44.209 [initandlisten] connection accepted from 172.17.0.3:60161 #5 (3 connections now open)
Mon Nov 10 11:29:44.307 [rsHealthPoll] replSet member 172.17.0.2:27017 is up
Mon Nov 10 11:29:44.307 [rsHealthPoll] replSet member 172.17.0.2:27017 is now in state SECONDARY
Mon Nov 10 11:29:45.507 [FileAllocator] done allocating datafile /data/db/local.2, size: 511MB,  took 1.537 secs
Mon Nov 10 11:29:45.528 [FileAllocator] allocating new datafile /data/db/local.3, filling with zeroes...
Mon Nov 10 11:29:46.854 [FileAllocator] done allocating datafile /data/db/local.3, size: 511MB,  took 1.325 secs
Mon Nov 10 11:29:46.866 [rsSync] ******
Mon Nov 10 11:29:46.866 [rsSync] replSet initial sync pending
Mon Nov 10 11:29:46.867 [rsSync] replSet initial sync need a member to be primary or secondary to do our initial sync
Mon Nov 10 11:29:48.471 [conn3] replSet info voting yea for 172.17.0.2:27017 (0)
Mon Nov 10 11:29:50.324 [rsHealthPoll] replSet member 172.17.0.2:27017 is now in state PRIMARY
Mon Nov 10 11:30:02.478 [conn3] end connection 172.17.0.2:58016 (2 connections now open)
Mon Nov 10 11:30:02.479 [initandlisten] connection accepted from 172.17.0.2:58028 #6 (3 connections now open)
Mon Nov 10 11:30:02.870 [rsSync] replSet initial sync pending
Mon Nov 10 11:30:02.870 [rsSync] replSet syncing to: 172.17.0.2:27017
Mon Nov 10 11:30:02.875 [rsSync] build index local.me { _id: 1 }
Mon Nov 10 11:30:02.881 [rsSync] build index done.  scanned 0 total records. 0.005 secs
Mon Nov 10 11:30:02.883 [rsSync] build index local.replset.minvalid { _id: 1 }
Mon Nov 10 11:30:02.884 [rsSync] build index done.  scanned 0 total records. 0 secs
Mon Nov 10 11:30:02.884 [rsSync] replSet initial sync drop all databases
Mon Nov 10 11:30:02.884 [rsSync] dropAllDatabasesExceptLocal 1
Mon Nov 10 11:30:02.884 [rsSync] replSet initial sync clone all databases
Mon Nov 10 11:30:02.887 [rsSync] replSet initial sync data copy, starting syncup
Mon Nov 10 11:30:02.887 [rsSync] oplog sync 1 of 3
Mon Nov 10 11:30:02.888 [rsSync] oplog sync 2 of 3
Mon Nov 10 11:30:02.888 [rsSync] replSet initial sync building indexes
Mon Nov 10 11:30:02.888 [rsSync] oplog sync 3 of 3
Mon Nov 10 11:30:02.888 [rsSync] replSet initial sync finishing up
Mon Nov 10 11:30:02.891 [rsSync] replSet set minValid=5460a194:1
Mon Nov 10 11:30:02.892 [rsSync] replSet RECOVERING
Mon Nov 10 11:30:02.892 [rsSync] replSet initial sync done
Mon Nov 10 11:30:03.319 [rsBackgroundSync] replSet syncing to: 172.17.0.2:27017
Mon Nov 10 11:30:03.893 [rsSyncNotifier] replset setting oplog notifier to 172.17.0.2:27017
Mon Nov 10 11:30:04.342 [rsHealthPoll] replSet member 172.17.0.3:27017 is now in state RECOVERING
Mon Nov 10 11:30:04.894 [rsSync] replSet SECONDARY
Mon Nov 10 11:30:06.344 [rsHealthPoll] replSet member 172.17.0.3:27017 is now in state SECONDARY
Mon Nov 10 11:30:12.248 [conn5] end connection 172.17.0.3:60161 (2 connections now open)
Mon Nov 10 11:30:12.249 [initandlisten] connection accepted from 172.17.0.3:60173 #7 (3 connections now open)
Mon Nov 10 11:30:32.505 [conn6] end connection 172.17.0.2:58028 (2 connections now open)
Mon Nov 10 11:30:32.506 [initandlisten] connection accepted from 172.17.0.2:58040 #8 (3 connections now open)
Mon Nov 10 11:30:42.271 [conn7] end connection 172.17.0.3:60173 (2 connections now open)
Mon Nov 10 11:30:42.272 [initandlisten] connection accepted from 172.17.0.3:60179 #9 (3 connections now open)
Mon Nov 10 11:31:02.526 [conn8] end connection 172.17.0.2:58040 (2 connections now open)
Mon Nov 10 11:31:02.527 [initandlisten] connection accepted from 172.17.0.2:58046 #10 (4 connections now open)
Mon Nov 10 11:31:12.290 [conn9] end connection 172.17.0.3:60179 (2 connections now open)
Mon Nov 10 11:31:12.291 [initandlisten] connection accepted from 172.17.0.3:60185 #11 (3 connections now open)
Mon Nov 10 11:31:32.555 [conn10] end connection 172.17.0.2:58046 (2 connections now open)
Mon Nov 10 11:31:32.556 [initandlisten] connection accepted from 172.17.0.2:58052 #12 (3 connections now open)
Mon Nov 10 11:31:42.311 [conn11] end connection 172.17.0.3:60185 (2 connections now open)
Mon Nov 10 11:31:42.312 [initandlisten] connection accepted from 172.17.0.3:60191 #13 (3 connections now open)
Mon Nov 10 11:32:02.574 [conn12] end connection 172.17.0.2:58052 (2 connections now open)
Mon Nov 10 11:32:02.575 [initandlisten] connection accepted from 172.17.0.2:58058 #14 (3 connections now open)
Mon Nov 10 11:32:12.333 [conn13] end connection 172.17.0.3:60191 (2 connections now open)
Mon Nov 10 11:32:12.334 [initandlisten] connection accepted from 172.17.0.3:60197 #15 (3 connections now open)
Mon Nov 10 11:32:34.449 [rsHealthPoll] couldn't connect to 172.17.0.2:27017: couldn't connect to server 172.17.0.2:27017
Mon Nov 10 11:32:38.446 [rsHealthPoll] DBClientCursor::init call() failed
Mon Nov 10 11:32:38.446 [rsHealthPoll] replset info 172.17.0.3:27017 heartbeat failed, retrying
Mon Nov 10 11:32:39.446 [rsHealthPoll] replSet info 172.17.0.3:27017 is down (or slow to respond): 
Mon Nov 10 11:32:39.447 [rsHealthPoll] replSet member 172.17.0.3:27017 is now in state DOWN
Mon Nov 10 11:32:40.450 [rsHealthPoll] couldn't connect to 172.17.0.2:27017: couldn't connect to server 172.17.0.2:27017
Mon Nov 10 11:32:46.452 [rsHealthPoll] replSet info 172.17.0.2:27017 is down (or slow to respond): 
Mon Nov 10 11:32:46.452 [rsHealthPoll] replSet member 172.17.0.2:27017 is now in state DOWN
Mon Nov 10 11:32:46.452 [rsMgr] replSet can't see a majority, will not try to elect self
Mon Nov 10 11:32:47.450 [rsHealthPoll] replset info 172.17.0.3:27017 heartbeat failed, retrying
Mon Nov 10 11:32:53.857 [rsBackgroundSync] Socket recv() timeout  172.17.0.2:27017
Mon Nov 10 11:32:53.857 [rsBackgroundSync] SocketException: remote: 172.17.0.2:27017 error: 9001 socket exception [RECV_TIMEOUT] server [172.17.0.2:27017] 
Mon Nov 10 11:32:53.858 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: 172.17.0.2:27017
Mon Nov 10 11:32:54.453 [rsHealthPoll] couldn't connect to 172.17.0.2:27017: couldn't connect to server 172.17.0.2:27017
Mon Nov 10 11:32:59.459 [rsHealthPoll] replset info 172.17.0.3:27017 heartbeat failed, retrying
Mon Nov 10 11:33:00.454 [rsHealthPoll] couldn't connect to 172.17.0.2:27017: couldn't connect to server 172.17.0.2:27017
Mon Nov 10 11:33:11.466 [rsHealthPoll] replset info 172.17.0.3:27017 heartbeat failed, retrying
Mon Nov 10 11:33:14.459 [rsHealthPoll] couldn't connect to 172.17.0.2:27017: couldn't connect to server 172.17.0.2:27017
Mon Nov 10 11:33:20.459 [rsHealthPoll] couldn't connect to 172.17.0.2:27017: couldn't connect to server 172.17.0.2:27017
Mon Nov 10 11:33:23.468 [rsHealthPoll] replset info 172.17.0.3:27017 heartbeat failed, retrying
Mon Nov 10 11:33:34.463 [rsHealthPoll] couldn't connect to 172.17.0.2:27017: couldn't connect to server 172.17.0.2:27017
Mon Nov 10 11:33:35.470 [rsHealthPoll] replset info 172.17.0.3:27017 heartbeat failed, retrying
Mon Nov 10 11:33:40.464 [rsHealthPoll] couldn't connect to 172.17.0.2:27017: couldn't connect to server 172.17.0.2:27017
Mon Nov 10 11:33:47.478 [rsHealthPoll] couldn't connect to 172.17.0.3:27017: couldn't connect to server 172.17.0.3:27017
Mon Nov 10 11:33:53.479 [rsHealthPoll] couldn't connect to 172.17.0.3:27017: couldn't connect to server 172.17.0.3:27017
Mon Nov 10 11:33:54.467 [rsHealthPoll] couldn't connect to 172.17.0.2:27017: couldn't connect to server 172.17.0.2:27017
Mon Nov 10 11:34:00.470 [rsHealthPoll] couldn't connect to 172.17.0.2:27017: couldn't connect to server 172.17.0.2:27017
Mon Nov 10 11:34:06.359 [initandlisten] connection accepted from 172.17.0.3:60265 #16 (4 connections now open)
Mon Nov 10 11:34:07.483 [rsHealthPoll] couldn't connect to 172.17.0.3:27017: couldn't connect to server 172.17.0.3:27017
Mon Nov 10 11:34:07.486 [rsHealthPoll] replSet member 172.17.0.3:27017 is up
Mon Nov 10 11:34:07.486 [rsHealthPoll] replSet member 172.17.0.3:27017 is now in state SECONDARY
Mon Nov 10 11:34:07.487 [rsMgr] not electing self, 172.17.0.3:27017 would veto with '172.17.0.4:27017 is trying to elect itself but 172.17.0.2:27017 is already primary and more up-to-date'
Mon Nov 10 11:34:07.612 [initandlisten] connection accepted from 172.17.0.2:58127 #17 (5 connections now open)
Mon Nov 10 11:34:08.361 [conn16] end connection 172.17.0.3:60265 (4 connections now open)
Mon Nov 10 11:34:08.361 [initandlisten] connection accepted from 172.17.0.3:60268 #18 (6 connections now open)
Mon Nov 10 11:34:08.475 [rsHealthPoll] replSet member 172.17.0.2:27017 is up
Mon Nov 10 11:34:08.475 [rsHealthPoll] replSet member 172.17.0.2:27017 is now in state PRIMARY
Mon Nov 10 11:34:08.917 [rsBackgroundSync] replSet syncing to: 172.17.0.2:27017
Mon Nov 10 11:34:09.613 [conn17] end connection 172.17.0.2:58127 (4 connections now open)
Mon Nov 10 11:34:09.614 [initandlisten] connection accepted from 172.17.0.2:58135 #19 (5 connections now open)
Mon Nov 10 11:34:36.385 [conn18] end connection 172.17.0.3:60268 (4 connections now open)
Mon Nov 10 11:34:36.386 [initandlisten] connection accepted from 172.17.0.3:60278 #20 (5 connections now open)
Mon Nov 10 11:34:40.490 [rsHealthPoll] DBClientCursor::init call() failed
Mon Nov 10 11:34:40.490 [rsHealthPoll] replset info 172.17.0.2:27017 heartbeat failed, retrying
Mon Nov 10 11:34:41.491 [rsHealthPoll] replSet info 172.17.0.2:27017 is down (or slow to respond): 
Mon Nov 10 11:34:41.492 [rsHealthPoll] replSet member 172.17.0.2:27017 is now in state DOWN
Mon Nov 10 11:34:42.110 [rsMgr] replSet info electSelf 2
Mon Nov 10 11:34:49.494 [rsHealthPoll] replset info 172.17.0.2:27017 heartbeat failed, retrying
Mon Nov 10 11:34:58.998 [rsBackgroundSync] Socket recv() timeout  172.17.0.2:27017
Mon Nov 10 11:34:58.998 [rsBackgroundSync] SocketException: remote: 172.17.0.2:27017 error: 9001 socket exception [RECV_TIMEOUT] server [172.17.0.2:27017] 
Mon Nov 10 11:34:58.998 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: 172.17.0.2:27017
Mon Nov 10 11:34:58.999 [rsMgr] replSet PRIMARY
Mon Nov 10 11:35:01.285 [initandlisten] connection accepted from 172.17.0.3:60297 #21 (6 connections now open)
Mon Nov 10 11:35:01.496 [rsHealthPoll] replset info 172.17.0.2:27017 heartbeat failed, retrying
Mon Nov 10 11:35:06.410 [conn20] end connection 172.17.0.3:60278 (5 connections now open)
Mon Nov 10 11:35:06.411 [initandlisten] connection accepted from 172.17.0.3:60302 #22 (6 connections now open)
Mon Nov 10 11:35:13.495 [rsHealthPoll] replset info 172.17.0.2:27017 heartbeat failed, retrying
Mon Nov 10 11:35:25.496 [rsHealthPoll] replset info 172.17.0.2:27017 heartbeat failed, retrying
Mon Nov 10 11:35:36.434 [conn22] end connection 172.17.0.3:60302 (5 connections now open)
Mon Nov 10 11:35:36.435 [initandlisten] connection accepted from 172.17.0.3:60323 #23 (6 connections now open)
Mon Nov 10 11:35:37.496 [rsHealthPoll] replset info 172.17.0.2:27017 heartbeat failed, retrying
Mon Nov 10 11:35:40.499 [rsHealthPoll] replset info 172.17.0.2:27017 thinks that we are down
Mon Nov 10 11:35:40.499 [rsHealthPoll] replSet member 172.17.0.2:27017 is up
Mon Nov 10 11:35:40.499 [rsHealthPoll] replSet member 172.17.0.2:27017 is now in state SECONDARY
Mon Nov 10 11:35:40.656 [initandlisten] connection accepted from 172.17.0.2:58189 #24 (7 connections now open)
Mon Nov 10 11:35:40.656 [initandlisten] connection accepted from 172.17.0.2:58191 #25 (8 connections now open)
Mon Nov 10 11:35:40.656 [conn24] end connection 172.17.0.2:58189 (6 connections now open)
Mon Nov 10 11:35:40.693 [initandlisten] connection accepted from 172.17.0.2:58192 #26 (8 connections now open)
Mon Nov 10 11:35:41.538 [initandlisten] connection accepted from 172.17.0.2:58193 #27 (9 connections now open)
Mon Nov 10 11:35:42.549 [slaveTracking] build index local.slaves { _id: 1 }
Mon Nov 10 11:35:42.552 [slaveTracking] build index done.  scanned 0 total records. 0.002 secs
Mon Nov 10 11:36:04.674 [conn25] end connection 172.17.0.2:58191 (8 connections now open)
Mon Nov 10 11:36:04.675 [initandlisten] connection accepted from 172.17.0.2:58197 #28 (9 connections now open)
Mon Nov 10 11:36:06.462 [conn23] end connection 172.17.0.3:60323 (8 connections now open)
Mon Nov 10 11:36:06.463 [initandlisten] connection accepted from 172.17.0.3:60336 #29 (9 connections now open)
In [128]:
#Tidy up
#Remove the blockade rules
clear_iptables('test1w2s2')
#Shut down the containers
tidyAwayContainers(c,['rs3_srv0','rs3_srv1','rs3_srv2'])

Writing to Replica Sets

Need examples of writing to different servers in the replica set under different partion conditions to show what happens to the replication in each case? Eg in matters of eventual consistency, etc?

PS

I think there's a lot to be said for giving students a transcript such as this showing a worked example. Having got their eye in, we can perhaps get them to run slightly different scenarios, eg a disconnect between rs3_srv0 and rs3_srv2, with connections still up between rs3_srv0 and rs3_srv1, and rs3_srv1 and rs3_srv2. I'm not sure if we can do one-way breaks? (eg so rs3_srv1 can see rs3_srv0 but rs3_srv1 can't see rs3_srv0?)

Also, students could run scenarios of inserting data into different machines under different network conditions?

testclient= MongoReplicaSetClient('{0}:{1}'.format(getContainIPaddress(c,STUB+'_srv0'),27017), replicaSet=STUB)
testdb=testclient.testdb
testcollection=testdb.testcollection
testcollection.insert({'name':'test1'})
testcollection.find()
In [ ]: