Disaster recovery

The minimal fault-tolerant Tarantool configuration would be a replica set that includes a master and a replica, or two masters. The basic recommendation is to configure all Tarantool instances in a replica set to create snapshot files on a regular basis.

Here are action plans for typical crash scenarios.

Master-replica

Master crash: manual failover

Configuration: master-replica (manual failover).

Problem: The master has crashed.

Actions:

Ensure the master is stopped. For example, log in to the master machine and use tt stop.
Configure a new replica set leader using the <replicaset_name>.leader option.
Reload configuration on all instances using config:reload().
Make sure that a new replica set leader is a master using box.info.ro.
On a new master, remove a crashed instance from the ‘_cluster’ space.
Set up a replacement for the crashed master on a spare host.

Master crash: automated failover

Configuration: master-replica (automated failover).

Problem: The master has crashed.

Actions:

Use box.info.election to make sure a new master is elected automatically.
On a new master, remove a crashed instance from the ‘_cluster’ space.
Set up a replacement for the crashed master on a spare host.

Data loss

Configuration: master-replica.

Problem: Some transactions are missing on a replica after the master has crashed.

Actions:

You lose a few transactions in the master write-ahead log file, which may have not transferred to the replica before the crash. If you were able to salvage the master .xlog file, you may be able to recover these.

Find out instance UUID from the crashed master xlog:

$ head -5 var/lib/instance001/*.xlog | grep Instance
Instance: 9bb111c2-3ff5-36a7-00f4-2b9a573ea660

On the new master, use the UUID to find the position:

app:instance002> box.info.vclock[box.space._cluster.index.uuid:select{'9bb111c2-3ff5-36a7-00f4-2b9a573ea660'}[1][1]]
---
- 999
...

Play the records from the crashed .xlog to the new master, starting from the new master position:

$ tt play 127.0.0.1:3302 var/lib/instance001/00000000000000000000.xlog \
          --from 1000 \
          --replica 1 \
          --username admin --password secret

Master-master

Configuration: master-master.

Problem: one master has crashed.

Actions:

Let the load be handled by another master alone.
Remove a crashed master from a replica set.
Set up a replacement for the crashed master on a spare host. Learn more from Adding and removing instances.

Master-replica/master-master: data loss

Configuration: master-replica or master-master.

Problem: Data was deleted at one master and this data loss was propagated to the other node (master or replica).

Actions:

Put all nodes in read-only mode. Depending on the replication.failover mode, this can be done as follows:
- manual: change a replica set leader to null.
- election: set replication.election_mode to voter or off at the replica set level.
- off: set database.mode to ro.
Reload configurations on all instances using the reload() function provided by the config module.
Turn off deletion of expired checkpoints with box.backup.start(). This prevents the Tarantool garbage collector from removing files made with older checkpoints until box.backup.stop() is called.
Get the latest valid .snap file and use tt cat command to calculate at which LSN the data loss occurred.
Start a new instance and use tt play command to play to it the contents of .snap and .xlog files up to the calculated LSN.
Bootstrap a new replica from the recovered master.

Note

The steps above are applicable only to data in the memtx storage engine.

Version:

Disaster recovery

Master-replica

Master crash: manual failover

Master crash: automated failover

Data loss

Master-master

Master-replica/master-master: data loss