Disaster recovery
The minimal fault-tolerant Tarantool configuration would be a replica set that includes a master and a replica, or two masters. The basic recommendation is to configure all Tarantool instances in a replica set to create snapshot files on a regular basis.
Here are action plans for typical crash scenarios.
Configuration: master-replica (manual failover).
Problem: The master has crashed.
Actions:
- Ensure the master is stopped.
For example, log in to the master machine and use
tt stop
. - Configure a new replica set leader using the <replicaset_name>.leader option.
- Reload configuration on all instances using config:reload().
- Make sure that a new replica set leader is a master using box.info.ro.
- On a new master, remove a crashed instance from the ‘_cluster’ space.
- Set up a replacement for the crashed master on a spare host.
See also: Performing manual failover.
Configuration: master-replica (automated failover).
Problem: The master has crashed.
Actions:
- Use
box.info.election
to make sure a new master is elected automatically. - On a new master, remove a crashed instance from the ‘_cluster’ space.
- Set up a replacement for the crashed master on a spare host.
See also: Testing automated failover.
Configuration: master-replica.
Problem: Some transactions are missing on a replica after the master has crashed.
Actions:
You lose a few transactions in the master
write-ahead log file, which may have not
transferred to the replica before the crash. If you were able to salvage the master
.xlog
file, you may be able to recover these.
Find out instance UUID from the crashed master xlog:
$ head -5 var/lib/instance001/*.xlog | grep Instance Instance: 9bb111c2-3ff5-36a7-00f4-2b9a573ea660
On the new master, use the UUID to find the position:
app:instance002> box.info.vclock[box.space._cluster.index.uuid:select{'9bb111c2-3ff5-36a7-00f4-2b9a573ea660'}[1][1]] --- - 999 ...
Play the records from the crashed
.xlog
to the new master, starting from the new master position:$ tt play 127.0.0.1:3302 var/lib/instance001/00000000000000000000.xlog \ --from 1000 \ --replica 1 \ --username admin --password secret
Configuration: master-master.
Problem: one master has crashed.
Actions:
- Let the load be handled by another master alone.
- Remove a crashed master from a replica set.
- Set up a replacement for the crashed master on a spare host. Learn more from Adding and removing instances.
Configuration: master-replica or master-master.
Problem: Data was deleted at one master and this data loss was propagated to the other node (master or replica).
Actions:
Put all nodes in read-only mode. Depending on the replication.failover mode, this can be done as follows:
manual
: change a replica set leader tonull
.election
: set replication.election_mode tovoter
oroff
at the replica set level.off
: setdatabase.mode
toro
.
Reload configurations on all instances using the
reload()
function provided by the config module.Turn off deletion of expired checkpoints with box.backup.start(). This prevents the Tarantool garbage collector from removing files made with older checkpoints until box.backup.stop() is called.
Get the latest valid .snap file and use
tt cat
command to calculate at which LSN the data loss occurred.Start a new instance and use tt play command to play to it the contents of
.snap
and.xlog
files up to the calculated LSN.Bootstrap a new replica from the recovered master.
Note
The steps above are applicable only to data in the memtx storage engine.