Disaster recovery | Tarantool
Administration Disaster recovery

Disaster recovery

The minimal fault-tolerant Tarantool configuration would be a replica set that includes a master and a replica, or two masters. The basic recommendation is to configure all Tarantool instances in a replica set to create snapshot files on a regular basis.

Here are action plans for typical crash scenarios.

Configuration: master-replica (manual failover).

Problem: The master has crashed.

Actions:

  1. Ensure the master is stopped. For example, log in to the master machine and use tt stop.
  2. Configure a new replica set leader using the <replicaset_name>.leader option.
  3. Reload configuration on all instances using config:reload().
  4. Make sure that a new replica set leader is a master using box.info.ro.
  5. On a new master, remove a crashed instance from the ‘_cluster’ space.
  6. Set up a replacement for the crashed master on a spare host.

See also: Performing manual failover.

Configuration: master-replica (automated failover).

Problem: The master has crashed.

Actions:

  1. Use box.info.election to make sure a new master is elected automatically.
  2. On a new master, remove a crashed instance from the ‘_cluster’ space.
  3. Set up a replacement for the crashed master on a spare host.

See also: Testing automated failover.

Configuration: master-replica.

Problem: Some transactions are missing on a replica after the master has crashed.

Actions:

You lose a few transactions in the master write-ahead log file, which may have not transferred to the replica before the crash. If you were able to salvage the master .xlog file, you may be able to recover these.

  1. Find out instance UUID from the crashed master xlog:

    $ head -5 var/lib/instance001/*.xlog | grep Instance
    Instance: 9bb111c2-3ff5-36a7-00f4-2b9a573ea660
    
  2. On the new master, use the UUID to find the position:

    app:instance002> box.info.vclock[box.space._cluster.index.uuid:select{'9bb111c2-3ff5-36a7-00f4-2b9a573ea660'}[1][1]]
    ---
    - 999
    ...
    
  3. Play the records from the crashed .xlog to the new master, starting from the new master position:

    $ tt play 127.0.0.1:3302 var/lib/instance001/00000000000000000000.xlog \
              --from 1000 \
              --replica 1 \
              --username admin --password secret
    

Configuration: master-master.

Problem: one master has crashed.

Actions:

  1. Let the load be handled by another master alone.
  2. Remove a crashed master from a replica set.
  3. Set up a replacement for the crashed master on a spare host. Learn more from Adding and removing instances.

Configuration: master-replica or master-master.

Problem: Data was deleted at one master and this data loss was propagated to the other node (master or replica).

Actions:

  1. Put all nodes in read-only mode. Depending on the replication.failover mode, this can be done as follows:

    • manual: change a replica set leader to null.
    • election: set replication.election_mode to voter or off at the replica set level.
    • off: set database.mode to ro.

    Reload configurations on all instances using the reload() function provided by the config module.

  2. Turn off deletion of expired checkpoints with box.backup.start(). This prevents the Tarantool garbage collector from removing files made with older checkpoints until box.backup.stop() is called.

  3. Get the latest valid .snap file and use tt cat command to calculate at which LSN the data loss occurred.

  4. Start a new instance and use tt play command to play to it the contents of .snap and .xlog files up to the calculated LSN.

  5. Bootstrap a new replica from the recovered master.

Note

The steps above are applicable only to data in the memtx storage engine.

Found what you were looking for?
Feedback