Replication cluster upgrade

Below are the general instructions for upgrading a Tarantool cluster with replication. Upgrading from some versions can involve certain specifics. To find out if it is your case, check the version-specific topics of the Upgrades section.

A replication cluster can be upgraded without downtime due to its redundancy. When you disconnect a single instance for an upgrade, there is always another instance that takes over its functionality: being a master storage for the same data buckets or working as a router. This way, you can upgrade all the instances one by one.

The high-level steps of cluster upgrade are the following:

Ensure the application compatibility with the target Tarantool version.
Check the cluster health.
Install the target Tarantool version on the cluster nodes.
Upgrade router nodes one by one.
Upgrade storage replica sets one by one.

Important

The only way to upgrade Tarantool from version 1.6, 1.7, or 1.9 to 2.x without downtime is to take an intermediate step by upgrading to 1.10 and then to 2.x.

Before upgrading Tarantool from 1.6 to 2.x, please read about the associated caveats.

Note

Some upgrade steps are moved to the separate section Procedures and checks to avoid overloading the general instruction with details. Typically, these are checks you should repeat during the upgrade to ensure it goes well.

If you experience issues during upgrade, you can roll back to the original version. The rollback instructions are provided in the Rollback section.

Checking your application

Before upgrading, make sure your application is compatible with the target Tarantool version:

Set up a development environment with the target Tarantool version installed. See the installation instructions at the Tarantool download page and in the tt install reference.
Deploy the application in this environment and check how it works. In case of any issues, adjust the application code to ensure compatibility with the target version.

When your application is ready to run on the target Tarantool version, you can start upgrading the production environment.

Pre-upgrade checks

Perform these steps before the upgrade to ensure that your cluster is working correctly:

On each router instance, perform the vshard.router check:

tarantool> vshard.router.info()
-- no issues in the output
-- sum of 'bucket.available_rw' == total number of buckets

On each storage instance, perform the replication check:

tarantool> box.info
-- box.info.status == 'running'
-- box.info.ro == 'false' on one instance in each replica set.
-- box.info.replication[*].upstream.status == 'follow'
-- box.info.replication[*].downstream.status == 'follow'
-- box.info.replication[*].upstream.lag <= box.cfg.replication_timeout
-- can also be moderately larger under a write load

On each storage instance, perform the vshard.storage check:

tarantool> vshard.storage.info()
-- no issues in the output
-- replication.status == 'follow'

Check all instances’ logs for application errors.

Note

If you’re running Cartridge, you can check the health of the cluster instances on the Cluster tab of its web interface.

In case of any issues, make sure to fix them before starting the upgrade procedure.

Installing the target version

Install the target Tarantool version on all hosts of the cluster. You can do this using a package manager or the tt utility. See the installation instructions at the Tarantool download page and in the tt install reference.

Check that the target Tarantool version is installed by running tarantool -v on all hosts.

Upgrading a Tarantool cluster with no downtime

Upgrading routers

Upgrade router instances one by one:

Stop one router instance.
Start this instance on the target Tarantool version.
Repeat the previous steps for each router instance.

After completing the router instances upgrade, perform the vshard.router check on each of them.

Upgrading storages

Before upgrading storage instances:

Disable Cartridge failover: run
```
tt cartridge failover disable
```
or use the Cartridge web interface (Cluster tab, Failover: <Mode> button).

Disable the rebalancer: run

tarantool> vshard.storage.rebalancer_disable()

Make sure that the Cartridge upgrade_schema option is false.

Upgrade storage instances by performing the following steps for each replica set:

Note

To detect possible upgrade issues early, we recommend that you perform a replication check on all instances of the replica set after each step.

Pick a replica (a read-only instance) from the replica set. Stop this replica and start it again on the target Tarantool version. Wait until it reaches the running status (box.info.status == running).
Restart all other read-only instances of the replica set on the target version one by one.
Make one of the updated replicas the new master using the applicable instruction from Switching the master.
Restart the last instance of the replica set (the former master, now a replica) on the target version.

Run box.schema.upgrade() on the new master. This will update the Tarantool system spaces to match the currently installed version of Tarantool. The changes will be propagated to other nodes via the replication mechanism later.

Warning

This is the point of no return for upgrading from versions earlier than 2.8.2: once you complete it, the schema is no longer compatible with the initial version.

When upgrading from version 2.8.2 or newer, you can undo the schema upgrade using box.schema.downgrade().

Run box.snapshot() on every node in the replica set to make sure that the replicas immediately see the upgraded database state in case of restart.

Once you complete the steps, enable failover or rebalancer back:

Enable Cartridge failover: run
```
tt cartridge failover set [mode]
```
or use the Cartridge web interface (Cluster tab, Failover: Disabled button).

Enable the rebalancer: run

tarantool> vshard.storage.rebalancer_enable()

Post-upgrade checks

Perform these steps after the upgrade to ensure that your cluster is working correctly:

On each router instance, perform the vshard.router check:

tarantool> vshard.router.info()
-- no issues in the output
-- sum of 'bucket.available_rw' == total number of buckets

On each storage instance, perform the replication check:

tarantool> box.info
-- box.info.status == 'running'
-- box.info.ro == 'false' on one instance in each replica set.
-- box.info.replication[*].upstream.status == 'follow'
-- box.info.replication[*].downstream.status == 'follow'
-- box.info.replication[*].upstream.lag <= box.cfg.replication_timeout
-- can also be moderately larger under a write load

On each storage instance, perform the vshard.storage check:

tarantool> vshard.storage.info()
-- no issues in the output
-- replication.status == 'follow'

Check all instances’ logs for application errors.

Note

If you’re running Cartridge, you can check the health of the cluster instances on the Cluster tab of its web interface.

Rollback

Rollback before the point of no return

If you decide to roll back before reaching the point of no return, your data is fully compatible with the version you had before the upgrade. In this case, you can roll back the same way: restart the nodes you’ve already upgraded on the original version.

Rollback after the point of no return

If you’ve passed the point of no return (that is, executed box.schema.upgrade()) during the upgrade, then a rollback requires downgrading the schema to the original version.

To check if an automatic downgrade is available for your original version, use box.schema.downgrade_versions(). If the version you need is on the list, execute the following steps on each upgraded replica set to roll back:

Run box.schema.downgrade(<version>) on master specifying the original version.
Run box.snapshot() on every instance in the replica set to make sure that the replicas immediately see the downgraded database state after restart.
Restart all read-only instances of the replica set on the initial version one by one.
Make one of the updated replicas the new master using the applicable instruction from Switching the master.
Restart the last instance of the replica set (the former master, now a replica) on the original version.

Then enable failover or rebalancer back as described in the Upgrading storages.

Recovering from a failed upgrade

Warning

This section applies to cases when the upgrade procedure has failed and the cluster is not functioning properly anymore. Thus, it implies a downtime and a full cluster restart.

In case of an upgrade failure after passing the point of no return, follow these steps to roll back to the original version:

Stop all cluster instances.
Save snapshot and xlog files from all instances whose data was modified after the last backup procedure. These files will help apply these modifications later.
Save the latest backups from all instances.
Restore the original Tarantool version on all hosts of the cluster.
Launch the cluster on the original Tarantool version.

Note

At this point, the application becomes fully functional and contains data from the backups. However, the data modifications made after the backups were taken must be restored manually.
Manually apply the latest data modifications from xlog files you saved on step 2 using the xlog module. On instances where such changes happened, do the following:
1. Find out the vclock value of the latest operation in the original WAL.
2. Play the operations from the newer xlog starting from this vclock on the instance.
Important

If the upgrade has failed after calling box.schema.upgrade(), don’t apply the modifications of system spaces done by this call. This can make the schema incompatible with the original Tarantool version.

Find more information about the Tarantool recovery in Disaster recovery.

Procedures and checks

Replication check

Run box.info:

tarantool> box.info

Check that the following conditions are satisfied:

box.info.status is running
box.info.replication[*].upstream.status and box.info.replication[*].downstream.status are follow
box.info.replication[*].upstream.lag is less or equal than box.cfg.replication_timeout, but it can also be moderately larger under a write load.
box.info.ro is false at least on one instance in each replica set. If all instances have box.info.ro = true, this means there are no writable nodes. On Tarantool v. 2.10.0 or later, you can find out why this happened by running box.info.ro_reason. If box.info.ro_reason or box.info.status has the value orphan, the instance doesn’t see the rest of the replica set.

Then run box.info once more and check that box.info.replication[*].upstream.lag values are updated.

vshard.storage check

Run vshard.storage.info():

tarantool> vshard.storage.info()

Check that the following conditions are satisfied:

there are no issues or alerts
replication.status is follow

vshard.router check

Run vshard.router.info():

tarantool> vshard.router.info()

Check that the following conditions are satisfied:

there are no issues or alerts
all buckets are available (the sum of bucket.available_rw on all replica sets equals the total number of buckets)

Switching the master

Cartridge. If your cluster runs on Cartridge, you can switch the master in the web interface. To do this, go to the Cluster tab, click Edit replica set, and drag an instance to the top of Failover priority list to make it the master.
Raft. If your cluster uses automated leader election, switch the master by following these steps:
1. Pick a candidate – a read-only instance to become the new master.
2. Run box.ctl.promote() on the candidate. The operation will start and wait for the election to happen.
3. Run box.cfg{ election_mode = "voter" } on the current master.
4. Check that the candidate became the new master: its box.info.ro must be false.
Legacy. If your cluster neither works on Cartridge nor has automated leader election, switch the master by following these steps:
1. Pick a candidate – a read-only instance to become the new master.
2. Run box.cfg{ read_only = true } on the current master.
3. Check that the candidate’s vclock value matches the master’s: The value of box.info.vclock[<master_id>] on the candidate must be equal to box.info.lsn on the master. <master_id> here is the value of box.info.id on the master.
  
  If the vclock values don’t match, stop the switch procedure and restore the replica set state by calling box.cfg{ read_only == false } on the master. Then pick another candidate and restart the procedure.

After switching the master, perform the replication check on each instance of the replica set.

Version:

Replication cluster upgrade

Checking your application

Pre-upgrade checks

Installing the target version

Upgrading a Tarantool cluster with no downtime

Upgrading routers

Upgrading storages

Post-upgrade checks

Rollback

Rollback before the point of no return

Rollback after the point of no return

Recovering from a failed upgrade

Procedures and checks

Replication check

vshard.storage check

vshard.router check

Switching the master