Replication cluster upgrade
Below are the general instructions for upgrading a Tarantool cluster with replication. Upgrading from some versions can involve certain specifics. To find out if it is your case, check the version-specific topics of the Upgrades section.
A replication cluster can be upgraded without downtime due to its redundancy. When you disconnect a single instance for an upgrade, there is always another instance that takes over its functionality: being a master storage for the same data buckets or working as a router. This way, you can upgrade all the instances one by one.
The high-level steps of cluster upgrade are the following:
- Ensure the application compatibility with the target Tarantool version.
- Check the cluster health.
- Install the target Tarantool version on the cluster nodes.
- Upgrade router nodes one by one.
- Upgrade storage replica sets one by one.
Important
The only way to upgrade Tarantool from version 1.6, 1.7, or 1.9 to 2.x without downtime is to take an intermediate step by upgrading to 1.10 and then to 2.x.
Before upgrading Tarantool from 1.6 to 2.x, please read about the associated caveats.
Note
Some upgrade steps are moved to the separate section Procedures and checks to avoid overloading the general instruction with details. Typically, these are checks you should repeat during the upgrade to ensure it goes well.
If you experience issues during upgrade, you can roll back to the original version. The rollback instructions are provided in the Rollback section.
Before upgrading, make sure your application is compatible with the target Tarantool version:
- Set up a development environment with the target Tarantool version installed. See the installation instructions at the Tarantool download page and in the tt install reference.
- Deploy the application in this environment and check how it works. In case of any issues, adjust the application code to ensure compatibility with the target version.
When your application is ready to run on the target Tarantool version, you can start upgrading the production environment.
Perform these steps before the upgrade to ensure that your cluster is working correctly:
On each
router
instance, perform the vshard.router check:tarantool> vshard.router.info() -- no issues in the output -- sum of 'bucket.available_rw' == total number of buckets
On each
storage
instance, perform the replication check:tarantool> box.info -- box.info.status == 'running' -- box.info.ro == 'false' on one instance in each replica set. -- box.info.replication[*].upstream.status == 'follow' -- box.info.replication[*].downstream.status == 'follow' -- box.info.replication[*].upstream.lag <= box.cfg.replication_timeout -- can also be moderately larger under a write load
On each
storage
instance, perform the vshard.storage check:tarantool> vshard.storage.info() -- no issues in the output -- replication.status == 'follow'
Check all instances’ logs for application errors.
Note
If you’re running Cartridge, you can check the health of the cluster instances on the Cluster tab of its web interface.
In case of any issues, make sure to fix them before starting the upgrade procedure.
Install the target Tarantool version on all hosts of the cluster. You can do this using a package manager or the tt utility. See the installation instructions at the Tarantool download page and in the tt install reference.
Check that the target Tarantool version is installed by running tarantool -v
on all hosts.
Upgrade router instances one by one:
- Stop one
router
instance. - Start this instance on the target Tarantool version.
- Repeat the previous steps for each
router
instance.
After completing the router instances upgrade, perform the vshard.router check on each of them.
Before upgrading storage instances:
Disable Cartridge failover: run
tt cartridge failover disable
or use the Cartridge web interface (Cluster tab, Failover: <Mode> button).
Disable the rebalancer: run
tarantool> vshard.storage.rebalancer_disable()
Make sure that the Cartridge
upgrade_schema
option isfalse
.
Upgrade storage instances by performing the following steps for each replica set:
Note
To detect possible upgrade issues early, we recommend that you perform a replication check on all instances of the replica set after each step.
- Pick a replica (a read-only instance) from the replica set. Stop this replica
and start it again on the target Tarantool version. Wait until it reaches the
running
status (box.info.status == running
). - Restart all other read-only instances of the replica set on the target version one by one.
- Make one of the updated replicas the new master using the applicable instruction from Switching the master.
- Restart the last instance of the replica set (the former master, now a replica) on the target version.
- Run box.schema.upgrade() on the new master. This will update the Tarantool system spaces to match the currently installed version of Tarantool. The changes will be propagated to other nodes via the replication mechanism later.
Warning
This is the point of no return for upgrading from versions earlier than 2.8.2: once you complete it, the schema is no longer compatible with the initial version.
When upgrading from version 2.8.2 or newer, you can undo the schema upgrade using box.schema.downgrade().
- Run
box.snapshot()
on every node in the replica set to make sure that the replicas immediately see the upgraded database state in case of restart.
Once you complete the steps, enable failover or rebalancer back:
Enable Cartridge failover: run
tt cartridge failover set [mode]
or use the Cartridge web interface (Cluster tab, Failover: Disabled button).
Enable the rebalancer: run
tarantool> vshard.storage.rebalancer_enable()
Perform these steps after the upgrade to ensure that your cluster is working correctly:
On each
router
instance, perform the vshard.router check:tarantool> vshard.router.info() -- no issues in the output -- sum of 'bucket.available_rw' == total number of buckets
On each
storage
instance, perform the replication check:tarantool> box.info -- box.info.status == 'running' -- box.info.ro == 'false' on one instance in each replica set. -- box.info.replication[*].upstream.status == 'follow' -- box.info.replication[*].downstream.status == 'follow' -- box.info.replication[*].upstream.lag <= box.cfg.replication_timeout -- can also be moderately larger under a write load
On each
storage
instance, perform the vshard.storage check:tarantool> vshard.storage.info() -- no issues in the output -- replication.status == 'follow'
Check all instances’ logs for application errors.
Note
If you’re running Cartridge, you can check the health of the cluster instances on the Cluster tab of its web interface.
If you decide to roll back before reaching the point of no return, your data is fully compatible with the version you had before the upgrade. In this case, you can roll back the same way: restart the nodes you’ve already upgraded on the original version.
If you’ve passed the point of no return (that is,
executed box.schema.upgrade()
) during the upgrade, then a rollback requires
downgrading the schema to the original version.
To check if an automatic downgrade is available for your original version, use
box.schema.downgrade_versions()
. If the version you need is on the list,
execute the following steps on each upgraded replica set to roll back:
- Run
box.schema.downgrade(<version>)
on master specifying the original version. - Run
box.snapshot()
on every instance in the replica set to make sure that the replicas immediately see the downgraded database state after restart. - Restart all read-only instances of the replica set on the initial version one by one.
- Make one of the updated replicas the new master using the applicable instruction from Switching the master.
- Restart the last instance of the replica set (the former master, now a replica) on the original version.
Then enable failover or rebalancer back as described in the Upgrading storages.
Warning
This section applies to cases when the upgrade procedure has failed and the cluster is not functioning properly anymore. Thus, it implies a downtime and a full cluster restart.
In case of an upgrade failure after passing the point of no return, follow these steps to roll back to the original version:
Stop all cluster instances.
Save snapshot and
xlog
files from all instances whose data was modified after the last backup procedure. These files will help apply these modifications later.Save the latest backups from all instances.
Restore the original Tarantool version on all hosts of the cluster.
Launch the cluster on the original Tarantool version.
Note
At this point, the application becomes fully functional and contains data from the backups. However, the data modifications made after the backups were taken must be restored manually.
Manually apply the latest data modifications from
xlog
files you saved on step 2 using the xlog module. On instances where such changes happened, do the following:- Find out the vclock value of the latest operation in the original WAL.
- Play the operations from the newer xlog starting from this vclock on the instance.
Important
If the upgrade has failed after calling
box.schema.upgrade()
, don’t apply the modifications of system spaces done by this call. This can make the schema incompatible with the original Tarantool version.
Find more information about the Tarantool recovery in Disaster recovery.
Run box.info
:
tarantool> box.info
Check that the following conditions are satisfied:
box.info.status
isrunning
box.info.replication[*].upstream.status
andbox.info.replication[*].downstream.status
arefollow
box.info.replication[*].upstream.lag
is less or equal thanbox.cfg.replication_timeout
, but it can also be moderately larger under a write load.box.info.ro
isfalse
at least on one instance in each replica set. If all instances havebox.info.ro = true
, this means there are no writable nodes. On Tarantool v. 2.10.0 or later, you can find out why this happened by runningbox.info.ro_reason
. Ifbox.info.ro_reason
orbox.info.status
has the valueorphan
, the instance doesn’t see the rest of the replica set.
Then run box.info
once more and check that box.info.replication[*].upstream.lag
values are updated.
Run vshard.storage.info()
:
tarantool> vshard.storage.info()
Check that the following conditions are satisfied:
- there are no issues or alerts
replication.status
isfollow
Run vshard.router.info()
:
tarantool> vshard.router.info()
Check that the following conditions are satisfied:
- there are no issues or alerts
- all buckets are available (the sum of
bucket.available_rw
on all replica sets equals the total number of buckets)
Cartridge. If your cluster runs on Cartridge, you can switch the master in the web interface. To do this, go to the Cluster tab, click Edit replica set, and drag an instance to the top of Failover priority list to make it the master.
Raft. If your cluster uses automated leader election, switch the master by following these steps:
- Pick a candidate – a read-only instance to become the new master.
- Run
box.ctl.promote()
on the candidate. The operation will start and wait for the election to happen. - Run
box.cfg{ election_mode = "voter" }
on the current master. - Check that the candidate became the new master: its
box.info.ro
must befalse
.
Legacy. If your cluster neither works on Cartridge nor has automated leader election, switch the master by following these steps:
Pick a candidate – a read-only instance to become the new master.
Run
box.cfg{ read_only = true }
on the current master.Check that the candidate’s vclock value matches the master’s: The value of
box.info.vclock[<master_id>]
on the candidate must be equal tobox.info.lsn
on the master.<master_id>
here is the value ofbox.info.id
on the master.If the vclock values don’t match, stop the switch procedure and restore the replica set state by calling
box.cfg{ read_only == false }
on the master. Then pick another candidate and restart the procedure.
After switching the master, perform the replication check on each instance of the replica set.