Overview
You noticed that Lithium SSI crash files were written on both SMS Router nodes and that all 'tp_-processes' have been shut down on the SPF nodes.
This article describes the root cause of this issue and how to resolve it.
Solution
This crash happens when SSI gets busy in the NDB API function for I/O and is usually related to other issues causing the 'watchdog kill' as the system cannot respond on time. The default NDB transaction timeout is 10 seconds and the SSI watchdog timer is 5 seconds, so the 'watchdog kill' may happen when NDB Cluster does not respond on time.
To prevent this issue, we suggest reducing the Transaction Deadlock Detection Timeout from 10s to 3s. Please perform the following steps:
1. Log in as the root user on SPF all nodes of the Master site and stop the provisioning:
systemctl stop spf_soap
2. From the NDB Cluster Management Node (MGR) or any of the SPF nodes of the Master site, take a backup of SPF DB:
ndb_mgm -e "start backup"
3. Check the backup files on all SPFs. The highest BACKUP-XX folder number contains the backup just taken:
ls –lth /dbspf/mysqlcluster/BACKUP/
4. Edit the config.ini file in the MGR node of the Master SPF cluster.
cp /var/lib/mysql-cluster/config.ini /var/lib/mysql-cluster/config.ini_BKP
vi /var/lib/mysql-cluster/config.ini
Transaction Deadlock Detection Timeout will be changed from: TransactionDeadlockDetectionTimeout=10000 To: TransactionDeadlockDetectionTimeout=3000 |
5. As root user in the NDB Cluster Management Node (MGR), restart the management server to read in the new configuration file:
ndb_mgmd
1 stop
Quit
ndb_mgmd --config-file=/var/lib/mysql-cluster/config.ini --reload
6. Restart all SPF nodes one, at a time executing the following commands:
systemctl stop mysql
systemctl stop ndbmtd
systemctl start ndbmtd
systemctl start mysql
7. Check the NDB Cluster running the following command from any of the nodes:
ndb_mgm -e "show"
8. As 'textpass' user, check SPF processes:
tp_status
9. Start the provisioning on all SPF nodes as the root user:
systemctl start spf_soap
Testing
Once the change is implemented, please monitor the nodes for some time to ensure the normal functioning of Lithium SSI. If the issue persists, please open a support ticket sharing the tp_walkall and syslogs from the SPF and SMR nodes covering the moment of the SSI crash.