Overview
This article provides information about how watchdog service causes network traffic to drop and outlines the steps on how to resolve it.
Root Cause
The software watchdog service generates a signal to kill the pang services, as shown below:
11-06-19 17:15:09.118, 1477 int state_machine_c::poll_health_status(health_socket&) poll_health_status - get_health failed or unsifficient health detected 25 times , pang will be restarted...
11-06-19 17:15:09.132, 1477 void persistency_data_c::set_watchdog_state(watchdog_states) new value =3 old value =2
11-06-19 17:15:09.132, 1477 void persistency_data_c::store_pang_op_mode_switch() Switch state to FAILURE(2) at 11-06-19 11:28:29 (1560252509)
11-06-19 17:15:09.166, 1477 bool state_machine_c::send_signal_to_pang(int) go_kill_command - executed kill(7278 , 4 )
This causes a drop in the traffic handled by the system.
Process
Increasing the value of the parameter max_application_packet_q will allow the system to consume inspector queues, and in turn, the pang process will not be killed by the watchdog software for a longer time. This parameter has no correlation with packet delays, and it's only used for health detection of inspector threads, so increasing its value is safe (default value is 5000).
An optimal value for this parameter can be determined in the system according to its load. A value of 10000 can work without any delay or overhead for 6 Gbps traffic.
UltraBand 5.7.x
- Edit the file in
/opt/pang/conf/pang.conf
- Add
max_application_packet_q 10000
- Reload config:
pkill -1 pang
UltraBand 6.x
- Edit the file in
/opt/pang/conf/pang.conf
- Add
max_application_packet_q 10000
- Reload config:
pkill -1 fp-rte