When sending SMSs to an SMSC, you notice timeouts while routing from 2 sites to a third site.
Check logs from affected RTRs
Are the MO timeouts always happening to one RTR or multiple RTRs on the same site? From the affected RTR(s), obtain the following to analyze:
- tp_walkall output;
- syslog (/var/log/messages);
- Wireshark trace.
Check routing rules and application names
What is the routing flow (MO-MO/MO-AT) on the third site? Check the routing rule names involved in this flow? With the rules and application names confirmed, you can search for timeouts or specific issues involving the outgoing traffic to the application.
For an application assigned like the following, identify any MO timeouts related to that application on tp_walkall:
moRtgRuleApplication.23 = INTEGER: 55
applicationName.55 = STRING: <APPLICATION_NAME>
If the default MO timer is 5s (Site 1 and 2) and the default AT timer is 5s (Site 3), if the recipient application is not responding fast enough or a temporary error occurs, the MO timer (T1) can timeout before the AT timer (T2) for this message flow, i.e. MO-MO in Site 1/Site 2 (T1) <=> MO-AT in Site 3 (T2).
applicationCntMoAtFromAmsSuccess.55 = Counter32: 23886
applicationCntMoAtFromAmsTemporaryError.55 = Counter32: 218
In the syslogs, are there any messages indicating AO throughput limit has been hit for the application:
AO throughput limit of application '<APPLICATION_NAME>' (ID 55) hit
You can search on the syslog of Site 3 nodes issues involving the application that can match with MO timeouts observed in Site 1 and 2 nodes.
As the default MO timer in Site 1 and 2 nodes is the same as the Application AT timer in Site 3 (5s), you can decrease the Max AT response time to avoid the timeout happening on the MO timer before the message is delivered.
applicationOutsideMaxResponseTime.55 = Gauge32: 5
i.e. set to 4s in the application and monitor -
Check for Outside 'deliver_sm' response timeout error
If you find the error tp_hub: App: 'APPLICATION_NAME' - Outside 'deliver_sm' response timeout error in any Site 3 RTR logs, this can be related to the MO timeouts observed in Sites 1/2.
See if you can find a corresponding MO timeout error in Site 1/2 logs. Conversely, for the MO timeouts that are appearing in Site 1/2 when routing to a Site 3 RTR, are there any Outside 'deliver_sm' response timeout errors in /var/log/messages of the corresponding Site 3 RTR? If not, move to the next subheading.
Since the complete flow is - MO-MO (Site 1/2) and MO-(ST)-AT (Site 3), the outside 'deliver_sm' response timeout can impact the originating MO-MO path. In this case, you must investigate the reason for these timeouts on the AT side with the particular application owner.
If, from the tp_walkall in some RTRs, there are no counters for the moRoutingRule = 'APPLICATION_NAME' with action StoreForDeliveryToApplication i.e.
moRtgRuleName.23 = STRING: APPLICATION_NAME
moRtgRulePriority.23 = Gauge32: 60
moRtgRuleAppliedCounter.23 = Counter32: 0
check if this rule has been applied in other RTRs i.e.
Alternatively, you can change the routing action from "Store for delivery to application" to "Route to application fallback to storage" and select the "Always Respond With Ack" checkbox. This will ensure a MO-ack is always sent to site 1 and 2.
If the "Route to application fallback to storage" routing action for MOR rule is not licensed, the ZephyrTel Sales Team can be informed.
Monitor network and node performance
Observe if these timeouts are happening more frequently at a certain time of the day, or during high traffic, etc. This info/statistics of a few days should help. If you can see MO timeouts slightly increasing during busy hours, this suggests some performance limitations on site 3.
Is there any latency in the network observed between sites 1 and 2 to site 3 that could result in delays to receive the MO-ack?
Another way forward is to monitor the performance of site 3 nodes as the received MO is first stored before delivery to the application. If site 3 Traffic Elements are not performing well this can result in MO timeouts.
For alternatives to further investigate this performance issue, see this article.