We have an asterisk frontend terminating all our SIP phones to, and an asterisk backend with a wildcard PRI card in it connecting to the PTSN. The frontend handles 99% of dialplan logic and just hands off anything outgoing to the backend via IAX2, which dials out on one of the open channels.
Lately we’ve been getting a disconnected calls. Keeping the consoles running it doesn’t seem to be the PRI initiating the hangups, as I’ll when I see hangups intiiated on the backend / PRI side:
— Span 2: Channel 0/21 got hangup request, cause 16
Instead, I’m seeing
== Spawn extension (outbound, (dialed #), 3) exited non-zero on ‘IAX2/asterisk-frontend2-603’
— Hungup ‘IAX2/asterisk-frontend2-603’
Which indicates the frontend initiated a hangup. But on the frontend I’m seeing auto fallthroughs to the h extension, which only happens if the hangup is initiated from the backend:
— Auto fallthrough, channel ‘SIP/phone1-00000167’ status is ‘ANSWER’
(h extension stuff follows)
If that side was initiating the hangup, I’d just see a jump to the h extension, with no auto fallthrough. So it looks like there may be a communication interruption between the front and backends.
The problem is this happens intermittently, so I can’t reproduce it reliably. I’ve held open a call for 30+ minutes and not run into the problem, while someone’s been on a call for 7 minutes and this happens. It doesn’t seem feasible to constantly run IAX2 debugs from the console on any open call – does anyone have suggestions on how to troubleshoot this? Weirdly enough, this only seems to happen when users dial into conference bridges (not local) such as WebEx and GoToMeeting, but that might just be because of the length of those calls.
Will tweaking things like the IAX2 jitter buffer help? The two systems are barely four hops apart with an average of .2 ms ping times between them on a very resilient network (two of those hops are through core transports). I’ve never seen ping loss between them, even when running ping tests for hours during heavy call volume periods. The loads on the machines are minimal – never seen the load go above .10 during normal operation. But it does seem like something between them is making them drop calls.