If I use either the Bridge() app, or the manager Action: Bridge() in a certain scenario (Basically to bridge 2 SIP channels, like an attended transfer, resulting in 2 other SIP channels being discarded) then the whole server locks solid. The console stops, the network stops, something is hammering the box and nothing (including debug tools) seem to be able to do anything about it.
If I ‘nice’ asterisk to lowest priority, and ‘nice’ a copy of ‘top’ to highest priority, everything still locks. After a short period, the box recovers, seemingly due to the 60 second RTP timer. Anything that was being logged is lost.
My theory is the I am somehow causing a frame loop internal to Asterisk by setting up some type of illegal bridge, but I used the same code on 1.2 (I backported Bridge()) and it works just fine.
I need suggestions please on how to determine where it is locking, and why.
I found that doing a build with debug symbols included and running under gdb slowed down asterisk enough for me to get debug output. Turns out that 188.8.131.52 fixes the symptom, and the cause is in my own “hack” to the code. Holding locks too long == verybad.
I was running in high priority mode – I thought I was turning it off for testing, but looks like I left the setting in asterisk.conf so leaving ‘-p’ off the command line was making no difference *sigh*
I believe the issue was actually with the devicestate thread. It was trying to update state on a locked channel, and was trying to grab the
lock so regularly that it caused asterisk to grab lots of CPU cycles (because of -p mode) The lock was not released because it was waiting
on a database write, which was being done by a lower priority external process that was getting no time scheduled to it.
The database write is a local hack to record some extra call data – I changed it to occur after the locks are released as I should have done
in the first place. 184.108.40.206 does not seem to have quite the same issue – It recovers after the usual 200 lock attempts and gets on with life much more happily. I cannot see any changes between 220.127.116.11 and 18.104.22.168 that would have improved this behavior.
So in this case, the canary actually saved me from needing to reboot the machine in order to recover from the lockup. The thread monitoring the canary noticed within 60 seconds that the canary stopped updating the file and deprioritized Asterisk, allowing the other processes to proceed.