We use Asterisk extensively for conferencing – for the last 8 years or so this has been the 1.4/1.6/1.8 releases running chan_sip and meetme for up to around 350 concurrent users. Right around that number DAHDI hit’s a hard coded memory limit and kicks allocation errors in the log.
[Jun 22 10:04:13] WARNING app_meetme.c: Unable to open DAHDI pseudo channel: Cannot allocate memory
In order to support our growing user count we recently upgraded to 13.1-cert6 with pjsip and replaced meetme with confbridge. During all of our UAT and load testing everything seemed to be fine, there were no perceived audio quality issues or any logs that would indicate an issue. Unfortunately now that we’re in production I’m getting consistent complaints that the audio from participants is cutting in and out. It only seems to occur while under load with > 350 users but that is anecdotal at best. This is not a simple networking issue, we’ve pretty much ruled that out with various performance testing. That was not the case initially and we had incrementing UDP packet receive errors which we’ve eliminated with a bit of tuning.
There are numerous architectural differences between the two installations and so far I have not been successful in determining the root cause. I’m reaching out to the community and the developers for insight and feedback hoping there is prior experience with this issue and how to resolve it. As you can see below the most significant difference is probably the use of VMware on the new install. I’ve tuned the ESXi host and guest per VMware’s recommendations for latency and jitter (full cpu/mem reservations) with no improvement. With all of the reading I’ve done I suspect my issue may come down to a timing source and VMware not providing a reliable clock. It seems they allow a backlog of interrupts and if it hasn’t caught up in 60s they are simply dropped.
Before I rip apart the environment and rebuild on physical I’d like to try and confirm that hypothesis. In the past this was a simple matter of running dahdi_test which would report the accuracy. I’m not sure how to interpret the results of “timing test” in the Asterisk CLI. If I increase the number of ticks per second the results are erratic while under load. I’m using the timerfd module in Asterisk with a 1000HZ tick kernel and high res timers enabled. I’ve tried both hpet and tsc as system clock sources, both exhibit the same breaks in audio. It sounds like someone presses the mute button in the middle of a sentence.
Any insight is appreciated!
Here are the specs on the new install:
Physical HW Cisco UCS Blade (UCSB-B200-M3)
vMware ESXI 5.5
VM Guest 4 vCPU w/ 32G of RAM tuned for latency/jitter (sensitivity=high) and full cpu/memory reservations. VM OS Redhat EL7 kernel 3.10.0-327.13.1.el7.x86_64 with tickless disabled e.g nohz=off and 1000HZ. Asterisk 13.1-cert6 using the timerfd module.
Regards Robert McGilvray SS&C GlobeOp Associate Director, IT Network Security
GlobeOp Financial Services | 1565 Front Street | Yorktown Hts NY 10598
t: +1 (914)-293-3584 | f: +1 (914)-293-3510
firstname.lastname@example.org | www.ssctech.com | www.sscglobeop.com
Follow us: Twitter | Facebook | LinkedIn
This email with all information contained herein or attached hereto may contain confidential and/or privileged information intended for the addressee(s) only. If you have received this email in error, please contact the sender and immediately delete this email in its entirety and any attachments thereto.