Amazon AWS Question
We are running load capacity tests using Amazon AWS configurations.
For the tests, we are basically scaling up calls to a second Asterisk box. First box that is calling the second box plays music on hold for 60 seconds, then hangs up the call. My initial thought was jitter problems, but that doesn’t seem to be the case.
I believe I found the cause while looking at the asterisk logs. I am finding hundreds of entries like this when they report audio problems.
[08/19 15:04:17.176] DEBUG[4062][C-00000076] res_timing_timerfd.c: Expected to acknowledge 1 ticks but got 4 instead
[08/19 15:04:17.173] DEBUG[4008][C-0000005b] res_timing_timerfd.c: Expected to acknowledge 1 ticks but got 3 instead
[08/19 15:04:17.180] DEBUG[4008][C-0000005b] res_timing_timerfd.c: Expected to acknowledge 1 ticks but got 2 instead
[08/19 15:04:17.174] DEBUG[3933][C-00000038] res_timing_timerfd.c: Expected to acknowledge 1 ticks but got 2 instead
I see references where people say to make sure to set VMWare ESXi Latency Sensitivity to high for issues like this.
However, I am told there is no such setting with Amazon AWS. It sounds like the settings are all based on the type of configuration you buy.
Does anyone have any experience running Asterisk on an Amazon AWS system?
If so, what type of Amazon AWS configuration are you using?
If you have the potential to be playing prompts and/or music on hold for all channels at once, what is you channel capacity?
Is it recommended to run on dedicated hardware when encountering issues like this?
Have a great day!
Dan
9 thoughts on - Amazon AWS Question
Dan,
I don’t run Asterisk on AWS, but I do on ESXi. Are you running a version of Asterisk before 13? Newer versions Asterisk handle timing better that don’t require a hardware timing source.
I’m running Asterisk 13 on a small 60 phone system without issues under ESXi 6.0
Doug
—
Ultimately it still does rely on timing and depending on the conditions it can have an impact on things. If timing is poor and the jitterbuffer on an endpoint can’t cope then you will have audio problems. AWS doesn’t guarantee this and depending on the host node problems can occur. It will come up when playing back audio files, using ConfBridge, and music on hold.
—
Joshua C. Colp Digium – A Sangoma Company | Senior Software Developer
445 Jan Davis Drive NW – Huntsville, AL 35806 – US
Check us out at: http://www.digium.com & http://www.asterisk.org
—
Thanks Doug.
We are running Asterisk 16.3.0 so I think we’re on a pretty good version for the timing.
We have Asterisk running on ESXi here and it’s running at several customer sites in various VM environments. Ironically, none of them have the latency sensitivity set to high. This is something I didn’t realize may be needed until encountering the issue during load testing this week. I would guess our largest at present has around 100 simultaneous calls and we haven’t run into problems.
At 200 calls, the Amazon AWS VM has 0 issues with audio. I believe it even was good at 250. It was when the tester went to 300 calls that I am told audio is often choppy.
I tried performing the noload on res_timing_timerfd.so and the person testing indicated it did not help.
I forgot to mention, this is on an Ubuntu 16 OS with the latest packages.
Our second Asterisk box receives the call and bridges them with another endpoint (allowing us to hear it on various phones)
The debugging on this box isn’t indicating any issues (at least from what I can tell). Obviously a lot is going on (basically 600 calls).
Neither box is indicating jitter issues.
Dan
—–Original Message—
Thank you Joshua.
Out of curiosity, what is the maximum capacity you have heard for simultaneous ConfBridges in a single box? (Looking at 3-4 channels per ConfBridge) with recording.
Dan
—–Original Message—
I don’t really remember any specific values. 100? 200?
—
Joshua C. Colp Digium – A Sangoma Company | Senior Software Developer
445 Jan Davis Drive NW – Huntsville, AL 35806 – US
Check us out at: http://www.digium.com & http://www.asterisk.org
—
Thank you Joshua.
Out of curiosity, when the capacity is reached is it a CPU issue?
Or some other resource (memory) issue?
Also, would the ConfBridge Bridge and/or ConfBridge User jitterbuffer setting help/hurt the capacity?
From what I understand jitter buffer support can result in higher CPU usage handling the jitter.
Is this correct?
Have a great day!
Dan
—–Original Message—
It wouldn’t be memory, it’d be CPU most likely.
The way ConfBridge works is that each conference bridge has a mixing thread that wakes up at a specific interval (20ms generally). That thread then pulls 20ms of audio from each participant, mixes it together, and provides a unique one to each participant (removing their audio). Audio is fed into a buffer from each participant as received for the mixing thread.
If the time it takes the mixing thread to do things increases, then the stream of packets will go out of wack, and if it’s worse enough then the endpoints can’t compensate (if it takes 30ms to produce a 20ms chunk of audio, you’re going to have a problem).
What this threshold is depends on the system.
Jitterbuffer inherently introduces a buffer, so it will increase memory usage and also increase CPU usage some as the jitterbuffer becomes the one feeding media.
—
Joshua C. Colp Digium – A Sangoma Company | Senior Software Developer
445 Jan Davis Drive NW – Huntsville, AL 35806 – US
Check us out at: http://www.digium.com & http://www.asterisk.org
—
Thank you Joshua
—–Original Message—
Jitter Buffer being enabled on the ConfBridge user profile was a big issue. Once we turned that off things were noticeably better. That allowed us to reach 300 ConfBridges 600 calls without a problem.
We reached 400 ConfBridges with 800 calls and only periodic hiccups in the audio for a second or two then it is happy.
We also have been able to increase our capacity some by increasing the AWS instance CPUs and that was a big help.
Looking further into the resources, I noticed something that I think indicates where my audio problems are coming from. On the Asterisk machine performing the 400 ConfBridges (combining audio), htop is showing periodic spikes (90%) of a single CPU core happening. It’s not the same CPU core each time, but it is always in the upper half of the CPU cores. All the other cpu cores seem to stay around 10%, maybe 20%. Far less frequently, we may see a second CPU core jump to 80%.
Anyone have an idea why a single CPU core would spike while the others remain low?
Have a great day!
Dan
—–Original Message—