AMI Timeouts

Home » Asterisk Users » AMI Timeouts
Asterisk Users 13 Comments

Hi,

We’re using Asterisk 1.8.0 to run a call centre. There is a Java process which talks to Asterisk through AMI, which is part of the software stack that presents a user interface to the call centre agents.

We’re seeing a strange issue with AMI. Most of the time, it doesn’t cause problems, because the Java code is written to cope with it, but occasionally it does, and, in rare cases, may even result in calls being dropped.

The main problem with the issue is that we haven’t been able to reproduce it in a test environment, and that it’s not that easy to describe.

The fact that we can’t reproduce it in test leads me to think that it’s somehow related to the load on our production system.

I’m hoping that someone else on the list may have seen something similar before, or that an Asterisk developer might read this and think “ah, it’s this bit of code that’s doing it!”. 🙂

Anyway, let me try to describe what’s happening.

The Java process has an AMI connection to Asterisk which it keeps open continuously. When sending an AMI request, the process will wait a certain amount of time for a response (I think the timeout is 5 seconds)
and if it hasn’t received it, will retry the request up to 5 times.

Occasionally, the Java process logs show AMI requests timing out (after
5 tries). What I see in packet captures of the AMI traffic is:

1. Java process sends a request (e.g., add member to queue)
2. Retries 5 seconds later
3. Retries again 5 seconds later
4. Retries again 5 seconds later
5. Retries again 5 seconds later
6. Logs a timeout
7. After a few more seconds, Asterisk replies to all five requests
in one go (in a single packet; so, e.g., for “add member to queue”
it would reply “success”, then four failures because the member is
already added); however at this stage, the Java process has given
up

It feels like Asterisk queues up the AMI responses and then periodically sends out all the responses in the queue in one go. Is something like this going on? Does the frequency at which Asterisk flushes the queue depend on load? Are there any tunable in the config for this?

If anyone has seen this before, or understands what’s happening here, I
would be very grateful for any info!

If you need more details, please do ask and I’ll do my best.

Thanks in advance,

Alex

13 thoughts on - AMI Timeouts

  • Maybe the Java code is not robust enough. I am using AMI for years and never had a communication issue (except for my programming errors). Some time ago I wrote a little tool to study the AMI
    interface (http://www.jgoettgens.de/Meine_Bilder_und_Dateien/AMOA.7z, http://www.jgoettgens.de/d9e5cb9b-c5fa-429b-9974-a7b4f4bf58c3.html). The tool is now outdated
    (partial DAHDI support, Woomera obsolete), but the SIP channels are usable. If you have a few SIP channels you could compare the behavior of both packages under load.

    jg

  • I didn’t write the Java code, but I think we do use the asterisk-java library.

    Alex

  • Timeout based retries on AMI is probably going to be a problem.

    AMI uses TCP, which has its own retry/retransmission logic. While the TCP connection is alive, you can safely assume that the message is either on its way to Asterisk, Asterisk is processing the message, or the response is heading back to you.

    Now if the TCP connection fails after you’ve sent a request, but before you receive a response, then you really have no way of knowing if the request was received/processed/rejected/etc. At that point, the best you can do is reestablish the AMI connection and retry the action.

    Do you see the TCP ACK coming back from Asterisk?

    During the quiet period while you’re waiting for the response, do you receive events over that AMI connection? Are there other actions that you’re attempting to execute? Is there any consistency as to which commands are getting delayed?

    No, there’s no response queue in Asterisk. For the action’s I’ve looked at, it pretty much immediately processes the request and sends the response.

    There are any number of reasons why the response would be delayed, but the >25 seconds delay you’re seeing is excessive for any of the reasons I can think of. The resource you’re using AMI to access may be busy doing something else. Or the request is simply taking that long to process. Packet loss could cause delays in getting responses, but usually not for the lengths of times you’re talking about.

    I know it’s not a lot of info, but hopefully you can turn up some logging or packet captures to narrow down what’s going on.

  • Hi,

    Yes, I do.

    Yes.

    In the particular case I’m looking at right now, yes (there’s a QueuePause action followed closely by an Originate action).

    Here’s a breakdown from the last two weeks:

    Timeout Total % Command
    ~~~~~~~ ~~~~~ ~ ~~~~~~~
    178 7478 2.38 Command
    5 1870 0.26 DBDel
    804 13549 5.93 QueueAdd
    2894 55621 5.20 QueuePause
    660 13856 4.76 QueueRemove

    So it appears that most of the delays are from the queue module, which is understandable, because that’s doing most of the work in our set-up.

    It turns out the timeout in the Java app is set to only 3 seconds, not
    5, like I said in my previous email.

    What would be a reasonable delay time? In the case I’m looking at right now, the longest I can see is 7.2s.

    Looking in the Java app logs, I can see it occasionally (166 times over the last two weeks) timing out after five retries, which means it failed to get a response to any of the retries within three seconds.

    There’s nothing in the packet capture to indicate packet loss.

    Perhaps I should mention another issue that we’ve seen (and worked around) previously. Our Asterisk uses ODBC to talk to an Oracle database for realtime peers, for func_odbc and, back then, for CEL. The issue was that when there was a job running against the database which caused it to slow down, Asterisk dropped calls with the message
    “no reply to our critical packet”. As soon as we changed the database job to run at night (when the call centre is closed), this problem went away. It feels like Asterisk was stuck waiting for the database and missed the critical packets when they were, in fact, there.

    Thanks for your help!

    Alex

  • The view of the person who wrote the Java app is that “three seconds is a long time”: if an agent presses the “pause” button in their GUI, during those three seconds, they may receive a call.

    He says that when this system first went live, his app received responses within milliseconds. I’m assuming this is to do with increasing call volume, though it could be something else.

    Alex

  • When you have many calls, there are usually (read/write=all) a lot of RTP, RTCP, and VarSet events. This might slow down things, but whether they occur or not depends on your configuration.

    This might be another thing to look at.

    jg

  • When you execute an Originate action, are you doing so synchronously or asynchronously?

    A synchronous Originate performs the full outbound dial operation on the thread servicing the AMI request. Since each session in AMI gets its own thread that services both actions and events, a synchronous Originate can block that session from receiving events until it completes.

    Matt

  • I guess this was a question for Alexander. As far as I am concerned, I never had such a load that slowed down AMI event processing (responses within at most 1/10 of a second), but for future tests I should probably set up a real torture test.

    For a robust PBX application, it would make sense to have an “event” thread and separate
    “action” threads that handle actions and associated events. I think I see your point, Matt.

    Thanks for all the fish.

  • Hi,

    I may be misunderstanding, but would your suggestion be to open a new AMI connection for each action? (Or at least each action that can block.)

    Alex