Speech recognition in Asterisk using Google Voice API

Home » VoIP News » Speech recognition in Asterisk using Google Voice API
VoIP News 30 Comments

I’m exited to announce that Lefteris Zafiris has written an agi script that uses Google Voice API for voice recognition.

As the author says, “the script records from the current channel untill the pound key (#) is pressed or the timeout (15 seconds) is reached. The recording is send over to google speech recognition service and the returned text string is assigned to a channel variable.”

More info and dialplan examples can be found in the README file:

The script is available here:

The author reports that the code is still young and not roughly tested so comments, suggestions and bug reports are more than welcome.

Enjoy this code jewel and please provide feedback/comments to the author.

30 thoughts on - Speech recognition in Asterisk using Google Voice API

  • Very interesting. I just tried to get it to work but it complains about
    sox. Probably you used a different version of sox?

    *PBX-*CLI> /usr/bin/sox: invalid option — -*
    */usr/bin/sox: invalid option — n*
    */usr/bin/sox: invalid option — o*
    */usr/bin/sox: -r must be given a positive integer*
    * — speech-recog.agi: /usr/bin/sox failed: 512*

    I am using: *Package sox-12.18.1-1.el5_5.1.i386 *


  • And with recent version 14.3.2 I get:

    /usr/local/bin/sox FAIL formats: no handler for file extension `flac’

  • Hi there,

    I’ve developed an agi script a while ago to use google speech
    recognition and by then I’ve used
    http://legroom.net/files/software/convtoflac.sh to convert files from
    wav to flac.
    You can the use the command:

    */usr/local/bin/convtoflac.sh -o /var/lib/asterisk/sounds/myfile.wav*
    It will then create create a flac file in the same directory as the
    source file.

    I hope it helps.

  • this looks great – is there any chance of coverting the googletts.agi
    to use flac as well ?


  • In googletts.agi we get the voice data from google in mp3 and we convert
    it in a format that asterisk can read and playback (slin). If we store it
    in flac asterisk wont be able to read it natively and we would have to
    convert it each time we want to play it back to the user.

    In the speech recognition script we have to convert the voice data in
    flac before sending it to google because that’s the accepted format.

    Is there some particular reason you want the googletts.agi data in flac?

  • the only reason is that I didn’t want to have to install sox. Lazy.
    that’s all 😉 Just another piece of software to find and install

    running on amazon ec2, is the best thing to download the source and
    compile sox ?



  • Works beautifully. Amazing job Lefteris. Thanks.

    The best result I got in probability was 0.9725632 by saying, “hello”. I
    think there is some non-phonetic logic built-in as well. I tried, “1, 2”
    and I got “0.86534226” in accuracy. While I tried “1, 2, 3, 4, 5” I got,
    “0.97256315”. Probably Google sees the pattern?!

    What are some of the other tricks (if any) or consideration that one should
    make while creating a strong speech recognition enabled IVR?


  • Wow – nice! A few quick questions:

    1. How long can the recording be for translation?
    2. Any limitation on how much text the return (transcribed) variable can hold?
    3. Any commercial / terms of use limitations?
    Sent: Wednesday, January 04, 2012 1:25 PM

    Note to self: “Never release anything asterisk related without testing
    on RHEL/Centos 5″

    Thank you for reporting this. I have replaced sox with flac and it seems
    to work now on older platforms too (tested on Centos 5 with asterisk 1.4).
    You can get the updated code here:

  • At the moment the recording timeout is set at 15sec. I haven’t tested
    yet the max
    length of voice data ta google accepts (all this voice recognition
    stuff is undocumented).
    I have read that it is between 10-20 seconds but havent really went to
    test this yet. On my todo list is
    to add the option to cut the sound data in smaller chunks before
    sending them to google and get rid of the
    recording length limitations.

    This better be answered by the astsrisk devs but empirically talking i
    have loaded in dialplan variables really big
    chunks of text (like the complete gpl license) without having any problems.

    This is a gray area at the moment. Voice recognition is undocumented
    in google’s API and i guess not
    officially supported yet. I hope it gets covered by the general TOS of
    google services:

  • For sure english and spanish, since its undocumented i don’t have a
    complete list

  • Google accepts sound files at any sampling rate (up to 44.1kHz) so if
    you can use some wideband codec ( eg g722)
    It can greatly improve the sound quality and the detection rates. For
    now the script supports 8kHz and 16kHz sampling rates
    for recording and it can be set by editing the scripts user defined
    parameters ( the variable $samplerate).
    Anything that improves the recording sound clarity will help, a good
    phone, low background noise level etc.
    I have also read that normalizing the recording and setting the gain
    to -5 db improves detection rates. I m experimenting with this at the
    moment and there will be some new code soon (as soon as i get sox
    working in RHEL/Centos 5 😛 ).

  • This is really spectacular. Thanks.

    I’m running Fedora 15, so I can use flac or sox. Any reason to prefer
    one over the other?


  • On Wed, 04 Jan 2012 14:48:22 -0500
    sean darcy wrote:

    We have to convert the voice data to flac format before sending them to
    google, this can be done by both sox and flac encoder. For now the
    script uses flac encoder for compatibility with older distros (mainly
    RHEL 5). Sox is a bit more flexible and also gives you the option to
    edit the sound data (normalizing, changing levels etc).

  • Fresh code is out! The use of sox can be now optionally enabled by the
    user if the system has a recent version of the program (won’t work in
    RHEL/Centos 5)
    This is done by editing the script and setting the variable ‘use_sox’.
    When sox is used the audio gets normalized, low frequency noise (<100Hz)
    is removed and also possible DC offset is corrected. Those are supposed
    to improve the recognition results(?). The settings are still a bit
    experimental, feel free to play with them and report what settings
    improved your results.

    get the new version here:

  • Does sox have more features on a Debian system than RHEL? Is that why it
    won’t work on RHEL?


  • On Fri, 6 Jan 2012 20:46:14 -0500
    Bruce B wrote:

    RHEL’s 5 version of sox is really old and outdated. The command syntax
    and the switches are totally different compared to recent versions of
    Anyway I’m not sure audio normalization and the rest we use sox for is
    really needed. My tests so far didn’t show any improvements in
    detection rates. Keep in mind that all this is still WIP and the
    option to use sox is more for testing than for serious use.

  • Thanks.

    I have been testing Aastra phones with SIP and had great results. I am
    testing my cell phone now and sometimes get “-1” for id, status, utterance,
    and confidence. What does that mean?


  • NVM. I explored the code and see the logic. I had sox = 1 so it was failing
    on RHEL.

    To report, my cell phone from a PRI gets same confidence level just like
    SIP. Building my control app now. Should make my life much easier while
    driving. Thanks again 🙂


  • Added two new features to the script: Timeout value and speechdata type.

    *exten => s,n,agi(speech-recog.agi,en-US,3000,phoneNumb)*
    – Will listen for 3 seconds and sanitize return as a single number without
    any spaces in between. This helps when one reads phone number in format
    415-554-2323 and google returns, “415 554 2323” as result which is not very

    *exten => s,n,agi(speech-recog.agi,en-US,20000,string)*
    – Will listen for 20 second and return result as provided by Google

    It would be great to see them in future versions as I seem to need them
    dearly in a real life scenario.

    Updated script attached.


    default iconspeech-recog.agi

  • Thank you Bruce for the testing and the suggestions.
    Both features added in the script. Timeout can now be set by the user,
    also -1 means no timeout and the recording keeps going till # is pressed.
    Space gets stripped between digits, this is now the default behavior and
    there’s no need to determine the ‘speechdata’ type.
    The updated code can be found here:

    Next on my TODO list is to make use of the asterisk speech recognition
    API (https://wiki.asterisk.org/wiki/display/AST/Speech+Recognition+API)
    This will make the application actually usable for real case scenarios
    and not a proof of concept as it is now.

  • Two more “offerings” – #1 – add DTMF parameter so function can be stopped by
    pressing a digit or digits other than * or # – #2 – add an option to
    “silence” the beep. If you were using this in an IVR and wanted to say
    “press 1 or say help for help”, silencing the beep before recording would
    (IMO) make the rendering sound more “professional”/less “mechanical”.

  • Hey Zaf,

    Just checking the Google Speech Recognition package again and I can’t see
    WolframAlpha.agi file. I check all of your projects on Git hub but can’t find wolframalpha.agi. Please let us know what the URL is.


  • Currently I have seen that most trusted speech system with asterisk is Lumenvox but as it is not free so hoping Google speech engine is a nice way of doing speech recognition. I have also tried this but I hear only silence when agi is launched, and looks like it is stuck. Nothing happens if I press # or any button, until I hangup and agi script exis. I am working to resolve this and will update accordingly. Any one, any idea about this?