Speech recognition in Asterisk using Google Voice API

Home » VoIP News » Speech recognition in Asterisk using Google Voice API

January 3, 2012 Pavel Espinal VoIP News 30 Comments

I’m exited to announce that Lefteris Zafiris has written an agi script that uses Google Voice API for voice recognition.

As the author says, “the script records from the current channel untill the pound key (#) is pressed or the timeout (15 seconds) is reached. The recording is send over to google speech recognition service and the returned text string is assigned to a channel variable.”

More info and dialplan examples can be found in the README file:
https://raw.github.com/zaf/asterisk-speech-recog/master/README

The script is available here:
https://github.com/zaf/asterisk-speech-recog

The author reports that the code is still young and not roughly tested so comments, suggestions and bug reports are more than welcome.

Enjoy this code jewel and please provide feedback/comments to the author.

30 thoughts on - Speech recognition in Asterisk using Google Voice API

Bruce B says:

January 4, 2012 at 1:43 am

Very interesting. I just tried to get it to work but it complains about
sox. Probably you used a different version of sox?

*PBX-*CLI> /usr/bin/sox: invalid option — -*
*/usr/bin/sox: invalid option — n*
*/usr/bin/sox: invalid option — o*
*/usr/bin/sox: -r must be given a positive integer*
* — speech-recog.agi: /usr/bin/sox failed: 512*

I am using: *Package sox-12.18.1-1.el5_5.1.i386 *

Thanks,
Bruce B says:

January 4, 2012 at 1:51 am

And with recent version 14.3.2 I get:

/usr/local/bin/sox FAIL formats: no handler for file extension `flac’
LL says:

January 4, 2012 at 2:05 am

Hi there,

I’ve developed an agi script a while ago to use google speech
recognition and by then I’ve used
http://legroom.net/files/software/convtoflac.sh to convert files from
wav to flac.
You can the use the command:

*/usr/local/bin/convtoflac.sh -o /var/lib/asterisk/sounds/myfile.wav*
It will then create create a flac file in the same directory as the
source file.

I hope it helps.
Lefteris Zafiris says:

January 4, 2012 at 5:06 am

Note to self: “Never release anything asterisk related without testing
on RHEL/Centos 5″

Thank you for reporting this. I have replaced sox with flac and it seems
to work now on older platforms too (tested on Centos 5 with asterisk 1.4).
You can get the updated code here:
https://github.com/zaf/asterisk-speech-recog/tarball/master
Julian Lyndon-Smith says:

January 4, 2012 at 10:07 am

this looks great – is there any chance of coverting the googletts.agi
to use flac as well ?

Julian
Lefteris Zafiris says:

January 4, 2012 at 10:18 am

In googletts.agi we get the voice data from google in mp3 and we convert
it in a format that asterisk can read and playback (slin). If we store it
in flac asterisk wont be able to read it natively and we would have to
convert it each time we want to play it back to the user.

In the speech recognition script we have to convert the voice data in
flac before sending it to google because that’s the accepted format.

Is there some particular reason you want the googletts.agi data in flac?
Julian Lyndon-Smith says:

January 4, 2012 at 10:24 am

the only reason is that I didn’t want to have to install sox. Lazy.
that’s all 😉 Just another piece of software to find and install

running on amazon ec2, is the best thing to download the source and
compile sox ?

Thanks

Julian
Lefteris Zafiris says:

January 4, 2012 at 10:29 am

It should be on your distro repos already.
Julian Lyndon-Smith says:

January 4, 2012 at 10:34 am

nope 🙁
Bruce B says:

January 4, 2012 at 2:25 pm

Works beautifully. Amazing job Lefteris. Thanks.

The best result I got in probability was 0.9725632 by saying, “hello”. I
think there is some non-phonetic logic built-in as well. I tried, “1, 2”
and I got “0.86534226” in accuracy. While I tried “1, 2, 3, 4, 5” I got,
“0.97256315”. Probably Google sees the pattern?!

What are some of the other tricks (if any) or consideration that one should
make while creating a strong speech recognition enabled IVR?

Best,
Anonymous says:

January 4, 2012 at 2:27 pm

Does anyone know what languages are supported?
Michelle Dupuis says:

January 4, 2012 at 2:47 pm

Wow – nice! A few quick questions:

1. How long can the recording be for translation?
2. Any limitation on how much text the return (transcribed) variable can hold?
3. Any commercial / terms of use limitations?
________________________________
Sent: Wednesday, January 04, 2012 1:25 PM

Note to self: “Never release anything asterisk related without testing
on RHEL/Centos 5″

Thank you for reporting this. I have replaced sox with flac and it seems
to work now on older platforms too (tested on Centos 5 with asterisk 1.4).
You can get the updated code here:
https://github.com/zaf/asterisk-speech-recog/tarball/master
Lefteris Zafiris says:

January 4, 2012 at 3:14 pm

At the moment the recording timeout is set at 15sec. I haven’t tested
yet the max
length of voice data ta google accepts (all this voice recognition
stuff is undocumented).
I have read that it is between 10-20 seconds but havent really went to
test this yet. On my todo list is
to add the option to cut the sound data in smaller chunks before
sending them to google and get rid of the
recording length limitations.

This better be answered by the astsrisk devs but empirically talking i
have loaded in dialplan variables really big
chunks of text (like the complete gpl license) without having any problems.

This is a gray area at the moment. Voice recognition is undocumented
in google’s API and i guess not
officially supported yet. I hope it gets covered by the general TOS of
google services:
http://www.google.com/accounts/TOS
Lefteris Zafiris says:

January 4, 2012 at 3:16 pm

For sure english and spanish, since its undocumented i don’t have a
complete list
yet.
Lefteris Zafiris says:

January 4, 2012 at 3:26 pm

Google accepts sound files at any sampling rate (up to 44.1kHz) so if
you can use some wideband codec ( eg g722)
It can greatly improve the sound quality and the detection rates. For
now the script supports 8kHz and 16kHz sampling rates
for recording and it can be set by editing the scripts user defined
parameters ( the variable $samplerate).
Anything that improves the recording sound clarity will help, a good
phone, low background noise level etc.
I have also read that normalizing the recording and setting the gain
to -5 db improves detection rates. I m experimenting with this at the
moment and there will be some new code soon (as soon as i get sox
working in RHEL/Centos 5 😛 ).
sean darcy says:

January 4, 2012 at 3:48 pm

This is really spectacular. Thanks.

I’m running Fedora 15, so I can use flac or sox. Any reason to prefer
one over the other?

sean
Israel Gottlieb says:

January 4, 2012 at 3:50 pm

wow i just tried in hebrew and i’ll say just 1 word “WOW”
Lefteris Zafiris says:

January 4, 2012 at 3:59 pm

On Wed, 04 Jan 2012 14:48:22 -0500
sean darcy wrote:

We have to convert the voice data to flac format before sending them to
google, this can be done by both sox and flac encoder. For now the
script uses flac encoder for compatibility with older distros (mainly
RHEL 5). Sox is a bit more flexible and also gives you the option to
edit the sound data (normalizing, changing levels etc).
Lefteris Zafiris says:

January 4, 2012 at 7:42 pm

Fresh code is out! The use of sox can be now optionally enabled by the
user if the system has a recent version of the program (won’t work in
RHEL/Centos 5)
This is done by editing the script and setting the variable ‘use_sox’.
When sox is used the audio gets normalized, low frequency noise (<100Hz)
is removed and also possible DC offset is corrected. Those are supposed
to improve the recognition results(?). The settings are still a bit
experimental, feel free to play with them and report what settings
improved your results.

get the new version here:
https://github.com/downloads/zaf/asterisk-speech-recog/asterisk-speech-recog-0.3.tar.gz
Bruce B says:

January 6, 2012 at 9:46 pm

Does sox have more features on a Debian system than RHEL? Is that why it
won’t work on RHEL?

Cheers,
Lefteris Zafiris says:

January 6, 2012 at 10:40 pm

On Fri, 6 Jan 2012 20:46:14 -0500
Bruce B wrote:

RHEL’s 5 version of sox is really old and outdated. The command syntax
and the switches are totally different compared to recent versions of
sox.
Anyway I’m not sure audio normalization and the rest we use sox for is
really needed. My tests so far didn’t show any improvements in
detection rates. Keep in mind that all this is still WIP and the
option to use sox is more for testing than for serious use.
Bruce B says:

January 6, 2012 at 11:50 pm

Thanks.

I have been testing Aastra phones with SIP and had great results. I am
testing my cell phone now and sometimes get “-1” for id, status, utterance,
and confidence. What does that mean?

Cheers
Bruce B says:

January 7, 2012 at 12:03 am

NVM. I explored the code and see the logic. I had sox = 1 so it was failing
on RHEL.

To report, my cell phone from a PRI gets same confidence level just like
SIP. Building my control app now. Should make my life much easier while
driving. Thanks again 🙂

-Bruce
Bruce B says:

January 7, 2012 at 3:34 am

Added two new features to the script: Timeout value and speechdata type.

*exten => s,n,agi(speech-recog.agi,en-US,3000,phoneNumb)*
– Will listen for 3 seconds and sanitize return as a single number without
any spaces in between. This helps when one reads phone number in format
415-554-2323 and google returns, “415 554 2323” as result which is not very
usable.

*exten => s,n,agi(speech-recog.agi,en-US,20000,string)*
– Will listen for 20 second and return result as provided by Google
untouched.

It would be great to see them in future versions as I seem to need them
dearly in a real life scenario.

Updated script attached.

-Bruce

speech-recog.agi
Lefteris Zafiris says:

January 7, 2012 at 8:22 am

Thank you Bruce for the testing and the suggestions.
Both features added in the script. Timeout can now be set by the user,
also -1 means no timeout and the recording keeps going till # is pressed.
Space gets stripped between digits, this is now the default behavior and
there’s no need to determine the ‘speechdata’ type.
The updated code can be found here:
https://github.com/zaf/asterisk-speech-recog/tarball/master

Next on my TODO list is to make use of the asterisk speech recognition
API (https://wiki.asterisk.org/wiki/display/AST/Speech+Recognition+API)
This will make the application actually usable for real case scenarios
and not a proof of concept as it is now.
"Danny Nicholas" says:

January 12, 2012 at 11:50 am

Two more “offerings” – #1 – add DTMF parameter so function can be stopped by
pressing a digit or digits other than * or # – #2 – add an option to
“silence” the beep. If you were using this in an IVR and wanted to say
“press 1 or say help for help”, silencing the beep before recording would
(IMO) make the rendering sound more “professional”/less “mechanical”.
Lefteris Zafiris says:

January 12, 2012 at 3:49 pm

Both features added:
Bruce B says:

July 4, 2012 at 12:45 pm

Hey Zaf,

Just checking the Google Speech Recognition package again and I can’t see
WolframAlpha.agi file. I check all of your projects on Git hub but can’t find wolframalpha.agi. Please let us know what the URL is.

Thanks,
Bruce
Lefteris Zafiris says:

July 4, 2012 at 2:08 pm

It is under the folder samples/wolfram/

https://github.com/zaf/asterisk-speech-recog/tree/master/samples/wolfram

————–
Brian Knep says:

January 4, 2012 at 12:21 pm

Currently I have seen that most trusted speech system with asterisk is Lumenvox but as it is not free so hoping Google speech engine is a nice way of doing speech recognition. I have also tried this but I hear only silence when agi is launched, and looks like it is stuck. Nothing happens if I press # or any button, until I hangup and agi script exis. I am working to resolve this and will update accordingly. Any one, any idea about this?