Monday, October 17, 2011

Kinect speech recognition in linux

Audio support is now part of libfreenect. Additionally it is now possible to load the microsoft SDK version of the audio firmware from linux courtesy of a utility called kinect_upload_fw written by Antonio Ospite.This version of the firmware makes the kinect appear to your computer as a standard USB microphone.
This means you can now record audio using your kinect, but that's not all that interesting in and of itself. Linux support for speech recognition at this point is not all that great. It is possible to run dragon naturallyspeaking via wine or to use the sphinx project (after much training), but neither of those approaches really appealed to me for simple voice commands (as opposed to dictation). The google android project happens to include a speech recognizer from Nuance which by default is meant to be built for an ARM target, like your phone. After extensive hacking around the build system I was able to instead build for an x86 target, like your desktop. Now, you can combine these two things- kinect array microphone + android voice recognition to do some more interesting things, i.e. toggle hand tracking on and off via voice.



How to get started:

1) Check if you have the "unbuffer" application which is part of the linux scripting language called expect:

which unbuffer

If the above command comes up empty you should download a copy of unbuffer from the link here:
http://dl.dropbox.com/u/11217419/unbuffer

copy unbuffer to a directory that is in your path, like /usr/local/bin or ~/bin

2)Download my precompiled version of the srec subproject from here:
http://dl.dropbox.com/u/11217419/srec_kinect.tgz

3)save the tarball from step 1 in a convenient directory then unpack it with this command:
tar xfz srec_kinect.tgz

4)switch into the subdirectory where I've placed some convenience scripts:
cd srec/config/en.us

5) Open a second terminal and in that second terminal also switch into srec/config/en.us

6) In the first terminal execute
./run_SRecTestAudio.sh
and in the other terminal execute
cat speech_fifo

7) try speaking into your microphone and wait for recognition results to appear in both terminals. Note that the vocabulary as configured at this point is very small- words like up,down,left,right and the numbers from 1-9 should be recognized properly.

Integrating the kinect:
1)Acquire Antonio Ospite's firmware tools like so:
git clone http://git.ao2.it/kinect-audio-setup.git/

2)move into the kinect-audio-setup subdirectory:
cd kinect-audio-setup

3)build kinect_upload_fw as root:
make install

4)Fetch and extract the microsoft kinect SDK audio firmware (depending on your directory permissions, this may also need to be run as root):
./kinect_fetch_fw /lib/firmware/kinect

This will extract the firmware to this location by default:
/lib/firmware/kinect/UACFirmware.C9C6E852_35A3_41DC_A57D_BDDEB43DFD04

5)Upload the newly extracted firmware to the kinect:
kinect_upload_fw /lib/firmware/kinect/UACFirmware.C9C6E852_35A3_41DC_A57D_BDDEB43DFD04

6)Check for a new USB audio device in your dmesg output

7)Configure the kinect USB audio device to be your primary microphone input and
try out run_SRecTestAudio.sh again as described earlier.


Additional Notes:

I unfortunately no longer remember all the changes I had to make in order for the srec project within android build for x86. Perhaps someone with better knowledge of the android build system can chime in at the comments below. In the interim, use the precompiled copy that I have linked above, just be aware that it is old, I think it dates back to the froyo branch of android or earlier (I compiled it a long time ago). If you want to take a shot at building the latest srec yourself, check out the android source code then look under external/srec/

The run_SRecTestAudio.sh script sets up the speech recognizer to run on live audio and pipes the recognition results to a fifo in the same directory called speech_fifo. Running cat in the second terminal lets you read out the recognition results as they arrive. Instead of cat you could alternatively have whatever programs needs recognition results read from the fifo and act accordingly. Unbuffer is used to make sure you see recognition results right away rather than waiting for the speech_fifo to fill up.

The srec recognizer does not require any training but has certain limitations. The most significant limitation is the vocabulary it can recognize. The larger the vocabulary you specify, the less accurate the recognition results will likely be. As a result this recognizer is best used for a small set of frequently used voice commands. Under srec/config/en.us/grammars/ there are a number of .grxml files which define what words the recognizer can understand. You can define your own simple grammar (.grxml) here which, for example, only recognizes the digits on a phone keypad. To do this you can follow the syntax of any of the other .grxml files in the directory and then execute run_compile_grammars.sh which will produce a .g2g file from the .grxml file. There is also a voicetag/texttag file with extension .tcp which needs to point to the g2g file of your choice. You can find the .tcp files under the srec/config/en.us/tcp directory. run_SRecTestAudio.sh points to a tcp file which you can specify.

Thursday, February 24, 2011

Kinect audio reverse engineering

I did some work on getting the kinect audio hardware to work as part of openkinect/libfreenect a while back. Here's some quick notes on what I've figured out and how:
Once the audio firmware has been loaded the kinect sends 524 bytes to the xbox every 1ms, every tenth packet is short (60 bytes) but potentially preceded by an empty packet. The short packets appear to be non-audio data (maybe signaling of some sort) because if you exclude them the resulting data doesn't appear to have any gaps.

The audio samples appear to be 32 bits signed each at 16khz (if you assume that sample rate then the FFT of the recorded data has the correct frequency).
The 4 channels seem to be transmitted in order left to right from the perspective of someone looking at the front of the kinect. The leftmost channel is transmitted first. 256 samples of each channel are transmitted before switching to the next channel.
if you stitch together disparate 256 sample blocks to reconstruct a given channel the data appears to be continuous. The plot below shows the captured 4-channel audio stream with the channels labeled from left to right as 1,2,3,4. You can see that the leftmost channel has the greatest amplitude corresponding to the fact that the speaker was placed closest to the leftmost microphone. I repeated the test with a speaker near the rightmost microphone and, as expected, channel 4 became the strongest.



I was able to determine the information in the last paragraph by synthesizing a 700Hz sine wave in matlab and then playing it back at the kinect with a speaker nearest the leftmost microphone (as seen from the front of the kinect). I then captured the data stream coming back from the kinect while I played the sine wave using a beagle USB sniffer. I extracted the 524 byte blocks I suspected to be audio from the beagle dumps and then post processed them with a series of shell scripts before reading them into matlab and plotting the FFT of this audio as seen below:



The frequency shown by the FFT is correctly 700Hz(approx.) This suggests that my interpretation of the audio format is correct.

Firmware loading process

I've managed to duplicate so far what I think is most of the init sequence-
I send all the same control transfers and bulk transfers as the Xbox,
as far as I can tell. My beagle480 confirms that I mirror the Xbox behavior for the most part. After completing a series of 512 byte bulk-out transfers which I
assume is some sort of bootstrapping firmware upload, the audio device
re-enumerates, I wait for that to happen and open the new audio
device, then send some more control and bulk transfers. So far ,so
good, this all follows what I see in the Xbox logs. At this point the
Xbox appears to send 12 cycles of ( 1 xfer: 0 byte iso IN, 8 xfers 4
bytes out) which I also duplicate perfectly. Now, the final step is a
very long stream of (1 xfer: 0 byte iso IN, 8 xfers 76 bytes out)
before eventually those 0 byte IN transfers become 524 bytes
transfers. Unfortunately it seems the content of those 76 byte OUT
transfers must matter because after trying all zeros I never get any
data back in my IN transfers (even after >5000 IN transfers). I have
some scripts I'll use to try generate code for all those OUT transfers
directly from the .tdc files.