My first guess for a starting point would be the speech recognition engine by it self. There are several opensource options for speech recognition. It all depends on what you’re looking to “get out” of the service. By this I mean after the words have been spoken, the computer has processed said words, what do you want it to spit out next? Keeping it simple in theory, we’re simply looking at a classic stdin/stdout situation with variables spaced over time.
Say you choose speech to text. After the speech has occured and the closest translation has been determined by the SRE a sample output would be a string containing each word, with a character delimiter of ‘,’ terminated with a newline ‘\n’, calculated number of syllables per word (this is no doubt part of speech recognition), and the total amount of time the original recording elapsed (tp try and preserve timing). This could be written to a text file, or better yet sent to a FIFO/named pipe for the “vocal processing” portion we are interested in. The result could be a rule, or a macro, or a event containing OSC messages, midi, any number of generated data that originated from the speech, the timing or speeech, the recognition of what was spoken, and its make up.
no doubt this is a MONSTER idea and opensource SR is very complicated. But alas here are some Speech Recognition Projects that will recognize speech and pump out ALOT of data about what and how its doing its thing:
Kaldi
http://kaldi-asr.org/doc/about.html
Julius
http://julius.osdn.jp/en_index.php
Voxforge (opensource speech corpus creation and integration)
http://www.voxforge.org/