I just stumbled across this wonderful video about Animal Crossing’s psuedo-speech that offers a nice survey over other methods for masquerading text+sound as speech.
The video makes an interesting point – when we see text appear chunk-by-chunk accompanied by sounds, most folks will be able to follow along easier than if presented with the text all at once.
It got me thinking – does anyone here have any experience with any of these techniques? Would people be willing to donate small snippets of speech so we can have a library to provide voices to text?
I’m a big fan of speech synthesis, and I love to make things babble! I also love to babble about things babbling. There is something I find so charming about getting a computer to chatter. One of my pipe dreams is to build an asemic speech engine that is built around prosody and inflection.
I once wrote a study called “computer on the phone with his mother”:
I wrote this in my Sporth language. You can find the sporth code here.
The recipe to make speech sounds like this is reasonably straight forward to replicate in a modular synthesis environment (Csound, SC, PD, FAUST, etc). What this is doing is taking a narrow pulse wave and putting it through a series of bandpass filters tuned to formant frequencies. Interpolating between the formant values gives it the “talky” effect. Randomly interpolating between formant values makes it “babble”. I also added some pauses, which makes it feel more like words in a language. The breaks are what give it personality (similarly, in singing synthesis, choosing the right vibrato makes a world of difference).
About a year or so later, I made another babble track. This one uses a Kelly-Lochbaum vocal track to synthesize the voice. From a control standpoint, the general concept is very similar, only instead of moving a bunch of bandpass filters around, you’re pushing and pulling on a virtual vocal tract to shape the sounds. The implementation itself is based on pink trombone, but I hastily ported the javascript code to C to make it work with the rest of my music software ecosystem.
There’s also stuff like espeak and festival which I’ve used in the past to generate electronic speech.
My partner and I built Martin Howse’s ‘Wormed Voice’, seems promising and the documentation is a treasure trove of speech synthesis history + info http://1010.co.uk/org/wormedvoice.html
Hers works, mine doesn’t properly :-/ gonna try and build another and see what went wrong
I also got a Micro:bit a while ago to experiment with it’s very limited speech synth possibilites, did a couple of little tests but yet to properly get into it.
Years ago I had a dream that I found a rackmount hardware speech synth machine at a car boot sale, recorded a 7" using it and John Peel played it…
Thanks for all of this. I’m excited to see what others have been using as well. I’ve been experimenting with synthetic speech as well and a lot of these notes are inspiring. One point of inspiration has been the group Visible Cloaks and their use of synthetic speech that generates MIDI data, which they refer to as ‘imprinting.’ Here’s a recent interview that they discuss the process:
Ryan brought up the idea of using the “text-to-music” (TTM) function in the generative music program Wotja to create the starting point for a piece. It’s an extension of a creative process I have come to call “imprinting” — using text or musical fragments to generate a digital representation of it in MIDI, then using it as a building block in the construction of a more complex arrangement.
You can also get pretty far by repeating spoken word over and over again. Your brain will eventually turn it into music. Diana Deutsch has a really great example on her website.
Steve Reich employs speech-to-note transcription in Different Trains, which I’ve always enjoyed listening to (especially on trains or any moving vehicle really).
big favorite is pink trombone
I sample this often in my digitakt for make weird wobbly
synths.
I love play with it when listen to music also, is like the uncanny valley
of robot vox! I imagine a poorly render robot comes to town singing
all your favorite medieval hits!
aaaaaaaw waaaw wawawa lalaaa llaa la!
There is a very cool device that is exploring those concepts: Pocket Operator PO-35 speak. It’s maybe overly minimalistic to make entire tunes in but as an immediate voice sampler with vocoding, vowel synthesis, and autotune it’s rather impressive.
Playing with it enables you to employ some of the creative strategies that @PaulBatchelor talked about above. If you’d like I can record a demo.
There’s a pretty good interview with the creator of the device that includes some technical detail:
I built a little filter around this in soundpipe that takes in a streaming input signal (assumed to be speech), and converts it to a that computer sound via openlpc.
Here’s what it sounds like (original plays first, followed by the LPC-processed output):
This is the one
Paul Lansky – of RadioHead Fame was our Composer in Residence and he showed us this
piece and he played to a group of grads and undergrads the first .mp3 of a file
and then he sent it to everyone in a mail and we all were floored …hahahaha
I just wanted to say thank you for the Diana Deutsch link - very interesting to me! I’ve been experimenting with singing poetry in the past year or so, and this experiment really resonates - it seems a very clear example of how deeply linked song & speech seem to be.
I experienced the effect of the non-repeated phrase seeming to “sing out” in the second example - but somehow I can’t think of the effect as necessarily illusory - it feels to me more like an exaggeration of something that is already there, that the initial repetition sensitises one to the musical/tonal quality already inherent in the speech, only occluded by our habit of just focusing on the words, as it were… maybe it’s our usual way of hearing speech that’s the illusion.
(It reminded me also of an experiment/workshop-thing I have a recording of John C Lilly doing, which is a single word (“cogitate” in this instance) looped over & over - doesn’t take long before most people start hearing a whole range of different words/short phrases.)
(& random things about whistled/tonal languages, mythical stories about the secret wisdoms of birds, and contented gorillas humming happy songs to themselves as they eat, eatc)
In the speech synthesis world, sampling a voice is sometimes referred to as “voice banking”. I’ve never done it, but I’ve heard it can be rather labor intensive. This PDF on building a voice bank in UTAU. There could be some hints there.
“Rare Phonemes” gave me a good chuckle. Looking up IPA phonemes on wikipedia can be a bit of a rabbit hole. I particularly enjoy listening to the Voiced Epiglottal Trill, which I think sounds just like Homer Simpson drooling.