I’m a big fan of speech synthesis, and I love to make things babble! I also love to babble about things babbling. There is something I find so charming about getting a computer to chatter. One of my pipe dreams is to build an asemic speech engine that is built around prosody and inflection.
I once wrote a study called “computer on the phone with his mother”:
I wrote this in my Sporth language. You can find the sporth code here.
The recipe to make speech sounds like this is reasonably straight forward to replicate in a modular synthesis environment (Csound, SC, PD, FAUST, etc). What this is doing is taking a narrow pulse wave and putting it through a series of bandpass filters tuned to formant frequencies. Interpolating between the formant values gives it the “talky” effect. Randomly interpolating between formant values makes it “babble”. I also added some pauses, which makes it feel more like words in a language. The breaks are what give it personality (similarly, in singing synthesis, choosing the right vibrato makes a world of difference).
There’s also stuff like espeak and festival which I’ve used in the past to generate electronic speech.
Thanks for all of this. I’m excited to see what others have been using as well. I’ve been experimenting with synthetic speech as well and a lot of these notes are inspiring. One point of inspiration has been the group Visible Cloaks and their use of synthetic speech that generates MIDI data, which they refer to as ‘imprinting.’ Here’s a recent interview that they discuss the process:
Ryan brought up the idea of using the “text-to-music” (TTM) function in the generative music program Wotja to create the starting point for a piece. It’s an extension of a creative process I have come to call “imprinting” — using text or musical fragments to generate a digital representation of it in MIDI, then using it as a building block in the construction of a more complex arrangement.
big favorite is pink trombone
I sample this often in my digitakt for make weird wobbly
I love play with it when listen to music also, is like the uncanny valley
of robot vox! I imagine a poorly render robot comes to town singing
all your favorite medieval hits!
aaaaaaaw waaaw wawawa lalaaa llaa la!
There is a very cool device that is exploring those concepts: Pocket Operator PO-35 speak. It’s maybe overly minimalistic to make entire tunes in but as an immediate voice sampler with vocoding, vowel synthesis, and autotune it’s rather impressive.
Playing with it enables you to employ some of the creative strategies that @PaulBatchelor talked about above. If you’d like I can record a demo.
There’s a pretty good interview with the creator of the device that includes some technical detail:
This is the one
Paul Lansky – of RadioHead Fame was our Composer in Residence and he showed us this
piece and he played to a group of grads and undergrads the first .mp3 of a file
and then he sent it to everyone in a mail and we all were floored …hahahaha
I just wanted to say thank you for the Diana Deutsch link - very interesting to me! I’ve been experimenting with singing poetry in the past year or so, and this experiment really resonates - it seems a very clear example of how deeply linked song & speech seem to be.
I experienced the effect of the non-repeated phrase seeming to “sing out” in the second example - but somehow I can’t think of the effect as necessarily illusory - it feels to me more like an exaggeration of something that is already there, that the initial repetition sensitises one to the musical/tonal quality already inherent in the speech, only occluded by our habit of just focusing on the words, as it were… maybe it’s our usual way of hearing speech that’s the illusion.
(It reminded me also of an experiment/workshop-thing I have a recording of John C Lilly doing, which is a single word (“cogitate” in this instance) looped over & over - doesn’t take long before most people start hearing a whole range of different words/short phrases.)
(& random things about whistled/tonal languages, mythical stories about the secret wisdoms of birds, and contented gorillas humming happy songs to themselves as they eat, eatc)
In the speech synthesis world, sampling a voice is sometimes referred to as “voice banking”. I’ve never done it, but I’ve heard it can be rather labor intensive. This PDF on building a voice bank in UTAU. There could be some hints there.
“Rare Phonemes” gave me a good chuckle. Looking up IPA phonemes on wikipedia can be a bit of a rabbit hole. I particularly enjoy listening to the Voiced Epiglottal Trill, which I think sounds just like Homer Simpson drooling.