I would like to introduce OddVoices, a project to create quirky lo-fi singing synthesizers for General American English, inspired by retro TTS systems from the 80’s and 90’s. Here’s a sample:
Unlike many modern speech synthesizers created using machine learning, OddVoices is a 100% manual endeavor. I wrote a list of 628 English words that cover the most common diphones in GA, convinced a singer friend of mine to record all of them sung in monotone, and labeled all the diphones in Audacity. The analysis/resynthesis algorithm is MBR-PSOLA.
Currently, the interface is a command-line tool that takes a JSON file format (see examples/music.json) or a MIDI file + lyrics and outputs a WAV file. Further down the line, I want to make this work in real-time as a SuperCollider UGen, and have already ported the engine to C++.
The project is still very early in development, but it’s capable of producing entertaining musical results. Thanks for checking it out, and let me know if you have questions or feedback.
Although I’m using the same algorithm as MBROLA, there’s no code in common with the MBROLA project and the voice bank is original, so you won’t run into any of MBROLA’s licensing issues. OddVoices is itself available under the Apache License.
Hmm… now that you mention it, I should probably give the voice banks an explicit CC0 license just to be safe.
I don’t have a Mac I can test on, but the non-realtime Python reference synthesizer, lightly documented in the README, should work cross-platform. The SuperCollider UGen might work if you follow the build instructions, but it’s still undocumented (I need to get on that).
@nathan I just tested this out (on arch linux fwiw) and it’s really fun. I’d like to try creating a new voice for your system at some point too!
I did have a number of issues with git-lfs though. Running git lfs pull looked like it downloaded the files, but the data was still just placeholder files in the filesystem when I checked, and running oddvoices died the first time trying to open the stub wav file in the quake voice, which was still just a textfile… anyway I manually downloaded that one file and then I was able to compile the voice, but rendering failed in a similar way on an unknown file resulting in a KeyError during part of the corpus.py routines to read the voice. (Sorry I forgot to make a note of the exact line number… I think it was trying to index into an object called string iirc… The compiled voice looks good though, the magic string is there and I was able to use it when I disabled git lfs so it must have been failing somewhere else…)
After several months’ hiatus from this project, I have recently resumed working on OddVoices. I ported the Python prototype over to C++ and built a friendlier option to the command-line interface: a web page where you can upload a MIDI file and text and produce a WAV output right in your browser thanks to the power of WebAssembly.
I’ve been posting OddVoices dev logs on my blog. I’m here to share a few choice updates.
First, I have registered oddvoices.org, the new home of the web frontend:
Also we have the addition of a new voice, Air Navier, and incorporation of real vocal pitch phenomena such as “preparation” (sliding away from the target pitch before sliding towards it) and “overshoot” (sliding past the target pitch, then changing direction). See images, derived by running real vocal data through a pitch tracker.
Thanks for sharing this, and your blog, it’s interesting to understand how it all works… Looking forward to a Supercollider version one day!
[P.s.: I don’t know if you’ve come across FoF synthesis in your research, but it might be interesting to check out. FoF is a French acronym, and means “formant wave function” synthesis. It was used by the Ircam CHANT project back in the day and produced some really nice sounding singing voices. I think it pre-dates PSOLA, and has some similarities as well. There’s an example clip from CHANT here]
Thanks for the links. In the textbook Text-to-Speech Synthesis, Paul Taylor describes how speech synthesizers evolved over time in “generations.” The first generation of speech synths was developed in the 70’s and 80’s and uses parametric control of formants in schemes like FOF, windowed sync, and LPC.
OddVoices is a second-gen singing synthesizer, modeled after late 80’s and 90’s speech synths that use sample playback with time-frequency modification. MBROLA also falls in this category.
Unit selection, like in your second link, is a third-generation method along with Hidden Markov Models (the recent Casio synth is apparently based on Sinsy, which is HMM-based). They sound great due to their heavy use of contextual information to select samples. In the past decade, deep learning approaches have sprung up that could be called “fourth generation.”
The downsides of the later generations are that they require enormous amounts of training data to work their magic. When deciding which approach I wanted to go with at the inception of OddVoices, I made a deliberate tradeoff because I don’t have the resources to record long hours of singing. The benefit is that new voice banks can be created with not too much work, enabling a wide diversity of voices in the project.