Machine Learning for Synthesizer Parameters

@gretchen and I had some conversations about FM synthesis and something she brought up was the term Gradient Descent as a methodology to discover a set of parameters for a synthesizer to reproduce the spectrum of some kind of input.

I have shamelessly bad education in the maths. So I started searching for some keywords and read a white paper titled SOUND DESIGN LEARNING FOR FREQUENCY MODULATION SYNTHESIS PARAMETERS

It describes “A general heuristic to match synthesis parameters of
a fixed sound engine to an arbitrary sound target.
After the generation of training data from the synthesis
engine, PCA is perform to extract relevant metrics
and a combination of k-means and gradient descent is used
to get an estimation.”

The paper is short and the example was not trained on the spectrum from a real acoustic sound recording. It references a few toolkits from the Matlab empire. A friend and colleague uses and recommends the Julia language for machine learning and AI, so I did a few searches and found MusicProcessing.jl and Julia Audio. They are new projects. MusicProcessing has 8 commits and 1 contributor. If anyone has an AES membership, there’s a white paper published last year on using Julia for audio applications.

So it looks like there are some basic building blocks to do ML with free software on Linux…which could include Norns! Taking a giant leap, I could imagine a subsystem on Norns devices that can take audio input and try and find a set of params for an engine that could reproduce that sound. I’ve been told that a primary feature of the Julia language is performance optimizations for small/embedded hardware.

Anyone have deeper references about work in this space?


I imagine he’ll find his way over here but @Rodrigo has been thinking in a similar world I believe. you also may very well know about google magenta but I’ll throw it down here anyway.

they’ve got some ML stuff running on open-source hardware as well


I’ve heard mixed reviews about how useful TensorFlow projects are for practical applications. I think the N-Synth is only a microcontroller with an X/Y touch surface. It cannot operate without a computer attached with CUDA cores and TensorFlow.


oh lol that’s very different

i would step back from sexy DNN hype stuff. tensorflow is great if you can do heavy crunching server side and get the results through a remote API. more specialized DNN applications can do training on a phat server and classification on a mobile device (esp. mobile GPUs or running Metal on apple’s dedicated compute modules.) but i’m not convinced this is the right approach for a self-contained creative hacking situation - it’s a fine approach if you’ve already established that DNN is a good fit for the problem, you know how to train it, and you have the engineering budget to make it work on a more constrained platform.

if you’re new to ML techniques, i’d look at PCA first. it’s very simple to understand and implement.

lua is a fine high-level language and is used in applied math and ML stuff (e.g. luaTorch.) for low-level BLAS bits, look to the gsl-lua package that wraps the Gnu Scientific Library and gives you all the fast matrix math.

i haven’t actually used julia. it looks good, but honestly the high-level language binding is not super important to me. my toolbox happens to be mostly octave/matlab and c++, with a dab of faust. other people like python. at the low-level, there is a small set of amazingly well optimized linear algebra libs (GSL, BLAS/LAPACK) that all the high-level tools use. (well ok, those are for CPUs, GPUs have their own things like cuBLAS that implement the same stuff.)

will write up some more details about potential PCA applications if it’s useful. (i do similar stuff for day job.)

interesting paper BTW. i think as far as finding interesting ways to create “meta parameters” for large parameter sets, it’s not necessary to jump straight to looking at the MFCC spectra of the sounds themselves. to me maybe useful first step would be starting with the large amount of preset data made by humans.

though, if you want:

a subsystem on Norns devices that can take audio input and try and find a set of params for an engine that could reproduce that sound

then yea, that broad outline seems about right.

  1. take MFCC of input (basically a spectrogram warped to Mel frequency scale)
  2. pick some starting parameters
  3. make a sound, find MFCC of that
  4. compute a cost function from the difference in spectra
  5. repeat 3-4, using gradient descent to adjust param set given last output of cost function.
  6. eventually you find a local minimum for the cost function and you save those parameter values.

this is very similar to how adaptive FIR filters are built, where the parameters are tap coefficients. (except you don’t need gradient descent there since it’s a solvable linear system.) the trick is typically in specifying the cost function, which can take many forms (in active noise/echo cancellation it’s typically a simple correlation with another signal.)

(to me, this isn’t actually super fun, because sounds change over time and a snapshot of the spectrum doesn’t capture much about what makes a sound interesting. but YMMV. for matching a static timbre it could be fine, but i’d just sample the spectrum and use a phase vocoder in that case.)

(k-means in that article is kindof a red herring - they used it to verify that MFCC is a fine feature set for the application.)

theory / algos / high level:
Duad et al, Pattern Classification 2nd ed:

method / implementation / low level:
Numerical Recipes, 3rd ed.

two books i couldn’t live without these days.

for audio in particular: many DSP tomes (especially Proakis/Manolakis) will have at least a chapter about adaptive filtering with linear methods, which is probably worth reviewing as a foundation. for fancier methods you want to start focus in on the area of nonlinear systems identificaton with audio, which itself is a big topic.


Thank you for this great contribution to the topic!

I’ll have a proper read through all this tomorrow, but this is quite likely very relevant:


yes very relevant thanks!

i’ve glanced through the linked paper, it’s cool.

in a tiny nutshell, that system does a few things:

    1. takes the state space of synth control params I
    1. for each state i in I, generates a whole bunch of acoustic descriptors d, to build a state-space of descriptors D.
    1. reduces dimensionality of D using PCA, producing D*
    1. uses a neural network to generate an arbitrary mapping from a generic multidimensional control space C, to D*.

so now you have a smaller set of meaningful parameters (the dimensions of C) to explore a large part the whole timbre space of the synth. (though, i think never the whole space.)

[glossing over lots of stats stuff, most of which i don’t understand yet, to clean up and normalize the feature space before and after step 3.]

the synth is a VST plugin. the acoustic descriptors are from max. the statistical mapping stuff is matlab.

so yeah, you could do some of this stuff with norns. norns can be the synth and it could pretty easily perform the acoustic analysis (generate d) from its own output - the descriptors are the IRCAM feature set, and most of them have equivalent supercollider ugens.

no way could you do the mapping on norns, i think - even working with I and D would be kind of nuts - but that’s ok.

once you generate the mapping you could definitely execute it on norns.

this is really cool stuff, but it still feels limiting to me to only talk about mapping timbre. (which again is awesome and i see the utility for gesture -> tone performance.) i mean none of these features are meaningful over a time scale of more than 20ms or something. synth controls aren’t all about timbre.

that said, there’s something in the TSAM paper about envelopes but i can’t quite grok it without seeing the thing; i think it’s about ways of sweeping through C with an envelope. and it does seem straightforward to extend this, maybe by adding \another layer to the mapping, and get the thing to match entire timbral gestures and phrases (once you have this nice cleaned-up D* space.)

it still seems fun to just generate a control mapping into a big set of presets, without even worrying about acoustic features (which can’t capture, say, some crazy LFO/envelope configuration that is more musical than acoustic.) this could be done with just PCA + inverse PCA, i think.


so forget all the exciting sounding algorithms, forget all the technologies.

The number one issue in ML is having a well labelled training set - and it’s your problem here too. (Having given some thought to this very topic for what I imagine are much the same reasons as @lazzarello

I think there are two approaches:

scavenge some data - Native Instruments FM8 preset db is tagged - and you could probably extract a like for like feature set from the presets - no idea if the preset library was written in the golden age of XML (ie when everyone and their dog thought XML would save the world and used it wether appropriate or not) or if it is binary/other and would need to write a parser. The tagging on that set is pretty mundane though and would limit you due to that

or (& this is more way work but would be more exciting)

Write a web app that generates a sound from a random set of parameters and gets people to tag those sounds either from a pre determined set of tags or their own tags or some combination. Given access to this community might be useful here - someone good at gaming/psychology might also suggest ways to increase/get engagement

ONCE YOU HAVE THAT - then you can think about ML

I agree with Zebra - the goal is off device application of ML and once you have your data set you can try different models to see what produces the most useful results - these days you use off the shelf implementations from others most of the time and you are just interested in the outputs based on the characteristics of your learning set

this will give you a model which you can then apply - @zebra suggests one approach to that or perhaps more Ideally, and given enough processing power, you’d have a model that given a set of parameters could generate you a tag - and then you wander/search that space at patch generation time

Interestingly Mutable Grids claims to have done something like this - I’ve not dug deep enough to see if that’s just ‘marketing’ or actually how it works - it is full of data but that might just be drum patterns


the ML part of grids is that olivier did a lot of work analyzing a whole ton of actual drum-machine-based music and came up with 25, like, “basis patterns” that can be combined and perturbed in interesting ways. all the factor and data analysis was in the design process and not in the actual functioning of the module (which in itself is not complicated at all.) but i think it’s cool and not just hype.

input data:
output process:

but yeah it’s too bad that AFAIK he hasn’t published anything about how he actually assembled the patterns.

and yea totally agree with your points re: importance of training sets, it’s worth emphasizing. which is why i think playing with something like a DX7 patch library would be a fun starting point, because people have put so many thousands of hours into making, collecting, and organizing these things.

[ed. good training sets, not necessarily “tagging” - as pointed out below, unsupervised learning like we’ve been discussing doesn’t require tagging per se, but you still need suitable input that covers the features you’re interested in.]

[i’ll pipe down now]


Ah cool. Wasn’t trying to cast aspersions just couldn’t tell. Will have a deeper look

There is some research on FM spectra with Genetic Algorithms. Take a look at this book:

  • Evolutionary Computer Music, E. R. Miranda, J.A. Biles, editors, London: Springer

and especially these chapters:

  • Horner, A. “Evolution in Digital Audio Technology,” pp. 52-78.
  • Dalstedt, P. “Evolution in Creative Sound Design,” pp. 79-99.
    some of the work was carried on a Nord Modular…

Also these papers by Andrew Horner:

  • Horner, A. 2003. “Auto-Programmable FM and Wavetable Synthesizers,” Contemporary Music Review, 22(3), 21-29.
  • Horner, A. 1997. “A Comparison of Wavetable and FM Parameter Spaces,” Computer Music Journal, 21(4), 55-85.
  • Horner, A. 1996. “Double Modulator FM Matching of Instrument Tones,” Computer Music Journal, 20(2), 57-71.
  • Horner, A., Beauchamp, J., and Haken, L. 1993. “Machine Tongues XVI: Genetic Algorithms and Their Application to FM Matching Synthesis,” Computer Music Journal, 17(4), 17-29.

Might be something of interest here: Automatic design of sound synthesizers as pure data patches using coevolutionary mixed-typed cartesian genetic programming

Not all ML requires labeled data, so that doesn’t need to be a limitation – it’s called unsupervised learning – and autoencoders are a popular example.

Wavelet scattering is a very new and interesting technique that captures information at longer time scales (e.g., several seconds), which would seem to be important for some tasks.


Yeah. I struggled to imagine how you might use unsupervised training here since I presume that the goal would be to give some sort of semantic interface to parameter generation. Ie make me something “percussive” or “zingy” or even to just have a “percussive” or “zing” knob you could turn up or down. However that might joe my lack of imagination

1 Like

Based on the OP’s problem description (i.e., find a set of synthesizer parameters to reproduce an input spectrum), unsupervised learning is possible. I only skimmed the paper they linked to, but it seems to be using unsupervised learning too (k-means).


I guess a ‘make a sound like this example’

Hmm interesting. That might be quite cool

1 Like