@gretchen and I had some conversations about FM synthesis and something she brought up was the term Gradient Descent as a methodology to discover a set of parameters for a synthesizer to reproduce the spectrum of some kind of input.
It describes “A general heuristic to match synthesis parameters of
a fixed sound engine to an arbitrary sound target.
After the generation of training data from the synthesis
engine, PCA is perform to extract relevant metrics
and a combination of k-means and gradient descent is used
to get an estimation.”
The paper is short and the example was not trained on the spectrum from a real acoustic sound recording. It references a few toolkits from the Matlab empire. A friend and colleague uses and recommends the Julia language for machine learning and AI, so I did a few searches and found MusicProcessing.jl and Julia Audio. They are new projects. MusicProcessing has 8 commits and 1 contributor. If anyone has an AES membership, there’s a white paper published last year on using Julia for audio applications.
So it looks like there are some basic building blocks to do ML with free software on Linux…which could include Norns! Taking a giant leap, I could imagine a subsystem on Norns devices that can take audio input and try and find a set of params for an engine that could reproduce that sound. I’ve been told that a primary feature of the Julia language is performance optimizations for small/embedded hardware.
Anyone have deeper references about work in this space?
I’ve heard mixed reviews about how useful TensorFlow projects are for practical applications. I think the N-Synth is only a microcontroller with an X/Y touch surface. It cannot operate without a computer attached with CUDA cores and TensorFlow.
i would step back from sexy DNN hype stuff. tensorflow is great if you can do heavy crunching server side and get the results through a remote API. more specialized DNN applications can do training on a phat server and classification on a mobile device (esp. mobile GPUs or running Metal on apple’s dedicated compute modules.) but i’m not convinced this is the right approach for a self-contained creative hacking situation - it’s a fine approach if you’ve already established that DNN is a good fit for the problem, you know how to train it, and you have the engineering budget to make it work on a more constrained platform.
if you’re new to ML techniques, i’d look at PCA first. it’s very simple to understand and implement.
lua is a fine high-level language and is used in applied math and ML stuff (e.g. luaTorch.) for low-level BLAS bits, look to the gsl-lua package that wraps the Gnu Scientific Library and gives you all the fast matrix math.
i haven’t actually used julia. it looks good, but honestly the high-level language binding is not super important to me. my toolbox happens to be mostly octave/matlab and c++, with a dab of faust. other people like python. at the low-level, there is a small set of amazingly well optimized linear algebra libs (GSL, BLAS/LAPACK) that all the high-level tools use. (well ok, those are for CPUs, GPUs have their own things like cuBLAS that implement the same stuff.)
will write up some more details about potential PCA applications if it’s useful. (i do similar stuff for day job.)
interesting paper BTW. i think as far as finding interesting ways to create “meta parameters” for large parameter sets, it’s not necessary to jump straight to looking at the MFCC spectra of the sounds themselves. to me maybe useful first step would be starting with the large amount of preset data made by humans.
though, if you want:
a subsystem on Norns devices that can take audio input and try and find a set of params for an engine that could reproduce that sound
then yea, that broad outline seems about right.
take MFCC of input (basically a spectrogram warped to Mel frequency scale)
pick some starting parameters
make a sound, find MFCC of that
compute a cost function from the difference in spectra
repeat 3-4, using gradient descent to adjust param set given last output of cost function.
eventually you find a local minimum for the cost function and you save those parameter values.
this is very similar to how adaptive FIR filters are built, where the parameters are tap coefficients. (except you don’t need gradient descent there since it’s a solvable linear system.) the trick is typically in specifying the cost function, which can take many forms (in active noise/echo cancellation it’s typically a simple correlation with another signal.)
(to me, this isn’t actually super fun, because sounds change over time and a snapshot of the spectrum doesn’t capture much about what makes a sound interesting. but YMMV. for matching a static timbre it could be fine, but i’d just sample the spectrum and use a phase vocoder in that case.)
(k-means in that article is kindof a red herring - they used it to verify that MFCC is a fine feature set for the application.)
theory / algos / high level:
Duad et al, Pattern Classification 2nd ed:
for audio in particular: many DSP tomes (especially Proakis/Manolakis) will have at least a chapter about adaptive filtering with linear methods, which is probably worth reviewing as a foundation. for fancier methods you want to start focus in on the area of nonlinear systems identificaton with audio, which itself is a big topic.
in a tiny nutshell, that system does a few things:
takes the state space of synth control params I
for each state i in I, generates a whole bunch of acoustic descriptors d, to build a state-space of descriptors D.
reduces dimensionality of D using PCA, producing D*
uses a neural network to generate an arbitrary mapping from a generic multidimensional control space C, to D*.
so now you have a smaller set of meaningful parameters (the dimensions of C) to explore a large part the whole timbre space of the synth. (though, i think never the whole space.)
[glossing over lots of stats stuff, most of which i don’t understand yet, to clean up and normalize the feature space before and after step 3.]
the synth is a VST plugin. the acoustic descriptors are from max. the statistical mapping stuff is matlab.
so yeah, you could do some of this stuff with norns. norns can be the synth and it could pretty easily perform the acoustic analysis (generate d) from its own output - the descriptors are the IRCAM feature set, and most of them have equivalent supercollider ugens.
no way could you do the mapping on norns, i think - even working with I and D would be kind of nuts - but that’s ok.
once you generate the mapping you could definitely execute it on norns.
this is really cool stuff, but it still feels limiting to me to only talk about mapping timbre. (which again is awesome and i see the utility for gesture -> tone performance.) i mean none of these features are meaningful over a time scale of more than 20ms or something. synth controls aren’t all about timbre.
that said, there’s something in the TSAM paper about envelopes but i can’t quite grok it without seeing the thing; i think it’s about ways of sweeping through C with an envelope. and it does seem straightforward to extend this, maybe by adding \another layer to the mapping, and get the thing to match entire timbral gestures and phrases (once you have this nice cleaned-up D* space.)
it still seems fun to just generate a control mapping into a big set of presets, without even worrying about acoustic features (which can’t capture, say, some crazy LFO/envelope configuration that is more musical than acoustic.) this could be done with just PCA + inverse PCA, i think.
so forget all the exciting sounding algorithms, forget all the technologies.
The number one issue in ML is having a well labelled training set - and it’s your problem here too. (Having given some thought to this very topic for what I imagine are much the same reasons as @lazzarello
I think there are two approaches:
scavenge some data - Native Instruments FM8 preset db is tagged - and you could probably extract a like for like feature set from the presets - no idea if the preset library was written in the golden age of XML (ie when everyone and their dog thought XML would save the world and used it wether appropriate or not) or if it is binary/other and would need to write a parser. The tagging on that set is pretty mundane though and would limit you due to that
or (& this is more way work but would be more exciting)
Write a web app that generates a sound from a random set of parameters and gets people to tag those sounds either from a pre determined set of tags or their own tags or some combination. Given access to this community might be useful here - someone good at gaming/psychology might also suggest ways to increase/get engagement
ONCE YOU HAVE THAT - then you can think about ML
I agree with Zebra - the goal is off device application of ML and once you have your data set you can try different models to see what produces the most useful results - these days you use off the shelf implementations from others most of the time and you are just interested in the outputs based on the characteristics of your learning set
this will give you a model which you can then apply - @zebra suggests one approach to that or perhaps more Ideally, and given enough processing power, you’d have a model that given a set of parameters could generate you a tag - and then you wander/search that space at patch generation time
Interestingly Mutable Grids claims to have done something like this - I’ve not dug deep enough to see if that’s just ‘marketing’ or actually how it works - it is full of data but that might just be drum patterns
the ML part of grids is that olivier did a lot of work analyzing a whole ton of actual drum-machine-based music and came up with 25, like, “basis patterns” that can be combined and perturbed in interesting ways. all the factor and data analysis was in the design process and not in the actual functioning of the module (which in itself is not complicated at all.) but i think it’s cool and not just hype.
but yeah it’s too bad that AFAIK he hasn’t published anything about how he actually assembled the patterns.
and yea totally agree with your points re: importance of training sets, it’s worth emphasizing. which is why i think playing with something like a DX7 patch library would be a fun starting point, because people have put so many thousands of hours into making, collecting, and organizing these things.
[ed. good training sets, not necessarily “tagging” - as pointed out below, unsupervised learning like we’ve been discussing doesn’t require tagging per se, but you still need suitable input that covers the features you’re interested in.]
Yeah. I struggled to imagine how you might use unsupervised training here since I presume that the goal would be to give some sort of semantic interface to parameter generation. Ie make me something “percussive” or “zingy” or even to just have a “percussive” or “zing” knob you could turn up or down. However that might joe my lack of imagination
Based on the OP’s problem description (i.e., find a set of synthesizer parameters to reproduce an input spectrum), unsupervised learning is possible. I only skimmed the paper they linked to, but it seems to be using unsupervised learning too (k-means).