i would step back from sexy DNN hype stuff. tensorflow is great if you can do heavy crunching server side and get the results through a remote API. more specialized DNN applications can do training on a phat server and classification on a mobile device (esp. mobile GPUs or running Metal on apple’s dedicated compute modules.) but i’m not convinced this is the right approach for a self-contained creative hacking situation - it’s a fine approach if you’ve already established that DNN is a good fit for the problem, you know how to train it, and you have the engineering budget to make it work on a more constrained platform.
if you’re new to ML techniques, i’d look at PCA first. it’s very simple to understand and implement.
lua is a fine high-level language and is used in applied math and ML stuff (e.g. luaTorch.) for low-level BLAS bits, look to the gsl-lua package that wraps the Gnu Scientific Library and gives you all the fast matrix math.
i haven’t actually used julia. it looks good, but honestly the high-level language binding is not super important to me. my toolbox happens to be mostly octave/matlab and c++, with a dab of faust. other people like python. at the low-level, there is a small set of amazingly well optimized linear algebra libs (GSL, BLAS/LAPACK) that all the high-level tools use. (well ok, those are for CPUs, GPUs have their own things like cuBLAS that implement the same stuff.)
will write up some more details about potential PCA applications if it’s useful. (i do similar stuff for day job.)
interesting paper BTW. i think as far as finding interesting ways to create “meta parameters” for large parameter sets, it’s not necessary to jump straight to looking at the MFCC spectra of the sounds themselves. to me maybe useful first step would be starting with the large amount of preset data made by humans.
though, if you want:
a subsystem on Norns devices that can take audio input and try and find a set of params for an engine that could reproduce that sound
then yea, that broad outline seems about right.
- take MFCC of input (basically a spectrogram warped to Mel frequency scale)
- pick some starting parameters
- make a sound, find MFCC of that
- compute a cost function from the difference in spectra
- repeat 3-4, using gradient descent to adjust param set given last output of cost function.
- eventually you find a local minimum for the cost function and you save those parameter values.
this is very similar to how adaptive FIR filters are built, where the parameters are tap coefficients. (except you don’t need gradient descent there since it’s a solvable linear system.) the trick is typically in specifying the cost function, which can take many forms (in active noise/echo cancellation it’s typically a simple correlation with another signal.)
(to me, this isn’t actually super fun, because sounds change over time and a snapshot of the spectrum doesn’t capture much about what makes a sound interesting. but YMMV. for matching a static timbre it could be fine, but i’d just sample the spectrum and use a phase vocoder in that case.)
(k-means in that article is kindof a red herring - they used it to verify that MFCC is a fine feature set for the application.)
theory / algos / high level:
Duad et al, Pattern Classification 2nd ed:
method / implementation / low level:
Numerical Recipes, 3rd ed.
http://numerical.recipes/
two books i couldn’t live without these days.
for audio in particular: many DSP tomes (especially Proakis/Manolakis) will have at least a chapter about adaptive filtering with linear methods, which is probably worth reviewing as a foundation. for fancier methods you want to start focus in on the area of nonlinear systems identificaton with audio, which itself is a big topic.