Norns Neural Sound Engine

The nsynth super is pretty intersting: Making music with NSynth Super - YouTube

I have been wanting to do some SuperCollider and Norns development since I first recieved my Norns, so i think this might be a great project for me to work on.

I would like this to be something that others might want to use too. Therefore before I start; can anyone interested tell me what they would like to see in a Norns/SuperCollider neural network sound engine?

1 Like

i assume you’re already familiar with the architecture for NSynth (gah, for all i know you’re a Googler), but as a short recap (like, for others):

  • the “NSynth Super” is basically a “dumb client” that does multisampled playback. it is built on openFrameworks. you could certainly build an analogous instrument in SuperCollider with ease. (oF is probably not a great fit alongside the existing norns stack - it’s quite heavy if all you wanna do is a bit of sample playback.)

  • the “special sauce” in NSynth is not in the client, it’s in the audio creation pipeline that is implemented in TensorFlow and run on a phat GPU server. basically you give it some number of training sounds and it spits out a big matrix of interpolated sounds.

personally, i find it a crazy use of resources [*], an unimaginative application of ANNs [**], and not paritcularly interesting musically. (sorry to be the naysayer.) but if people are into it then i think it would be pretty straightforward to adapt the instrument interface glue on NSynth (GPIO, touchscreen, patch format) to OSC, and re-implement the fairly trivial sample playback engine in SC (it’s an ADSR volume envelope and a scanning wavetable oscillato:, hz, and that’'s about it.) oop, i take it back. it’s not just single-cycle, but xfaded and looped sample playback. similar deal. here’s the source for the synth guts.

(i guess you would also want to make some kinda client for the audio pipeline, i dunno)

what i would like to see is maybe some interesting applications of all the stuff that SC already has to do arbitrary signal and sequence generation, including some with ANNs and other reinforcement-learning structures. e.g., brian heim made a simple client-side ANN class and there is a server-side demand-rate ANN UGen as well. meanwhile nick collins has been making SC interfaces to ML audio stuff for decades.

i have not played with the two ANN classes above, but i’ve certainly used SC to make music with markov chains and weird dynamical systems, it’s easy and fun.

of course it’s all about what you do with it - training and output mapping. and these are primitive and small ANNs, not like the many layers of embedding that Magenta uses.

[*] e.g., for the demo video it took 8 NVidia K80 GPUs working for 36 hours to produce the table of audio files between which the playback interface interpolates. that’s a lot of energy investment for something that sounds and plays kinda like a Korg Wavestation. :slight_smile:

[**] why do i say unimaginative? well, cause it’s just playing back short looped samples. (not single-cycle but still.) there is already a tremendous palette of synthesis techniques for generating interesting timbres, in sample form or algorithmically. short-loop playback isn’t by itself interesting, surprising or expressive regardless of what’s in the loop (IMHO obviously,) music happens elsewhere…

er, on a more positive note, i would like to see a really nice multisampled scanning wavetable synth done in SC. i don’t think it would necesarily require making a plugin…

by this, i mean that it would interpolate in two dimensions; one for pitch. you could then convert single-cycle waveforms (captured or generated) into arrays of successively brickwall-lowpassed waveforms for antialiased playback. (i believe the Waldorf wave has a feature that can do this on the fly with captured audio.) i’d be happy to contribute the brickwalling part. with that building block, you could start doing some pretty interesting stuff with fewer resources that doesn’t “sound digital” in the way that non-antialiased WT scanning does.


First of all. Ah! Hi! I’m a fan!

Thanks for the recap, that really helps with dialog, i should have included something like that in my post.

Yeah the training time was slow for the demo in the video. They mitigate that with a “WaveNet checkpoint” you can use so that any new training doesn’t have to start from scratch.

I’m hoping to get my hands on anything from an RTX 2070 to an RTX 2080 super or ti. That is if i can find something at or below MSRP when the 2080 super comes out. The mixed precision tensor cores in those models are really quick (89 tflop for the 2080 super.) The regular CUDA speeds are pretty quick too. I do want to stray away from googles designs eventually which will require lots of training power, but for now I might be able to get away with using their model.

As for unimaginative point, yeah I see what you are saying. I will have to start by playing more with what google has already put together. Which is naturally not very imaginative. I think it might be possible to get away from just playing short looped samples. I would like to have a network generate the audio signals based on note and control events/values. I’m not sure what size trained network it would take to pull that off though. If the norns cant handle it, it sure would be fun to use a coral.

Thanks for the really well thought out response! Starting with the wavetable synth is a great idea. That way if my Neural dreams don’t play out I might still have something nice to show for it.


(at risk of going on and on,)

thanks for the response and sorry to be so critical. i guess i just find the hype around DNN to be very real and instinctually want to push back it! (in my workplace we use some DNN models, but i also often find myself advocating for exponentially simpler models that are backed by analytical understanding of the system in question.)

anyways, i took a walk around the neighborhood and thought about it some more. i think if you are gonna produce nsynth data, there’s absolutely no reason not to make a norns client to play it. the core function really is quite simple and is all encapsulated in that MultiSampler class: it has to parse a packed binary format of audio samples and do something with it - in that case it’s a simple xfaded looper, but you wouldn’t have to stop there.

i see major advantages to implementing an nsynth client in supercollider. namely, the rich array of musically-oriented DSP blocks would let you build up a more fully-featured musical sampler, with all the usual good stuff that nsynth-super lacks: filters, saturation, pitch modulation, LFOs, &c. (viz., the excellent Timber; in fact maybe all that is called for is a tool to convert the audio matrix format, and an extension to Timber to xfade samples.)

and of course the fun part in norns is allowing people to manipulate/sequence(/&c) those parameters in whatever weird ways.

i’m also wondering about 2 things (not that i actually need answers personally, but i think they’d be good questions to ask):

  • how big are these audio matrices, typically? norns has both limited disk, and limited RAM. i’d expect you’d want to get the whole matrix in RAM if possible. CPU isn’t really the main concern though it does limit polyphony.

  • what about single-shot samples? one of the more interesting applications that come to mind would be for percussive sounds, which are less likely to be susceptible to harmonic analysis, freq-domain approximation, or other “traditional” methods of generating audio material by interpolating arbitrary sources. but my (relatively uninformed) impression is that the nsynth training heuristics place a big emphasis on picking good loop points; dunno if there’s an option to bypass that consideration

anyway, GLHF as i believe those gamer kids say


Looking at this paper (different from magenta). It says that the “receptive field” in their network is 300 milliseconds, at CD quality thats a 13,230 elements per layer. At 16bit that’s 26.46kb for the input layer.

I’m currently playing around with different machine learning architectures for sound synthesis as part of an undergraduate research project. I’ve been mostly toying around with RBMs and CRBMs (which are not quite as great as I wish they were, at least when treating data sample per sample). Interestingly though CRBMs behave very similarly to IIR filters. Recently I stumbled upon Magenta’s GANSynth, which outperforms NSynth by orders of magnitude in terms of render times, and which also gets rid of some ugly FFT phase glitches. It sounds much better to my ears than NSynth. It’s a generative adversarial convolutional net that works on modified FFTs of 4 second audio fragments. They claim to generate 4 seconds of audio in around 20 milliseconds, which implies real-time synthesis is possible (still, you would render each note at the moment they’re registered and wouldn’t be able to alter them once they’ve started). I’ve got it to run at around 4 seconds of output per 2 seconds on their python implementation on a maxed out MacBook Pro, so it seems it’s not quite there yet. I doubt this would run on Norns, but it’s interesting nonetheless!