Neural Network Module Idea

I was watching youtube recommended videos last night and stumbled into a rabbit hole of generative song composition implemented by neural networks.

After watching this, I had the idea that such implementation would be an interesting concept for a synthesizer module! Imagine if you could feed it a sample, and then it could through analysis output it’s attempt to recreate the continuation of the sample. I have no clue if this is possible with our current technology. The library OpenAI Jukebox’s website suggests 1 minute of music takes 9 hours to render. Although I have seen other neural network solutions like Dadabots who are able to stream in real time.

I don’t have anywhere near the technical capacity to make something like this. But I thought maybe discussing the idea could spark some interest in the concept! Again I’m a luddite in this field but I wonder if the application of FPGA’s or FPAA’s could help realize this idea. I also recently worked with special AI platform developed by Google through my work recently. Although the implementation of that was a bit lacking in my opinion

Here’s a messy sketch of what I envision a eurorack module with this concept could look like:


So I don’t have a lot of knowledge (or even just a little to be honest :stuck_out_tongue:) about current state of the art AI, but what might yield nice results in context of “fed module some set of samples and based on that generate the new ones” would be some kind of module based on markov chain:

I think it should be possible to implement even on something like ornament and crime and shouldn’t take a lot of time to do so.
Of course this gets a lot more tricky when you want to work on audio rate samples as in your examples, so I guess the Markov Chain would be better suited to analysis and outputing CV Pitch or something similar.
And just out of my small experience: teaching neural network model can take a lot of time. I once tried to naively teach neural network suited for text to generate audio samples (this one: ) feeding it some vaporwave and it was training for few days if I recall correctly on CPU and the end result sounded like harsh noise with devil speaking obscenities very faintly in the background :smiley: what I am trying to say is that as far as I am aware most of neural networks training is done on GPU so such module would have to probably use already trained models.

Real-time audio generation from a ML model at the moment requires more GPU horsepower than is typically employed in gaming computers, let alone eurorack modules. Most of this stuff happens in the cloud so you can use clusters of GPUs.


Are you sure that’s true? I’ve worked on neural models, but not sequence models. Certainly the training is ridiculously expensive, but I know inference can be performed fairly quickly these days on some models if model compression techniques are used

TiNRS Tuesday has a HMM.

MI Marbles has a Markov Chain mode (not enabled on the UI - you have to mod the code, but the code is there).

1 Like

I mentioned I worked with a Google product Coral TPU, specialized processor designed for TensorFlow that runs on a raspberry pi mini. We used it specifically for facial recognition, and voice recognition. If the technology isn’t there to do this kind of generative synthesis, it surely will exist in the near future.

Maybe a cheat, but if you had access to cloud processing neural network for this application, a module with an ESP32 to send audio data back and forth could be a hack to achieve this as well.


Audio is a lot of detailed data. It would be far easier to achieve with a sparse data set such as CV or MIDI.

Here I quickly did a second mock up, it might require a pretty deep eurorack case:

I hadn’t heard of this module but it looks fascinating! It makes me wonder that now with dedicated TPU, maybe this concept applied to MIDI input could be possible?

if I recall correctly on CPU and the end result sounded like harsh noise with devil speaking obscenities very faintly in the background :smiley: what I am trying to say is that as far as I am aware most of neural networks training is done on GPU so such module would have to probably use already trained models.

@karol this is exactly right, and it is how I believe the OpenAI Jukebox works, it uses pre-trained models for best estimate results.


Yeah, I definitely agree, I just wasn’t sure how fast these algorithms have gotten. I know there are actually some Ableton plugins for generating midi, but I don’t think there’s any real-time processing yet.

In the Magenta plugins for Live, “Continue” is the one closest to @dianus’s idea, but it’s for MIDI rather than audio. It’s not “real-time” in the sense of continuous generation (you have to click a button) but it’s quick.

This web tool does generate MIDI accompaniment in real-time:

I’m working on something close to your concept but different in application/goal using a raspberry pi + ADC/DAC for templating. Style transfer is a lighter weight computation and training process than things required for time-dependent inference/extrapolation. My goal is to style transfer incoming audio into pre-trained embeddings as a first draft (I think Price would make a nice starting embedding). This also makes it so that training can occur upstream of actual audio processing and audio processing will involve a small number of matrix multiplications where the latency will scale as a function of the sample length.

In my mind’s eye this kind of module would look like an effects module where you could switch through these pretrained embeddings filtering your audio signal through say, Prince, Morton Feldman, or GAS.


Yes! I bet this would work quite well! Exciting.

Very interesting and reminds me of this conference where we saw slime moulds doing a similar thing with a piano live


always depends on the size of the network and the relationship of network size and problem complexity - if you keep your networks small training could be cheap - imagine you could set as a parameter the number of training iterations/minibatch size you’d like to do. Imagine this as part of the performance - you’re playing - you let the network train for some number of bars/time - then once trained you switch to sampling the latent space or doing inference or what have you.

how all of these parameters related to sonic/musical qualities I am super interested in - I don’t want digital harshness personally. Could we imagine writing up aesthetically pleasing activation functions? or applying certain kinds of smoothness constraints?

Dadabots doesn’t generate anything on the fly. If you look at their about at youtube openai generates the tracks and they’re curated, mastered and uploaded by dadabots. The streams are essential just selections.
There are some libraries for pure data that interface with nn and ai data sets I have no idea how these are supposed to work though.
I feel like for now you are basically going to be dealing with a glorified wavetable module where maybe you input a sample into openai and it will generate wavetables or a dataset based off your input sample and then you download and upload into a module. Although you could potentially have a module that does this for you and you just record a sample into and let it cook for 9 hours(actually it could be a lot less with shorter samples and less need for quality)

The google n-synth? diy project incorporated some ai generated vector synthesis stuff.

Also 2 second papers is a really awesome channel on youtube to keep up with this stuff.

Also this
I’ve probably spent literal days of time on this site. One of the best out there.
It’s come a ways since I started going there. It seems like it is mostly being used to make fursona’s and anime avatars, but you can go way beyond that.
We need this for synth patches and I think it wouldn’t even be that hard.


This is very cool - I haven’t heard about these units - they look amazing. What were they like to develop on?

Looks like Google has done some experiments with “tone transfer”.

h/t to @ntrier for the link


thank you for this tip - this looks super interesting and sort of what what I was getting at with harmonically useful transformations. It looks like they are taking a very forceful approach to their representation learning where they are learning a parameterization of (an additive synthesis generator + noise) + reverb to reconstruct their inputs. Their criticism of fourier encodings and wavenet architectures is useful. That was my intended starting point - but I’d still like to validate their observations - spectral leakage sounds beautiful to me. Their spectral loss is a good call. TLDR: this is a big ol architecture!

Jumping off of this could I ask for advice on how I am proceeding with implementation?
Currently I am using a raspberry pi + pisound device (from blokas - awesome little board). I have a simple python interface that I shuttle audio through on the raspberry pi. Currently the steps are:

  • FFT(chunk)
  • linear transform(spectra)
  • inverse fft (spectra)
  • return chunk

I am using pyaudio + numpy thus far. Primary consideration here - is this going to scale? How fast am I going to hit a prohibitive latency problem? I haven’t done experiments yet here. Do I need to switch to something like rust now before I get too deep?

I’ve been thinking a lot about the training algorithm. My first pass will be something along the lines of -

  • build a dataset of myself reading the lyrics of prince songs matched with the songs and get this set up in a few queue streams on disk
  • randomly pull from the queues, shifting slightly in time, and pitch, and input rate. I’ll probably try adding different noise types - these will be the minibatches
  • I’ll probably use something similar to the loss from that paper - L1 spectral difference / though might be fun to compare L1/L2 vs KL-divergence (or ELBO for the VAEs) - There’s a lot here - and can get real wacky real fast - starting simple sounds good - the loss will measure the difference between the read-in lyrics and the prince song chunk
  • for architecture I was thinking of comparing 2 methods
    • wavenet encoding of incoming audio - VAE/AE to encode representation
    • train a VAE/AE on the fourier spectra

Main difference here is that first method operates on the raw audio (maybe faster? though wavenet requires a lot of multiplication) - vs learning a transformation of the FFT of a incoming chunks.

Let me know what you think! I would greatly appreciate criticism if folks have any thoughts.

I have little experience with pyaduio + numpy, but @hecanjog does a lot of DSP in python and can likely speak to its real-time capabilities and constraints.

Some interesting thoughts about challenges inherent in implementing “tone transfer”. I wonder how Google addressed these challenge in the Magenta demo above.


Adam Neely tried making some music using tone transfer.