thank you for this tip - this looks super interesting and sort of what what I was getting at with harmonically useful transformations. It looks like they are taking a very forceful approach to their representation learning where they are learning a parameterization of (an additive synthesis generator + noise) + reverb to reconstruct their inputs. Their criticism of fourier encodings and wavenet architectures is useful. That was my intended starting point - but I’d still like to validate their observations - spectral leakage sounds beautiful to me. Their spectral loss is a good call. TLDR: this is a big ol architecture!
Jumping off of this could I ask for advice on how I am proceeding with implementation?
Currently I am using a raspberry pi + pisound device (from blokas - awesome little board). I have a simple python interface that I shuttle audio through on the raspberry pi. Currently the steps are:
- FFT(chunk)
- linear transform(spectra)
- inverse fft (spectra)
- return chunk
I am using pyaudio + numpy thus far. Primary consideration here - is this going to scale? How fast am I going to hit a prohibitive latency problem? I haven’t done experiments yet here. Do I need to switch to something like rust now before I get too deep?
I’ve been thinking a lot about the training algorithm. My first pass will be something along the lines of -
- build a dataset of myself reading the lyrics of prince songs matched with the songs and get this set up in a few queue streams on disk
- randomly pull from the queues, shifting slightly in time, and pitch, and input rate. I’ll probably try adding different noise types - these will be the minibatches
- I’ll probably use something similar to the loss from that paper - L1 spectral difference / though might be fun to compare L1/L2 vs KL-divergence (or ELBO for the VAEs) - There’s a lot here - and can get real wacky real fast - starting simple sounds good - the loss will measure the difference between the read-in lyrics and the prince song chunk
- for architecture I was thinking of comparing 2 methods
- wavenet encoding of incoming audio - VAE/AE to encode representation
- train a VAE/AE on the fourier spectra
Main difference here is that first method operates on the raw audio (maybe faster? though wavenet requires a lot of multiplication) - vs learning a transformation of the FFT of a incoming chunks.
Let me know what you think! I would greatly appreciate criticism if folks have any thoughts.