There’s a new source separation Python library available called Spleeter and it actually works surprisingly well.
Source separation is taking a mix and separating it into individual stems (e.g. a vocal track and an instrumental track). Spleeter uses Tensorflow, which is a Python platform for machine learning, to train neural networks that can perform source separation.
I’ve been playing around with it for a day or so and I’m pretty impressed with the results, especially compared with more primitive methods. Some songs separate better than others, and I’m not sure what drives those difference yet. I made some pretty good acapellas from songs that previously had no isolated vocals available. They have a few different pre-trained models, but you can train your own models if you have enough data.
Part of this may have to do with the time-frequency overlap between sources in a given song. Spleeter estimates a time-frequency mask which determines what percentage of each “bin” can be attributed to each source. If lots of the bins are primarily due to a single source the separation should go pretty well. If there is a lot of overlap it will likely be worse. Spleeter also synthesizes the output using the mixture phase, which is another source of artifacts (even if there were an oracle that told us the “perfect” mask, this would still be a source of artifacts).
I’m interested to hear what happens if you apply this to something that isn’t a musical mix, something like field recordings or complex synthesizer tones. Or… what if you take eight songs and mix them together and then try to separate individual elements from that? Or how about non-audio signals hacked to looked like audio files…
Many of the worries people have about the social consequences of (over use and poorly thought-out applications of) machine learning is down to the fact that these systems aren’t really capable of saying “I have no idea”. Maybe there’s something of artistic interest to be found in that.
I recently found that there is a tool quite similar to Spleeter from Facebook it’s called demucs (i think it was on lines actually, but can’t find the thread) and they claim that it got the first rank in a demixing competition.
I just tried it with a song i already had spleeter stems of and it definetly sounds more natural, the separation is better and there are less artifacts.
About a year ago I tried Spleeter on some field recordings, it reconized tweeting and singing birds as voices and separated them from the noise of the city. Because of the artifacts it sounded a bit like cyborgy birds in an unnaturally silent environment, so definetly worth a try if you’re not looking for naturalist sounds!