Ah very cool.
From the video it looks/sounds quite accurate. Giving the paper a quick readthrough and it’s doing something similar to what I’m doing above (slightly different descriptors), and in this case using a classifer instead of a generic euclidian distance. I do like the inclusion of zero crossing rate though, as that’s not something I’ve messed with very much.
The time frame, however, of 50ms is crazy latency for realtime use. The paper does say this:
Although a 50 millisecond analysis latency is noticeably late in musi- cal contexts, the authors are developing a predictive music generation algorithm which will correct for this and other delays introduced by the robotic system.
So I’m guessing their workaround for that will be to predict when the next attack might be based on pattern analysis, which is great for beat-based music, but not necessarily others.
Either way, 50ms is super luxurious! As a point of reference, I’m generally working on a 256 sample analysis window at the moment (half of what it was for Kaizo Snare) and that’s like 5.8ms(!!).
Obviously that’s fucking tiny, and there are loads of other tradeoffs (e.g. pitch and low frequencies in general are kind of shit), but getting quite promising results.
I’ve gone back to square one on my feature set, but I’m trying to come up with an aggregate descriptor space that will (hopefully) accurately capture the differences in sounds I’m interested in (mainly prepared snare drum):
On the right you see each individual subspace, and how it maps out on a 3d plot via UMAP reduction (just for visualization).
Each individual subspace is as follows:
Loudness (4D) - mean, std, min, max → robust scale
Timbre (4D) - loudness-weighted 20(19)mfccs, mean, std, min, max → standardize → 4D UMAP → robust scale
Envelope (4D) - deriv of loudness mean, deriv of loudness std, deriv of loudness-weighted centroid mean, deriv of loudness-weighted rolloff mean → robust scale
Pitch (2D) - confidence-weighted median, raw confidence → robust scale
This works alright, but the timbre one isn’t ideal, and I still need to refine which statistics I take. But it’s slowly getting there…