The concept is interesting, but I can see a few complications around making such a setup that works robustly and intuitively.
First of all, to get a good view of your fingers the camera would need to be well placed with respect to your fingers (as they play on the screen). No on device camera could do this, so your need external cameras (not very convenient).
An alternative is to use the front facing camera, but that is designed to show your face when you look at the display. You could need to angle to device to instead point at your fingers, and even then, viewing then from above isn’t very good for estimating velocity (I.e camera placement matters a lot).
Next, if you are playing off the screen, you need a keyboard (maybe projected or paper) to play on. Both of these methods require extra space and potentially additional material (a keyboard with QR codes on?).
Finally, cameras are often not very good in low light (typical evening time on a household illumination). They compensate by taking longer exposures resulting in motion blurred images. My guess is that this wouldn’t make for very usable velocity estimates so you would also need artificial illumination for it to be viable. Visible light illuminating would probably be awkward so I’d guess you’d been invisible light (such as IR); this would require more power and a camera capable of selectively filtering out visible/invisible light, or a dedicated camera.
As for different velocity per finger, if you solve the other problems it’s no more complex, and playing one finger at higher velocity than the others isn’t that uncommon. You can effectively hold one finger higher or lower than the other as you push down meaning it comes in contact earlier or later than the other fingers as your hand accelerates making it louder or quieter. It’s very common to use this method to make the top (or bottom) note in a chord louder than the rest (less common for dinner voices).