Well you also hit the digital Vs analogue argument. To enumerate sounds you are required to number a given sound pressure at a given moment. In order to do this you need a finite set of sound pressure measured at a finite subdivision of time. You, hence, need a corollary to that argument around the minimum variation that is discernible by any given listener. I believe most people consider 48khz 16bit to be acceptable for playback (recording had different requirements like headroom).
For context, I remember calculating the dynamic range of a 32 but floatng point audio recording; it could encode (in a single file) sounds with a dynamic range representing the ratio of the quietest place in earth, up to the sound pressure 100ft from an atomic bomb (I can’t remember which and it may not have been exactly 100ft), while maintaining decibel granularity of a typical 24 bit recording.
As for temporal granularity, some under sea mammals have perception up to 150khz, so we’d want to place the nyquist limit above that, let’s say 320khz. In other words, on earth a 32 bit 320khz recording should suffice to record anything from the perception of any complex lifeform. Via a strange information argument you could say that’s 2^32 values / (1000000000000ps/320khz) = 1374 possibile sounds per picosecond.
Anyway, this post was mostly me thinking about the question of how best to enumerate rather than a specific response: thanks for kicking off that train of thought.