Audio over Ethernet. Kernel or User-Space?

Audio over Ethernet is a technology that’s been collecting attention on-and-off for a while, but looks like it’s about to hit the mainstream. Part of me is stoked because pretty much everyone already has Ethernet cable lying around them but it’s a technology that is in no way ready for the greater open-source community.

Intel’s Open-AVB looks promising, but currently it only supports the Intel I210 Ethernet controller and I’m not confident in it enough to commit the time to port the existing code to any different chipsets.

Then there’s the proprietary solutions (ick).
Audinate’s Dante seems to be the big contender in this arena and Audinate offers you plug-and-play solutions, a more featured one in the form of a Spartan6-based add-on to your current implementation (no idea how much they cost). The article below claims that Dante doesn’t offer the timing-accuracy that Open-AVB possesses and that the accuracy isn’t even necessary unless network bandwidth has been exceeded. I’m taking that with a grain of salt, but if that is the case then why the proprietary hardware? Can’t I do that with the existing FPGA’s available bandwidth in my project?

So I think that it’s pretty clear that the disadvantage for Open-AVB is that it’s not finished, supports only specific chipsets, and must still be ported to all but a couple of them, plus you need expensive switching hardware. Advantages are that it’s free, open-sourced, and highly synchronised. Disadvantages for Dante is that it costs money (possibly a lot), is another embedded system on top of what you already have, and is closed-source. Advantages being potentially painless integration thus lower cost of development and you can use consumer network hardware.

The other fast(ish) option is to write a user-space application in the form of a JACK client, VST plugin, or similar program. I’ve written plenty of these network-audio-clients that were stable and offered low-latency over a consumer LAN network (the JACK server didn’t crash on my laptop waiting for samples so it was synchronized enough with my laptop’s audio card). However, this solution assumes there’s an audio interface that the JACK server is using for synchronisation in the first place (would have to create an interface that generates a clock) and totally breaks the expectations of systems looking for audio devices.

Anyone else thinking about this?

EDIT: My target use-case is for a few number of external audio devices connected to a computer or computers over 100/1000Mbps Ethernet LAN. I feel like current market solutions are a bit over-engineered (or aren’t platform agnostic), but I’m still not totally set… I think some formal, real-world tests and metric gathering is in order.

Wow this is long… sorry! It’s been something I’ve been thinking about in another context and have been meaning to explore. This was the right opportunity!

I’m wondering if all of these are solutions in search of a problem… or at very least, a problem that our domain - musicians building music equipment for solo or small ensemble performance - does not have.

Sure - if you are building a 96 channel studio, and want to pipe 96 channels around between several endpoints - and want to do that on Ethernet with sample level sync… then sure: 24 bits × 48kHz × 96 channels × 4 endpoints ≈ 500Mb/s. That’s asking a lot from both GigE ethernet and software stacks.

BUT - I imagine that for the kinds of uses the people in this community might put audio over ethernet to, something much simpler is in order:

Audio over Ethernet as point-to-point connection: I’m thinking connecting computer to interface w/4 ch out, or say stereo in & out to an effects device. 16 bits × 48kHz × 4 channels ≈ 3Mb/s, – or call it 9MB/s if you want to double your sampling rate and more bits. This seems to me well within the realm of commodity interfaces and the standard network stack, even striving for, say 1ms latency.

Small studio of devices connected over Ethernet via a standard switch: Okay, so thinking three audio sources (synths) stereo out, three effects units, stereo in/out, analog modular interface: 4 each in/out, and audio interface, 4 each in/out. That’s still only 34 channels total. So - say - 24 bits × 48kHz × 34 channels ≈ 40Mb/s. This is probably still not straining the standard consumer GigE hub - and the software stacks at each device are still good.

I’m assuming here that what we’d more likely want from audio over Ethernet is ease and flexibility of digital audio interconnections between our devices. I don’t think we need to keep all these things in sample sync - 1ms or better latency (in to out, separate from any processing latency the unit has itself) - is probably fine.

Now - I haven’t done all the calcs of overhead and congestion here - but my day gig is architect of enterprise network systems (!) - so I don’t think I’m that far off.

The key thing to think about is “What is the use case?” AVB looks to me developed for the use cases of the film and broadcast video studio - probably with a bit of over engineering - because those users will prefer to spend upfront for systems with more specs than they need now (retrofitting studios is expensive).

I think we we might have a different set of needs…

Right, consumer networks definitely have the bandwidth for most small-studio uses so maybe the solution is a user-space network/audio client and an audio driver for the master host to synchronize its clock to (could talk to a h/w real-time clock).

Think server room in the other wing of the building, and the ability to route audio from any studio to any control room, e.g. a feed from an entirely different space comes up on a fader in your desk with the lowest possible latency. Throw a master clock in, attacking time domain issues for sync to picture - where 2048 samples is huge and likely fails QA. I think the issues with GigE are to do with clocking, jitter, latency, rather than throughput. Certainly it’s convenient to have many audio channels folded into one ethernet cable - cheaper and travels further than toslink.