-
Notifications
You must be signed in to change notification settings - Fork 952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QEMU Audio support in VNC Viewer (currently Windows only) #1478
base: master
Are you sure you want to change the base?
Conversation
Added support for VNC protocol extension "QEMU audio" to the VNC Viewer (Windows only).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR! This is definitely something we are interested in.
How is the easiest way for us to go about testing this?
nAudioBytesLeft -= consumed; | ||
} | ||
if (nAudioBytesLeft != 0) | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This complexity is something that doesn't really fit well in CMsgReader
. How large are these audio chunks? Can't we let the input stream handle the buffering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Audio chunks may be large, there are no protocol limits besides UINT32_MAX. Bad-behaved VNC server may outgrow MAX_BUF_SIZE in common/rdr/BufferedInStream.cxx. That's why I prefer not to leave audio samples in the input buffer of BufferedInStream, and move them to circular playback buffer of Win32AudioOutput as soon as they arrive. Calls to memcpy happen in Win32AudioOutput::addSamples(), usually once, or twice if wraparound in the target circular buffer happens, and caller of that interface must provide a pointer and length of continuous chunk of source data. This continuous chunk of source data is owned by BufferedInStream.
When audio frame is fully received, Win32AudioOutput::submitSamples() makes one or two heap allocations of struct Win32AudioOutput::HdrInSlist, then calls waveOutPrepareHeader() WinAPI function (which boils down to allocation of struct OVERLAPPED on heap, creation of kernel event object and some pointer bookkeeping), then it calls waveOutWrite() WinAPI function (which boils down to submitting asyncronous I/O request to the NT kernel device driver, the audio mixer). There's no need to move bytes around in this context, asyncronous NT I/O request in flight refers to the audio samples in circular buffer owned by Win32AudioOutput which were added there earlier, by Win32AudioOutput::addSamples(). So while audio playback I/O request is still in flight, original socket buffer owned by BufferedInStream may contain different data, may be reallocated, may be empty etc.
When kernel completes asyncronous I/O request, it signals via kernel event object and awakes internal worker thread started by winmm.dll/wdmaud.drv. This thread invokes our Win32AudioOutput::waveOutCallback(). There we link struct Win32AudioOutput::HdrInSlist into slist to dispose it later, during the next call to Win32AudioOutput::submitSamples() or in the destructor of Win32AudioOutput. Before disposing Win32AudioOutput::HdrInSlist we call waveOutUnprepareHeader(), which boils down to deallocating struct OVERLAPPED on heap and destruction of kernel event object.
If a bad-behaved VNC server sends unreasonably large audio frame, then we clamp it at Win32AudioOutput::addSamples(): samples which do not fit into the circular buffer owned by Win32AudioOutput are de facto discarded, but reported as consumed to the caller, so the caller can release this space in BufferedInStream, and continue to do so until Win32AudioOutput::submitSamples() is called at the end of audio frame.
Summarizing all of the above, I think two buffers (socket I/O buffer and async playback buffer) are necessary.
So, what we need from BufferedInStream in this context of CMsgReader is an API to fetch some fresh data from socket and provide raw pointer to the continuous chunk of memory along with number of available bytes. There's an upper limit on the number of bytes we need here (nAudioBytesLeft), and there's no lower limit. We're happy to consume less than nAudioBytesLeft bytes here (especially if nAudioBytesLeft is large, i.e. this is a large audio frame). If there's partial audio sample/frame (e.g. less than 4 bytes for default audio format, 16-bit stereo), then up to 3 bytes will be left unconsumed in the BufferedInStream, for later use when more data arrives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The double buffering sounds fine, given the varying lifetime.
But I'd still like to get rid of the complexity. Since there is a limit anyway (which seems to be about 1 MiB if I read the code correctly), then this extra complexity doesn't seem worth it.
We don't need to support the theoretical extremes. Supporting what is actually used out there is quite sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this complexity can be reduced by adding a new method to rdr::InStream, which takes a number of bytes wanted by caller, similar to hasData(), but unlike hasData() does not require all of that bytes to be available. Instead it just returns a pointer and length of what is available. Length may be less than requested. Also this new method must call rdr::InStream::overrun() for actual I/O on socket to happen. Then code in CMsgReader will be shorter. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think I need to provide justification for MSGSTATE_AUDIO_DATA. It's true that entire 1 MiB of audio samples can fit in socket I/O buffer and be submitted for playback in one go. However consider ill-behaved VNC server which sends 1 GiB of audio data, so 1 MiB of these data is going to be submitted for playback and the rest (1023 MiB) is going to be discarded. How we can model this logic in the terms of rdr::InStream: "keep N bytes at the beginning of the buffer, but discard everything above N?" I think we cannot do this, rdr::InStream is designed for FIFO-style consumption of incoming data, we can't cut away samples/bytes/whatever in the middle of that buffer. Hence the compexity in CMsgReader, the new state of the FSM, and the double buffering. It's just a coincidence that double buffering fits nicely with async nature of audio I/O on Windows.
|
||
for (rdr::U8 bits_per_sample = 16; bits_per_sample != 0; bits_per_sample -= 8) { | ||
for (rdr::U8 n_channels = 2; n_channels != 0; n_channels--) { | ||
for (rdr::U8 freq_index = 0; freq_index < 4; freq_index++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't Windows resample for us? Do we really need to try all of these? It would make things a lot simpler if we just ha a fixed format we always use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always use 44100 Hz, 16 bits, stereo, this should be sufficient for most cases. This is what is tried first, and usually succeeds. However if that fails, why not try other formats? In QEMU audio extensions of VNC protocol, it's client who determines an audio format to be used, and VNC server must perform resampling if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this adds a lot of complexity to the code. So if it isn't needed in practice, then it's just a cost.
Have you actually seen it failing to give you the requested format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't test all audio drivers for the Windows, but I'm pretty sure it will fail to work on some poor's guy audio card of VM or whatever else if we just hardcode everything to 44100 Hz, 16-bit stereo.
This reverts commit 31aa3f8.
Added support for VNC protocol extension "QEMU audio" to the VNC Viewer (Windows only).
VNC audio can be tested with recent versions of qemu (e.g. 7.0.0) by adding the following command line options to qemu: for recent guests (e.g. Windows 8.1+):
for older guests (e.g. Windows 7, Windows XP):
Windows 7 has builtin driver for Intel HD audio, but its sound quality in QEMU is terrible (even for local playback, e.g. via ALSA or PulseAudio QEMU plugin). This is a well-known problem and usually resolved by switching to emulation of AC97 and installing compatible driver for Windows 7 x64, e.g. Realtek AC97 WDM driver alcwdm.inf version 6.0.1.6305 from 06/19/2009 (Windows hardware ID is PCI\VEN_8086&DEV_2415). |
I'm afraid I don't have easy access to any Windows guests. I'll try to spin up a Linux machine with those arguments and see how that goes. |
Just a quick status update; things are hectic here on my side at the moment, with upcoming vacations, so I'm not sure when I can have a look at this again. So it's not forgotten, but I'm afraid I don't know when I can have a closer look. |
Switched default output sampling rate to 48 kHz to avoid downsampling in QEMU for modern Windows guests. Also removed some code duplication in CMsgReader.
Here is a small update to the proposed pull request. My experiments has shown that modern Windows guests (Windows 10, Windows 11) prefer to output audio in 48 kHz, so I've changed default output rate in VNC Viewer to 48 kHz to avoid downsampling from 48 kHz to 44.1 kHz in the QEMU host code with subsequent upsampling from 44.1 kHz to 48 kHz in the kernel audio mixer driver (on the client host where VNC Viewer runs). Delay-wise and quality-wise it's better to leave resampling to the Windows guest and then just passthrough 48 kHz audio all the way down to kernel audio mixer driver on the VNC client host, I guess. Also I've reduced complexity in CMsgReader a bit, as you requested, by moving common code to a new method, CMsgReader::readAudioData(). |
Might be useful, there's a PoC with cross-platform audio support using miniaudio library based on original @mkupchik commit |
This pull request implements a new feature in the VNC Viewer: support for VNC protocol extension called "QEMU audio". My testing indicates that it works quite well, including anti-jitter measures (such as audio output device starvation detection and adjusting audio playback delay accordingly). I've implemented QEMU audio support on the Windows platform only, because this is only platform I'm familiar with (and the only platform I use TigerVNC on). This is a byproduct of another project I currently work on, so I'm just sharing my work in hope that it will be eventually upstreamed and I won't have to rebase it to the subsequent versions of TigerVNC and recompile manually.