[![Bountysource](https://api.bountysource.com/badge/issue?issue_id=60853960)](ht…tps://www.bountysource.com/issues/60853960-lagless-vsync-support-add-beam-racing-to-retroarch-now-1082-bountysource)
# Feature Request Description
A new lagless VSYNC technique has been developed that is already implemented in some emulators. This should be added to RetroArch too.
## Bounty available
There is currently a [BountySource of about $500](https://www.bountysource.com/issues/60853960-lagless-vsync-support-add-beam-racing-to-retroarch-now-850-bountysource) to add the beam racing API to RetroArch plus support at least 2 emulator modules (scroll below for bounty trigger conditions). RetroArch is a C / C++ project.
## Synchronize emu raster with real world raster to reduce input lag
It is achieved **via synchronizing the emulator's raster to the real world's raster**. It is successfully implemented in some emulators, and uses less processing power than RunAhead, and is more forgiving than expected thanks to a "jitter margin" technique that has been invented by a group of us (myself and a few emulator authors).
*For lurkers/readers: Don't know what a "raster" or "beam racing" is? Read WIRED Magazine's [Racing the beam article](https://www.wired.com/2009/03/racing-the-beam/). Many 8-bit and 16-bit computers, consoles and arcade machines utilized similar techniques for many tricks, and emulators typically implement them*
## Already Proven, Already Working
* WinUAE -- [GitHub Issue](https://github.com/tonioni/WinUAE/issues/133)
* WinUAE -- [Download Announcement](http://eab.abime.net/showthread.php?t=88777&page=8)
GroovyMAME -- [Dropbox .7z file: Successful experiment via unsubmitted patch by Calamity](https://www.dropbox.com/s/2chq2l29wujuuh5/mame64_frame_slice.7z?dl=0) (and [thread](https://forums.blurbusters.com/viewtopic.php?f=22&t=3972#p31750))
There is currently discussion between other willing emulator authors behind the scenes for adding lagless VSYNC (real-world beam racing support).
## Preservationist Friendly. Preserves original input lag accurately.
Beam racing preserves all original latencies including mid-screen input reads.
## Less horsepower needed than RunAhead.
RunAhead is amazing! That said, there are other lag-reducing tools that we should also make available too.
Android and Pi GPUs (too slow for RunAhead in many emulators) even work with this lag-reducing technique.
Beam racing works on PI/Android, allows slower cycle exact emulators to have dramatic lag reductions,
We have found it scales in both direction. Including Android and PI. Powerful computers can gain ultra-tight beam racing margins (sync between emuraster and realraster can be sub-millisecond on GTX 1080 Ti). Slower computers can gain very forgiving beam racing margins. The beam racing margin is adjustable -- can be up to 1 refresh cycle in size.
In other words, graphics are *essentially* raster-streamed to the display practically real-time (through a creative tearingless VSYNC OFF trick that works with standard Direct3D/OpenGL/Metal/etc), while the emulator is merrily executing at 1:1 original speed.
## Diagrammatic Concept
![Lagless VSYNC](https://www.blurbusters.com/wp-content/uploads/2018/03/EmulatorRasterFollowerAlgorithm.png)
![Lagless VSYNC jitter margin](https://www.blurbusters.com/wp-content/uploads/2018/03/BeamChasingJitterMargin-690x129.png)
Just like duplicate refresh cycles never have tearlines even in VSYNC OFF, duplicate frameslices never have tearlines either. We're simply subdividing frames into subframes, and then using VSYNC OFF instead.
We don't even need a raster register (it can help, but we've come up with a different method), since rasters can be a time-based offset from VSYNC, and that can still be accurate enough for flawless sub-millisecond latency difference between emulator and original machine.
Emulators can merrily run at original machine speed. Essentially streaming pixels darn-near-raster-realtime (submillisecond difference). What many people don't realize is 1080p and 4K signals still top-to-bottom scan like an old 60Hz CRT in default monitor orientation -- we're simply synchronizing to cable scanout, the scanout method of serializing 2D images to a 1D cable is fundamnetally unchanged. Achieving real raster sync between the emulator raster and real raster!
Many emulators already render 1 scanline at a time to an offscreen framebuffer. So 99% of the beam racing work is already done.
## Simple Pre-Requisites
Distilling down to minimum requirements makes rasters cross-platform:
* Platform supports a VSYNC OFF mode
* Platforms is able to provide VSYNC timestamps
* Platform supports high-precision counters (sub-millisecond-accuracy counters)
*Such as RTDSC or QueryPerformanceCounter or std::chrono::high_resolution_clock*
* PC, Mac, Android, Pi, Radeon, GeForce, Intel, all supports beamraced frame slice technique
We use beam racing to hide tearlines in the jitter margin, creating a tearingless VSYNC OFF (lagless VSYNC ON) with a very tight (but forgiving) synchronization between emulator raster and real raster.
## The simplified retro_set_raster_poll API Proposal
Proposing to add an API -- **retro_set_raster_poll** -- to allow this data to be relayed to an optional centralized beamracing module for RetroArch to implement realworld sync between emuraster and realraster via whatever means possible (including frameslice beam racing & front buffer beam racing, and/or other future beam racing sync techniques).
The goal of this API simply allows the centralized beamracing module to do an early peak at the incomplete emulator refresh cycle framebuffer every time a new emulator scan line has been plotted to it.
This minimizes modifications to emulators, allowing centralization of beam racing code.
The central code handle its own refresh cycle scanout synchronization (busylooping to pace correctly to real world's raster scan line number which can be extrapolated in a cross-platform manner as seen below!) without the emulator worrying about any other beam racing specifics.
## Further Detail
Basically it's a beam-raced VSYNC OFF mode that looks exactly like VSYNC ON (perfect tearingless VSYNC OFF). The emulator can merrily render at 1:1 speed while realtime streaming graphics to the display, without surge-execution needed. This requires far less horsepower on the CPU, works with "cycle-exact" emulators (unlike RunAhead) and allows ultra low lag on Raspberry PI and Android processors. Frame-slice beam racing is [already used for Android Virtual Reality too](https://www.imgtec.com/blog/reducing-latency-in-vr-by-using-single-buffered-strip-rendering/), but works successfully for emulators.
## Which emulators does this benefit?
This lag reduction technique will benefit any emulator that already does internal beam racing (e.g. to support original raster interrupts). Nearly all retro platforms -- most 8-bit and 16-bit platforms -- can benefit.
This lag-reduction technique does not benefit high level emulation.
## Related Raster Work on GPUs
Doing actual "raster interrupts" style work on Radeon/GeForces/Intels is actually surprisingly easy: [tearlines are just rasters](https://www.pouet.net/topic.php?which=11422&page=1) -- see YouTube video.
This provide the groundwork for lagless VSYNC operation, synchronization of realraster and emuraster. With the emulator method, the tearlines are hidden via the jittermargin approach.
## Common Developer Misconceptions
First, to clear up common developer misconceptions of assumed "showstoppers"...
* Yes, it can work with 60Hz, 120Hz, 180Hz, 240Hz (simply beam racing cherrypicked refresh cycles -- requires surge execution for beam racing "fast" refresh cycles), works in WinUAE
* Yes, it's more forgiving than expected of computer performance fluctuations (jitter margin technique)
* Yes, it can work simultaneously with RunAhead (if need be, though not necessary). Simply beam race the final/visible frame.
* Yes, it works simultaneously with variable refresh rate ([see this post](https://forums.blurbusters.com/viewtopic.php?f=22&t=3972&start=20#p31926)), works in WinUAE
* Yes, you can easly enter/exit beamracing mode on the fly *(e.g. screen rotation to incompatible scan direction, switch to windowed operation)*
* Yes, it works with scaled and HLSL/shaders/fuzzylines, as it always works in WinUAE. It does slow things down, and requires optimizations to speed up again (but this can be solved as a separate optimization). Any distortions (e.g. curves, or line fuzz) can be hidden in the jitter margin height technique, to be 100% artifactless
* Yes, it can be used in conjunction with black frame insertion (including for the 31KHz 240p compatibility mode for MAME arcade machines; though that will require 2x surge-execute during a fast 1/120sec scanout of the visible refresh cycle).
---
# Proposal
**Recommended Hook**
1. Add the per-raster callback function called "**retro_set_raster_poll**"
2. The arguments are identical to "**retro_set_video_refresh**"
3. Do it to one emulator module at a time (begin with the easiest one).
It calls the raster poll every emulator scan line plotted. The incomplete contents of the emulator framebuffer (complete up to the most recently plotted emulator scanline) is provided. This allows centralization of frameslice beamracing in the quickest and simplest way.
**Cross-Platform Method: Getting VSYNC timestamps**
You don't need a raster register if you can do this! You can extrapolate approximate scan line numbers simply as a time offset from a VSYNC timestamp. You don't need line-exact accuracy for flawless emulator frameslice beamracing.
For the cross-platform route -- the register-less method -- you need to listen for VSYNC timestamps while in VSYNC OFF mode.
These ideally should become your **only** #ifdefs -- everything else about GPU beam racing is cross platform.
*PC Version*
1. Get your primary display adaptor URL such as \\.\\\\DISPLAY1 .... For me in C#, I use Screen.PrimaryScreen.DeviceName to get this, but in C/C++ you can use **EnumDisplayDevices()** ...
2. Next, call **D3DKMTOpenAdapterFromHdc()** with this info to open the hAdaptor handle
3. For listening to VSYNC timestamps, run a thread with **D3DKMTWaitForVerticalBlankEvent()** on this hAdaptor handle. Then immediately record the timestamp. This timestamp represents the end of a refresh cycle and beginning of VBI.
*Mac Version*
Other platforms have various methods of getting a VSYNC event hook (e.g. Mac CVDisplayLinkOutputCallback) which roughly corresponds to the Mac's blanking interval. If you are using the registerless method and generic precision clocks (e.g. RTDSC wrappers) these can potentially be your only #ifdefs in your cross platform beam racing -- just simply the various methods of getting VSYNC timestamps. The rest have no platform-specificness.
*Linux Version*
See [GPU Driver Documentation](https://kernel.readthedocs.io/en/sphinx-samples/gpu.html). There is a *get_vblank_timestamp()* available, and sometimes a *get_scanout_position()* (raster register equivalent). Personally I'd only focus on the obtaining VSYNC timestamping -- much simpler and more guaranteed on all platforms.
**Getting the current raster scan line number**
For raster calculation you can do one of the two:
(A) _Raster-register-less-method:_ Use RTDSC or **QueryPerformanceCounter** or std::chrono::high_resolution_clock to profile the times between refresh cycle. On Windows, you can use known fractional refresh rate (from **QueryDisplayConfig**) to bootstrap this "best-estimate" refresh rate calculation, and refine this in realtime. Calculating raster position is simply a relative time between two VSYNC timestamps, allowing 5% for VBI (meaning 95% of 1/60sec for 60Hz would be a display scanning out). _NOTE: Optionally, to improve accuracy, you can dejitter. Use a trailing 1-second interval average to dejitter any inaccuracies (they calm to 1-scanline-or-less raster jitter), ignore all outliers (e.g. missed VSYNC timestamps caused by computer freezes). Alternatively, just use jittermargin technique to hide VSYNC timestamp inaccuracies._
(B) _Raster-register-method:_ Use **D3DKMTGetScanLine** to get your GPU's current scanline on the graphics output. Wait at least 1 scanline between polls (e.g. sleep 10 microseconds between polls), since this is an expensive API call that can stress a GPU if busylooping on this register.
NOTE: If you need to retrieve the "hAdaptor" parameter for **D3DKMTGetScanLine** -- then get your adaptor URL such as \\.\\\\DISPLAY1 via **EnumDisplayDevices()** ... Then call **D3DKMTOpenAdapterFromHdc()** with this adaptor URL in order to open the hAdaptor handle which you can then finally pass to **D3DKMTGetScanLine** that works with Vulkan/OpenGL/D3D/9/10/11/12+ .... D3DKMT is simply a hook into the hAdaptor that is being used for your Windows desktop, which exists as a D3D surface regardless of what API your game is using, and all you need is to know the scanline number. So who gives a hoot about the "D3DKMT" prefix, it works fine with beamracing with OpenGL or Vulkan API calls. (KMT stands for Kernel Mode Thunk, but you don't need Admin priveleges to do this specific API call from userspace.)
**Improved VBI size monitoring**
You don't need raster-exact precision for basic frameslice beamracing, but knowing VBI size makes it more accurate to do frameslice beamracing since VBI size varies so much from platform to platform, resolution to resolution. Often it just varies a few percent, and most sub-millisecond inaccuracies is easily hidden within jittermargin technique.
But, if you've programmed with retro platforms, you are probably familiar with the VBI (blanking interval) -- essentially the overscan space between refresh cycles. This can vary from 1% to 5% of a refresh cycle, though extreme timings tweaking can make VBI more than 300% the size of the active image (e.g. Quick Frame Transport tricks -- fast scan refresh cycles with long VBIs in between). For cross platform frameslice beamracing it's OK to assume ~5% being the VBI, but there are many tricks to know the VBI size.
1. **QueryDisplayConfig()** on Windows will tell you the Vertical Total. (easiest)
2. Or monitoring the ratio of .INVBlank = true versus .INVBlank = false ... (via **D3DKMTGetScanLine**) by monitoring the flag changes (wait a few microseconds between polls, or 1 scanline delay -- D3DKMTGetScanLine is an 'expensive' API call)
**Turning The Above Data into Real Frameslice Beamracing**
For simplicity, begin with emu Hz = real Hz (e.g. 60Hz)
1. Have a configuration parameter of number of frameslices (e.g. 10 frameslices per refresh cycle)
2. Let's assume 10 frameslices for this exercise.
3. Actual screen 1080p means 108 real pixel rows per frameslice.
4. Emulator screen 240p means 24 emulator pixel rows per frameslice.
5. Your emulator module calls the centralized raster poll (retro_set_raster_poll) right after every emulator scan line. The centrallized code (retro_set_raster_poll) counts the number of emulator pixel rows completed to fill a frameslice. The central code will do either (5a) or (5b):
(5a) Returns immediately to emulator module if not yet a full new framesliceful have been appended to the existing offscreen emulator framebuffer (don't do anything to the partially completed framebuffer). Update a counter, do nothing else, return immediately.
(5b) However once you've got a full frameslice worth built up since the last frameslice presented, it's now time to frameslice the next frameslice. Don't return right away. Instead, immediately do an intentional CPU busyloop until the realraster reaches roughly 2 frameslice-heights above your emulator raster (relative screen-height wise). So if your emulator framebuffer is filled up to bottom edge of where frameslice 4 is, then do a busyloop until realraster hits the top edge* of frameslice 3. Then immediately Present() or glutSwapBuffers() upon completing busyloop. Then Flush() right away.
_NOTE: The tearline (invisible if unchanged graphics at raster are) will sometimes be a few pixels below the scan line number (the amount of time for a memory blit - memory bandwidth dependant - you can compensate for it, or you can just hide any inaccuracy in jittermargin)_
_NOTE2: This is simply the recommended beamrace margin to begin experimenting with: A 2 frameslice beamracing margin is very jitter-margin friendly._
![Example](https://www.blurbusters.com/wp-content/uploads/2018/06/ScanDiagramWithTearLinePositions-690x690.png)
_Note: 120Hz scanout diagram from a different post of mine. Replace with emu refresh rate.matching real refresh rate, i.e. monitor set to 60 Hz instead. This diagram is simply to help raster veterans conceptualize how modern-day tearlines relates to raster position as a time-based offset from VBI_
![Lagless VSYNC jitter margin](https://www.blurbusters.com/wp-content/uploads/2018/03/BeamChasingJitterMargin-690x129.png)
Bottom line: As long as you keep repeatedly Present()-ing your incompletely-rasterplotted (but progressively more complete) emulator framebuffer ahead of the realraster, the incompleteness of the emulator framebuffer never shows glitches or tearlines. The display never has a chance to display the incompleteness of your emulator framebuffer, because the display's realraster is showing only the latest completed portions of your emulator's framebuffer. You're simply appending new emulator scanlines to the existing emulator framebuffer, and presenting that incomplete emulator framebuffer always ahead of real raster. No tearlines show up because the already-refreshed-part is duplicate (unchanged) where the realraster is. It thusly looks identical to VSYNC ON.
Precision Assumptions:
* Scaling doesn't have to be exact.
* The two frameslice offset gives you a one-frameslice-ahead jitter margin
* You can vary the height of consecutive frameslices if you want, slightly, or lots, or for rounding errors.
* No artifacts show because the frameslice seams are well into the jitter margin.
_Special Note On HLSL-Style Filters: You can use HLSL/fuzzyline style shaders with frameslices. WinUAE just does a full-screen redo on the incomplete emu framebuffer, but one could do it selectively (from just above the realraster all the way to just below the emuraster) as a GPU performance-efficiency optimization._
**Adverse Conditions To Detect To Automatically disable beamracing**
Optional, but for user-friendly ease of use, you can automatically enter/exit beamracing on the fly if desired. You can verify common conditions such as making sure all is me:
* Rotation matches (scan direction same) = true
* Supported refresh rate = true
* Module has a supported raster hook = true
* Emulator performance is sufficient = true
Exiting beamracing can be simply switching to "racing the VBI" (doing a Present() between refresh cycles), so you're just simulating traditional VSYNC ON via VSYNC OFF via that manual VSYNC'ing. This is like 1-frameslice beamracing (next frame response). This provides a quick way to enter/exit beamracing on the fly when conditions change dynamically. A Surface Tablet gets rotated, a module gets switched, refresh rate gets changed mid-game, etc.
## Questions?
I'd be happy to answer questions.