Hello, API Requirements For an automated benchma

Some discussion on Discord: <a href="https://discord.com/channels/7785397009810718

Benchmarking hooks / External API to set frame rate cap / or scanline number during Latent Sync about specialk HOT 5 OPEN

specialko commented on May 21, 2024 1

Benchmarking hooks / External API to set frame rate cap / or scanline number during Latent Sync

from specialk.

Comments (5)

mdrejhon commented on May 21, 2024

Some discussion on Discord:
https://discord.com/channels/778539700981071872/778539700981071875/1130619440719999059

from specialk.

mdrejhon commented on May 21, 2024

I have updated the information to simplify the feature request.

I removed API timing precision completely!
I removed the requirement for SK to modify its existing merry frame presentation timing

The only immediacy consideration is the thread-safe boolean flag, that needs to be settable anytime in the previous frametime, prior to frame presentation of new frame, right up to new frame. (And the presentation hook thread will do a last-minute white-flash modification of that said frame, even if I just happen to set the flag 0.1ms prior).

Flash should be maybe 5% of screen width at either left edge or right edge, sufficient room for a common photodiode tester not to slam against monitor bezels when trying to test that location. Paint it as a full-height rectangle for one screen edge.

EDIT: I just remembered I need an additional API to set/remove a non-flash color (e.g. black), so I don't get falsings from the in-game material (testing a game) or in-app (custom test patterns optimized to measure specific stuff, such as display-only lag)

from specialk.

Kaldaien commented on May 21, 2024

Okay, I'm getting to work on some of this now. I can accommodate all of these requests except for item 3/4.

Fundamentally, flushing after submitting a finished frame is meaningless in modern graphics APIs.

To elaborate, Special K already flushes OpenGL/D3D9/D3D11 before adding CPU delays (in all framerate limiter modes). This is because in those APIs there's a small chance that there's queued render commands that haven't been submitted to the GPU yet before the framerate limiter makes everything go idle. Normally, if there were no framerate limiter, the game's Swap Buffers/Present would include an immediate implicit flush and Special K is replicating the immediate flush but delaying the Present.

D3D12 and Vulkan have done away completely with explicit and implicit flushing, they begin executing commands on the GPU as soon as they're submitted.

The only scenario I can think of where a flush after present would do anything at all is single-buffered OpenGL. The command queue in all of these graphics APIs is flushed during any kind of double-buffered present, you can't finish a frame without the API flushing things.

I could eliminate the flush that Special K applies before framerate limiting, but I don't actually know what purpose that would serve? In my mind, that only introduces the possibility that after the framerate limiter wakes up there's some extra GPU work remaining to be done during presentation that might cause a missed deadline. It wouldn't increase framerate or anything like that, the GPU still has to do this work to complete a frame and the sooner it begins, the better.

from specialk.

mdrejhon commented on May 21, 2024

Request 4 (other than Flush) is essential for display lag filter algorithm

I hope you are only referring to the Flush part of Request 4. My limiting factor in measuring display lag depends on being able to fully filter software/GPU from computer.

The margin between setting flag, and the presenter seeing the flag and then doing a last-minute draw before presenter, will be my error margin in how much I can filter the computer/software from display lag.

If I cannot set the flag even as little as 0.1ms prior to your frame-present doing a last-minute flash, it throws the algorithm right out of the window and the other requests become useless, without that critical gating factor.

It's to say, that I need to be able to set a flag to tell your frame-present routine to suddenly add a last-minute draw. If it can't be done, the "beamraced display lag filter algorithm" is thrown out of the window as I would be fully unable to algorithm-out display lag from GPU/software lag...

That being said, I would still have other use cases for controlling the frame cap externally (e.g. controlling the frame rate of TestUFO running in a fullscreen window, if SK is successfully able to 'latch' onto Chromium's DirectX-based fullscreen buffer). I do have a need for a VRR-compatible TestUFO, and Chrome can run in --disable-gpu-vsync --disable-frame-throttle command line options, and then an external cap can control the TestUFO framerate as a result. On Windows I even see tearlines in Chrome when TestUFO runs at 2000fps, and it does kind of make WebGL fullscreen games work in VRR (if I force VRR via NVCP). So, provided SpecialK is able to latch onto the DirectX-API-based framebuffer of Google Chrome in fullscreen mode -- FINALLY, I can do VRR-TestUFO!

Then I can write a wrapper around Chromium Embedded Framework (CEF) and simply use SpecialK API to control the framerate of TestUFO - without modifying the Chromium source code, just forcing VRR and --disable-gpu-vsync and --disable-frame-throttle flags. So that'd be another use case for me, that is unrelated to measuring display lag.... if you're unable to do item 4 -- it's not useless.

It's possible Flush might be unnecessary, but some findings:

You flush early even before the frame is presented? Perhaps that's why SK behaves better than RTSS. How stable is Latent Sync tearline? Can it stable to near raster-exact?

I found it kinda depended on the graphics drivers. Flush was necessary on certain GPUs such as GTX 1080 Ti to stabilize the tearlines to near-pixel-exact positions. But as long as I can get that accurate tearline steering with SK as with an external Present()+Flush() app.

https://www.youtube.com/watch?v=OZ7Loh830Ec
https://www.youtube.com/watch?v=tQW7-VbrD1g
https://www.youtube.com/watch?v=6M9XdACBUnk

Also, GPUs often fall asleep while waiting for a frame to finally Present(), so when the GPU finally wakes up, the Present() is a bit lagged, while a repeat-Flush(), even if draw commands are done, sometimes does a dual-purpose to automatically end the GPU power management right on the spot.

But I did notice on my RTX 3080 GPU, adding a Flush didn't make as much a difference.

One great way to monitor whether Flush worked, is to intentionally configure Latent Sync or Scanline Sync to move the tearline on-screen (middle of screen) while displaying fast horizontal panning material. In RTSS Scanline Sync, enabling the flush setting in the config (Present+Flush) suddenly stabilized the tearline, at the cost of high GPU utilization (slowdown). That's OK, when I'm prioritizing display lag testing (beam raced lag filtering algorithm is most accurate with perfectly stationary tearlines, ala the youtube videos of Tearline Jedi).

Tearlines are amazing timer-precision debugging tools! Basically, you intentionally place a tearline at a specific raster, and experiment with various techniques to stabilize it (e.g. busylooping on QueryPerformanceCounter() produced much better timing precision than a timer). What kind of mechanism do you 'time' your scanline?

At 67KHz scan rate aka "horizontal refresh rate" or "number of scanlines pre second" (1080p 60Hz), a 2-pixel jitter in tearline translates to an error margin of 2/67000sec inaccuracy! Incredibe I can beamrace that accurately in mere MonoGame C#... So, the more stable the tearlines are during the beamraced lag filtering algorithm I've invented -- the more accuracy lag tests can become.

I don't know if Flush is needed here -- perhaps not -- but my litmus test will be comparing tearline stability in SK versus tearline stability in my external app (for the same complexity-type graphics). I would test it on both a GTX 1080 Ti and a RTX 3080, they behave very differently in frameslices/sec count and tearline-stability. Surprisingly, 1080 Ti has more stables rasters than RTX 3080 due to the more hyper-pipelined design of the 3080, but I can still get sub-0.1ms precision on tearlines.

But since the step 4 (last minute flash draw) is absolutely essential/germane to the algorithm, and since that buffers a single last-minute draw command (a single white rectangle, drawn by two solid triangles likely), that likely has to be flushed "again" to be deterministic. In other words, you've flushed the earlier draw commands -- but since you're adding a last-minute white rectangle to the framebuffer, then the framebuffer needs to be re-flushed so the draw command doesn't sleep/powermanage/etc with the amplified tearline jitter. On the other hand, it is very driver-dependant and GPU-dependant. It reduced tearline-jitter error margin of some GPUs by 90%

Regardless of what you do, as long as I can achieve near-zero tearline jitter (at least one one of my GPUs), I'm happy, even if I have to cherrypick a GPU that doesn't need a bit of additional flush help. Although that potentially limits the market.

I did experiments already with MonoGame in C#, and in theory I could just write a standalone app doing my "beamraced display latency filtering algorithm" -- since as long as I time tearlines that precisely, I've clearly succeeded at almost zero (wayyyy sub-1ms) Present()-to-VGA-output latency (oscilloscope on a VGA output via DVI-I adaptor of an older GPU card that still had the analog pins). So from that basis, I was able to come up with a method of filtering the computer/system from display latency. Since GPU-output-to-photons is considered largely the display latency (albiet digital cables will have a bit of transceiver latency including the display latency).

Regardless, instead of being forced to use only an app I write, I'm wanting to be able to use multiple software programs, including software I did not write, so that's why I'm asking SK to add an API instead. Also allows measuring game lag, etc. Fewer wheels reinvented, etc.

from specialk.

mdrejhon commented on May 21, 2024

What kind of mechanism do you 'time' your scanline?

Reminder! I'm curious, I could look at the code, but:

Do you have an optional busyloop-based precise-presenter?
Do you use precise GPU scheduled asynchronous frame-presentation APIs?
Have you verified you can get temporarily get reliable perfect tearline locks (1-to-3-pixel jitter in tearlines)?

BTW, I often use tearline jitter as a visual (eyeballing) timing-precision debugging tool. With display horizontal scanrate (via QueryDisplayConfig()), the number of pixel jitter in your tearline = (vertical jitter in pixel) / (horizontal scan rate) = your timing imprecision in SpecialK

Like raster jitter in yesteryear 8-bit games being a debugging tool for beamracing / raster interrupt precision, today, tearline jitter is also still my timing-precision monitor even today.

I don't know if you now use precise GPU-scheduled frame presentation APIs (e.g. telling the GPU that you want a frame presented at a specific exact microsecond. This API is not available on all GPUs, sadly, and I want broad compatibility with all kinds of GPUs (including Intel GPU). The Beam Raced Display Lag Filter Algorithm is very GPU-independent.

Tearline jitter on one GPU became most stable with a double flush via "Flush-BusyLoop-Present-Flush", although some had to be thrashed with dummy draws (1-pixel faint-color changes, to a corner) to prevent power management jitter, aka "Draw-Flush-Draw-Present-Flush-Draw-Present-Flush-Draw-Present-Flush-[until-1-to-2ms-prior]-DrawFlash-Flush-BusyLoop-Present-Flush". Basically only busyloop for 1ms-ish or so, only for microsecond tearline alignment. But I can just do that in my own app most of the time. Basically thrashing the GPU power to prevent power management sleeps on GPUs that don't have precise-scheduled asynchronous frame presentation APIs. One can flush before/after present, but on one GPU, post-present flush was needed for rock-stationary tearline.

SpecialK doesn't have to do the thrash-trick; it's only useful to prevent GPU power management (sleeping between frames during low-GPU-%-utilization), but if I am only doing 1 tearline per refresh cycle, with simple test patterns, I'm only using GPU for 1% of the time, and then some GPUs go to sleep, and scheduled presentation jitters by a millisecond. So all the thrashing/flushing keeps the GPU on its toes, keeps it power use high, clockrate high, and assisted by a final CPU BusyLoop on QueryPerformanceCounter (eat 100% CPU of one core for 1ms), producing microsecond-accurate rock-solid-stationary tearlines (on some systems) that sometimes only jitter by 1 pixel. It's also useful if you need tearingless VSYNC to work properly at high frame rates at low CPU utilizations, in reduced-VerticalTotal situations (hard to dejitter tearline between refresh cycles).

I would disable all this flush-hacks when playing games, or measuring game latency -- it is only useful to filter GPU/computer from display lag, via perfect-stationary tearlines.

I can forgo SpecialK doing flush, and just do my own app to do the flushes, but it means I sometimes have to use a separate app to filter display lag, before I benchmark the game separately. And a few esoteric use cases (stabilizing tearlines in tiny VBIs on very high refresh rates on GPUs that don't have precise scheduled asynchronous frame presentation API capabilities) -- like tearingless VSYNC OFF during low-GPU-utilizations where power management adds annoying tearline jitter if not thrashed (with a dummy pixel) between frames to avoid power management, unless you have an API to tell the GPU not to do power management.

Now that said, I can forgo flush, just cherrypick a GPU, rather than a GPU-independent "Beam Raced Display Lag Filter Algorithm" using a various of thrash/flush tricks.

It's just a special consideration...

from specialk.

Benchmarking hooks / External API to set frame rate cap / or scanline number during Latent Sync about specialk HOT 5 OPEN

Comments (5)

Request 4 (other than Flush) is essential for display lag filter algorithm

It's possible Flush might be unnecessary, but some findings:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent