
Emulator Latency - panic
http://byuu.org/articles/latency/
======
pjc50
The only real way round this is not to do the emulation on a commodity multi-
user system (with all its tremendous advantages!), but in some kind of
hardware or bare metal software. Where you can access the original gamepads
with the original sub-millisecond latency.

Still can't quite work around the latency imposed by the display, unless you
go back to CRT or build your own LVDS driver board.

I had the opportunity to play a _Tempest_ arcade machine recently, and the
combination of low latency and the weighted spinning controller was really
tangible.

~~~
byuu
Yes, the best thing that could happen to emulation latency reduction would be
the creation of a skeleton OS framework. Something that piggybacks and can use
existing Linux or BSD hardware drivers, but puts the emulator in kernel space
and lets it manage the hardware as directly as is practical.

I talk about this briefly in my infeasible section. But I'm mostly referring
to myself there: I just don't have the skill and the time to write my own
real-time OS for this.

However, I do hope that one day we see a serious project that isn't just
"lightweight Linux" (eg Lakka, etc) or something running on top of DOS.

I presume that 98% of emulator users aren't going to be willing to boot into
an "emulator OS" just to play games; so it's going to be a lot of work for a
little gain.

In that vein, the ultimate latency reduction would be an entire FPGA emulator,
like kevtris' FPGA NES
([http://kevtris.org/Projects/console/video/page1.html](http://kevtris.org/Projects/console/video/page1.html)),
but now you're approaching extreme effort for a userbase of maybe ten people
:/

~~~
pjc50
I think the C64 Direct-to-TV by Jeri Ellsworth was quite a success in terms of
userbase. Of course the problem with making and selling hardware is increased
IP scrutiny.

I think the Raspberry Pi or Beaglebone would make good candidates for
baremetal emulation. The Pi has a formidable but barely documented vector
processor, and the Beaglebone has the (also under-supported) PRU realtime
subsystems which would be ideal for controllers and audio.

Both also have enough ecosystem to make it worth doing. Retropie is already
popular, although it's a straight port of the normal userland emulators and
doesn't attempt to do anything fancy with the hardware. I think there is a
"market" (not the kind that pays money, but a decent userbase) which will
happily dedicate a cheap SBC to booting straight into emulators.

~~~
stepvhen
Last time I used retropie it wouldn't correctly render vibrato effects in the
music, so all of FFVI's wonderful score came out flat and awful. That was over
a year agohowever, maybe its been updated.

~~~
nibnib
Retropie is Linux right? That will have the same latency issues discussed. A
more realtime OS is needed.

~~~
anonbanker
retropie is an emulator frontend for linux. he was likely using Snes9x.

Also, linux has realtime kernel patches, which help a lot. Distributions like
KXStudio ship them by default.

------
ilitirit
There are well-known methods of significantly reducing latency (well, usually
the methods are exclusive to certain emulators), but often the implication is
that you're not sticking to faithful emulation of the hardware/software any
more.

FWIW: I think byuu overreacted to the last few questions regarding latency-
related issues.

[http://board.byuu.org/phpbb3/viewtopic.php?f=8&t=1058&start=...](http://board.byuu.org/phpbb3/viewtopic.php?f=8&t=1058&start=60)

Apparently he won't even look at the test results...

Besides all that, while I completely agree that the easiest way to reduce
latency is by getting better hardware etc, in terms of competitive gaming 1f
of lag is quite a big deal at high levels of play for _some_ games. For
example, games often utilize the concept of "just frame" inputs, meaning that
inputs are required on very a specific frame (the definition has been slightly
relaxed in recent years). Now there are two types of ways to do these: Muscle
Memory (the easiest, you just repeat the move non stop in practice mode till
it becomes second nature), and using an external cue (eg. audio/video). When
it comes to muscle memory, input lag isn't that much of a deal breaker because
once the first move is inputted, everything else will follow correctly, even
if the entire sequence is 1f or 2f off. However, when it comes to using
external cues, even 1f latency can cause the sequence to fail.

~~~
stepvhen
I understand your argument, but I don't think anybody is playing SNES games
competitively, on a non-negligable scale. Moreover if such a game exists, it
would probably fall into the pathological cases he mentioned.

~~~
ilitirit
The thing of course is that the so-called minority _are_ in fact the people
who raise the issue of input lag in the first place. For example, the entire
Speed Running community considers this an important topic.

While they do play on "real" hardware, when runners practice for runs they
generally load up a particular scenario in an emulator and repeat the practice
a particular section that may require just frame inputs over and over until
they are comfortable enough that it can be considered a valid strategy.

So while it's true that the SR community only forms a minority of gamers, any
sort of argument that relies on the experience of the majority effectively
rules them out of a conversation that affects them the most.

But besides that, it could be argued that in today's day and age, SNES Speed
Runners probably _do_ form a non-negligent subgroup, considering that the
majority of gamers don't play games like SMW, Metroid or Megaman any more.

For those who aren't aware, the Speed Running community has charity events
that have in the past raised over $1m for cancer research etc. They may be
"small", but they are far from insignificant in terms of their gaming
presence.

[https://gamesdonequick.com/](https://gamesdonequick.com/)

~~~
byuu
> The thing of course is that the so-called minority are in fact the people
> who raise the issue of input lag in the first place.

And I consider input lag the achilles heel of emulation replacing real
hardware. And if it's some kind of qualifier, I've done speedrunning stuff on
real hardware and emulation too (Ninja Gaiden especially.) Point being: I not
only take it very seriously, I'm in the position to actually do something
about it. And have. I've spent a ton of time with a lot of ideas like this.
The one I wasted the most time on was probably this one:
[http://www.ouma.jp/ootake/delay.html](http://www.ouma.jp/ootake/delay.html)

Before you consider me unreasonable on this topic, dig through all the posts
about input latency on the "bsnes megathread" on ZSNES forums; the bsnes
subforums on the ZSNES forums; the five years of posting history on my own
forum that InMotion Hosting corrupted with a botched MySQL upgrade; the three
years of new posting history on my new forum instance; and all of the
discussions I've had on all of the other sites on the internet over the past
twelve years; and read my article in full please.

See? Look, this is me, talking about this issue, in 2008:
[http://board.zsnes.com/phpBB3/viewtopic.php?p=168556&sid=f27...](http://board.zsnes.com/phpBB3/viewtopic.php?p=168556&sid=f27384ebda7c77d13b2dae63a83697f8#p168556)

See how polite I used to be on this topic? Now tell me you wouldn't be tired
and agitated after _twelve years_ of new people you've never heard of before
popping up and telling you that you're doing it all wrong and they totally
have a _revolutionary_ new way to vastly improve latency.

~~~
ilitirit
I know about the ootake "fiasco", and to be honest, it, and everything you've
been asked about input lag in the past _really_ doesn't matter. (If you've
ever been in a support job, you'll know this is one of the first things you
learn). You implied to a poster in that thread that if he upset you, you would
ban him. Now I'm not about to tell you how to run your forum (or write your
emulator!), but I don't consider that very reasonable. If you don't want to
answer questions about latency, why participate in the first place?

Secondly, I really don't mean any disrespect, but your article completely
misses the point. Competitive gamers don't care about those types of reasons,
and not because they are being obtuse or unreasonable - there exists a healthy
overlap between software developers after all. We too have been studying the
topic for years.

For example, you care about faithful emulation more than anything - we don't.
We don't mind playing without the sound, or turning off graphics layers, or
even turning off a complete subsystem if it improves latency and maintains a
_faithful_ framerate (it's no good if it doesn't slow down when it's supposed
to - see Cave Shmups - or if it runs too fast - see SF2:HF). We often have
different emulators for different games simply because one handles a certain
case better than another. We'll use emulators like Shmupmame that use tricks
to make the input lag closer to what it's like on the arcade cab. And we
certainly don't think in terms of milliseconds (and of course we understand
the issues mentioned in your article...). We only think in terms of frames and
everything around that. What is the FPS? How often does the game loop run?
Once/multiple times every frame? How often is the game state rendered? Can the
emulator process everything before the game's internal loop ends? What about
if I use tricks like running @ 144hz? What about if we alter the input handler
so that it disregards every but player 1's inputs, or "impossible"
combinations, or buttons that don't exist in that game? Etc etc etc.

And of course we're no stranger to resistance from emulator developers. That's
why so many forks exist. I myself have got my own forks of Mame. And, as with
everything else, sometimes we are wrong. Sometimes the devs are wrong
(recently: see Hunter K's discovery that shaders can in fact introduce
latency, like players have been claiming for years). Often both parties are
wrong and everyone learns something new. That's just the way it goes.

My point is that many of the people you interact with actually do know what
they are talking about. They might not know the minutiae about the inner
workings of Higan, but often this is exactly why they post on your board, as
evidenced by the thread you locked. And in terms of your article, you skip the
known methods already used by developers (and hackers), and you don't offer
anything else besides something else we already know - getting as close to the
bare metal as possible. Is there really anything that you added to the
conversation, besides describing how Higan works? Don't get me wrong, it's a
useful for someone who doesn't know anything about the topic. But for everyone
else? That's debatable.

It's completely understandable that you are tired of the subject. But in that
case you should just not respond instead of alienating yourself from your
users. People will figure it out eventually, or just do what they've been
doing for years - use the method that works _for them_ even though they don't
fully understand it.

------
scottlamb
The article talks about the OS audio buffering (for mixing streams from
multiple applications) and application audio buffering (for mixing streams
from within the application), each adding 10–40 ms of latency.

Why is the application audio buffer necessary? Can't the application just send
all its streams on to the OS separately to let the OS do all the mixing with a
single-layer approach? Is there some unrealistically low bound on the number
of possible streams or something that makes this impractical? Could that be
fixed?

This seems like a simple way to save 10–40ms, without giving up mixing audio
across applications (as is necessary with the "WASAPI exclusive mode" the
author described).

~~~
haberman
What kind of "stream" did you have in mind?

For the OS to do software mixing, it is basically going to run a "for" loop
over the audio buffers and add them up to produce the audio buffer that goes
to the hardware. When is it going to run that loop?

~~~
scottlamb
> What kind of "stream" did you have in mind?

The same kind the author means. I didn't coin this term or introduce it to the
discussion.

> For the OS to do software mixing, it is basically going to run a "for" loop
> over the audio buffers and add them up to produce the audio buffer that goes
> to the hardware. When is it going to run that loop?

Right, the author is saying that the OS is already doing this, and
additionally the application does the same thing before sending a combined
stream to the OS. So to answer your question of "when is [the OS] going to run
that loop?" My answer is the same time it does now. I'm not proposing any
changes to how the OS works.

My question is: why does the application need to do that work? Why can't it
send each stream to the OS to have them be combined in only one place?

~~~
haberman
Ah, I see what you are asking. Sorry I misunderstood.

I doubt that the application mixing is actually adding any latency compared to
if the application only had a single logical stream internally.

Regardless of the application's internal structure, you will at every moment
have a set of audio buffers:

1\. the one the sound card is currently playing

2\. the one the OS is preparing for the sound card (it gets one buffer's worth
of time to synthesize this).

3\. the one the application is preparing for the OS (it gets one buffer's
worth of time to synthesize this).

If the OS isn't doing any mixing, then you could make buffer 2 and 3 the same
and save a buffer's worth of latency. But if the OS is doing mixing, then it
needs a chance to add up all the application buffers before they go to the
sound card (that is its "synthesize" step). So the application can't be
writing directly into the OS's buffer.

You might ask "why can't the application and OS both do their work within a
single buffer's worth of time?

Hmm, I guess it is an interesting question whether the OS could, inside the
write() call, do the mixing immediately. I'm reaching the point where I'd have
to speculate: I'm not exactly sure how existing OSs design their mixing and
whether this would be feasible or not.

------
nibnib
Great article.

>This process can incur quite a bit of CPU time as well. Attempting to poll
the keyboard state, mouse state, and all attached gamepads can easily eat
several milliseconds per call. So it's just not possible to poll every
millisecond.

If input software layers mean it's not possible to poll every millisecond, why
bother polling the hardware at 1kHz? Is it a just-in-case solution to increase
hardware polling to maximum?

I'm also curious if there are any harder numbers available, maybe by
triggering a USB key input and measuring time for a test program to register
the change. I know this sort of thing is done to compare CRTs and LCDs, I've
never seen it done for a whole PC.

~~~
byuu
Thank you! I was really worried about the tone being too harsh or know-it-all.
Wasn't really meant for audiences that weren't aware of the context. But as an
emulator author, you get people presenting new zany latency reduction ideas
that defy the laws of physics all the time, and they completely dismiss your
own experience in the field, and it's like kids constantly saying, "are we
there yet?" on a long car ride. Eventually you lose your cool, and then well
... you sound like me in that article >_>

> why bother polling the hardware at 1kHz? Is it a just-in-case solution to
> increase hardware polling to maximum?

Yes, pretty much. More of a because-we-can and to combat the cumulative
effects of latency (death by a thousand papercuts) however possible. If you
were to push it to 200Hz (5ms), then it becomes possible that your OS API
returns states immediately after and it stacks with your emulator latency of
5ms to form a 10ms latency. Push it to 1000Hz and that drops to a 6ms maximum
latency.

It is indeed silly. No one is going to perceive a worst-case 4ms difference.
(And I say worst-case because these sorts of misses tend to average between
best-case and worst-case, so in practice it's probably half that bad.)

We're trying to chase the emulation latency of a gamepad that you literally
tell it _exactly_ when to poll the inputs and within mere cycles on a 21MHz
clock, start reading out the results from its shift-register.

> I'm also curious if there are any harder numbers available

That would be fun. I'll admit that many of the numbers are estimated. In the
end, we can only observe the net total of all latency by pressing a button and
seeing how quickly the sprites respond visually and aurally. But it's probably
possible to isolate similar test cases for each source of latency. In a lot of
cases (kernel audio mixing, keyboard responses), we're probably talking much
smaller latencies than CRT vs LCD monitors, so you'd need a huge amount of
precision.

~~~
nibnib
> as an emulator author, you get people presenting new zany latency reduction
> ideas that defy the laws of physics all the time, and they completely
> dismiss your own experience in the field, and it's like kids constantly
> saying, "are we there yet?" on a long car ride.

Yep, that seems to be the usual for nearly every emulator scene.

> But it's probably possible to isolate similar test cases for each source of
> latency. In a lot of cases (kernel audio mixing, keyboard responses), we're
> probably talking much smaller latencies than CRT vs LCD monitors, so you'd
> need a huge amount of precision.

I was thinking of a test program that inverts/cycles the screen colour on
registering a keypress. Put a photodiode in front of the monitor and use an
electrical switch across the controller button contacts. Cheap parts and an
oscilloscope will give microsecond resolution for the end-to-end net case.
Unfortunately not everyone has this kind of setup lying around.

I seem to remember someone marketing a commercial device that did something
like this for timing CRTs vs LCDs but can't remember the name of it.

~~~
ilitirit
Leo Bodnar's Lag Testing device:

[http://www.leobodnar.com/shop/?main_page=product_info&produc...](http://www.leobodnar.com/shop/?main_page=product_info&products_id=212)

[https://www.youtube.com/watch?v=zVw6T9rw0xU](https://www.youtube.com/watch?v=zVw6T9rw0xU)

~~~
nibnib
Thanks :)

------
gregpardo
Been following byuu since my rom hacking days. I always stop by to read his
articles and this is another good one.

------
nwmcsween
If the input is steady could you not do some sort of predictive rendering
where a frame or more is prerendered?

------
Zardoz84
Well, I never feel that ZSNES on DOS (on a 486) was unresponsive compared
against my NES clone, when I was playing Super Mario 3 on NES and on ZSNES.

~~~
jdbernard
In case you aren't trolling, the reason you are getting downvoted is probably
because you are comparing apples to oranges. ZSNES is not an accurate
emulator. byuu's goal is to emulate the systems perfectly. Here is an article
he wrote about it:

[http://arstechnica.com/gaming/2011/08/accuracy-takes-
power-o...](http://arstechnica.com/gaming/2011/08/accuracy-takes-power-one-
mans-3ghz-quest-to-build-a-perfect-snes-emulator/)

So yes, ZSNES feels a lot more responsive than higan. But that's because it
cuts a lot of corners with regards to accuracy of emulation. It is probably
the least accurate emulator for this reason. It prioritizes speed. In light of
that, your comment doesn't really add anything to the discussion.

~~~
byuu
If you can run higan at 100% speed, then ZSNES has more lag than higan does
(due to double buffering, polling only once per frame, etc.)

For me, the issue with the OP is a) comparing to clone NES hardware, and b)
it's subjective. Different people react differently to latency. The numbers I
talk about may be largely estimated, but they're objectively _real_ latencies
that really do exist, even if you can't observe them personally.

It's a very good thing if you can't. Makes gaming a lot more fun. Testing
myself, I don't really detect a change in latency under emulation alone until
I simulate adding about 75ms more than is already there. However, I did notice
a problem when I moved from playing Ninja Gaiden Trilogy (I know, bad port) on
my SNES to higan: all of my timed moves were failing (it's a game that
requires pixel-perfect movements); but I adapted pretty quickly and was able
to beat the game. But again, this is all subjective stuff, so it's not really
adding to the technical discussion any.

~~~
jdbernard
> If you can run higan at 100% speed, then ZSNES has more lag than higan does

Yes, I should have said, "ZSNES feels a lot more responsive than higan on less
powerful systems."

------
pjc50
I've just thought up a conceptually simple but tremendously difficult to
implement and CPU-consuming way of working round this: avoid latency by seeing
into the future.

Modern GPUs give you a ton of parallel processing. Old gamepads are binary
with a fairly limited number of buttons not all of which can be pressed at
once by someone with two thumbs. The emulated RAM state is also fairly small -
kilobytes.

So, run lots of copies of the emulator. One for each possible change in button
press status from the current state. At each frame (50/60Hz), look at the
current actual inputs and pick (frame,20ms audio samples) from the available
precomputed choices. Start calculating the next frames based on the winning
version of the emulator state, and discard the rest.

(This is effectively branch prediction at the macrostate level).

~~~
larsiusprime
Mentioned in the article:

> There are magic tricks beyond this, such as emulating every possible input
> one frame into the future, to cut out a single frame of latency. But with
> only one controller, this would require higan to emulate up to 4096
> simultaneous SNES systems and well ... higan just isn't that fast, sorry.

~~~
mikeash
Why does it need to emulate every possible input, rather than emulating a
single path assuming that the controller state remains unchanged? You can only
display one future frame to the user, after all, and that seems like the best
one to display.

~~~
byuu
I realize a lot of the 4096 states are extremely unlikely (especially Up+Down
or Left+Right ... not physically possible on an unmodified controller); but if
the inputs you predict end up _wrong_ , then you have no choice but to run the
frame again normally. This is going to cause an extreme jittering where the
input lag doubles for some frames.

Imagine audio stuttering from a scratched CD, or a framerate that suddenly
dips from 60fps to 30fps for a moment and then resumes. With input, the effect
is going to be even more jarring.

If you're going to do this, you absolutely can't have a miss. Ever.

~~~
mikeash
Yep, makes total sense to me now. I was thinking of optimistic prediction, not
deterministic precomputation.

