Hacker News new | past | comments | ask | show | jobs | submit login
Debugging memory corruption: who the hell writes “2” into my stack? (2016) (unity.com)
409 points by darknavi on Nov 14, 2021 | hide | past | favorite | 144 comments



I read this and couldn't shake the feeling that it's a fantasy tale because it mentions the mystical Unity support engineer who actually fixes bugs - the one I haven't been able to get a hold on for the past 5 years, despite having critical bugs escalated by support... But then I noticed this is from 2016, so before they started kicking out experienced people to polish their profit margin for the IPO.

Still an amazing debugging story, though :)


Luckily, this particular developer is still at Unity and still fixing bugs—and doing the world a service by writing fascinating breakdowns about them https://blog.unity.com/technology/fixing-time-deltatime-in-u...


I get a 502 bad gateway from that link?


"Mythical", I guess you mean. A mystical engineer might be even less useful.


:)

I now looked up "mystical" in Merriam-Webster: "having a spiritual meaning [..] that is neither apparent to the senses nor obvious to the intelligence"

That sounds like a pretty good description of my relationship with Unity's support engineers. I hope they exist, but to me it's more of a religious thing because I have never met one.

"mythical" says: "existing only in the imagination: fictitious, imaginary, having qualities suitable to myth: legendary".

That also fits well, because so far they exist only in my imagination but reading this story, I am impressed by their legendary problem-solving power.

Thanks to you, today I learned the difference between two similar words :) And now I can even more clearly articulate someone's progress on being less useful ^_^


You can pray to god, and you can pray to unity support. Many people claim they exist and perform miracles, but hardly anyone can point to fungible proof


Which makes tangible vs. fungible your next dictionary dive :^)


Well, non tangible token makes a lot of sense.


A tangible token is a ... coin.


Had the same thought, my experience with unity support is its a blackhole. Good that there are some good examples even if from times past.


I think it's a sort of tragedy of the commons situation.

When I first met some of the Unity people at the Nordic Game Jam in 2012, they seemed like an awesome team making game development accessible to everyone. At that time, UE was still closed source and out of every indie's reach. And Unity was Mac-only. Since then, Unity has become the de-factor platform for cheap outsourced mobile games, reskin spam, and app store gambling scams. They now have millions of developers using their game engine, but it's an army of people building 1€ apps. Obviously, you can't charge them much because their revenue is so low. And so they had to make support cheap to compensate.

But the part which I don't get is why Unity doesn't offer paid support for medium-sized indie studios. And why they don't publicly sell source code access. I mean their biggest competitor UE4/5 already has all source code on GitHub.


"De facto" - No hyphen, no 'r'.

https://en.wikipedia.org/wiki/De_facto

Post left in the spirit that the OP was a typo, but that someone else might not know and be interested in the Latin roots of this phrase.


Thanks :)


They do sell support

https://unity.com/success-plans


The paid support is not cheap and mostly useless, at least in Asia region. We still need to go through normal bug report system, create reproduction steps for them, and get denied for back-porting bug fixes every single time. The only thing changed is that there's a "bug fix progress update" to make you feel a little bit better. We terminated the support right after the term expired.


They sell both paid support and source code for extra money in addition to the subscription. It's in the plan comparison table.


I contacted them about it in the past and they refused to even tell me the price. That's why I said they need to publicly sell it, because as-is I don't know anyone who succeeded in buying it.


It's determined on a per-contract basis. Expect to spend quite a few months making a deal, and having it come with a bunch of strings attached.

The company I worked for had Unity source access, but we were required to submit any source changes back to Unity (even if they chose not to integrate them), and announce our game at UnityCon 2018.

They also wanted us to help write some of the cutscene timeline tools. I don't know if they used any of the code we wrote, but we definitely did a bunch of work on those timeline tools and sent them back to them.


This is worse than I could possibly imagine. Did your studio at least get credited for the work on timeline tools?


I don't see why we would -- I don't think Unity has any individual credits listed for their engine developers. That's standard for software development, AFAIK Adobe is really the only mainstream software company that still has credits listed in their product releases.


> In 2012...Unity was Mac-only

The timeline is a bit off here. Unity was initially Mac-only, but only for a short time. It ran on Windows from version 2.5 in 2009.[1]

[1]https://blog.unity.com/technology/unity-25-for-mac-and-windo...


IL2CPP unfortunately still has issues. I had a friend playtest some of my gamejam entries and any IL2CPP build would just randomly crash on his laptop. Unfortunately I couldn't replicate it, so eventually I just swapped to using the Mono build, which works fine.


You are forced to use IL2CPP on mobile since the Google/Apple mandated ARM64 build is not available through Mono. (Mono itself does support ARM64, just Unity doesn't bother to port it) I recently had a bug which erroneously stripped functions in generated IL2CPP code, and I had to trace generated IL2CPP line by line, it’s so much fun.


Wow, I didn't know that. I can imagine the frustration!


When another thread pushes a socket to be processed by the socket polling thread, the socket polling thread calls select() on that function. Since select() is a blocking call, when another socket is pushed to the socket polling thread queue it has to somehow interrupt select() so the new socket gets processed ASAP. How does one interrupt select() function? Apparently, we used QueueUserAPC() to execute asynchronous procedure while select() was blocked… and threw an exception out of it!

Not taking anything away from the great debugging, but that is just incredibly bad design and reeks of bugs from a mile away. System calls block for a reason. If you can't handle the blocking, how about a non-blocking architecture? But in this case, the engineer thought "hey, I don't want to block on my precious main thread, so let's just give this blocking thing to another thread, that'll surely solve it!". Except it doesn't when you use that same thread as a worker to execute all other thread's blocking crap ... but then write the queue push() function to be blocking as well!

Keep it simple folks. Bad design always breeds more bad design.


It's particularly surprising to me because the whole point of select() is to handle these sorts of cases easily. The technique I've often seen is just to have a pipe or other local connection to the thread using select, and then you wake up select when you need it to add something to the queue.


That was their fix.


My over-editing of my comment got rid of the sense of "they knew it was the fix, why not do that in the first place?"

Which, admittedly, sometimes folks just don't know, but usually I more often run into complete cluelesness that select()/poll() and friends exist.


I'd go for a different approach: Add some cancellation event for select() to listen to as well. Then, when another thread submits a new event source, it's only a matter of signalling the cancellation event, so the select() thread can update its poll set.

"Non-blocking" sounds very special but for anything except batch programs it should be used by default because whenever there is more than one possible input (inputs can be trivial as a cancellation signal) one can not afford to be blocked.

Then again, that doesn't mean that one shouldn't use select(). Unless one goes ultra-high frequency where the CPU is busy almost 100% of the time, select() saves CPU cycles, and decreases response time. Without select(), finding an appropriate polling frequency is a tradeoff between wasted compute cycles and increased response time.


> I'd go for a different approach: Add some cancellation event for select() to listen to as well.

That's exactly what the author of the article did.


https://www.youtube.com/watch?v=k238XpMMn38 but it's the rapidly dwindling sanity of unity programmers


There's nothing super special about this design. It apparently uses a single "IO reactor" which handles readiness notifications for a couple of other threads. Those might be threadpool, and/or run non-blocking code (we don't know too much about this).

This design is pretty much the same that you can e.g. find in the go runtime (netpoll), the .NET core runtime, Rust multithreaded async libraries, Java nio2, etc.

The only thing that was special here is that they used an APC for waking up the reactor in case new events had to be added. Its non-standard, and kind of comparable to using a signal on unix to interrupt `select` instead of a message on a preregistered socket.


I was once brought over to a project that used exceptions for flow control, sometimes nested several layers deep. It was a nightmare to debug and untangle.


What's funny to me is their solution is very natural and commonly used in unix. My guess is because signal handling on unix is much more prevalent and brings with it many of the same control flow issues you'd see in exceptions.


From the article: "Surprisingly, setting up kernel debugging is extremely easy on Windows."

Microsoft (at least used to) take developers very seriously. The OS actually plays along when you're trying to debug things, and Windows error reporting actually sends you data that you can work with.

Say what you want about their business practices, but I'd deal with Microsoft's developer support over Apple's anytime. And mandatory Youtube: https://www.youtube.com/watch?v=Vhh_GeBPOhs


I will give Windows this

You have Event Viewer, Windbg built-in.

Windbg Preview blows away other debugging tools.

And then there's freeware "API Monitor". This thing is incredible, it's like a GUI for looking at all syscalls and COM events on every program without needing to explicitly debug.

The closest thing I could approximate it to would be something like BPF

http://www.rohitab.com/apimonitor


I'm (among other things) a Linux system engineer and an occasionally reluctant Linux plumber on my team. Debugging kernel panics / bluescreens is night and day between Linux and Windows.

I have a Windows gaming rig that I was getting BSODs on. It took about ten minutes to reconfigure Windows to write full crash dumps, get WinDbg set up, and determine where and what was crashing the system (bad display driver install - DDU and reinstall fixed it). There were plenty of guides as to how to get it set up and how to get it to pull Windows debugging symbols from the Internet. It was completely painless.

I have to keep a well-worn grimoire covered with notes and scribbles on how to get kdump working correctly to do the same sort of work on the Linux machines that I support - assuming that devops even set them up in the right configuration for kdump to work.


I also am a lifelong Linux user and prefer it to Windows.

But I have to hand it to Windows in this realm, despite how I might feel about other parts of it.

Linux has strace, valgrind/kcachegrind, eBPF, etc

And those tools are incredibly powerful, but have a far steeper learning curve and (personal opinion) UX than similar Windows tools.

I've used Windbg Preview to debug code written in niche languages like D and it's been able to sync source code from .d files and let me set breakpoints.

That really blew me away. I know that debug info is standardized in PE/COFF and in Linux with DWARF, but still.

Neat (at least I think so) screenshot of Windbg Preview debugging a D .dll with source synced:

https://media.discordapp.net/attachments/625407836473524246/...

The JS scripting is pretty wild too.

Repo full of them here. Example for generating a mermaid diagram image of callgraph, given a function name:

https://github.com/hugsy/windbg_js_scripts/blob/main/scripts...

(Material from a Defcon workshop by same author):

https://github.com/hugsy/defcon_27_windbg_workshop


I mess around with ~5 Linux boxes, so this is a small sample size, and also all my neighbors' Windows PCs.

Seems like the Windows PCs are more likely to show you a blue screen/crash than a Linux one.

Would that match your experience too?


My experience is that these days only hardware failures crash Windows.

A few years ago the biggest source of BSODs were the performance-optimised video drivers. That has thankfully improved a lot…


Honestly, not really. My Linux daily driver PC and Windows gaming PC are about equally reliable. The only recent blip for me has been with my Windows machine randomly crashing due to what turned out to be a bad RAM DIMM module.

My Windows PC actually seems to have video card problems less often, though I don’t really blame Linux here because they both have nVidia cards.


To clarify: WinDbg is not built-in in the literal sense (you have to download it separately), but it does come with a plenty of extension commands which lets you peek over Windows internals with ease (e.g. PEB, kernel process/threads, I/O Request Packets with all their Stack Locations, etc.). And it integrates with the operating system easily as a Just-In-Time debugger.


> WinDbg is not built-in in the literal sense (you have to download it separately)

I’m not sure about that. I think you downloading the GUI, not the debugger. The symbolic debugger engine is in DbgEng.dll and is included in Windows.


Perhaps we don't agree on certain terminology.

I would opine that a debugger in the usual sense is a software that interacts with both the end user (frontend) and the debugee program (debugging/tracing engine), allowing the user to study, analyze, and manipulate the dynamic execution of said program.

The debugger stacks on e.g. Linux and Windows are as follows:

- Linux: GDB, LLDB (frontend + debugging framework); libbfd, libdwarf (object file/symbol engine); ptrace (debugging/tracing engine)

- Windows: WinDbg, cdb, ntsd, kd (frontend); DbgEng.dll (debugging framework); DbgHelp.dll (object file/symbol engine); DbgUi*, Nt* syscalls, kd stubs (debugging/tracing engine)

I have a question: would you call gdbserver a debugger?


Modern debuggers are complicated. There’re good reasons to name all of these components “a debugger”. But I think the symbolic engine deserves the name “debugger” more than the GUI.

Note that in Windows, the GUI is the only piece missing from the default OS installation. The rest of the stack (DbgEng.dll, DbgHelp.dll, kernel calls, tracing, and more) is shipped with the OS.

> would you call gdbserver a debugger?

I’m not an expert on Linux. But based on this page https://www.tutorialspoint.com/unix_commands/gdbserver.htm probably not. I think it says that thing’s an IPC server which doesn’t do debugging on its own, doesn’t even need symbols, instead relies on the host to implement the actual debugger.


In Linux this would be like using one of the many tracing tools such as strace, and a dbus viewer.


API Monitor sounds like SnoopDOS.


Doing debugging work on macOS is actually not particularly difficult either; it’s just that very few people do so and the number of online resources is much smaller.


And Linux makes it even easier.


Pros and cons

Pros: no PatchGuard sh*t, complete source available

Cons: your out-of-box experience may vary depending on your distro


Not really. Almost every major distro comes with all those tools prepackaged.


I meant particulary debug symbols

Fedora/CentOS has -debuginfo for every binary package, while Debian seems to vary for each package


I come from Linux land so some of the details flew over my head. With that said, is there not a valgrind equivalent for Windows? The sentinel thing seems to a default functionality to the valgrind virtual machine. I would expect that the initial part of the investigation would be automatically done by valgrind. Of course reading the whole post it looks like it would not get very far.

The other thing is, how can a system call block a thread but somehow that same thread runs other stuff in between that blocking call? Even if that is possible that sounds like a dangerous game to play, and this class of bugs seems natural.

I can imagine the scenario where a thread is cancelled mid system call, but it will be the kernel that will cancel the thread and responsible to know that half way syscalls need to be cleaned up. This way the application does not get system call call leaks? Also if i remember one debugging session correctly, when you have a blocking syscall interrupted you will get a sigabort on linux.


The syscall writing the data is async, which requires the memory it uses to be available until until it completes. Since this was on the stack here, it means that the function that contains this memory on its stack must not return until IO completes

But the code that was supposed to wait until completion was interrupted by an exception, which unrolled the stack beyond the function owning the relevant memory. So the completing async operation writes into memory owned by somebody else now.

This is a general complication with completion based async IO. You could get the same problems without any interrupts, if you just don't wait for async IO to complete in some circumstances.


Thank you. Now I understand why Rust async is so complicated. It's about lifetimes. It's dangerous to write data asynchronously because... eh... that location might be already invalid.

I know this is not about Rust. Forgive me that I mentally connect this story with async Rust.


There are two ways how async IO can work:

* Readiness based: You use `select` (or preferably its successors) to figure out which file descriptor has data available. Then you synchronously read the data, which completes instantly because there is data available. This is the traditional way on Linux, and also the way Rust's async works. This approach doesn't run into any lifetime issues, because you only need to provide the memory (`&mut [u8]`) to a sync call once data is available. This approach works well from sequential streams, like sockets or pipes, but isn't a natural match for random access file IO.

* Completion based: You trigger an IO operation specifying the target memory. The system notifies you when it completes. This approach needs you to keep the memory available until the operation finishes. This is the traditional way on Windows (IO completion ports) and also used by the more recent io_uring on Linux. This one runs into the lifetime issues that caused the bug in the linked article. Thus in Rust it can't safely be used with a `&mut [u8]`, you have to transfer ownership of the buffer to the IO library, so it can keep it alive until the operation completes. The ringbahn crate is an example of this approach in Rust.


Note that what you're referring to as "readiness-based I/O" is not asynchronous I/O in the strictiest sense--it's actually non-blocking, synchronous I/O.

I find this article helpful in clearing up confusion of asynchronous vs non-blocking I/O: http://blog.omega-prime.co.uk/?p=155


Good writeup. One thing to note is that Go, Crystal and other systems based on green threads tend to not run into the problem as they have separate stacks for different execution contexts.


More like a general complication with SEH. One of several reasons why I don't allow it in any code I touch.

Anything that obscures the flow of control is a violation of literate programming principles, and exceptions are pretty high on that particular shit list.


> With that said, is there not a valgrind equivalent for Windows?

Yep, it's called Application Verifier and it will check all kinds of system calls and make the memory allocator incredibly aggro towards checking validity

> The other thing is, how can a system call block a thread but somehow that same thread runs other stuff in between that blocking call?

When you call select(), your thread is in a special state called Alertable Wait; one way to get out of it is to have an APC queued to your thread, which has a similar vibe to a UNIX signal (with all of the similar caveats about how easy it is to fuck things up in it). Basically, OP's scenario is similar to corruption caused by a badly written signal handler


> The other thing is, how can a system call block a thread but somehow that same thread runs other stuff in between that blocking call? Even if that is possible that sounds like a dangerous game to play, and this class of bugs seems natural.

I mean, signal handlers exist on Linux, too ...

> when you have a blocking syscall interrupted you will get a sigabort on linux

Doesn't the syscall just return EINTR?


The select() was not interrupted in a traditional sense. It looks like a special windows API was used (QueueUserAPC) which allows a callback to be called in user space while blocked in certain kernel calls. My guess is that when used normally, the callback returns and control goes back to select() being blocked, but in this case an exception was thrown which unrolled the stack and left the select() call holding onto an invalid stack address. Jumping out of a syscall using an exception like this will probably cause all kinds of problems and my guess this is completely unsupported.

https://docs.microsoft.com/en-us/windows/win32/api/processth...


Throwing from a signal handler on Linux is also UB. Or rather, it will blow up on Linux too, and that's why it's UB in the standard, because it'll blow up on any platform.


With the understanding that somebody had complex logic in the equivalent of a signal handler in linux, then i am not sure this bug is so fantastic after all. I have signal handlers so high on my danger zone that i do not recall ever coding any. If I am in a sigabrt or sigsegv i hope i die as cleanly as possible, so that a core dump is useful. No wonder he needed kernel level development.


No, absolutely not. It just illustrates the problems that obviously insane code can lead to and why we call it "insane" in the first place. But it was still a fantastic read following him on his journey from "huh, that's odd" to "I had to use kernel-level debugging to find out they were ~~throwing from an APC~~ using APC."

Imagine how much he could have achieved in those five days if he had just unfucked obvious bug magnets proactively without first painstakingly proving that they were really manifesting as bugs, and how.


There's IBM Rational PurifyPlus, which is kinda the same as valgrind, but with a Visual Studio plugin and fancy GUI. But when we bought it, we only got one license for the entire company because it cost around $10k per seat.


PurifyPlus was an old enterprise software similar to valgrind for Sun/Solaris. It eventually got Linux and Windows support, but for more than a decade has been one of those undeveloped systems milking legacy enterprise subscriptions as it slowly bitrots into uselessness. It was bought by IBM and then passed on to Unicom. These days it is not a reasonable choice for anyone not a legacy user.


Not sure if PurifyPlus existed for HP-UX, but I know Purify did (at a very previous job, we even managed to find a small memory leak in the standard libc, which we filed with HP; unfortunately, a repro required having a Purify license).


There’s address sanitizer on newer versions of visual studio now but in my experience getting it to actually work with all of your projects’ dependencies can be very hit and miss. Windows Debug build config also does a lot more memory checking (part of the reason it’s so slow) so you’re not totally screwed for automated tools.


> The fix was pretty straightforward: instead of using QueueUserAPC(), we now create a loopback socket to which we send a byte any time we need to interrupt select().

Ha, I have always been using this trick to break from a blocking select() call. I feel less evil now.

On Linux and macOS you could also use a pipe instead since select() doesn't only work with sockets.

On Windows you can alternatively use WSAEventSelect (https://docs.microsoft.com/en-us/windows/win32/api/winsock2/...) on all sockets and have another Event just for signalling, but the downside is that it forces your sockets to be non-blocking.


This is a common enough technique that linux standardized it to some degree: https://man7.org/linux/man-pages/man2/eventfd.2.html


Ah yes, forgot about those... A bit similar to the Windows solution. Unfortunately, the loopback socket method is the only cross-platform solution I am aware of and I always kind of felt bad using it...


Immediately guessed the culprit after seeing the stack trace, because (network) I/O was occurring.

Windows overlapped (=async) I/O to a stack allocated struct is pretty certain to blow up just like that, sooner or later.

If you guard for it, worse, since canceling is asynchronous as well, at least in theory you could end up in a scenario you just can't reliably ever let the execution to continue again. Luckily this should be extremely rare... I hope.

Much better idea: allocate async data structures on the heap. In the worst case you leak some memory. Just never deallocate until the I/O has actually completed.

Variation of the same can happen on the hardware level as well. Don't DMA into stack, unless you can really be sure it'll complete fast enough or you can afford to wait for it.


Hmm, you immediately knew that async IO was the culprit... but it wasn't, according to the article. The function that called the async IO (which was a Windows API function, not in Unity) was correctly waiting until the IO completed before releasing the struct (by returning).

The actual problem was that the rug was being whipped from under it by an exception being injected into a place where it ought to have been impossible. Your suggestion of using heap allocation instead wouldn't have fixed anything; it would've turned the crash into a memory leak, which would've just masked the the problem.

The solution they went for in the end addressed this real issue, and the async IO (happening under the hood of that Windows API function) remains unchanged.


Well, overlapped (=async) IO was what ultimately overwrote stack memory. The hazard pattern I recognized from previous experience.

Of course exception in the queued APC was the reason why select's WFSO was prematurely interrupted and stack rewound, so the bug was of course there. It's pretty dangerous to execute code in the context of a waiting thread.

While the solution to fix it by using heap allocation is a bit ugly in some sense, if your choices are to abort execution completely or to infrequently leak some bytes of memory, I know I'd choose to leak and perhaps log what just happened.

There are rare cases when the I/O never completes and CancelIo doesn't complete either. Probably hardware faults or kernel driver bugs. Stuff that should absolutely never happen, but can still be defensively programmed around.


Would allocating on the heap have been better than the work around that they used now? I suppose there is something to say for having the same mechanism on multiple platforms.


I suspect it would have corrupted the heap instead of the stack. If so, this would be even harder to diagnose, because the corruption would show up elsewhere unrelated to the actual problem.

The underlying problem is that you 1) queue an asynchronous callback and reserve some memory for it to use 2) wait (or so you think) for the function that calls the callback to be done and return 3) something interrupts the wait, but the callback is still queued for execution sometime in the future. 4) since the wait is over, you assume the callback was called, and the memory it was going to use can be safely freed 5) if the memory was on the stack, it gets freed purely by returning. if it was in the heap, you'd deallocate it. 6) the callback that was still queued fires up and writes to memory it no longer should be able to use.

To really fix this situation, you'd need to free the memory in the callback. This is not possible in this case unless you can rewrite the library select() function significantly, perhaps even rewriting other parts (not familiar enough with those APIs to be really sure).

Even in similar situations where it is possible (e.g. if you have garbage collection), it will often turn a memory corruption into a logical corruption, because you have a callback fired related to something that is not relevant anymore, and any side effects of that callback will happen without the proper context. (in this case the only side effect is probably the memory access, so in this particular instance the problem would be "fixed")

[Edit: if the heap allocation was exception-aware (e.g. RAII that gets called during exception unwinding, then you would get heap corruption. If it was manual deallocation and thus skipped during exception unwind, then indeed you would just get a memory leak exactly as you said]


Somehow the lifetime must be at least as long as the async operation takes to complete. All other scenarios could result memory corruption.

If there's no other option, I'd rather leak that memory.


> How does one interrupt select() function? Apparently, we used QueueUserAPC() to execute asynchronous procedure while select() was blocked… and threw an exception out of it!

<rant>

Let's just ignore not using overlapped I/O and completion ports for a moment. How has using APCs even come as a good idea? Like, perhaps a dedicated thread pool for async jobs would work a little better?

If you need to interrupt that specific thread though, sorry for that. Perhaps uhh CancelSynchronousIo() would work since there's IO_STATUS_BLOCK? Or maybe call CancelIo() on the socket in that same APC. However it goes, throwing exceptions in an APC isn't something I'd like to see in production code.

Also, even in Unix world, the select(2) system call is infamously known as a buffer overflow landmine [1]. Just FD_SET any fd >= FD_SETSIZE and BOOM. If you're on Unix, at least use poll() or better. If you're on Windows (and can't afford switching to proper async I/O), maybe WSAEventSelect() would work better?

</rant>

[1]: This does not apply to Windows Sockets though, where fd_set is not a bit array but internally just an FD array in disguise. You get FD_SETSIZE = 64 though, which is conspiciously close to MAXIMUM_WAIT_OBJECTS in WaitForMultupleObjects.


Throw an exception to escape a syscall block sounds crazy evil. Allocate syscall I/O buffer to stack, even more so. I think they deserved this bug.

Good diagnosis though. I think I would have given up at the kernel debugger and just refactored that entire module until it went away. I had no idea kernel debugging was even possible.


> Allocate syscall I/O buffer to stack

That part wasn't done by them though, but by select(). Still a good lesson why exceptions should only be used if you know exactly how everything that allocates any resources in between works. And sometimes you can't know until it bites your like this.

On another note, getting old isn't that bad. When I saw the title I knew I read this post before but couldn't remember how it went, so the second read was just as exciting. ;)


Crazy thing is to use alertable state, iocp doesn't work with it as you expect. I expected frameworks like unity engine to use completion ports.


Completion ports are a solution to maximize efficiency if you are dealing with thousands of connections. I don't really think unity or any other game has that requirement. It's a client, not a webserver.

They probably picked the select based solution since its decently portable between windows and unix, and good enough for their requirements.


I think he is saying because Unity is expensive, he expected them at minimum to use the best technology available in the stack (IOCP), since the implementation time is fairly minimal (i.e. a developer and a few weeks) compared to the performance gained for all the games using the engine.

In the same way you don't buy Unreal Engine and then expect to see GLFW being used underneath instead of the native platform APIs or something like that. You expect any big game engine to be thoroughly natively integrated, and so-be-it if they need to maintain multiple native backends. That's their value sell for an engine.

And I think there would be plenty of 32-odd player games using P2P sockets.


I understand that. But what performance gained? Unless your application spends 20% of CPU or more on networking or you run 1000 connections there won't be any measurable difference. Realistic numbers for games are likely < 1% of CPU, there's simply not too much data to exchange. Optimizing graphic API for games is something different, since this is really where CPU and GPU time is being spent.


"It was just a dream, Bender. There's no such thing as two."

https://www.tvfanatic.com/quotes/bender-what-is-it-whoa-what...


Problems like this (though for this particular app it might be hard because it's Too Big), Time-Travel Debugging is practically made for it - https://docs.microsoft.com/en-us/windows-hardware/drivers/de... - hit your first corruption, set the memory breakpoint, then just go backwards


This should be the top response. With a time traveling debugger this would take 15 minutes to solve (minus the time to make the crash happen). On linux I use the Undo debugger - I don't own stock - I just love the product.


Oh data breakpoints! Our savior. In the bad old days Intel had a big 'blue box' that ran ISIS and had a $1200 cable and circuit board called 'ICE' for 'In-Circuit Emulator'. It could monitor the bus and trap on certain memory operations. My brother worked on that right out of college. He met his wife over that circuit board!

Anyway fast forward a bit, I'm working at Convergent which was a nascent Intel's biggest single consumer of x86 chips. Intel was coming out with a shiny new version, the 80286, showed our engineers the spec. They complained "Hey! We're your biggest chip buyer but you didn't consult us on features!"

Intel guy said "Ok, what features do you want?" Our folks said "We'll get back to you..." Then they circulated a questionnaire among us software types for ideas.

I suggested "Let's get rid of that Intel Blue Box, have a feature in the new CPU that traps on bus condition mask and value registers." I even drew out what I wanted.

Well, what do you know, Intel did it. And it remains to this day, and is called "Data breakpoints". Probably the only thing I ever did that will last any length of time, since software has a really short sell-by date as a rule.


On Linux, a conceptually similar idea to the loopback socket could safely and efficiently be implemented with epoll(). The epoll instance itself is a poll-able file descriptor that can be fed to epoll() alongside the sockets.

When a new socket is added, epoll() gets triggered (by the loopback epoll file descriptor that was used to add the new socket, since it's also actively being polled by epoll). epoll() immediately returns to userspace, the event is handled (to do necessary socket bookkeeping) then go back to polling again with the updated list of sockets.


Damn, all these happens in a mere FIVE days? Really shows how far I need to grow as a developer lol


And still you'll always find a guy that claims he would have found the bug in 5 minutes.


And he's probably right. And it also doesn't mean anything.

There are thousands of geeks reading an interesting debug report. It is not very surprising to find one who'd think of the right idea as the first thing.


You are probably right. But also in this field I have found a lot, and I mean A LOT, of one-uppers. When the annoying bug appears, everybody is busy and everything other than the bug is top priority. Then you bite the bullet, fight the bug, and eventually get it fixed. Suddenly the one-uppers show up, ask you "what was it?", and whatever you answer they already suspected it as soon as they read the symptoms, and explain why your solution is suboptimal.

I remember the time that I "solved" a slow database query that was annoying everybody for weeks just by creating an index. Everybody was avoiding the issue like the plague, probably out of fear that it was something serious, but after I created the index (something all of them could do in their sleep), one-uppers showed up to 1) claim that "it was obviously a bad index" and 2) that the index I created could be done better.

My point is that claiming "I knew it" adds absolutely nothing. Claiming that it could be solved in 5 minutes or in a better way, adds even less, turning a moment of deserved proudness into fuel for your imposter syndrome. What really adds is fixing the bug and write a beautiful post explaining the process.



Like others have already mentioned, that's a developer who saw the blood on the walls. They've likely seen this issue before and know the signs to look out for. It doesn't make them any smarter but it does mean they are wiser/more experienced with this tech stack.

As a personal example, my team sees me as the consult for all the arcane fuckery that you find as a result of unusual C++ behaviour or build system issues. This doesn't make me a good dev but it does mean that I've been through my paces there and bled my share. But at the same time I'm practically a blind, helpless child wrt most of the JVM or web tech stacks. IMHO my colleagues are wizards for being able to make heads or tails of issues in those spaces.

Point being, this isn't a matter of "Oh I could solve that in seconds", it's a matter of "Oh god I remember dealing with this type of issue. Here's the cause and how to fix & avoid it".


"Like others have already mentioned, that's a developer who saw the blood on the walls. They've likely seen this issue before and know the signs to look out for. It doesn't make them any smarter but it does mean they are wiser/more experienced with this tech stack."

This. Exactly right.


And that comment is very helpful and informative so I think it is great we have people like that posting! Not sure why anyone would call them out.


The difference is between working out something from scratch vs recognising something you’ve seen before. The second will be much faster, but it depends on experience, not talent.


maybe, but to get there it takes a lifetime. (paraphrasing Picasso here)


> Lesson learnt, folks: do not throw exceptions out of asynchronous procedures if you’re inside a system call!

Maybe unpopular opinion, but don't throw exceptions at all, anywhere. It isn't worth it.


The people with the skill and knowledge to debug something like this truly inspire me.

Reading low level debugging war stories like this it always amazes me that computers work at all let alone as well as they do. It’s surely due to people like this carrying the torch for the rest of us.


Reminds me of a memory corruption bug we encountered in our OS class. We were building a basic OS for a Cortex M3 board and we occasionally printed the wrong strings to the console despite having correct code and assembly. I forgot most of the details but basically we discovered our hardware (?) was not reentrant safe and our bug was caused by an interrupt handler overwriting our callstack. We "fixed" it by disabling interrupts while writing to the console.


Out of curiosity, is it possible to disable ASLR easily on Windows? It's trivial to change on Linux and that (plus Valgrind) has occasionally made obscure debugging life easier.


The executable has to have ASLR enabled as a flag, so you just compile without the flag


Also can disable globally, for ex. if source isn't available.

https://stackoverflow.com/a/9561263/1895684


Alternatively it's also possible to disable ASLR in the executable itself by removing the flag 0x40 (opt-in flag for ASLR) in the DllCharacteristics field of the header. There's a neat little tool called SetDllCharacteristics[0] which can do that.

[0] https://blog.didierstevens.com/2010/10/17/setdllcharacterist...


“Somebody had been touching my sentinel’s privates - and it definitely wasn’t a friend.”

A non-C++ programmer might find this weird.


Now that was an enjoyable debugging story, but what was their callback doing that threw an exception?


They threw it on purpose as a way to interrupt select().


Huh - now I can see why exceptions are looked at somewhat askance in low level coding circles.


Who uses non-blocking networking on the client?

It's not like you don't have cores/threads for blocking when you just have one server, how many servers are they going to connect to?

Or is this for the Unity server part?


From what I've seen, it's a common convention with game development to do non-blocking networking and update your network state every tick of your update loop.

It fits in nicely with how everything else works in your game loop, and means you don't need to deal with marshalling data to/from a dedicated thread.


You should keep the network async. from the rendering yes, but why non-blocking?


Actually, the select() call was blocking. Non-blocking in this context would mean periodically calling select() with a timeout of 0.


Select being blocking is normal, non-blocking IO does not depend on select being non-blocking.

Select returns the list of non-blocking sockets that have data read/write pending.


select() is just a means to multiplex network I/O. It can be either blocking or non-blocking, depending on your application architecture. I have worked on both kinds.

BTW, the sockets themselves don't even have to be non-blocking.

Also, I'd like to stress that the concepts or blocking/non-blocking I/O and synchronous/asynchronous I/O are really orthogonal. You can do synchronous networking with non-blocking sockets and vice versa.


Or to put it another way: depending on the timeout value, select() can achieve blocking or non-blocking I/O on several sockets simultaneously. In both cases, the sockets themselves can be either blocking or non-blocking.

There is no practical difference between blocking on select() for multiple sockets or blocking on recv() for a single socket - in both cases the thread can't do anything else.


> It's not like you don't have cores/threads for blocking when you just have one server

Some platforms don't have these luxuries.


Why blocking?


Because it's simple and on the client that does not have many connections there is no point to use non-blocking just because it's the latest tech.!


Non blocking IO is ancient tech, it is done because spinning up more threads is expensive and makes performance unreliable. Thread overhead is much less of an issue today so just spinning up a new thread and running everything with blocking calls is the modern way of doing things, but many of the idioms for games are from a long time ago when not every CPU was multi core so thread switching overhead was huge and therefore they had to do non blocking IO to run well.


So if your client has one thread for networking why use non-blocking?

The age is relative, sockets are from the 70s, non-blocking is 2000+ and in most languages it's only stable since 2010+.


Overlapped (=async) IO has been in Windows since NT came out, so mid-90s. I don't even know what you mean with "most languages". Most programming languages are only stable since 2010+ anyway, we get more languages every year.


> So if your client has one thread for networking why use non-blocking?

No, the client would have one thread for everything, not one thread for networking. You shouldn't do blocking IO if you only have one thread that also needs to do other things.


I won't fully remember the details of these articles, but it has happened in the past that reading these has helped future me debug very big and complicated issues.

I believe this comes from the fact that I am presented with a new _kind_ of bug: async syscall corrupting memory because the stack was unrolled (by an exception). And I integrate this is my mental debug checklist.


About 15 years ago I was debugging an ARM7 memory corruption issue on an embedded target. Chip was running at 40 MHz but the instructions were ARM 32 bit instructions, but the external data bus was only 8 bits wide -- reading instructions from external NOR flash, required 4 bus cycles per instruction. So an effective rate of ~10 MHz.

We were good about doing code reviews, stacks weren't overflowing, etc. So it was puzzling. Finally, just like the article said, I figured the only way to find it was to catch it "red handed", in the act.

The good news is that memory locations getting corrupted were always the same.

Long story short, I set up a FIQ [1] -- some of you the FIQ -- which would check the location each interrup. I forget if it checked "for" a value or that it "wasn't" an expected value, ugh, sorry... If the FIQ detected corruption, it did a while (1) that would trigger a breakpoint in the emulator. Then I'd be able to look at the task ID -- we were running Micrium u/C OS-II as I recall -- the call stack, etc.

Originally I set up a timer at 1 MHz to trigger the FIQ, but the overhead of going in & out of the ISR 1 million times per second, at essentially a 10 MHz rate, brought the processor to its knees.

So I slowed the timer interrupt down to 100 kHz (!!), which still soaked up a lot of the CPU slack that we'd been running with. And time after time I'd hit the breakpoint in the FIQ, but the damage had been done usecs earlier and the breadcrumbs didn't finger a victim.

Then it happened. Remember, the hardware timer is running completely asynchronously with respect to the application. Finally, the FIQ timer ISR had interrupted some task's code in exactly the function, at exactly the place (maybe a couple instructions later) where the corruption had occurred.

Took about a day start to finish, I'd never seen or heard of using a high speed timer to try to "catch memory corruption in the act", but as they say, necessity is mother of invention.

And to non-embedded developers, this is an embedded CPU. No MMU or MPU, etc. just a flat, wild-west open memory map. Read or write whatever you want. Literally every part of the code was suspect.

Good times.

[1] On ARM 7/9, maybe 11, I think also Cortex R -- the Fast Interrupt Request, or FIQ, uses banked registers and doesn't stack anything on entry -- so it's the lowest-latency, lowest overhead ISR you can have. But you can only have one FIQ I believe, so you have to use it judiciously.


Saw the red flag on the very first page:

> the socket polling thread then dequeues these requests one by one, calls select()

The select() API is not good regardless on the platform.

If you like the semantics and developing for Linux, use poll() instead. It does exactly the same thing, but the API is good.

Windows API supports thread pools, it's usually a good idea to use StartThreadpoolIo instead.


How would poll() have helped in this case? It seems like the error condition in the article (interrupting syscalls in unexpected ways can trash whatever state your syscall wrapper manages for e.g. return values) is a risk regardless of the blocking operation being performed.


poll is not available for Windows. The semantic is similar to WaitForMultipleObjects with overlapped IO but not quite.

Thread pool API doesn't have this class of bugs. One doesn't need to wake up any threads to change the set of handles being polled.


That's a bit of a apples-to-oranges comparison. The description on the blog says the `select` call runs on a thread which is mainly responsible for checking for IO readiness. They use the information somewhere else to perform IO. Probably on a threadpool.



This is the sort of thing that would have had me breaking my keyboard over the monitor. The sentinel trick is nice though isn't it possible for compilers to insert such sentinels if requested?


Yeah, they're typically called stack canaries or cookies by the compilers I've used. I doubt they would have caught this though.


Ask HN: How do I reach this level of understanding of what to do in these situations..? Really inspired by the post on how we can use different tools and pin down the exact issue


For these types of situations you can really only learn by doing. You can litter code with printf() [although this can actually hide bugs too if you're very unlucky], comment out code until you come to a minimal program that exhibits the buggy behaviour and then just persevere.


Fantastic write-up. I was gripped!


Async is hard. But fun when you can do it intuitively.


mixing threads and select? there's your problem right there! Don't use select in threaded programs.


Answer: Windows does!


Technically their program told windows to write to their stack at an arbitrary time in the future. Windows then did what it was told to do causing this bug.


More accurately: Undefined Behaviour does.

Windows only writes to the location because the program gave it that pointer. It's the program's fault that they gave a pointer to an async API when they didn't first guarantee that said pointer would be valid in the future. Really there should be tooling for catching this kind of thing (there is on Linux but less so on Windows).


The author needs to discover JTAG for himself


And how much they cost.

They are great though, I used use a Lauterbach occasionally back when I work for a mobile OS company.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: