Hacker News new | past | comments | ask | show | jobs | submit login
Four lines of code it was four lines of code (conman.org)
166 points by todsacerdoti 2 days ago | hide | past | favorite | 32 comments





"strace -c" is usually how I start looking for things like this. It so often immediately reveal some vast difference in syscalls that'll give a clue, and is usually faster/more convenient as a first step than starting with a full profiler trace or looking through code.

strace was not on my radar. I know it exists, it just never occurred to me to use it. So, I tried it. Here are the results:

    4542 open
    3041 read
    2990 close
    2517 epoll_ctl
    2478 old_mmap
    2475 munmap
    2470 fstat64
    1509 epoll_wait
    1509 clock_gettime
    1501 ioctl
ioctl() (that, combined with epoll_ctl() and epoll_wait(), which was the actual problem) does show up in the top 10, but it doesn't scream "I'm the issue!"

It should kind of jump out for not being an "actually do something useful" syscall.

Without knowing what your program does, what your syscalls say to me are:

* why are so many files opened but not closed?

* that's probably too much epoll_ctl; try optimistically performing the real syscall once before actually registering. Or, under some circumstances, do a quick few-fd `poll` since it is stateless.

* is mmap really a win here? Unless you need its swap-like behavior, read is often faster (and it avoids ToCToU). Also, why are there fewer `fstat`s than `munmap`s?

* why is there an ioctl so prominent; what is it even doing?

* that's a lot of reading but no writing ... I guess maybe it's split across multiple lower syscalls but it's still weird.

(of course, all these can have legitimate answers, but you need to know them)


I do know how to answer most of the questions.

The program is a gopher server. The list you see is the result of sending several hundred (sequential) requests to the server. The number of calls to open() vs. close() can be explained [1], so (to me) nothing unusual there. I don't directly call mmap(), munmap() or fstat(), so I'm assuming there's some code in some library I'm using that's doing those calls. And that's the assumption I would have with ioctl() as well (well, I now know better, but in investigating the issue originally I might just ignored ioctl() along with mmap()).

There is a lot of reading, due to the requests (blog entries). Writing is to a socket.

You are right in that there is too much epoll_ctl(), but it's unclear to me if I would have found the actual issue. I did do change the code to perform the real system call before registering, but that had little effect.

[1] Not all requests are valid, for instance. In this case, it's a mirror of my blog, and due to how I use the file system as a DB, there may or may not be files of meta data that exist.


> * why are so many files opened but not closed?

Guess: about 1/3 of those opens are unsuccessful.


Turns out that mmap() is the memory allocator, nothing to do with files.

I did suspect that, but now the implication is that the memory allocator is not reusing memory even though it easily could.

GLIBC used to have a fixed 128 KiB threshold, but since 2.25 (2017) it defaults to dynamically adjusting the threshold to prevent exactly this pattern of excess syscalls. A manual threshold disables this though!

(This may be a microoptimization depending on how much real work you're doing between the syscalls, but there's also a good chance that it's easy to fix.)


I was going to point out that you can mmap() anonymous pages and use them directly as heap so this tracks.

Not for me either in this case, but often times you will see it's running down some error branch repeatedly, and will sometimes give you a filename of what it's working on, which can be a big clue

You'll need to get used to comparing it to more "usual" output, I guess. Ioctl() is a syscall you'd very rarely expect to see many of in most code. If I'm seeing more than half a dozen ioctl's, I'll be wondering what is going on (doesn't mean it has to be something wrong, just unusual).

But also, you'll want to look at the wall time and the microseconds per call - the raw syscall count is much less interesting.

Once you then spot something odd, dropping the "-c" and using "--trace=" to set which syscalls you're interested in tracing, and getting a live dump of the syscalls while you're testing often reveals very "interesting" patterns (favourite pet peeve: Looking for small reads/writes - context switch overhead almost always makes it pay to buffer in userspace, but a surprising number of applications fail to do so)

Basically, my assumption w/networking code is that if I'm stress-testing it and the wall-time isn't dominated by read/write, something is very wrong.


`strace -c` will show you wall time per call and not just the volume of syscalls.

My apologies, I'm just unfamiliar with what would look normal. These are the results of the buggy TCP implementation? Could you show a comparison for the fixed TCP implementation and the TLS implementation?

> These are the results of the buggy TCP implementation?

Yes. Here's the fixed TCP implementation.

    4542 open
    3041 read
    2990 close
    2478 old_mmap
    2475 munmap
    2470 fstat64
    1501 epoll_ctl
    1007 fcntl64
    1002 epoll_wait
    1002 clock_gettime
I didn't do the TLS implementation, but I suspect it would be similar to the above (since it didn't exhibit the problem). You can see that ioctl() dropped from the list, and that epoll_ctl() and epoll_wait() dropped down the list.

You can often get pretty far with `perf top`, also. Maybe not so much in this case, where it sounds like the overall CPU used was low compared to the amount available from the system, but for hot-spots it can be handy.

yup, plus if the program is already running in the bad state you don't even have to restart it to investigate, just do

  strace -p <pid> -c
to attach

The site seems down, so here's the wayback link: https://web.archive.org/web/20240701032732/https://boston.co...

Best line of the post:

> And as it is with these types of bugs, finding the root cause is not trivial, but the fix almost always is.


This is actually the best case scenario: putting in some effort to find the root cause, having maybe a bit of luck and actually finding it, and then making a minimal change to fix it (hopefully) once and for all. Much better than the typical back-and-forth trial-and-error "I changed something, see if it works now" - which I suspect are much more common in the real world (especially if the reporter is a different person than the one fixing it), but are nothing worth blogging about...

Well, except that his fix can't easily be pushed back into the source tree because he doesn't know what other software is only working because of the original fix. Presumably it must have been needed by at least one other person, probably also after a crazy long debugging session, or it wouldn't have been there at all. As this code is in a public repo for the default library of a well-used language, it's pretty much a given that fixing this by just removing those lines will cause a regression for someone.

Unless I'm misunderstanding the post, it's the author's own code.

(Here's the relevant change, removing those four lines of code: https://github.com/spc476/lua-conmanorg/commit/0ba8d616d1447...)


Oh good spot. I'd misread the repo name as lua-common, I guess my brain "autocorrected" it. The way the article was written, I didn't at all get the impression that it was his code, just that he discovered it and tried hacking it out:

> This bit of code wasn't in the TLS implementation, and as the comment says, it “shouldn't” hurt the TCP path, but hey, it was only needed for local (or Unix) domain sockets to begin with, so let me try removing that bit of code and run a test.

But yeah, if it's his own project, it's far easier to keep track of and watch out for any bug reports that might result from this change.


Heh, which is why you document the code when adding hacks like this in the first place...

>#Adding this line to prevent hangs on XXX terminal

Or whatever.


There was in this case, just above the offending lines, including this gem:

> [...] This does NOT appoear [sic] to happen with TCP sockets, but this doesn't hurt the TCP side in any case.

Oops!


The fix doesn't require removing code though as far as I can tell. Could just add an if statement.

Getting paid not to turn a screw but to know which screw to turn.

> There's a race condition where writting data then calling close() may cause the other side to receive no data. This does NOT appoear to happen with TCP sockets, but this doesn't hurt the TCP side in any case.

This might come down to a different default for SO_LINGER for UNIX-domain sockets than for INET{,6}-domain sockets, and possibly Unix-domain sockets not having an equivalent of RST to signal the loss. IIUC, typical best practice anyway is to shut down the sending half then wait for the receiver to shut down the receiver half before closing the socket. It's been almost 20 years(!) since I've played with this, but the behavior you get by just calling close with unsent data at least used to vary by platform anyway with TCP. [1]

[1] https://web.archive.org/web/20060924184856/http://www.slamb....


Whenever I need to do network things in Lua, I refer to the turbo.lua framework, which I've used in the past on other projects and I just find it so useful:

https://turbo.readthedocs.io/en/latest/

Whether its a set of file descriptors, a network socket, or some higher level session abstraction, I can usually find the 'best' way to deal with it by inspecting the turbo.lua sources - or, if feasible, I just use turbo.lua for the application on the table. It has been a seriously valuable tool in the 'get some networking/signals/fd_set thing working as quickly as possible" department.

Not that I'm suggesting the author switch to it, just that there is already a battle-tested and well-proven Lua-based framework for these things, and its worth investigating how it manages these abstractions. The distinction between local and IP sockets is well managed, also ..


There seems to be a copy double paste error in the link. Should be: https://turbo.readthedocs.io/en/latest

Thanks, fixed.

Why does it take 52 seconds of CPU time to serve 275 gopher requests? That’s a couple orders of magnitude more than I’d expect.

Because they were all done sequentially, not concurrently.

That's elapsed time, not CPU time.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: