Hacker News new | past | comments | ask | show | jobs | submit login
Refix: Fast, Debuggable, Reproducible Builds (yosefk.com)
106 points by luu 16 days ago | hide | past | favorite | 43 comments

Author doesn't mention Nix by name, so I wonder if they're aware of it. While it as a build system has more than its fair share of "dark corners", it does solve the reproducibility of all paths (including with a chroot, to get fixed-length paths), and has automagic debuginfod support via [1].

[1]: https://github.com/symphorien/nixseparatedebuginfod

debuginfod has its issues, as discussed in a sister thread. No, I'm not familiar with Nix. What source paths does it produce in the debug info sections? If they are "reproducible" in the sense that everyone gets the same path rather than debug info pointing to their source directory, then running refix on Nix's output solves exactly the problem I want solved!

That is exactly what happens.

Is it possible to embed relative paths (i.e. starting with ./), and instruct your debugger where the source root is?

Surely the way to do this would be to embed a SHA-256 of the source file. That's something that can theoretically be hosted anywhere and easily recovered if you don't have it locally.

although not the target audience of tfa (no build caching, because we have a compact c codebase), we always use relative paths in compiler invocation, and that naturally results in relative paths in debug info.

so we've never had the problem the author set to resolve -- despite each of us (and the ci) having different source tree locations, gdb picks up the source code just fine.

i guess build systems people use love canonicalising source paths?

Actually you very much do have the problem I set to resolve :-) in the sense that if you get 5 core dumps from 5 program versions, you don't know which source tree corresponds to which core dump. You might not experience this as a problem because there's some way of finding the source tree, but you definitely have it :-)

Note that in general, relative paths create 2 problems:

1. You [and tools] don't know where the code is just by inspecting the binary;

2. There may be more than one answer to the question relative to what?

In this sense, a relative path is potentially even worse than "an absolute path to a directory created during the build that was since deleted", since there's substitute-path [in gdb, not necessarily any other DWARF consumer, but at least in gdb] and you could find currently existing paths corresponding to the absolute paths in the build and remap them (unless of course they were all the same absolute path like /tmp/xxx but they were directories on different build hosts or created at different build phases - but you hope it won't get this bad.) With relative paths, you have no reasonable way of remapping different meanings of "./" to different absolute paths since "./" is always "./" (though you might remap differently based on what follows "./" in the path; but it's getting really painful at this point.)

This "relative to what"? question is not a problem if you have a monolithic build (which is awesome, everyone should!) with all the source kept under the same root directory and then "relative to what?" has the sane single answer "relatively to that root directory." This can be a big problem if there are multiple build systems building multiple binaries in different root directories and a linker then links those libraries or they're shared objects loaded at runtime and ending up in the same process or core dump.

> 1. You [and tools] don't know where the code is just by inspecting the binary;

Your bug reporting tool should obviously report the version of the software the dump is associated with. It's common to just put the git hash in the binary.

> 2. There may be more than one answer to the question relative to what?

The directory CI runs in. If you're not sure, check your CI scripts.

1. What if the version has uncommitted changes? What if the code is compiled from multiple repositories? I should now clone them all?

2. As discussed in GP, there can be more than one such directory if your build isn't monolithic (which it should be, but it often isn't)

> What if the version has uncommitted changes

How is knowing the filenames going to help with that?

> What if the code is compiled from multiple repositories? I should now clone them all?

Presumably you already have a CI script that does this. If not, invest a few minutes into making one.

If the code has uncommitted changes they are typically (~always) kept in the modified files they were compiled from. Knowing the filenames is how you find them.

Great, I now have the path to my git repository, same as every other build, and probably leaked my system username in the process.

This is definitely a hack and despite their claims that it's more robust they've introduced a load of more failure modes.

I can't imagine it's really that slow to edit the paths properly. It's probably just accidental N^2 or something.

Also it feels like there should be a proper field in ELF/DWARF for this if people are wanting to do it a lot (and it seems like a reasonable thing to do).

"A load of more failure modes" - which? This has been used for more than a year by more than 1000 developers and there were never complains about corrupted binaries or corrupted debug information (except when for some reason "refixing" didn't actually happen for some of binaries), or build speed issues. This, in a code base producing a binary that crashed debugedit after 30 seconds of processing. So empirically it's fine and I think also fine on theoretical grounds; if you disagree it'd be interesting to hear the details.

Regarding replacing the paths "properly": it's not accidentally quadratic, it's predictably linear but with large constants because there's a lot of data to move and a lot of offsets to update. An example of such a problem (linear complexity, large constants) is the linker. It took more than 40 years for C++ linkers to get fast; mold is the linker that I always said was possible (a very fast one) but that I didn't actually implement. Now it's implemented and it takes 5-10 seconds to link that large binary. This is roughly the speed that you can hope for if you not only postulate the possibility of fast DWARF editing like I used to do with fast linking but actually implement it; this will still be an order of magnitude at least slower than refix but it will match the speed of sed. Of course since we had to wait for more than 40 years for a fast linker and a linker is something nobody can do without and everyone wants to be fast, we will have to wait for more like 400 years for anyone to actually optimize debugedit to that level. And I much prefer the small annoyance of having leading ////// in the file paths to waiting for 400 years for a less buggy and faster debugedit and then waiting 5-10 seconds per rebuild.

I agree that it would have been great if the DWARF format as well the actual DWARF producer made it easier to change path prefixes, and it'd be especially great if it worked with compressed debug sections without having to uncompress and recompress them.

I can imagine it. In fact, I would trust josefk 100% if he says it is. Did you try it on a big binary yourself?

Sorry if I wasn't clear but I didn't doubt that it was slow, I doubted that it fundamentally had to be slow.

__FILE__ is pain of my existence, recently contributed this code to run/build ds3os using nix instead of docker, where I hit issues with __FILE__


For unwanted values of __FILE__, you can "refix" these to something reproducible after the build (of the libraries or object files or the final binaries, whichever is easier with the given build flow), if that's your problem (better yet is to remap the path prefix at build time and then "refix" to where the source code really is, but if you don't want that but rather want to remove irrelevant absolute paths and replace them with something deterministic, you can do that, too)

Yeah, nix's remove-references-to does exactly that, it replaces the prefix with something deterministic (bunch of zeroes IIRC). I know about the gcc (and recently clang's) options to map the prefix, but the command line is annoying to use. Wish there was simply -relative-prefix or something and it would be the default. The nice thing about nix is that it will detect any references to /nix/store path that shouldn't be there and fail the build, so you immediately know if there is reproducibility problem related to the paths.

I am kinda struggling to understand the proposed solution. So, okay, I have an EXE, and I have a PDB file for it, and that PDB file states that the source code lived at the Q:\Users\raymond.chen\Documents\work\projects\ContosoFoobar\src\main.c. Instead of having some sort of debugger setting to allow me to remap the path to something existing on my machine (my disks go up only to H:, for once), the proposed solution is to monkey-patch the PDB file itself. But, like, why? It doesn't solve the main problem that you still need to somehow find the sources, manually.

Also: if you really, really want to never go looking for the sources ― just store them in the finished executable/debug info file. Seriously. For some reason, they're (when gzip'ed) somehow smaller than the resulting executable with debug info.

Well, PDB isn't DWARF (which is kept inside the executable) so indeed maybe you can't find the PDB from the executable... you could embed its path using a .ver section as described in TFA though (if it's not already there - I never worked seriously enough on Windows to learn). As to finding the sources from the PDB - yes it's easier after the monkey patching if the source is exported on NAS or is kept locally; if you download from CI then CI can put a canonical local directory in the path, convenient for concurrently debugging multiple released versions.

Regarding debugger path remapping - there are a lot of tools consuming debug info, not just "the" debugger, and not all have the setting; assert line info printout doesn't have a setting like that. It's also extra hassle to remap even if you can.

As to storing the sources in the binary - it can be nice but it's going to be slower for various reasons. How big they are depends in part on how good you are at putting just the code you actually compiled there which is not always easy. And when assert fails you need to extract it before looking at it; everything now has this step.

The point of refix is to make sources easier / trivial to find with very minor build flow changes - just a compiler flag and a tens of ms long post build step.

Oh, you have your CI store all the sources for all the builds, and you have NFS-like access to it? Yeah, OK, this case I can see how having absolute paths would be useful.

In my case at work (and we actually use Linux), we sadly don't have that luxury: the CI is walled of and its disk is purged every midnight, so knowing what folder the project was built in is pointless since it doesn't even contains the commit hash, it has the CI's unique job number instead but even it is also useless since the CI for some reason stores info only about 3 days of build jobs so... yeah. We resort to baking in the git info as a recognizable string into the binary, and then checking out the sources manually.

P.S. IIRC, PDB's are matched to the EXE's by the EXE's checksum so generally if the PDB is checked in into a symbol server, then retrieving it back is simple, if you know what source server to query, of course.

The premise of the tool seems very useful: edit debug symbols and assert messages so that source code can be found by debuggers. But this description does not make it clear how this tool accomplishes the whole task:

> Why not fix the binary coming out of the build cache, so it points to the absolute path of the source files?

What is the absolute path? If you had a virtual file system that allowed you to construct a path to any file at a given commit, this would work great. But who does that other than Google? Or if you agree that every developer will check out the same source code repo at the same path, but the you have to have the right commit checked out.

Ideally you would want your binary to point back to your code repo, like SourceLink does.


If it's a local build, it's the path where your source files are right now. (And you want the full path since you might be working with multiple versions concurrently or you might have sent someone/somewhere a locally built binary and got a core back.)

If it's a CI build, I recommend exporting source files to NAS, and define a retention policy. If the org is averse to NAS despite this use case being fine and not exposing the weaknesses of NAS, you could define the absolute path as /local/commit-id/whatever and have a system where the source gets closed there relatively easily. Note that either way CI needn't build with source at the path you refix to at the post link step, and with NAS it definitely shouldn't since for this use case NAS is very bad.

Ok, thanks, that makes more sense to me now.

What about debuginfod instead? Look up your source and even your debug symbols by your build identifier.

debuginfod is infinitely better than nothing, but I think much worse than the binary just pointing to the absolute source path (1). Using debuginfod for this is an example of what I meant by "people standardizing on workarounds" to a problem which I think is more easily solved by "refixing" the binary to the absolute path.

Some issues with debuginfod you won't have with "refixed" binaries:

* If an assert fails, you still don't get an absolute path you can open in an editor in the error message

* This assumes the debug info including the source code was uploaded to the server. This can work for "external releases" but do you want this to happen with every build done by every user? This would be notably slower than refixing the binary. If you don't do this, then for every build you didn't do this for, you still have the problem

* Not all tools consuming DWARF support this. gdb and Valgrind do; does VTune or the sanitizers which emit call stacks with source lines you then want to open in an editor?

* Implementing this reasonably is just much more work than passing 2 flags to gcc and running one post-link pass on every binary coming out of the build system

(1) Well, "much worse" in this specific sense. debuginfod does solve a different problem of distributing debug information to people who have a binary installation without debug info on demand. At this it's infinitely better than refix which doesn't help here at all. However this is not how I'd like to work on my employer's software as a developer.

> firstly, GNU memmem is slower than memchr::memmem::Finder,

Not only slower than the rust version, also slower than most other >200 string search functions. Whilst musl has the fastest string search function for less than 32 substring chars.

See https://rurban.github.io/smart/results/all/rand2.html with libc1 being GNU libc memmem() and musl1 being musl's memmem().

Weird post. It raises questions, then admonishes you for reading them.

> "Well you're not supposed to get core dumps from versions with uncommitted changes, unless it's your local version that you haven't given to anyone but are testing locally, so you know which version it is. You should only release versions externally thru CI" - so giving anything to anyone to test is now considered "releasing externally" and must necessarily go thru CI, and having trouble finding the source code is now a punishment for straying from proper procedure? How did this discussion, which started at how build caches speed up the build, deteriorate to the point where we're telling developers to change how they work, in ways which will slow them down?

Hey, you brought it up. Don't put this on the reader.

I'm not the author, but I think, I understand what this is about. It's the etiquette rather than the law.

This is about bad situations like... here are some examples:

* If you read the description of Wheel format (Python packages), the name of the archive allows one to include eg. build number in it, but states such packages shouldn't be released publicly. (I've never tested this with PyPI.org, but wouldn't be surprised if they actually allow such packages, because see next).

* PyPI allows uploading packages with non-release versions, i.e. X.Y.Za1 or X.Y.Zb2 or X.Y.Zrc3. That is the public index intended for production use allows non-production versions. This usually doesn't bother Python programmers because they rarely use anything other than pip to install packages, whose crafty defaults allow to filter out non-released versions. This, of course, breaks when easy_install gets into the picture (and that happens when developers run something like "setup.py install").

Now, where this is all going: the protocol for working with publicly released software should include the release versions (if you want to debug it, send bug reports etc.) This is to prevent from overloading maintainers with investigation of custom configurations and the infinite number of things users could do with the software when not following the explicit usage / build instructions.

You, among friends, don't have to follow the protocol, as long as your friends are OK with it. But, as a general advise, OP is right, this is how you behave nicely, and if you don't do that the users will be upset, and you'll get a lot of extra work with a ton of criticism on top of it.

My point in the quote from GP was that it should be easy to debug any binary, whether it was built in CI or locally, as a formal release or not, and whether it did or didn't include uncommitted changes. Refix will make your binaries easily debuggable in all these cases regardless of how your build system and version control (or several build and version control systems) works.

The author is much too self deprecating! IMHO this isn’t the dreaded “premature optimization,” it’s avoiding pessimization. I’d like to read them in their unrestricted voice!

Great article!

Thanks! Not sure about the unrestricted voice, it might get me in trouble, though if you browse the archives there are some fairly unrestricted instances for better and for worse

Concur with OP. Either way, you're a gifted technical writer (and you know so much!).

> Not sure about the unrestricted voice, it might get me in trouble

None of your recent tweets (on teams/managers, on tests, on tooling etc) have got you in to any lasting trouble... have they? ;)

But imagine what the unrestricted version could do

Unrestricted voice?


If I had just read your comment I would have thought that he was a rookie/newcomer writer.

I love this. The kind of simple solution that gets the job done.

Just for the record, for nicer inspection of files with such debug information, including compressed sections, and debuginfod support, Rizin[1] can be used, since starting from the 0.7.0 release[2] all of those were added.

[1] https://rizin.re

[2] https://github.com/rizinorg/rizin/releases/tag/v0.7.0

Refix proposes an intriguing approach to achieving fast, debuggable, and reproducible builds - a triad often considered difficult to attain simultaneously. Its methodology, emphasizing minimal binary manipulation to preserve debuggability while ensuring reproducibility, introduces a potentially game-changing technique in software development practices. However, the practicality of implementing Refix in diverse development environments, especially those with complex dependency graphs and varying build processes, raises some concerns. Additionally, while the focus on C/C++ is understandable given the language's compilation model, the broader applicability to other programming paradigms remains to be seen. The proposition of modifying binaries post-compilation might also introduce unforeseen complexities in continuous integration workflows. Refix's novel idea undoubtedly sparks interest, but its adaptation to the multifaceted landscape of build systems calls for a thorough examination of its implications and integrations.

Bad bot.

I love this bot - expert-level concern trolling! It would be fun to chat with it about the details of its wholly bogus concerns. I like talking to LLMs from time to time - it's its own kind of fun!

Yeap its an bot but we are testing our new Function Store for LLMS. You can check out, Its stores the functions whic is used in your llm agents like langchain and autogen with their document and usage analyses

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact