Hacker News new | past | comments | ask | show | jobs | submit | CodesInChaos's comments login

> we compile the binary with debug symbols and a flag to compress the debug symbols sections to avoid having huge binary.

How big are the uncompressed debug symbols? I'd expected processing uncompressed debug symbols to happen via a memory mapped file, while compressed debug symbols probably need to be extracted to anonymous memory.

https://github.com/llvm/llvm-project/issues/63290


Normal build

  cargo build --bin engine-gateway --release
      Finished `release` profile [optimized + debuginfo] target(s) in 1m 00s

  ls -lh target/release/engine-gateway 
  .rwxr-xr-x erebe erebe 198 MB Sun Jan 19 12:37:35 2025    target/release/engine-gateway

what we ship

  export RUSTFLAGS="-C link-arg=-Wl,--compress-debug-sections=zlib -C force-frame-pointers=yes" 
  cargo build --bin engine-gateway --release
      Finished `release` profile [optimized + debuginfo] target(s) in 1m 04s

  ls -lh target/release/engine-gateway
  .rwxr-xr-x erebe erebe 61 MB Sun Jan 19 12:39:13 2025  target/release/engine-gateway

The diff is more impressive on some bigger projects


The compressed symbols sounds like the likely culprit. Do you really need a small executable? The uncompressed symbols need to be loaded into RAM anyway, and if it is delayed until it is needed then you will have to allocate memory to uncompress them.


I will give it a shot next week to try out ;P

For this particular service, the size does not matter really. For others, it makes more diff (several hundred of Mb) and as we deploy on customers infra, we want images' size to stay reasonable. For now, we apply the same build rules for all our services to stay consistent.


Maybe I'm not communicating well. Or maybe I don't understand how the debug symbol compression works at runtime. But my point is that I don't think you are getting the tradeoff you think you are getting. The smaller executable may end up using more RAM. Usually at the deployment stage, that's what matters.

Smaller executables are more for things like reducing distribution sizes, or reducing process launch latency when disk throughput is the issue. When you invoke compression, you are explicitly trading off runtime performance in order to get the benefit of smaller on-disk or network transmission size. For a hosted service, that's usually not a good tradeoff.


It is most likely me reading too quickly. I was caught off guard by the article gaining traction in a Sunday, and as I have other duties during the weekend, I am reading/responding only when I can sneak in.

For your comment, I think you are right regarding compression of debug symbols that add up to the peak memory, but I think you are misleading when you think the debug symbols are uncompressed when the app/binary is started/loaded. Decompression only happens for me when this section is accessed by debugger or equivalent. It is not the same thing as when the binary is fully compressed, like with upx for example.

I have done a quick sanity check on my desktop, I got.

  [profile.release]
  lto = "thin"
  debug = true
  strip = false

  export RUSTFLAGS="-C link-arg=-Wl,--compress-debug-sections=zlib -C force-frame-pointers=yes"
  cargo build --bin engine-gateway --release
  
 
From rss memory at startup I get ~128 MB, and after the panic at peak I get ~627 MB.

When compiled with those flags

  export RUSTFLAGS="-C force-frame-pointers=yes" 
  cargo build --bin engine-gateway --release
From rss memory at startup I get ~128 MB, and after the panic at peak I get ~474 MB.

So the peak is taller indeed when the debug section is compressed, but the binary in memory when started is roughly equivalent. (virtual mem too)

I had some hard time getting a source that may validate my belief regarding when the debug symbol are uncompressed. But based on https://inbox.sourceware.org/binutils/20080622061003.D279F3F... and the help of claude.ai, I would say it is only when those sections are accessed.

for what is worth, the whole answer of claude.ai

  The debug sections compressed with --compress-debug-sections=zlib are decompressed:

  At runtime by the debugger (like GDB) when it needs to access the debug information:

  When setting breakpoints
  When doing backtraces
  When inspecting variables
  During symbol resolution


  When tools need to read debug info:

  During coredump analysis
  When using tools like addr2line
  During source-level debugging
  When using readelf with the -w option


  The compression is transparent to these tools - they automatically handle the decompression when needed.    The sections remain compressed on disk, and are only decompressed in memory when required.
  This helps reduce the binary size on disk while still maintaining full debugging capabilities, with only a small runtime performance cost when the debug info needs to be accessed.
  The decompression is handled by the libelf/DWARF libraries that these tools use to parse the ELF files.


Thanks for running these checks. I’m learning from this too!


Container images use compression too, so having the debug section uncompressed shouldn't actually make the images any bigger.


The analysis looks rather half-finished. They did not analyze why so much memory was consumed. If this is the cache which persists after the first call, if it's temporary working memory, or if it's an accumulating memory leak. And why it uses so much memory at all.

I couldn't find any other complaints about rust backtrace printing consuming a lot of memory, which I would have expected if this was normal behaviour. So I wonder if there is anything special about their environment or usecase?

I would assume that the same OOM problem would arise when printing a panic backtrace. Either their instance has enough memory to print backtraces, or it doesn't. So I don't understand why they only disable lib backtraces.


Hello,

You can see my other comment https://news.ycombinator.com/item?id=42708904#42756072 for more details.

But yes, the cache does persist after the first call, the resolved symbols stay in the cache to speed up the resolution of next calls.

Regarding the why, it is mainly because

1. this app is a gRPC server and contains a lot of generated code (you can investigate binary bloat with rust with https://github.com/RazrFalcon/cargo-bloat)

2. and that we ship our binary with debug symbols, with those options ``` ENV RUSTFLAGS="-C link-arg=-Wl,--compress-debug-sections=zlib -C force-frame-pointers=yes" ```

For the panic, indeed, I had the same question on Reddit. For this particular service, we don't expect panics at all, it is just that by default we ship all our rust binaries with backtrace enabled. And we have added an extra api endpoint to trigger a catched panic on purpose for other apps to be sure our sizing is correct.


I don't think the 4 GiB instance actually ran into an OOM error. They merely observed a 400 MiB memory spike. The crashing instances were limited to 256 and later 512 MiB.

(Assuming that the article incorrectly used Mib when they meant MiB. Used correctly b=bit, B=byte)


No:

> On the other hand, Ropey is not good at:

> Handling texts that are larger than available memory. Ropey is an in-memory data structure.

Also, this is a rust library, not an editor application.


When using deniable authentication (e.g. Diffie-Hellman plus a MAC), the recipient can verify that the email came from the sender. But they can't prove to a third party that the email came from the sender, and wasn't forged by the recipient.


Mallory sends a message, forged as if from Alice, to Bob. How can Bob determine that it came from Alice and wasn’t forged by Mallory?


No, no, in these systems Alice and Bob both know a secret. Mallory doesn't know the secret, so, Mallory can't forge such a message.

However, Bob can't prove to the world "Alice sent me this message saying she hates cats!" because everybody knows Bob knows the same secret as Alice, so, that message could just as easily be made by Bob. Bob knows he didn't make it, and he knows the only other person who could was Alice, so he knows he's right - but Alice's cat hatred cannot be proved to others who don't just believe what Bob tells them about Alice.


Now it makes sense why Alice was sending me that kitten in a mixer video.

But seriously, in a case before a court or jury, wouldn't there be much more evidence? Down to your own lawyer sending a complete dump of your phone with all those Sandy-Hooks-conspiracies and hate messages to the opposing side?


Sometimes instead of a complete dump, you might mutually agree on a third party forensic lab to analyze your phone and only provide relevant details to the opposing side. Usually there's a few rounds of review where you can remove some details/records that are not relevant before the distilled records are sent to the opposing counsel.


Deniability is just that, the opportunity to deny. Will denying change what people believe? Well, maybe or maybe not. I'm sure you can think of your own examples without me annoying the moderators.


The idea is to use a MAC instead of a signature. As long as Alice isn't compromised and sharing her key with Mallory (which she could do even in the signature case), when Bob receives a message with a valid MAC on it, he knows that Alice authorized the message.


This is always a possibility but I'm guessing the happy path is the more likely believable scenario and could be verified OOB.


Returning a 3xx redirect to an generic error page is even worse than 200.


Of the three bad things they've been accused of, I'd consider that by far the least. Selling tracking data is an invasion of privacy. Deliberately not showing better discounts violates their core value proposition. Replacing deferral links doesn't hurt the user, and isn't much different from blocking ads.


As a user that might use referral links to support the youtube channel, I do feel in an indirect way this does hurt the user


1. Do you support compression for data stored in segments?

2. Does the choice of storage class only affect chunks or also segments?

To me the best solution seem like combining storing writes on EBS (or even NVMe) initially to minimize the time until writes can be acknowledged, and creating a chunk on S3 standard every second or so. But I assume that would require significant engineering effort for applications that require data to be replicated to several AZs before acknowledging them. Though some applications might be willing to sacrifice 1s of writes on node failure, in exchange for cheap and fast writes.

3. You could be clearer about what "latency" means. I see at least three different latencies that could be important to different applications:

a) time until a write is durably stored and acknowledged

b) time until a tailing reader sees a write

c) time to first byte after a read request for old data

4. How do you handle streams which are rarely written to? Will newly appended records to those streams remain in chunks indefinitely? Or do you create tiny segments? Or replace and existing segment with the concatenated data?


(Founder) Thanks for the deep questions!

1) Storage is priced on uncompressed data. We don't currently compress segments.

2) It only affects chunk storage. We do have a 'Native' chunk store in mind, the sketch involves introducing NVMe disks (as a separate service the core depends on) - so we can offer under 5 millisecond end-to-end tail latencies.

3) The append ack latency and end-to-end latency with a tailing reader is largely equivalent for us since latest writes are in memory for a brief period after acknowledgment. If you try the CLI ping command (see GIF on landing page) from the same cloud region as us (AWS us-east-1 only currently), you'll see end-to-end and append ack latency as basically the same. TTFB for older data is ~ TTFB to get a segment data range from object storage, so it can be a few hundred milliseconds.

4) We have a deadline to free chunks, so we we PUT a tiny segment if we have to.


> To me the best solution seem like combining storing writes on EBS (or even NVMe) initially to minimize the time until writes can be acknowledged, and creating a chunk on S3 standard every second or so.

Yep, this is approximately Gazette's architecture (https://github.com/gazette/core). It buys the latency profile of flash storage, with the unbounded storage and durability of S3.

An addendum is there's no need to flush to S3 quite that frequently, if readers instead tail ACK'd content from local disk. Another neat thing you can do is hand bulk historical readers pre-signed URLs to files in cloud storage, so those bytes don't need to proxy through brokers.


Here "log" means "append-only stream of small records". This isn't just about traditional logs (including http request logs and error logs). You could use it to store events for an event-sourced application, and even as the Write-Ahead-Log (WAL) for a database.

A distributed, but still consistent and durable log is a great building block for higher level abstractions.


That makes more sense. I suppose an audit log would also fit. I guess append-only backups wouldn't fit the "small" requirement though.


"Small" means 1MiB per record here. But a higher level abstraction could split one logical operation into multiple records. Just like FoundationDB has severe limits on its transaction size, while higher level databases built on top of it work around that limit.

The OP's blog post linked to this article, which explains some scenarios where this storage primitive would be helpful: https://engineering.linkedin.com/distributed-systems/log-wha...

This product offers two advantages over S3: 1) Appending a small amount of data is cheap 2) Writes are forced into a consistent order (so you don't need to implement Paxos or RAFT yourself). Neither of these are useful for backups. Raw S3 already works well for that usage-case, especially now that Amazon added support for pre-conditions.


> we are considering an in-memory emulator to open source ourselves

I'd suggest a persistent emulator, using something like SQLite (one row per record). Even for local development, many applications need persistence. And it'd be even enough to run a single node low throughput production server which doesn't need robust durability and availability. But it still has enough overhead and limitations not to compete with your cloud offering.

What's however important is being as close as possible to your production system, behavior wise. So I'd try so share as much of the frontend code (e.g. the GRPC and REST handlers) as possible between these.


Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: