Hacker Newsnew | past | comments | ask | show | jobs | submit | el_dockerr's commentslogin

Hi HN, author here.

I've been working on optimizing perception pipelines for SWaP-constrained FPGAs (like in satellite or drone payloads). I realized that we often run out of DSP slices for simple 3x3 convolutions.

I implemented a method to approximate these convolutions by learning coefficients that map strictly to Powers-of-Two (PoT). This allows replacing the constant multipliers with bit-shifts and adders (LUTs).

The results:

Reduces DSP usage by 33% (2 Muls instead of 3 per atomic dot-product).

Achieves >99% SSIM on correlated images.

The error manifests as a global DC-offset, which Batch Norm layers in CNNs can typically absorb.

I wrote a blog post detailing the math and the hardware implementation. The full C-benchmark and PoC code is on GitHub (linked in the post).

I'd love to hear from the FPGA folks here: Is this trade-off (accuracy vs. resources) something you'd use in production payloads?

Other sources:

[blog] https://www.dockerr.blog/blog/lowrank-hardware-approximation

[git] https://github.com/el-dockerr/Low-Rank_Hardware_Approximatio...

[LinkedIN] https://www.linkedin.com/in/swen-kalski-062b64299/


Hi HN,

I wrote a C++ library that implements transparent memory compression in user-space. Up to 100% if you use it right ^^

The core idea is to catch memory access violations (using AddVectoredExceptionHandler on Windows or userfaultfd/Signals on Linux) to implement a custom paging mechanism. Instead of swapping to disk, it compresses cold pages using LZ4 and stores them in a reserved heap area.

How it works:

It allocates virtual memory but sets protections to PAGE_NOACCESS.

When the app tries to access the memory, the library catches the CPU trap.

It allocates physical RAM, decompresses the data (if it existed), and resumes execution.

An LRU strategy freezes cold pages back to the compressed store when a limit is reached.

Why? Aside from the technical challenge, my main use case is embedded/IoT systems (like Raspberry Pi). Swapping to SD cards kills them quickly due to write wear. By compressing in RAM, we can extend the lifespan of the hardware and prevent OOM kills in constrained environments without touching the kernel. And you can set it up to store memory right on the IO of the Harddrive instead. So you have a application that make use of no ram (mostly).

In future it will add an optional AES-128 encryption layer for "ephemeral security" (data is encrypted while cold).

It's a PoC / Alpha right now. I'd love to hear your thoughts on the implementation or potential edge cases with specific C++ STL containers.

Link to code:https://github.com/el-dockerr/ghostmem


Nice project! One question: decompression and page-fault handling also add latency. How do you avoid thrashing in practice? Also, for such low-level memory management, why C++ instead of C? C might give more predictable control without hidden runtime behavior.


Thanks for the feedback! You hit the nail on the head regarding the trade-offs.

1. Latency & Thrashing: You are absolutely right, there is overhead (context switch + LZ4). The intended use case isn't high-frequency access to hot data, but rather increasing density for "warm/cold" data in memory-constrained environments (like embedded/IoT) where the alternative would be an OOM kill or swapping to slow flash storage.

To mitigate thrashing, I'm using a configurable LRU (Least Recently Used) strategy. If the working set fits within Physical Limit + Compression Ratio, it works smoothly. If the active working set exceeds physical RAM, it will indeed thrash—just like OS paging would. It's a trade-off: CPU cycles vs. Capacity.

2. Why C++? Valid point regarding runtime opacity. However, I chose C++ for RAII and Templates.

RAII: Managing the life-cycle of VirtualAlloc/VirtualFree and the exception handlers is much safer with destructors, ensuring we don't leak reserved pages or leave handlers dangling.

Templates: To integrate seamlessly with C++ containers (like std::vector), I needed to write a custom Allocator (GhostAllocator<T>). C++ templates make this zero-overhead abstraction possible, whereas in C, I'd have to rely on void* casting macros or manual memory management for generic structures.

I try to stick to a "C with Classes" subset + Templates, avoiding heavy runtime features where possible to keep it predictable.


I'm late for the party, but I create a tiny but useful C / C++ build system called Bodge that is equipped to build small C++ projects. Can be found here: https://github.com/el-dockerr/bodge. You simply define the buildflags and you can add subresources from git, which can also run Bodge to be build if needed. You can define further action, like copying or move of files after build. You can create sequences to build linked libraries and executable in one single run. Maybe it's something. In the company I work for it is beloved because it get rid of the issues that you have distinct between IDEs and environments what is especial a pain for Windows User. It's under GNU and very small. I will enhance it step by step.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: