Hacker News new | past | comments | ask | show | jobs | submit login

Thanks for the detailed reply! But I'm really curious: That performance is a bit beyond the spec'd max spec for such HDDs (3.0ms seek + 2ms latency, so 50K random IO should need around 31 seconds with 8 disks. I'm guessing a bit of clustering in the packet distribution improves seek time so a sector/page contains multiple hits?

I'm interested because I wrote an app-specific indexer, but with requiring "interactive" query response times over a couple TB, for multiple users. But that was years ago, before LevelDB and Snappy, and Kyoto Cabinet had far too much overhead per kv), and on small CPUs and a single 7200rpm disk. I got compressions rates of 5 to 6 using QuickLZ; a non-trivial gain.

I was looking at this problem space again and considering a delta+int compression approach to offsets, given they're just incremental. (And there are cool SIMD algorithms for 'em.) But it sounds like SSTable + fscache is fast enough, wow, that's pretty cool!

The decompression of blocks in some apps doesn't have to be much of a penalty if there's a reasonable amount of clustering going on in the sample set. What I did was instead of just splitting blocks on time, I segmented them based on flow and time. I did L7 inspection, and an old quad-core Core2 could handle 1Gbps, so 10Gbps is probably achievable nowadays, certainly for L4 flows. That way there's great locality for most queries.

Further, the real cost is the seek, and transferring a few more sectors won't cost as much. If you're using mmap'd IO for reading, you might be able to compress pages and not pay any IO penalty, right? And in fact, it might even reduce the number of seeks, due to increasing clustering of packets onto the same page. And I think some of the fastest compression algorithms only look back a very small amount, like 16K or 64K anyways? Although, this is probably easier done just by using a compressed filesystem cause the cache management code is probably nontrivial.

I think the reason we're getting faster performance is that we tend to have packets clustered on disk, as you've surmised. Since packets with particular ports/IPs/etc tend to cluster in time, there's a good chance that at least a few will take advantage of disk caches. Even if we clear the disk cache before the query, the first packet read can cache some read-ahead, and a subsequent packet read may hit that cache entry without requiring an additional seek/read.

As far as compressing offsets, I haven't done any specific measurements but my intuition is that snappy (really any compression algorithm) gives us a huge benefit, since all offsets are stored in-order: they tend to have at least 2 prefix bytes in common, so it's highly compressible.

I experimented with mmap'ing all files in stenographer when it sees them, and it turned out to have negligible performance benefits... I think because the kernel already does disk caching in the background.

I think compression is something we'll defer until we have an explicit need. It sounds super useful, but we tend not to really care about data after a pretty short time anyway... we try to extract "interesting" pcaps from steno pretty quickly (based on alerts, etc). It's a great idea, though, and I'm happy to accept pull requests ;)

Overall, I've been really pleased with how doing the simplest thing actually gives us good performance while maintaining understand-ability. The kernel disk caching means we don't need any in-process caching. The simplest offset encoding + built-in compression gives great compression and speed. O_DIRECT gives really good disk throughput by offloading all write decisions to the kernel. More often than not, more clever code gave little or even negative performance gains.

Yeah it's very impressive how fast general systems have become, eliminating a lot of the need for clever hacks.

I wonder how much would change if you were to use a remote store for recording packets, like S3 or other blob storage. In such cases the transfer time overhead _might_ make the compression tradeoff different. And the whole seek-to-offset might need a chunking system anyways (although I guess you can just Range when requesting a blob, but the overhead is much larger than a disk seek).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact