

Stenographer – A full-packet-capture utility - dionyziz
https://github.com/google/stenographer

======
dryicerx
A bit surprised to see no support for DPDK. While using AF_PACKET makes the
project a bit more portable, you'll be able to save a lot of cycles by using
skipping the kernel all together.

~~~
kcudrevelc
I blame blissful ignorance: I'm not at all familiar with DPDK. I'll definitely
read up, though!

I wonder if O_DIRECT writes can happen from DPDK memory space? If not, we
don't gain anything, since we'd need to copy packets into RAM for writes
anyway.

Supporting stock Linux is definitely a nice-to-have... I'd like to make this a
relatively easily installed deb. Currently, all dependencies are available via
apt-get in stock Ubuntu.

------
micheloosterhof
Interesting!

An open source high performance rolling packet dump with packet index for
incident response.

To be honest, I had imagined Google already had solutions like this internally
:)

These have been commercially available for a few years, look at RSA Security
Analytics (formerly NetWitness), or BlueCoat Security Analytics (formerly
Solera). Or the (open source) Bro Time Machine. I used to work with one of
these products in the past.

What make systems like this a lot more powerful is more and easier search and
retrieval. While indexing IP numbers and port numbers is good, it will get
much more useful if you can connect it to something like 'bro' and get session
level data and then index filenames, user-agents, file hashes, and others
pieces of information. I'm sure you can see the use cases.

Having an easy way to query 'all traffic with this particular user agent',
together with the full packet capture, which allows you to write new rules,
can significantly increase the efficiency of a security team.

Apart from the streaming analytics, once the PCAP data is stored, you can use
mapreduce type operations on them to search through yesterday's data with
today's IDS signatures (look at PacketPig/what Packetloop does). Maybe a
lambda architecture is the way to go, or just reprocess old data through the
same stream processing.

Cool work though! I'm curious where this will go next.

------
mlacitation
The design document is a good read and includes high-level details of how
they're grabbing packets (AF_PACKET), the packet index format (leveldb), and
defensive action they took (fuzz testing via AFL, setcap, setcomp):

[https://github.com/google/stenographer/blob/master/DESIGN.md](https://github.com/google/stenographer/blob/master/DESIGN.md)

~~~
kcudrevelc
Hey, thanks! If you have any additional questions about the design process,
internals, etc, feel free to ask. I'm the primary author of the project, and
I'll be refreshing the HN post for the next hour or so trying to answer
questions as they come up, and/or updating the docs.

~~~
MichaelGG
What kind of performance do you see when searching over, say, 10TB/day or two
of captured data? It seems like the query would have to open a file for every
minute? Have you considered a higher level index, to tell which minute files
are worth inspecting? (I realize this only helps when searching for more
unique characteristics.)

Is LevelDB the best choice out there for write once KV pairs? For, say, IP
address indexing, what's the final bits/packet overhead of indexing?

I didn't see any compression for the packet data. Did you consider high perf
compression like LZ4?

Is AF_PACKET better than PF_RING+DNA? It's been a while since I looked but
with hardware accel they boasted massive perf advantages.

Excellent design docs and cool work!

~~~
kcudrevelc
Hey, great questions!

Query Performance: Right now, we've got test machines deployed with 8 500GB
disks for packets + 1 indexing disk (all 15KRPM spinning disks). They keep at
90% full, or roughly 460GB/disk, about 1K files/disk. Querying over the entire
corpus (~4TB of packets) for something innocuous like 'port 65432' takes 25
seconds to return ~50K packets (that's after dropping all disk caches). The
same query run again takes 1.5 sec, with disk caches in place. Of course, the
number of packets returned is a huge factor in this... each packet requires a
seek in the packets file. Searching for something that doesn't exist (host
0.0.0.1) takes roughly 5 seconds. Note that time-based queries, like "port
4444 and after 3h ago and before 1h ago" do choose to only query certain
files, taking advantage of the fact that we name files by microsecond
timestamp and we flush files every minute.

A big part of query performance is actually over-provisioning disks. We see
disk throughput of roughly 160-180MB/s. If we write 160MB/s, our read
throughput is awful. If we write 100MB/s, it's pretty good. Who would have
thought: disks have limited bandwidth, and it's shared between reads and
writes. :)

We actually don't use LevelDB... we use the SSTables that underly LevelDB.
Since we know we're write-once, we use
[https://github.com/google/leveldb/blob/master/include/leveld...](https://github.com/google/leveldb/blob/master/include/leveldb/table_builder.h)
directly for writes (and its Go equivalent for reads). I'm familiar with the
file format (they're used extensively inside Google), so it was a simple
solution. That said, it's been very successful... we tend to have indexes in
the 10s of MBs for 2-4GB files. Of course, index size/compressibility is
directly correlated with network traffic: more varied IPs/ports would be
harder to compress. The built-in compression of LevelDB tables is also a boon
here... we get prefix compression on keys, plus snappy compression on packet
seek locations, for free.

We currently do no compression of packets. Doing so would definitely increase
our CPU usage per packet, and I'm really scared of what it would do to reads.
Consider that reading packets in compressed storage would require
decompressing each block a packet is in. On the other hand, if someone wanted
to store packets REALLY long term, they could easily compress the entire
blockfile+index before uploading to more permanent storage. I expect this
would be better than having to do it inline. Even if we did build it in, we'd
probably do it tiered (initial write uncompressed, then compress later on as
possible).

AF_PACKET is no better than PF_RING+DNA, but I also don't think it's any
worse. They both have very specific trade-offs. The big draw for me for
AF_PACKET is that it's already there... any stock Linux machine will already
have it built in and working. Thus steno should "just work", while a PF_RING
solution has a slightly higher barrier to entry. I think PF_RING+DNA should
give similar performance to steno... but libzero currently probably gives
better performance because packets can be shared across processes. This is a
really interesting problem that I'm wondering if we could also solve with
AF_PACKET... but that's a story for another day. Short story: I wanted this to
work on stock linux as much as possible.

~~~
MichaelGG
Thanks for the detailed reply! But I'm really curious: That performance is a
bit beyond the spec'd max spec for such HDDs (3.0ms seek + 2ms latency, so 50K
random IO should need around 31 seconds with 8 disks. I'm guessing a bit of
clustering in the packet distribution improves seek time so a sector/page
contains multiple hits?

I'm interested because I wrote an app-specific indexer, but with requiring
"interactive" query response times over a couple TB, for multiple users. But
that was years ago, before LevelDB and Snappy, and Kyoto Cabinet had far too
much overhead per kv), and on small CPUs and a single 7200rpm disk. I got
compressions rates of 5 to 6 using QuickLZ; a non-trivial gain.

I was looking at this problem space again and considering a delta+int
compression approach to offsets, given they're just incremental. (And there
are cool SIMD algorithms for 'em.) But it sounds like SSTable + fscache is
fast enough, wow, that's pretty cool!

The decompression of blocks in some apps doesn't have to be much of a penalty
if there's a reasonable amount of clustering going on in the sample set. What
I did was instead of just splitting blocks on time, I segmented them based on
flow and time. I did L7 inspection, and an old quad-core Core2 could handle
1Gbps, so 10Gbps is probably achievable nowadays, certainly for L4 flows. That
way there's great locality for most queries.

Further, the real cost is the seek, and transferring a few more sectors won't
cost as much. If you're using mmap'd IO for reading, you might be able to
compress pages and not pay any IO penalty, right? And in fact, it might even
reduce the number of seeks, due to increasing clustering of packets onto the
same page. And I think some of the fastest compression algorithms only look
back a very small amount, like 16K or 64K anyways? Although, this is probably
easier done just by using a compressed filesystem cause the cache management
code is probably nontrivial.

~~~
kcudrevelc
I think the reason we're getting faster performance is that we tend to have
packets clustered on disk, as you've surmised. Since packets with particular
ports/IPs/etc tend to cluster in time, there's a good chance that at least a
few will take advantage of disk caches. Even if we clear the disk cache before
the query, the first packet read can cache some read-ahead, and a subsequent
packet read may hit that cache entry without requiring an additional
seek/read.

As far as compressing offsets, I haven't done any specific measurements but my
intuition is that snappy (really any compression algorithm) gives us a huge
benefit, since all offsets are stored in-order: they tend to have at least 2
prefix bytes in common, so it's highly compressible.

I experimented with mmap'ing all files in stenographer when it sees them, and
it turned out to have negligible performance benefits... I think because the
kernel already does disk caching in the background.

I think compression is something we'll defer until we have an explicit need.
It sounds super useful, but we tend not to really care about data after a
pretty short time anyway... we try to extract "interesting" pcaps from steno
pretty quickly (based on alerts, etc). It's a great idea, though, and I'm
happy to accept pull requests ;)

Overall, I've been really pleased with how doing the simplest thing actually
gives us good performance while maintaining understand-ability. The kernel
disk caching means we don't need any in-process caching. The simplest offset
encoding + built-in compression gives great compression and speed. O_DIRECT
gives really good disk throughput by offloading all write decisions to the
kernel. More often than not, more clever code gave little or even negative
performance gains.

~~~
MichaelGG
Yeah it's very impressive how fast general systems have become, eliminating a
lot of the need for clever hacks.

I wonder how much would change if you were to use a remote store for recording
packets, like S3 or other blob storage. In such cases the transfer time
overhead _might_ make the compression tradeoff different. And the whole seek-
to-offset might need a chunking system anyways (although I guess you can just
Range when requesting a blob, but the overhead is much larger than a disk
seek).

------
e28eta
They probably don't want to give away too much (like security details of their
network), but I think it'd be more compelling with some examples of how to use
this for Intrusion Detection.

It's a topic I don't know much about, and I think it'd reinforce the claim
this isn't for user monitoring.

~~~
CHY872
Ok so this isn't a Google product. In brief (please correct if I'm wrong),
Google lets its employees work on their own side projects on company resources
if they assign copyright to Google. This means that it gets published on the
Google github account, but is then denoted to not be a Google product - it's
someone's side project.

I do however have at least anecdotal experience with how these sorts of
systems work. The idea is that as a large company, you traditionally pump all
of your internet through a firewall, which scans it all online, does deep
packet inspection etc to look for attackers.

Then, because it takes up a lot of space, you ditch it, and perhaps keep finer
grained logfiles - perhaps just the DNS requests or headers or suspicious
packets etc.

The idea here is that for many companies, this isn't helpful when you do get
owned - you'll have deleted most of the relevant data (showing exactly what
got exfiltrated etc, how it happened etc) and you might have some logfiles
showing TCP addresses but you know little else.

Since a company of 1000 will use no more than around 1-10TB per day for its
staff, it's actually now feasible to store every packet that is sent in and
out of your network - you could store for 90 days on around 0.1-1PB - which is
actually fairly affordable for a company of that size.

Then, you either run large (more expensive than can be done in a firewall)
jobs over the data offline to look for intrusions, or wait for a breach and
then drill down on the data to try to learn exactly what happened.

The reason why this isn't really a tool for monitoring users is:

a) What can you do to track users that you couldn't already do with systems
that don't store all the data? b) The target seems to be corporate networks
who can and should monitor what their users are doing on their network. c) The
nature of this sort of data is that because it's not really indexed any
specific searches would be very expensive - perhaps requiring runthroughs of
terabytes of data. So individually spying on many people isn't really doable
without further processing - this is really just a big packet dumper.

If you were going to try and monitor random Joe Public, then you'd certainly
be fitting a device like this to a computer their traffic would be passing
through - but this isn't useful for someone who's not an ISP or nation state
(and in that case, there'd probably be smarter ways of doing this (since here,
you can only sniff local connections)). For Google, the most they'd be able to
sniff is communications from their users to their own servers - which isn't a
huge bonus for the costs.

Even for an ISP, it'd just be massively expensive and unhelpful - a UK ISP
(Plusnet) I just searched up has around 800,000 ADSL users, and at peak time
they see total usage of 130Gbps-ish. Even assuming average half utilisation of
65Gbps, that's still 702TB a day. That's a massive amount of data to store for
any reason. The reason you (bad person) only store the metadata is beause the
metadata is the valuable part!

I welcome corrections :)

~~~
e28eta
Thanks!

> Then, you either run large (more expensive than can be done in a firewall)
> jobs over the data offline to look for intrusions, or wait for a breach and
> then drill down on the data to try to learn exactly what happened.

I was thinking in terms of offline jobs, and don't have a good intuition for
what those rules would look like. I'm also skeptical that your average company
would have the expertise to write a good set of rules. So I was interested to
see that "half" of an IDS tool.

I think the real answer is that it truly is just a rolling packet dump, and
it's up to you to use it however you choose.

I can think of uses outside of network security: capturing traffic from your
mobile devices on your home network (maybe this is just IDS if you're watching
for the contents of your address book to be exfiltrated by a malicious app),
or snooping on people through a Internet cafe, library, or other (small) open
network that you administer.

For these uses, just like IDS, you'd want to run offline jobs against the
data. Whether that's a full scan for something interesting, or an indexing
pass that extracts (portions?) into a more easily viewable form.

~~~
kcudrevelc
Offline jobs are an interesting idea, but they weren't what we were really
thinking of. Instead, we use stenographer more like a database of recent
traffic. Consider this as a simple use case for intrusion detection:

    
    
      set up snort and steno
      foreach snort alert
        request all packets in stream from steno: srcIP,srcPort,dstIP,dstPort match
        OR request all packets on that srcIP,dstIP, to get OTHER connections between those hosts
        store pcap to directory (or central DB, or whatever)
    

Then, when a human analyst wants to investigate the alert, instead of getting
the very limited PCAP that comes out of snort, they get a ton of data they can
use to build context, write new detection rules, etc.

------
warmwaffles
What does this offer that tcpdump doesn't?

~~~
ithkuil
1) Performance. Zero copy ("The kernel writes them from the NIC to shared
memory, then the kernel uses that same shared memory for O_DIRECT writes to
disk. The packets transit the bus twice and are never copied from RAM to
RAM."). Parallelism.

2) Disk management. Rotates old data, etc

3) Indexing and supports efficient retrieval while writing.

It allows to analyse the traffic after the fact, at 10Gbps line speed.

~~~
akadien
You can get zero-copy for tcpdump with PF_RING or netmap.

~~~
ithkuil
I'm aware of libpcap's ability to share memory with a user buffer, but I
didn't find any mention that tcpdump utility is actually written to exploit it
for extra fast writes.

Look here how they handle this in stenographer:
[https://github.com/google/stenographer/blob/65fb928e6bce276c...](https://github.com/google/stenographer/blob/65fb928e6bce276ccd2481542dc45cf68fd33530/stenotype/stenotype.cc#L42)

I guess that in principle they could have patched tcpdump, but it's probably
easier to have a smaller software written to do exactly what you want rather
than extend a general purpose mature complex tool such as tcpdump.

