
Timeline – a high-resolution event log - jsnell
https://github.com/lukego/snabb/blob/timeline/src/core/timeline.md
======
mbrock
I've been running into Luke Gorrie's cool projects in different languages
since my early university days in 2006. Somehow he always does inspiring
stuff, and gives off a really friendly vibe. Cheers!

------
amelius
> The timeline is a high-resolution event log that stores entries in a shared-
> memory ring buffer

What problem is this solving exactly? Why can't one write multiple log files
(i.e., without sharing anything). Also, shouldn't log entries be stored on
disk in most cases?

~~~
jsnell
This is for latency sensitive applications, where you're counting cycles and
can't possibly afford to write nicely formatted log messages to a file. (Or
can't really even afford system calls in the main loop).

So there's a main application process, which does its logging by adding
messages to the shared memory ring buffer, with a message format that's
designed to be compact and cheap to write. And then there's going to be a
reader which reads the log messages, and actually does something with them
(filter, format, write to log files, whatever).

~~~
amelius
Can't the logging be handled by a different thread, which subsequently writes
the log entries to a file? This is how most loggers work I suppose (keep the
main thread free from logging-work as much as possible).

And because you use threads instead of processes, you get shared memory for
free.

~~~
jasonwatkinspdx
This is not a replacement for an application logger or syslog or the like.
This is for sampling traces of extremely high throughput and low latency
networking systems built atop snabb.

~~~
dang
You may be right, but please don't make the criticism personal ("you're just
not getting it", "while failing to understand", etc.) It makes the thread
worse and typically provokes worse yet.

~~~
jasonwatkinspdx
I edited my comment to reflect your criticism, but I disagree with it. It's
pretty clear the person I was replying to just didn't even read the original
posting.

~~~
dang
Assuming that's true, you still shouldn't be personally abrasive in HN
comments. Being right doesn't make it ok; it often makes it worse.

------
frugalmail
This is trying to replicate Java/Log4J2/Disruptor in Lua?

~~~
jsmthrowaway
No. You could build dtrace with software like this, whereas the entire stack
you mentioned only cares indirectly about per-message cycle overhead. This is
for when your tight loop, like your routing engine or game loop, wants to kick
out numbers at timestamps as fast as it can and other software deals with
making them presentable. It's a great technique.

This is built for networking software/hardware, where every cycle matters, and
is actually extremely useful to me in a couple domains. I'm impressed disabled
messages only sink five cycles so I assume there was hand tuning done here.

Edit: And I'm right, it's a hand tuned logical compare (cool):
[https://github.com/lukego/snabb/blob/timeline/src/core/timel...](https://github.com/lukego/snabb/blob/timeline/src/core/timeline.dasl)

If you want reasons to learn assembly, there you go. Too few folks wielding it
in typical Web operations, where it would be extraordinarily useful in any
number of hot paths. Just requires care.

~~~
vardump

      local memory      = shm.map(shmpath, ffi.typeof("char["..size.."]"))
    

mmapp'ed, so buffer memory is always 4096 aligned.

    
    
      -- Log entry format (ring item)
    
      ffi.cdef[[
        struct timeline_entry {
          uint64_t tsc;                    // CPU timestamp
          uint64_t ip;                     // Instruction pointer of logger
          uint32_t aux;                    // TSC_AUX: core (bits 0-7) + numa (12-15)
          uint32_t reserved;               // (pad to 64 bytes)
          uint64_t msg_id;                 // category:24, message:24, priority:16
          uint64_t msg0, msg1, msg2, msg3; // message arguments
        };
      ]]
    

And 64 bytes long. Just like a cache line, probably not a coincidence? No
unnecessary slow memory traffic or cache pollution. Nice.

Such a shame TSC_AUX is always 0 on Windows and most Linux kernels as well. Oh
well. It's not that much code or work to write CPU core index and NUMA group
to an MSR when bringing up a CPU core.

~~~
jsmthrowaway
Yeah, no kidding. The author did their homework and this is really good code
to study. I'm with you, it's almost certainly not a coincidence.

It's aging a bit, but Ulrich Drepper's seminal paper on memory comes to mind:
[https://www.akkadia.org/drepper/cpumemory.pdf](https://www.akkadia.org/drepper/cpumemory.pdf)

Say what you want about Drepper's style and personality, but that paper told
me (a) he's an incredibly knowledgable dude and (b) I'll always have more to
learn, especially about cache.

~~~
vardump
Looks like he's not writing to "reserved" field. So CPU will definitely need
to read the cache line at current log entry before write (RFO, read for
ownership).

I'm not sure whether current CPUs are smart enough, but in theory writing 64
bytes at once to cache aligned address could avoid RFO. If it's L1/L2/L3 miss,
that could be 100-200 cycles saving.

Might have also been better to do RDTSCP first. Otherwise it'll also avoid
reordering instructions. And sabotage attempt to avoid RFO.

Anyways, not sure, didn't profile. (And by profiling I mean using those
countless CPU performance counters to figure out what's going on in the
mysterious black box.)

