
NanoLog – a nanosecond scale logging system for C++ - tzm
https://github.com/PlatformLab/NanoLog
======
muststopmyths
I spent a lot of time trying to build a fast logging system in my last couple
of jobs. The basic lesson (and I'm only talking about C/C++/C# here) is that
you will spend most of your time formatting strings if you do your file I/O
asynchronously.

Since this system has a preprocessor mode, I assume they learnt the same
lesson.

The bigger lesson is that it really doesn't matter how many millions of logs
you can generate per second if you don't have the infrastructure to store and
analyze them easily. No one is going to enjoy digging through these things and
the more you generate, the harder it is to extract meaningful information.

In other words, pay a lot more attention to what you'll do with the logs
rather than how fast you can spew them out. I have rarely needed more than a
few thousand log messages/sec even for a loaded MMO server. I spent way more
time creating ways to look at these logs and making them accessible in near
real time.

~~~
gpderetta
> is that you will spend most of your time formatting strings if you do your
> file I/O asynchronously.

That's why you are supposed to do your formatting in the background thread :).

> doesn't matter how many millions of logs you can generate per second

Usually the issue is not generating millions of line of logs (in whitch case
dispatching to a background thread just adds overhead), but being able to log
with minimal overhead on the surrounding code.

~~~
Matthias247
> That's why you are supposed to do your formatting in the background thread
> :).

Which then has the drawback that every argument that gets transferred to the
background thread must first be heap allocated or static - instead of only the
final buffer being that. My gut feeling is that the additional allocation and
memory management costs typically outweigh the cost of formatting in the
current thread. Even if one pools all logging buffers less buffers are needed
for transferring only formatted buffers.

~~~
paulddraper
Or you'd copy them.

Unless you do something dumb (like copy a vector when you only need to print
the size), copying is significantly faster than string formatting.

~~~
dclusin
I think what OP is hinting at here is cache invalidation/eviction caused by
the additional thread's processing and memory operations. If your requirements
are down to nanosecond granularities then cache misses are probably being
measured and noticeable. A third party logging thread doing memory copies and
other log processing sounds like a fine way to unintentionally evict a bunch
of cache entries. You might be able to mitigate this to some extent with CPU
affinities for threads I suppose. Another option would be to move logging to
the network and record packets in/out with an out of band monitoring solution.

~~~
Rapzid
Logging the raw structured log entry and off-loading any string formatting to
another system is a very interesting idea. Just push the whole "friendly
message" stuff off to other logging infrastructure where the latencies matter
less.

------
MauranKilom
The code overall is pretty clever, but at the core of all this they completely
ignore strict aliasing to dump stuff into a char* buffer indiscriminately...
Look at this function:

[https://github.com/PlatformLab/NanoLog/blob/master/runtime/N...](https://github.com/PlatformLab/NanoLog/blob/master/runtime/NanoLogCpp17.h#L742)

    
    
         T argument = *reinterpret_cast<T*>(*in);
         ...
         uint32_t stringBytes = *reinterpret_cast<uint32_t*>(*in);
    

A total of 6 reinterpret_casts in that file alone.

I didn't see any indication that you have to turn off strict aliasing to use
this library (and would you really want to make all your remaining code slower
just so your logging has better benchmark scores?). Which makes all this code
UB as far as the C++ standard is concerned (no, their allocator does not
magically placement-new the correct object types into the right places).

They could just inspect their T values as char* and/or memcpy them around
instead of all this pointer aliasing (well, as long as their values are
trivial types - I couldn't figure out whether you can log non-trivial stuff
here?) at virtually no additional cost.

Edit: And this right here convinces me that I definitely wouldn't want to use
this library in production... Adding a volatile to fix a threading bug is just
a big no-no.

[https://github.com/PlatformLab/NanoLog/commit/e9691246ede6da...](https://github.com/PlatformLab/NanoLog/commit/e9691246ede6dacbc4be2a56409f9f982856e3eb#diff-7d79f738e30bcbecebcf9183a9c00ed1R749)

~~~
Leszek
You'd think so, but in fact reinterpret_cast from char* (and unsigned char _,
and std::byte_ ) is explicitly allowed by the type aliasing rules.

~~~
MauranKilom
To the best of my knowledge, casting _to_ char* is totally fine (inspecting an
object as bytes) under certain constraints.

What is not fine is pretending that an object lives at a position in memory
where it does not (i.e. treating bytes of memory as some object through a
reinterpret_cast away from char* ).

Edit: To clarify, casting away from char* is of course allowed if you cast to
whatever object type actually lives at the pointed-to location.

~~~
detaro
If I remember correctly, it is fine if there actually was an object of that
type (or a "similar" type) in that location. So casting to char* and back is
fine, casting to char and then to int-type to inspect multiple bytes is fine
as long as alignment plays correctly, ...

It seems like the function only cast to T* if the input pointer hasn't been
changed (the paths that do modify it return early), so there's a value there?

~~~
gpderetta
> casting to char and then to int-type to inspect multiple bytes is fine as
> long as alignment plays correctly

It is only fine if there were the same exact int types at that address.
Remember, UB usually isn't the reinterpret_cast but the dereference.

~~~
detaro
I thought the int-types had a special exception making that always possible
(if there's a bunch of bytes, you can turn them into an int), but I could very
well be wrong on that detail.

~~~
gpderetta
Char is the only exception.

And now std::byte IIRC

------
rafa1981
Author of mini-async-log and mini-async-log-c here.

[https://github.com/RafaGago/mini-async-log](https://github.com/RafaGago/mini-
async-log)

[https://github.com/RafaGago/mini-async-
log-c](https://github.com/RafaGago/mini-async-log-c)

I wrote both some time ago and AFAIK they were the fastest, how does nanolog
compare against them?

At least before the source file preprocessor Nanolog was slower.

Here I had a benchmark project with Nanolog on an old version, maybe you can
PR an update:

[https://github.com/RafaGago/logger-bench](https://github.com/RafaGago/logger-
bench)

~~~
RafaGC
As comments can't be edited: there are two nanolog projects on github. They
may not be the same one.

------
wrs
I did something very similar to the “preprocessor” version of this back in
2001 for Microsoft application crash reporting (the problem then being upload
size, not so much logging overhead). It’s very cool to see how you can do the
whole thing with constexpr now. As I recall, people really didn’t like using
the preprocessor!

------
usefulcat
The description says "the compilation builds a version of the NanoLog library
that is non-portable, even between compilations of the same application". They
also mention that a decompressor application also gets built and can be used
to reconstitute a human-readable version of the log file. My question is, is
the decompressor application also tightly coupled to a particular compilation?
It sounds like the answer is yes, but I'd really like to be wrong about that.

That would really suck for cases where you want to store old log files for
later use. You'd have to also store a copy of the decompressor program that
corresponds to each log file. Or you'd have to convert to human-readable
format before storing the log files, which loses a number of the benefits of
the binary log format (compact size, faster to parse).

~~~
nemetroid
In the related paper[1], this paragraph (5.1.2) strongly suggests that the
decompressor is non-portable as well:

> NanoLog performs best when there is little dynamic information in the log
> message. This is reflected by staticString, a static message, in the
> throughput benchmark. Here, NanoLog only needs to output about 3-4 bytes per
> log message due to its compaction and static extraction techniques.

1: [https://www.usenix.org/conference/atc18/presentation/yang-
st...](https://www.usenix.org/conference/atc18/presentation/yang-stephen)

------
gmueckl
I built a similar system with a focus on profiling events with extremely low
overhead. When taking an timestamp, the processor cycle count register is
read, and pushed into a ring buffer with a tiny descriotion including a static
string label. A background thread perodically writes that buffer contents to
disk. An offline tool visualizes the recorded events and intervals over time
and can do a bit of statistics.

This system has served me quite well, but I need to rework the UI because
zooming to microsecond resolution on a 30 second trace overflows the range
values in the Qt scrollbars.

~~~
liuliu
We use custom format and an offline process to transform the format into
Google Catapult for visualization. Catapult visualization for 30s trace should
be pretty reasonable.

------
Const-me
Recently built similar stuff. Even using same __rdtsc() for time. I didn't
require any preprocessing. Instead, I require the strings come from readonly
section of a module, and logging pointer values. Also logging into circular
buffer in shared memory instead of file, as I was only interested in knowing
what the app did immediately before a rare crash which takes hours to
reproduce.

~~~
bonzini
How do you deal with ASLR (logging from shared libraries or from position-
independent executables)?

~~~
Const-me
In that particular case I’ve cheated. I’m writing out of band file which maps
pointer addresses into strings. In runtime it’s CAtlMap<const char*, int>
really fast because the keys are just addresses. I only have ~200 unique
messages (ignoring format arguments) so the file is only written for the first
few milliseconds. I did that because it was the simplest thing to do. I could
use debug symbols + crash dumps, they can resolve these pointers regardless of
ASLR, just it was way more complex to implement. BTW, the only reason I was
doing that, the main executable is gta5.exe, it’s encrypted, it resists
debugging, and I don’t have debug symbols. If it was my own app, I would
probably do something much simpler instead.

------
nemetroid
From a quick reading of the paper, it sounds like Nanolog is internally
translating logging messages into a compacted format, which must be further
processed to become human readable. This further processing is not included in
the benchmark.

section 4.3, "Decompressor/Aggregator"

> The final component of the NanoLog system is the decompressor/aggregator,
> which takes as input the compacted log file generated by the runtime and
> either outputs a human-readable log file or runs aggregations over the
> compacted log messages. The decompressor reads the dictionary information
> from the log header, then it processes each of the log messages in turn. For
> each message, it uses the log id embedded in the file to find the
> corresponding dictionary entry. It then decompresses the log data as
> indicated in the dictionary entry and combines that data with static
> information from the dictionary to generate a human-readable log message. If
> the decompressor is being used for aggregation, it skips the message
> formatting step and passes the decompressed log data, along with the
> dictionary information, to an aggregation function.

section 5.1.2, "Throughput":

> [...] NanoLog performs best when there is little dynamic information in the
> log message. This is reflected by staticString, a static message, in the
> throughput benchmark. Here, NanoLog only needs to output about 3-4 bytes per
> log message due to its compaction and static extraction techniques. Other
> systems require over an order of magnitude more bytes to represent the
> messages (41-90 bytes).

> [...] Overall, NanoLog is faster than all other logging systems tested. This
> is primarily due to NanoLog consistently outputting fewer bytes per message
> and secondarily because NanoLog defers the formatting and sorting of log
> messages.

section 5.1.3, "Latency":

> All of the other systems except ETW require the logging thread to either
> fully or partially materialize the human-readable log message before
> transferring control to the background thread, resulting in higher
> invocation latencies. NanoLog on the other hand, performs no formatting and
> simply pushes all arguments to the staging buffer. This means less
> computation and fewer bytes copied, resulting in a lower invocation latency.

So NanoLog performs less work and therefore has much higher throughput. This
separation of logging and formatting may possibly be a great idea, but it
should probably be more prominently mentioned wherever the benchmark table is
posted.

------
bt848
Looks interesting but how does one reproduce the comparative benchmarks from
the paper? I see the code that benchmarks nanolog but how do I reproduce the
baseline for glog?

~~~
rafa1981
Author of mini-async-log and its C variant here.

Some time ago I wrote a benchmark for different loggers, but it uses an old
version of Nanolog.

[https://github.com/RafaGago/logger-bench](https://github.com/RafaGago/logger-
bench)

~~~
bt848
Neat. Your code actually contains the issue I was curious about. Why do you
log at level ERROR? As I understand it this causes glog to synchronize the
output stream after every message.

~~~
rafa1981
But worst case latency, the most interesting property of an async logger is
still unaffected I guess...

------
RafaGC
Is this a new incarnation of this project?

[https://github.com/Iyengar111/NanoLog](https://github.com/Iyengar111/NanoLog)

Otherwise it seems that the name is already taken.

~~~
RafaGC
Both projects are a different C++ logger implementations. The first commit of
both repositories is unrelated. If I wanted to benchmark both name clashes are
likely to happen.

This is a legitimate question. Why downvoting?

------
jart
Cool stuff. Judging by his code, this guys mind is probably going to get blown
when he discovers mmap() and bcd arithmetic.

