
I/O library 6x faster than fmt, 10x faster than stdio.h and iostream - RMPR
https://github.com/expnkx/fast_io
======
quietbritishjim
I find the lack of clarity about this library quite suspicious.

* What is the core implementation idea behind this library? Why is it supposedly so much faster than stdio and fmt? There doesn't seem to be much explanation in the readme. The only thing I can see is "Locale support optional" and "Zero copy IO" but these are mixing up formatted and unformatted interfaces - it is supposed to be faster at both? Does the "zero copy IO" mean that it's unbuffered (as another comment here mentioned, that usually makes things slower not faster)?

* What does the API look like? There's no documentation whatsoever - the "documentation" heading in the readme just refers to "./doxygen/html/index.html" which doesn't exist in the repo; I can't even see a Doxyfile. Just a brief example in the readme showing reading and writing would be nice!

* _What_ exactly is it faster at? Reading or writing? Formatted or unformatted IO? If formatted, is it just the formatting that's faster (e.g. is there also a comparison against fputs)? The benchmarks section has no detail of what code was compared except a mention of the examples/ directory, but that contains dozens of subdirectories, most of which each have multiple files in them. I find it quite implausible that it's 6x faster at formatting than fmt, and benchmarks are notoriously hard, so combined with the extreme lack of clarity I find it hard to take these claims seriously.

~~~
m00dy
same here. All i/o libraries eventually hit OS barrier. I can't really see how
it could be optimised to gain 6x speed up. I/O libraries are basically
transporters between userland and kernel space.

~~~
usefulcat
Not when formatting is involved. stdio and iostreams are notoriously not
particularly fast when it comes to formatting. So, while I don't know anything
about this particular library, the 6x figure does not sound suspicious to me.

Source: have spent a fair amount of time working on speeding up human-readable
log output, the performance of which tends to be dominated by formatting.

------
nabla9
In many applications I can get 10x faster I/O using stdio.h just by radically
increasing the buffer size. It turns out that less you talk to the kernel,
faster you are.

    
    
         /*  typically 16 to 32 times  
             stat.st_blksize  or BUFSIZ 
             is good */
    
         int new_size = 16 * 4096;
    
         setvbuf(fp, NULL, _IOFBF, new_size);

~~~
michaelcampbell
I can't remember how many times I've suggested that people use the optional
pre-sizing parameter to Java's StringBuffer class constructor in code reviews.
Many times the size of the final string is known absolutely, or at least close
enough to pre-allocate. IIRC, the default initial size is 16 bytes which is
absurdly low.

~~~
jrpelkonen
I completely agree on the buffer size statement, but at the same time, find it
interesting that you would use StringBuffer enough to make that comment. I
personally haven’t found any compelling use cases for it since Java 1.5
introduced StringBuilder. Would you like to share some ways you find
StringBuffer useful?

~~~
cogman10
You shouldn't use StringBuffer.

The reason it is slow is because it is "thread safe". 99% of the time, you
don't really want that (especially in cases where you know how big the string
will be at the end of everything).

StringBuilder should be preferred for pretty much everything.

~~~
hyperman1
While I agree with the conclusion to strongly prefer StringBuilder, in most
cases the difference is zero: The JVM will prove no other thread can access
the object, so it remooves the locks

~~~
cogman10
I'm not sure if I agree about most cases. It is fairly trivial to come up with
code that goes down a sad path (you just need to defeat the Escape analysis).
That happens all the time.

The JVM's Escape analysis isn't perfect.

------
vitaut
Unfortunately performance claims appear to be bogus.

1\. ospan, performance claims seem to be based on, doesn't do any bound
checks, so you can easily get buffer overflow.

2\. fast_io generates a whopping 50kB of static data just to format an
integer.

So if these benchmark results are correct (I was not able to verify because
the author hasn't provided the benchmark source):

> format_int 7867424 ns 7866027 ns 89 items_per_second=127.129M/s

> fast_io_ospan_res 6871917 ns 6870708 ns 102 items_per_second=145.545M/s

fast_io gives 15% perf improvement by replacing a safe format_int API from
[https://github.com/fmtlib/fmt](https://github.com/fmtlib/fmt) with a similar
but unsafe one + 50kB of extra data. Adding safety will likely bring perf down
which the last line seems to confirm:

> fast_io_concat 7967591 ns 7966162 ns 88 items_per_second=125.531M/s

This shows that fast_io is slightly slower than the equivalent {fmt} code.
Again this is from the fast_io's benchmark results that I hasn't been able to
reproduce.

50kB may not seem like much but for comparison, after a recent binary size
optimization, the whole {fmt} library is around 57kB when compiled with `-Os
-flto`: [http://www.zverovich.net/2020/05/21/reducing-library-
size.ht...](http://www.zverovich.net/2020/05/21/reducing-library-size.html)

The floating-point benchmark results are even less meaningful. They appear to
be based on a benchmark that I wrote to test the worst case Grisu
([https://www.cs.tufts.edu/~nr/cs257/archive/florian-
loitsch/p...](https://www.cs.tufts.edu/~nr/cs257/archive/florian-
loitsch/printf.pdf)) performance on unrealistic random data with maximum digit
count. fast_io compares it to Ryu
([https://dl.acm.org/doi/pdf/10.1145/3192366.3192369](https://dl.acm.org/doi/pdf/10.1145/3192366.3192369))
where maximum digit count is actually the best case and the performance
degrades as the number of digits goes down. A meaningful thing to do would be
to use Milo Yip's benchmark instead: [https://github.com/miloyip/dtoa-
benchmark](https://github.com/miloyip/dtoa-benchmark)

~~~
knoebber
The author has made a rebuttal to this comment:
[https://github.com/expnkx/fast_io/commit/8cf7497593eb185bba1...](https://github.com/expnkx/fast_io/commit/8cf7497593eb185bba13ad0154a6dfab5a534388)

~~~
vitaut
The new benchmark is even less meaningful because

1\. It now constructs unnecessary `std::string` penalizing `format_int`:

value+=fmt::format_int(i).str().size();

2\. The input is consecutive numbers which makes branches well predicted which
is not realistic but beneficial for fast_io integer formatter which has a lot
of branches.

The precomputed table you can find in
include/fast_io_core_impl/integers/jiaendu/table_gen.h

~~~
usefulcat
> 1\. It now constructs unnecessary `std::string` penalizing `format_int`: >
> value+=fmt::format_int(i).str().size();

Author appears to be doing the equivalent in the other benchmark:

value+=fast_io::to<std::string>(i).size();

Maybe you could argue that in both cases it would be better not to measure
time spent creating and destroying strings, but I don't see how the two
benchmarks are not comparable.

~~~
vitaut
If std::string construction was the goal then sure, but that's not what the
original benchmark that I've written was about. The goal was to find the
fastest way to format an integer with or without std::string construction:
[http://www.zverovich.net/2013/09/07/integer-to-string-
conver...](http://www.zverovich.net/2013/09/07/integer-to-string-conversion-
in-cplusplus.html). Those methods that construct a string but not required to
are explicitly marked. The OP made some claims based on the method that
doesn't construct std::string (which is one of the reasons it's rather fast)
and then when I pointed out that it's unsafe switched to a much slower one but
at the same time unnecessarily penalized other methods. It seems like moving a
goalpost just to prove that your method is the fastest (which might be true
but we don't know because of a poor methodology).

------
apankrat
Just keep in mind that stdio is really slow, even its fast paths.

For example, here's a trivial sscanf() rewrite for a log parsing case that
achieves 300x speed-up -
[https://gist.github.com/apankrat/20776d68d1d97bca12576a6e204...](https://gist.github.com/apankrat/20776d68d1d97bca12576a6e204b2f74)
\- and that's with a completely unoptimized code.

~~~
AceJohnny2
so the follow-up question (not necessarily to you but to the crowd) is: why is
stdio so slow?

~~~
kstenerud
Because it's an abstraction on top of an abstraction on top of an abstraction.
You need to leak parts of the lower layers to higher levels in order to get
the speed back (and allow zero-copy). There are probably some non-leaky
optimizations that can be done as well, although that's been going on for
decades.

Another biggie is floating point. Converting binary floating point to decimal
floating point and vice versa is VERY VERY VERY SLOW, difficult to get right,
and difficult to predict results. Unfortunately, ieee754 decimal float just
isn't getting any adoption, so we're stuck doing these costly conversions
every time we deal with text formats or big float implementations.

Actually, scanf and printf are just in general slow because of their general
nature. They need to handle a TON of options and get really bloated and
complicated as a result.

~~~
thethirdone
> Another biggie is floating point. Converting binary floating point to
> decimal floating point and vice versa is VERY VERY VERY SLOW, difficult to
> get right, and difficult to predict results.

I haven't read deeply into IEEE 754-2008 for decimal floating points, but it
seems like it should be pretty fast (relative to system calls) to convert
binary to decimal because 10 has a factor of 2.

> Unfortunately, ieee754 decimal float just isn't getting any adoption, so
> we're stuck doing these costly conversions every time we deal with text
> formats or big float implementations.

Is there any reason big float implementations should use decimal rather than
binary? It seems like it is very straightforward to make a binary big float,
and do operations on it. In fact, IEEE 754-2008 specifies interchange formats
for all binary floats of bit lengths >= 128 where length is a multiple of 32.

~~~
kstenerud
> haven't read deeply into IEEE 754-2008 for decimal floating points, but it
> seems like it should be pretty fast (relative to system calls) to convert
> binary to decimal because 10 has a factor of 2.

It could be fast, but legacy has doomed us all to a "canonical" conversion of
sorts, where any other conversion algorithm will likely yield off-by-a-tiny-
amount differences in the binary format (like 1.200000000031 instead of 1.2).
There are in fact fairly simple conversions that can be done, but they yield
results that are incompatible with printf.

> Is there any reason big float implementations should use decimal rather than
> binary?

At the end of the day, we work in decimal. So every binary result we calculate
has to be converted to its decimal approximation (the meaning of which is
subject to convention). Rounding is also an issue, because you want to keep
your number of significant digits within reason to avoid false precision
errors. Doing all of this in a different base that can't be 1:1 converted adds
a whole slew of bug opportunities and corner cases.

~~~
enriquto
> At the end of the day, we work in decimal.

What? No.

I have spent much of my life working with floating point numbers and never
ever had to resort to anything decimal for serious purposes (that is, except
for some occasional printing of a number for debugging).

~~~
smabie
Are you telling me that you do math with a pen and paper in binary?
_everything_ is decimal. Computers aren't, but almost everyone tries to paper
over that fact as much as possible.

~~~
enriquto
> Computers aren't, but almost everyone tries to paper over that fact as much
> as possible.

Do you realize that many entire fields of scientific computing (i.e., computer
vision, fluid mechanics, solid mechanics, signal processing, numerical partial
differential equations, computational chemistry, climate modeling, and dozens
of other things) couldn't care less whether the inner representation of
floating point numbers is decimal or binary? If binary can be made just a
little faster, so be it.

~~~
smabie
I'm not saying anyone thinks we should have computers process numbers as
decimal. I'm just saying that no one _thinks_ (like the brains internal
representation) in decimal binary, that's it.

~~~
enriquto
I don't think about real numbers as decimals either (for one, the decimal
representation is not unique). For many people real numbers are points on a
straight line.

~~~
laumars
Even if that were true (and I highly doubt it is), when you then need to
transcribe those points on a line for future reference you'd do so using
decimal. You wouldn't have number lines on a shopping list against items you
need multiples of any more than you'd hold up different lengths of string to
bar staff when ordering a round of drinks.

The reason being is because decimal is the base system people learn when
growing up. It's also widely speculated that one of the origins of decimal is
around the number of digits (it's not an accident I've chosen that ambiguous
term) on our hands: 10 fingers and thumbs. You might mentally map that to
number line when trying to assess the distance between numbers or visualise
formula but ultimately these would be mental models you generate on top of
decimal rather than in place of.

~~~
jfkebwjsbx
Prices aren’t real numbers. They aren’t even fractional.

They are integers with a fixed point, which is an entirely different domain.

~~~
laumars
You put quantities on shipping lists, not prices. Prices go on the receipt.
They would be integers rather than real numbers.

edit: you've now edited your post to repeat a lot of what I put above. Though
I doubt intentional (probably a race condition of us both posting at the same
time). Anyhow, I'll keep what I posted as it's still relevant.

~~~
jfkebwjsbx
Yeah sorry, I added more because leaving it at the first sentence could be
misunderstood.

It is true that in home shopping lists there aren’t prices. I was thinking
more about company ones (like parts).

~~~
laumars
Either way your point is valid :)

------
londons_explore
You know what's even faster... write()

I was writing part of a codebase where I couldn't use the heap, so all
standard libraries were not on the cards, and I ended up just using write(2,
"hello\n", 6) to send the data direct to the kernel, and found the experience
remarkably simple and unpainful.

~~~
smabie
No it's not, write is wayyyy slower. I just did a test, I write "hello\n" to a
file 10,000,000 times. Using fprintf() and stdio.h, it takes 0.04s on my
machine. Using write(), it takes 7.6s, over 2 orders of magnitude slower.

Syscalls are extremely expensive and dominant the execution time unless you
are writing large chunks of data all at once. Which is precisely why stdio.h
exists, because it only occasionally flushes the buffer with write(). This can
be observed when you get a segfault, often it seems like your print statements
did not execute. One can use fflush() to force stdio.h to flush the buffer.

write() on the otherhand, always writes the data, well, at least to the file
buffer cache. fsync() is required to ensure that the data is on disk, though
I've seen systems lie about that too.

~~~
MaxBarraclough
I imagine your compiler optimises the _fprintf_ call into a call to _fputs_ or
_puts_.

~~~
smabie
fwrite, but yeah.

------
bjourne
Here is my fast io "library":
[https://github.com/bjourne/c-examples/blob/master/libraries/...](https://github.com/bjourne/c-examples/blob/master/libraries/fastio/fastio.h)
It is very (very!) limited, but also very fast. :) Think I got about 20-50x
compared to scanf.

------
CoolGuySteve
Formatting is a natural demarcation point for deferring to another thread as
it can be surprisingly expensive.

You can create a workqueue where each item is a copy of the argument list and
a const char* of the format string. Then your program will only take a
microsecond or two to print (depending on the argument length) on the critical
path.

~~~
klodolph
Formatting is generally fast enough in practice that synchronizing with
another thread would slow things down. So it is actually not a good place to
offload work to other threads, since you spend a similar amount of work just
sending the message.

Unless you are formatting a really large message.

~~~
CoolGuySteve
You should time it and come back. You're wrong.

~~~
nkurz
You might be right, but your response comes across as rather rude. I think you
should either soften your message, or provide more details about the timing
--- or both! My instinct is that for standard IO involving moderate amounts of
string interpolation and number formatting in C running on Linux on modern
Intel processors, klodolph is probably closer to right. But maybe you are
right for some other setup? Or maybe klodolph and I are both wrong, in which
case some more details would be helpful.

~~~
CoolGuySteve
I gave a time in my original post: 1-2usec. Printing a handful of doubles with
a cold instruction cache using the standard stdio headers takes longer than
that due to all the special casing.

klodolph made some handwavey claim about synchronization overhead that are
clearly false. It doesn't even make sense to me since the device being printed
to will need some sort of synchronization and syscall.

Both you and klodlph are also making handwavey claims about formatting speed
in Linux.

You both need to measure before you argue. I find it extremely rude to argue
from a position of ignorance as both of you are doing.

"Well gee, I dunno but you're probably wrong". It's so asinine.

~~~
klodolph
Sure, if you have a cold instruction cache. If stdio is not in your
instruction cache, or at least in L2, then you’re not using it very much, and
you won’t get much benefit from moving it to a different thread. My
measurements put stdio formatting at around 100ns for the typical messages I’m
printing. So, offloading it to another thread is a fairly clear loss.

Even for floating point numbers, you’ll hit the fast path with fairly high
regularity. My measurements put it at around 80ns for a double, on my system,
most of the time.

------
andi999
So is IO faster or 'conversion'?

~~~
vitaut
The formatting part but unfortunately it's just a clickbait:
[https://news.ycombinator.com/item?id=23311726](https://news.ycombinator.com/item?id=23311726)

~~~
expnkx
Sorry dude. You are horribly wrong.
[https://github.com/expnkx/fast_io/commit/9b2b084e680b6cd2809...](https://github.com/expnkx/fast_io/commit/9b2b084e680b6cd28096a566b103f08b9ad3a714)

Benchmark:
[https://github.com/expnkx/fast_io/tree/master/benchmarks/001...](https://github.com/expnkx/fast_io/tree/master/benchmarks/0015.fast_io_vs_fmt/benchmark)
Binary Size:
[https://bitbucket.org/ejsvifq_mabmip/fast_io/src/master/benc...](https://bitbucket.org/ejsvifq_mabmip/fast_io/src/master/benchmarks/0015.fast_io_vs_fmt/binary_size/)

