
Fast Transaction Log: Windows - jswny
http://ayende.com/blog/174785/fast-transaction-log-windows
======
rusanu
> Like quite a bit of other Win32 APIs (WriteGather, for example), it looks
> tailor made for database journaling.

Indeed, WriteFileGather and ReadFileScatter are specifically tailored for
writing from and reading into the buffer pull. The IO unit is the sequential
layout of an extent (8 pages of 8Kb each), but in memory pages are not
sequential so they have to be 'scattered' at read and 'gathered' at write.

You also have to keep in mind that the entire IO stack, Windows and SQL
Server, was designed in the days of spinning media where sequential vs. random
access was ~80x faster. SSD media has very different behavior and I'm not sure
the typical 'journaling' IO pattern is capable of driving it to the upper
bound of physical speed.

As a side note, I was close some folk that worked on ALOJA
[http://hadoop.bsc.es/](http://hadoop.bsc.es/) and it was a very interesting
discussion I had with them: the default configuration for Java/Hadoop was
providing, out of the box, the best IO performance on Linux. Same
configuration was a disaster on Windows and basically every parameter had to
be 'tuned' to achieve decent performance. This paper has some of their
conclusions: [https://www.bscmsrc.eu/sites/default/files/bsc-
msr_aloja.pdf](https://www.bscmsrc.eu/sites/default/files/bsc-msr_aloja.pdf)

~~~
loeg
SSDs also like nice big sequential writes, although it matters less. It's
easier on their wear-leveling firmware and erase blocks. And fewer commands
back and forth to the drive reduce latency.

~~~
masklinn
> SSDs also like nice big sequential writes

Sure, but the point of the remark is that random IO on spinning rust is
absolutely dreadful compared to sequential (on consumer drives it's common to
have a difference of 2 orders of magnitude or more), the difference is
significantly lower on SSDs _and_ SSDs have a ton of IOPS so random accesses
can happen concurrently leading to much lower effective latency.

~~~
loeg
You're absolutely correct. Wear leveling can affect total IOPS and drive
longevity though, so it's still something to consider for those eking out the
best total performance from the drive.

------
PaulHoule
Windows is fast at some things and slow at some things.

For instance, metadata reads on the filesystem are much slower on Windows than
Linux, so it takes much longer to do the moral equivalent of "find /"

Circa 2003 I would say the Apache Web Server ran about 10x faster on Linux
than Solaris, but that a Solaris mail server was 10x faster than the Linux
server.

Turns out that Linux and Apache grew up together to optimize performance for
the forked process model, but that fsync()-and-friends performance was much
worse than Solaris at that time, if you want to meet specifications for
reliable delivery.

~~~
corysama
[https://news.ycombinator.com/item?id=11864211](https://news.ycombinator.com/item?id=11864211)
contains an interesting discussion from the author of PyParallel about how
doing IO on NT in the style preferred by Linux is slow. This leads people to
believe "NT IO is slower than Linux". However, doing IO on NT in the style
preferred by NT is faster than Linux doing it's preferred thing.

~~~
Analemma_
That discussion and others have led me to the conclusion that the NT kernel
has an excellent design and a subpar implementation (since only Microsoft's
team can work on it), whereas Linux has a crappy design and an excellent
implementation (being constantly refined and iterated by anyone). Kind of
makes you wonder what could be possible if Microsoft would ever open-source
it.

~~~
walkingolof
"constantly refined and iterated by anyone"

Thats the theory, if you remove device driver folks, how many really works on
the Linux kernel ?

~~~
JonathonW
Here's [1] the current block I/O code from the kernel, annotated with the
authors who last touched each line.

While I don't have time right now to actually go and count the number of
distinct people involved, that's a lot of hands that have touched that source
file over the years. And this view's only showing the changes that make up the
current version of that file-- there are authors not credited there because
their code has since been overwritten or edited by someone else.

[1]
[https://github.com/torvalds/linux/blame/master/block/bio.c](https://github.com/torvalds/linux/blame/master/block/bio.c)

------
Jedd
Been a while since I used AWS in anger, but EC2 instances were massively
(hair-pullingly) variable from one moment to the next. I can't see any detail
on either blog post (GNU/Linux or Microsoft Windows) regarding how they
catered for this, how many runs they did of their custom benchmark code, and
what kind of variances they were seeing in each iteration.

~~~
jonathanoliver
Agreed. I'd love to see the numbers again but running on dedicated hardware.

~~~
joshka
"Jonathan, I highly doubt that. We see similar results when running on
physical hardware. I'm posting the results of EC2 instances here to ensure
that they can be easily reproduced, but we are we have two identical boxes in
the office that sit there are show Windows being much faster in this kind of
thing." \-- from the comments

------
markbnj
As a former Windows/C++/C# dev who has been working on linux for five years
now, I have never automatically assumed Windows was slower than linux. The
main advantages of linux over windows are not in the performance area, imo,
but in any case I think you'd have to average a lot of runs to make sure of
getting reasonably meaningful numbers from an ec2 instance.

------
zsombor
The Linux version was benchmarked with gettimeofday() while the Windows one
with QueryPerformanceCounter. The first has a lower resolution of 10 micros
and as such the benchmark is not comparable.

~~~
jschwartzi
gettimeofday() is totally inappropriate for benchmarking. Any time the system
clock is being adjusted by NTP, for instance, your benchmark timing will be
skewed.

They should be using the following API if they're going to use the system time
to measure time differences:

clock_gettime(CLOCK_MONOTONIC, &timespec);

It's a really good clue that a timing function is inappropriate for
benchmarking when the man page talks about time zones.

~~~
emn13
While that may be true, the kind of errors you'd expect that to cause do not
appear present in their results. It doesn't matter, here.

------
ComodoHacker
The site is having problems. Google cache: Windows benchmark[1], Linux
benchmark[2].

1\.
[http://webcache.googleusercontent.com/search?q=cache%3Ahttps...](http://webcache.googleusercontent.com/search?q=cache%3Ahttps%3A%2F%2Fayende.com%2Fblog%2F174785%2Ffast-
transaction-log-windows)

2\.
[http://webcache.googleusercontent.com/search?hl=en&q=cache%3...](http://webcache.googleusercontent.com/search?hl=en&q=cache%3Ahttps%3A%2F%2Fayende.com%2Fblog%2F174753%2Ffast-
transaction-log-linux)

------
noja
An 80% performance difference? Something doesn't seem right here.

~~~
emn13
80% difference in a microbenchmark is not nothing, but it's hardly unusual. In
a real application, these kind of tiny differences may well be much less
dramatic, especially if you consider that most apps will be tuned to the OS
they're designed for and thus pick the "happy path" for that OS.

And that 80% difference is in the buffered case, but that's also the least
relevant - you can use user space buffering (which is normal anyhow, on both
OS's) to amortize the system call costs.

------
justinsaccount
Either I am reading this wrong or something is not right here.

Buffered:

windows = 0.006 linux = 0.03

80% Win for windows?

But where do those numbers come from?

The time in ms for linux was 522, the time for windows was 410. That's not an
80% win.

where does the "Write cost" number come from?

In general for the other numbers I don't think they are comparing the same
things. I don't think it is a coincidence that the two systems had write times
of both about 10s and 20s for the different tests. Where linux took 20s and
windows took 10s I'd bet that they were comparing different behaviors.

~~~
zebracanevra
I believe write cost is ms/write

0.006 write cost = 410ms

0.03 write cost = 1966ms

~~~
justinsaccount
I figured that is what it was, the raw data doesn't match up with the results
in the table.

~~~
danbruc
Time is the total running time for 64k writes, write cost is for a single
write, so just dividing time by 65536 gives the write cost. This is correct
for all listed tests on Linux and Windows except two, the two buffered tests
on Linux. I am not sure what went wrong, both write cost values are about 3.75
times to high assuming the total time is correct.

------
MrZipf
Windows also has the cool and little known CommonLogFileSystem if you need
logging:

[https://en.wikipedia.org/wiki/Common_Log_File_System](https://en.wikipedia.org/wiki/Common_Log_File_System)

[https://msdn.microsoft.com/en-
us/library/windows/desktop/bb9...](https://msdn.microsoft.com/en-
us/library/windows/desktop/bb986747\(v=vs.85\).aspx)

~~~
spullara
Not that kind of logging.

~~~
reubenbond
Yes, the same kind of logging - transaction logging.

~~~
spullara
You're right! The name of it was so generic it fooled me. Sorry!

------
UK-AL
It's well known windows performs better in these sorts of situations. Probably
because of the reasons mentioned, they have their own products that rely on
good performance for these code paths.

------
zihotki
Duplication of
[https://news.ycombinator.com/item?id=12084481](https://news.ycombinator.com/item?id=12084481)

~~~
joshka
This one has more comments / points though

