
An unexpected benefit of open-sourcing our code - jgrahamc
https://blog.cloudflare.com/cloudflare-fights-cancer/
======
dekhn
I'm glad you're contributing to genomics technology, but I feel compelled to
point out that zlib is a terrible algorithm for compressing BAM files. The
reason is that over 50% of the compressed BAM file space is spent on encoding
the quality scores at full resolution, which is just wasted effort. Quality
scores are very hard to compress- they are very close to completely random
data- and the best compressed use complex probabilistic models to switch the
encoding technology depending on what category quality scores fall into.

~~~
chuckcode
I'm all for a replacement for BAM file format and the quality scores. Ideally
something that supports delta based encoding to a reference genome similar to
CRAM and something that more compactly represents quality of bases sequenced.

In the meantime as a user of BAM I'm very very very grateful for faster zlib
as it makes life a lot better with the current large installed base of
programs that use BAM and data sets in BAM. These sort of improvements really
do make a meaningful difference in the same way that switching to SSDs from
magnetic disks didn't improve the implementation of my software but makes my
computer a lot faster to boot up and compile things.

------
minimax
The CloudFlare x86-64 optimized version of the longest match algorithm is
especially well documented.

[https://github.com/cloudflare/zlib/blob/31043308c3d3edfb487d...](https://github.com/cloudflare/zlib/blob/31043308c3d3edfb487d2c4cbe7290bd5b63c65c/contrib/amd64/longest-
match.inc)

It's great material for anyone interested in these types of low level
optimizations.

~~~
Matt3o12_
Thanks for the link, although I understand only half of what it is.

Do you know what the file extension inc stands for?

~~~
munificent
"include"

It's a fairly common extension for a file containing C implementation code
(and not just headers) that is intended to be #included by some other file.

~~~
lfowles
You know what, I was trying to find an example for Matt3o12_ where it was
#included, but I can't even find the file at the tip of any of the branches.
Can a better git archaeologist explain what happened to it? Did it get merged
into another file?

Edit:

Here's an example of a .inc file being used

[https://github.com/cloudflare/zlib/blob/31043308c3d3edfb487d...](https://github.com/cloudflare/zlib/blob/31043308c3d3edfb487d2c4cbe7290bd5b63c65c/deflate.c#L1158)

Double Edit:

git log _\--all_ -p contrib/amd64/longest-match.inc

It was merged directly into deflate.c

------
astazangasta
Can you argue that there is ever a social (not private) benefit to NOT open
sourcing your code?

~~~
pjc50
A sufficiently bad FOSS implementation can drive out a better paid-for
solution, especially if the latter is from a small company.

(I can't immediately name an example of this)

~~~
jahewson
A sufficient bad FOSS implementation can also prevent a better FOSS
implementation from ever being built, e.g. Xpdf/Poppler.

~~~
TD-Linux
It didn't prevent PDF.js from being built (though maybe it's not "better")

Though I haven't had problems with poppler, so I don't quite understand what
is bad about it.

------
pdknsk
> Both formats are supported by the absolute majority of web browsers [...].

Unfortunately this isn't true. IE still does not support deflate as specified
in the RFC, with a zlib header. It only supports raw deflate.

[https://www.ietf.org/rfc/rfc2616.txt](https://www.ietf.org/rfc/rfc2616.txt)

    
    
      deflate
          The "zlib" format defined in RFC 1950 [31] in combination with
          the "deflate" compression mechanism described in RFC 1951 [29].
    

This has lead to most browsers also accepting raw deflate, to be reverse
compatible.

~~~
cwp
He didn't say "all browsers", he said the majority. It's been a long time
since the majority of web traffic came from IE.

------
rurban
I have to applaude CloudFlare for their zlib version. It really is the best.

Some versions on my system even use the SW crc32 variant, macports on an i7:
200x slower

------
j_s
The previous discussion of another zlib alternative last month mentioned
unexpected issues incorporating them into other software (that reaches into
internals), as well as the potential for licensing issues.

[https://news.ycombinator.com/item?id=9664655](https://news.ycombinator.com/item?id=9664655)

~~~
nadams
I was actually just wondering if they made it better - why not create a patch
and push it upstream?

------
Mojah
I've experienced something similar, although on a much smaller scale.

A few years back, I got an e-mail that one of my open source tools was being
used as a monitoring solution on a hospital ship. I felt pretty damn proud
too, not having imagined the impact of open sourcing a project could have.

I wrote a quick blogpost on the matter back then; [https://ma.ttias.be/this-
is-what-open-source-is-about-mobile...](https://ma.ttias.be/this-is-what-open-
source-is-about-mobile-zabbix-used-on-hospital-ship/)

------
halosghost
> -O3

yikes. I didn't realize anyone actually used -O3 in-production. Isn't that
widely considered to be a BadIdea™?

~~~
majke
I use it all the time. I've encountered more bugs with -O0 and lack of -O at
all, than with explicit -O3. My experience is with gcc.

~~~
lfowles
Funny, building -O3 actually flushed out some segfaults due to timing...
everything was running too damn fast for my assumptions to hold!

It looks like Debian policy is to build at -O2 with exceptions for building at
-O0 or -O3

------
rollo2
"Because if we can make your photos smaller...we can make your cancer
smaller!"

------
jackgavigan
So... Cloudflare cures Cancer? :-)

~~~
daenney
So about that...

> The files used for this kind of research reach hundreds of gigabytes and
> every time they are compressed and decompressed with our library many
> important seconds are saved, bringing the cure for cancer that much closer.
> At least that's what I am going to tell myself when I go to bed.

I realise it's not meant entirely seriously and I applaud any effort that
helps speed up the exchange and storage of information and this type of
research. But "bringing the cure for cancer that much closer" I find a bit of
a stretch.

