
MPEG-G: the ugly - kierank
https://datageekdom.blogspot.com/2018/09/
======
rayiner
This bit is revealing:

> There are clear non-royalty based incentives for _large companies_ to
> develop new compression algorithms and drive the industry forward. Both
> Google and Facebook have active data compression teams, lead by some of the
> world's top experts in the field.

Google and Facebook can afford to spend money on R&D because they throw off
gobs of money from near-monopolies in important economic sectors. This is one
of the archetypal models for R&D, and has a lot of precedent: AT&T Bell Labs
(bankrolled by AT&T's telephone monopoly) and Xerox PARC (bankrolled by the
copier monopoly built on Xerox's patents). Much of the really fundamental
technologies underlying computing were developed this way.

But MPEG is thirty years old now, and the MPEG-1 standard is 25 years old.
Until recently, the MPEG standard has been pushed forward not by a single
giant corporation that can afford to bankroll everything, but a consortium of
companies using patents and licensing to recover their investment into the
R&D. This is one of the other archetypal models for R&D. Many of the other
fundamental technologies underlying computing were developed this way.

(The third archetypal model is the government-funded project, _e.g._ TCP/IP,
which is also an example of a monopoly bankrolling R&D.)

The "benevolent monopoly" model obviously has advantages for open source--
because the company bankrolls R&D by monetizing _something else_ , it can
afford to release the results of the research for everyone to use. But it's
not sustainable without the sponsor (and we know this, because open source has
been around for a long time, and there is little precedent for a high-
performance video codec designed by an independent group of open source
developers).[1]

I see people demonizing MPEG and espousing reliance on Google and FB as the
way forward, but it's not clear to me that everyone fully understands the
implications of that approach.

[1] Query whether Theora counts--it was based on an originally proprietary,
patented codec.

~~~
tgb
I think that's the only excerpt from the blog post suggesting that companies
will do this research out of good will. The reality is that this work is
largely being done by academia and government research groups for now and it's
unclear that MPEG is pushing the state of the art forward more than
incrementally.

There's also the idea that our personal medical data (which is only a portion
of all genomic data currently but it's increasing rapidly) should be entirely
open to reading, no licenses required.

~~~
cbsmith
It's not good will. It's in their interest to have better compression ratios
and higher quality video.

~~~
labster
It's in all our interest to have better codecs. Self-interest is called good
will when it benefits everyone.

~~~
cbsmith
It's in all our interest to have better codecs, but not at any cost. Only if
you are serving billions of videos a day does it make sense to freely
contribute improvements to codecs.

------
pdkl95
> "barrage of 12 patents from GenomSys"

Based on the patent titles (I'll see if I can read some of them in detail
tomorrow), most of these sounds almost exactly like the code I wrote while
working at the JGI[1][2][3] in the early 2000s that managed moving large
amounts of reads from the ABI (Sanger) sequencers, running it through
phred/phrap, and storing it all so the biologists could access it easily. This
included a custom Huffman tree based encoder/decoder to efficiently store
FASTA files at (iirc) about ~2.5 bit/base (quality scores were just stored as
packed array of bytes), a _very_ large MySQL backend, and a large set of Perl
libraries that provided easy access to reads/libraries/assemblies/etc. It was
certainly a "method and apparatus" for "storing and accessing" \+ "indexing"
bioinformatics data using a "compact representation" that provided many
different types of "selective access".

I even had code that did a LD_PRELOAD hack on (circa 2002) Consed that
intercepted calls to open(2) to load reads automagically from the DB. Reading
Huffman encoded data in bulk from the DB (instead of one file per read)
reduced the network bandwidth required to open an assembly with all it's
aligned reads by ~90%. That sounds a lot like "transmission of bioinformatics
data" over a network and "access ... structured in access units". It defiantly
involved "reconstruction of genomic reference sequences from compressed
genomic sequence reads".

They may have a more efficient compression method, and we didn't do anything
re: "multiple genomic descriptors" (was that even a thing <2004?), but...
no... they didn't invent what is basically a bioinformatics-specific
variations of the same methods used everywhere in the computer industry for as
long as "text file formats" have existed.

[1] [https://jgi.doe.gov/](https://jgi.doe.gov/)

[2] These are my personal comments and opinions only, which are not endorsed
by or currently affiliated with the Joint Genome Institute, Lawrence Berkeley
National Laboratory, or the U.S. Department Of Energy.

[3] While I have no idea if any of that code even exists today (I left the JGI
in 2004), I _did_ mark the source files with the BSD license, since there was
historical precedent.

~~~
twic
> ~2.5 bit/base

Given that there are four bases, i would have thought you could reliably do it
in 2. What am i missing?

~~~
pdkl95
At a minimum there were two additional symbols:

"N" for uNknown, where the raw sensor information - which should be a series
of nice Gaussians[1 top] - was too ambiguous/low-quality for a useful
basecall, but the presence of _something_ can be inferred from the timing
information[1 bottom, with "N"s in some bases]

An additional unused bit pattern was reserved for future emergency additions.
It used the longest bit pattern so future additions wouldn't be balanced
properly, but I wanted something simple, guaranteed binary-compatibility, that
could handle a sudden change in requirements.

In some situations, a variation was used that included a few symbols for
partial basecalls like "unknown purine" (A or G) and "unknown pyrimidine" (C
or T).

[1]
[http://image.slidesharecdn.com/binoinfo-8-141021112354-conve...](http://image.slidesharecdn.com/binoinfo-8-141021112354-conversion-
gate02/95/7-11-638.jpg)

~~~
twic
Makes sense, thanks. Back when i was phredding for a living, i had the luxury
of just cutting off the file when the quality of the read got dodgy!

------
mbreese
If there are really patents protecting this format, it makes it a complete
non-starter for a great deal of work (commercial and academic). Posts like
this scare me. I don't want to devote effort to support a format that I might
not be able to use in the future. The only thing that I could think of that
_might_ work is putting the patents in some sort of defensive portfolio in
much the same way that the Open Invention Network protects Linux.

I understand the desire to develop bioinformatics file formats in a more
disciplined way than we have done in the past, but this process seems like it
may be more of a pain than a benefit. Unfortunately, I couldn't see some of
the MPEG-G talks at ISMB this year (other talks were concurrent).

Could anyone explain what the benefits of the MPEG-G format is over something
like CRAM? I mean, we were already starting to get close to the theoretical
minimum in terms of file size. I personally would like to see more support for
encryption and robustness (against bitrot) in formats, but this could be done
in a very similar way to current formats.

~~~
deugtniet
I agree, the comparison with CRAM is in the whitepaper of MPEG-G. But the
author of the blog has some more recent posts, where he is very skeptical of
the claims made with respect to the CRAM format. It's worth the read.

~~~
jkbonfield
Disclaimer - I am the author of the blog.

There are no "comparisons with" CRAM in the MPEG-G preprint, only comparison
between CRAM and DeeZ, taken from the DeeZ paper. Those comparisons are fair
and correct, but obviously were done at the time that paper was written - some
4 years ago. Since then CRAM has moved on (as has deez), but modern CRAM
generally beats modern DeeZ if we restrict ourselves to the formats that
permit random access (DeeZ has a higher compression non-random access mode).

So far there have been no direct data on how well MPEG-G does bar an old slide
from a year ago; ISMB talk I think.

[https://mpeg.chiariglione.org/sites/default/files/events/Mat...](https://mpeg.chiariglione.org/sites/default/files/events/Mattavelli.pdf)

From that we can glean some compression ratios at least. I attempted to
compress the same data set with CRAM, but that data set, while public, is in
FASTQ format instead of BAM. I asked for author of the talk how he had
produced the BAM, but got no response. I tried my best stab at creating
something similar, but it's not a satisfactory comparison yet.

------
0xcde4c3db
How did MPEG get involved in genomic data? I thought it was specifically
chartered by ISO for audiovisual formats.

~~~
theophrastus
Interesting isn't it? As a biochemist I can see only a slight correspondence
between a sequence of audio or visual data and of genetic data, (as a contrary
notion, there is only a small statistical expectation of correlation between
frame N and N+1 for DNA data; yet a lot between whole sequences in terms of
evolutionary homology). But yet there it is plopped right in the middle of the
wikipedia standards page[1]. Likely more easily explained from the business
point of view than science.

[1]
[https://en.wikipedia.org/wiki/Moving_Picture_Experts_Group](https://en.wikipedia.org/wiki/Moving_Picture_Experts_Group)

------
ezoe
So, they got a lot of patents for DNA compression standard MPEG-G. But there
is no way to get a license of it? That's unusable standard for 20 years!

What are they thinking?

------
jascenso
All MPEG standards have patents and this one is not an exception. If companies
are interested they can license its use (assuming fair terms). This is far
better than having proprietary formats which are locked or formats made by a
single company which you don't know the patent situation clearly. Also,
companies involved invested in the development of this standard and expect
some return.

What I don't like in this post, is the call for non-adoption when the author
has a competing format (CRAM) for which the patent situation and the
performance is not clear. It seems a biased opinion.

~~~
jkbonfield
I am the author of _an_ implementation, although not the author of the file
format itself. Although yes that it is still a fair point if you look at just
the one blog post. However there are a series of them where I clearly explain
the process and my involvement, so don't just look at the last.

I agree though the message would be better if it came from a third party. I
was hoping this would happen, but it didn't look likely before the GA4GH
conference (where both myself and MPEG were speaking), so I self published
before that to ensure people were aware and could ask appropriate questions
(to both myself and MPEG of course).

As for royalties, CRAM comes under the governance of the Global Alliance for
Genomics & Health ([https://www.ga4gh.org](https://www.ga4gh.org)). They
stated explicitly in the recent conference that their standards are royalty
free (as far as is possible to tell) and promote collaboration on the formats
/ interfaces, competition on the implementation. For the record, we are
unaware of any patents covering CRAM and we have filed none of our own, nor do
we intend to for CRAM 4.

~~~
jascenso
I recognize that I only have read the last post and had searched in the past
for royalty information about CRAM and never found it. Thanks for your answer.

In my opinion, there is clearly the need to assess both solutions with clear
and meaningful data, not only in terms of performance but also in terms of
patents. Conferences are a great place to do it and thus, I completely agree
with you.

However, I don't see an MPEG standard as evil (or ugly) and I do think that
both types of standards (CRAM and MPEG) can coexist. Every company should
decide (based on factual information) what is the best solution for their
needs and if the MPEG standard brings some advantage, a company may use it
despite the licensing costs. The same happens for video coding standards,
where the patent heavy HEVC is nowadays used in some scenarios (e.g. iPhone)
and the royalty free AOM AV1 is used in others (e.g. streaming video). It is
up to the market" to decide. The main problem with MPEG-G is that the
licensing information is not known yet since it didn't reach draft
international standard yet.

~~~
ak217
CRAM is a standard that was started at Sanger/EMBL/EBI and developed freely in
the open by the genomics community over the past decade. As others have told
you, it is royalty free and unencumbered. The work behind the first version of
the standard was published in 2010
([https://genome.cshlp.org/content/21/5/734.full](https://genome.cshlp.org/content/21/5/734.full)).
Since then CRAM has undergone many revisions and improvements from a plurality
of sources, and is widely adopted among users whose use cases demand genomic
sequence compression.

MPEG-G at this point is a three year old attempt to patent troll the genomics
community and grab money. The people involved are not experts and are not
aware of the actual state of the art or motivating requirements in sequence
compression, and are instead trying to dance their way around prior art, as
evidenced by the contents of their patents and presentations.

There is no equivalency here to fit your worldview.

~~~
jkbonfield
Some of the MPEG-G authors are experts in genomics data compression, while
others are experts in video compression. It should, in theory, be a good mix.

MPEG _are_ also well aware of the prior art. The authors of various existing
state of the art genome compression tools were invited to one of the first
MPEG-G conferences where they presented their work. Do not assume because they
do not compare against the state of the art, that they are not aware of its
existence or how it performs. It's more likely simply that "10x better than
BAM" is a powerful message that sells products, more so than "a little bit
better than CRAM". It's standard advertising techniques.

------
RichardStallman
It is a mistake to take for granted that "more technological advance" is worth
the price society would pay for it. That price, imposed through patents, is
unacceptable in this case.

We are better off if other people encode in older, less efficient codecs that
we can support in in free/libre software, than if they encode the files a
little smaller and we are forbidden by the MPEG patent portfolio to handle it
with free software.

See [https://www.gnu.org/philosophy/software-literary-
patents.htm...](https://www.gnu.org/philosophy/software-literary-patents.html)
and [https://www.gnu.org/philosophy/limit-patent-
effect.html](https://www.gnu.org/philosophy/limit-patent-effect.html).

You'll note that I do not use the term "open source". Since 1983, I have led
the free software movement, which campaigns to win freedom in our computing by
insisting on software that respects users' freedom. Open source was coined in
1998 to discard the ethical foundation and present the software as a mere
matter of convenience.

See [https://gnu.org/philosophy/open-source-misses-the-
point.html](https://gnu.org/philosophy/open-source-misses-the-point.html) for
more explanation of the difference between free software and open source. See
also [https://thebaffler.com/salvos/the-meme-
hustler](https://thebaffler.com/salvos/the-meme-hustler) for Evgeny Morozov's
article on the same point.

Which one you advocate is up to you. If you stand for freedom, please show it
-- by saying "free" and "libre", rather than "open".

