Hacker News new | past | comments | ask | show | jobs | submit login
MPEG-G: the ugly (datageekdom.blogspot.com)
149 points by kierank 9 months ago | hide | past | web | favorite | 40 comments

This bit is revealing:

> There are clear non-royalty based incentives for large companies to develop new compression algorithms and drive the industry forward. Both Google and Facebook have active data compression teams, lead by some of the world's top experts in the field.

Google and Facebook can afford to spend money on R&D because they throw off gobs of money from near-monopolies in important economic sectors. This is one of the archetypal models for R&D, and has a lot of precedent: AT&T Bell Labs (bankrolled by AT&T's telephone monopoly) and Xerox PARC (bankrolled by the copier monopoly built on Xerox's patents). Much of the really fundamental technologies underlying computing were developed this way.

But MPEG is thirty years old now, and the MPEG-1 standard is 25 years old. Until recently, the MPEG standard has been pushed forward not by a single giant corporation that can afford to bankroll everything, but a consortium of companies using patents and licensing to recover their investment into the R&D. This is one of the other archetypal models for R&D. Many of the other fundamental technologies underlying computing were developed this way.

(The third archetypal model is the government-funded project, e.g. TCP/IP, which is also an example of a monopoly bankrolling R&D.)

The "benevolent monopoly" model obviously has advantages for open source--because the company bankrolls R&D by monetizing something else, it can afford to release the results of the research for everyone to use. But it's not sustainable without the sponsor (and we know this, because open source has been around for a long time, and there is little precedent for a high-performance video codec designed by an independent group of open source developers).[1]

I see people demonizing MPEG and espousing reliance on Google and FB as the way forward, but it's not clear to me that everyone fully understands the implications of that approach.

[1] Query whether Theora counts--it was based on an originally proprietary, patented codec.

I think that's the only excerpt from the blog post suggesting that companies will do this research out of good will. The reality is that this work is largely being done by academia and government research groups for now and it's unclear that MPEG is pushing the state of the art forward more than incrementally.

There's also the idea that our personal medical data (which is only a portion of all genomic data currently but it's increasing rapidly) should be entirely open to reading, no licenses required.

It's not good will. It's in their interest to have better compression ratios and higher quality video.

It's in all our interest to have better codecs. Self-interest is called good will when it benefits everyone.

It's in all our interest to have better codecs, but not at any cost. Only if you are serving billions of videos a day does it make sense to freely contribute improvements to codecs.

Concerted effort to improve technology can be done without patents, that's what collaboration is for. Instead, some like MPEG-LA are doing concerted effort to extort money and prevent competing technologies from emerging. That's the opposite of progress.

> that's what collaboration is for

"Collaboration" is a poor model for hard R&D. Some problem domains, like video coding, require PhDs with extensive domain knowledge. These folks don't work for free. Community projects don't have the resources to pay for systematic, concerted research efforts in these areas.

Which is why entities like MPEG-LA exist in the first place. If collaborative projects had developed these technologies first, the companies behind MPEG-LA never would've gotten their patents.

> Instead, some like MPEG-LA are doing concerted effort to extort money and prevent competing technologies from emerging.

Note that the key competitors to the patented MPEG technology is coming from "benevolent monopolies" like Google. Those alternatives are unsustainable without those sponsors. So in reality, "open collaboration" is not one of the options. It's a choice between R&D bankrolled by patents, or R&D bankrolled by Google's search profits or Facebook's social media profits.

> Note that the key competitors to the patented MPEG technology is coming from "benevolent monopolies" like Google.

That may be true in the abstract or in the case of MPEG's multimedia technologies. It is not true in this specific case of genomic file formats.

The existing formats in use in the field (BAM and CRAM) that MPEG-G is looking to supplant come from the genomics community itself. They have come about in the course of bioinformaticians' work analysing sequencing data or as bioinformatics research [1]. Thus this R&D has primarily been bankrolled as scientific research; the funding comes from scientific charities or other scientific research funding.

That is to say, it has come from collaboration.

[1] CRAM originates from "Efficient storage of high throughput DNA sequencing data using reference-based compression", published in Genome Research in 2011. <https://genome.cshlp.org/content/21/5/734.long>

The work behind that article was done at the European Molecular Biology Lab ("EMBL"), which is a government-supported research lab. I wouldn't call that "collaboration" --it's one of the three forms of R&D spending I mentioned above ("government as benevolent monopoly").

These are my made up definitions of course, so we can argue about what counts as "collaboration." I'd say something like W3C or Linux--companies and individuals working together who aren't bankrolled by either patents, the government, or a quasi-monopoly.

Firstly, GA4GH has commercial members as well as academics, and all collaborate together to produce file formats, standards and protocols.

Secondly you missed out a key part of funding - precompetitive alliances. Eg see the Pistoia Alliance (https://www.pistoiaalliance.org/) who funded the SequenceSqueeze project into compression of FASTQ (http://www.sequencesqueeze.org/).

The notion here is simple - there are some technologies that are so core they cross all the commercial and academic boundaries. Collaboration rather than competition is considered to be to the mutual benefit of everyone involved. It is this mind set which lead to the Alliance for Open Media (AOM), who are also in direct competitive with MPEG.

> I wouldn't call that "collaboration" --it's one of the three forms of R&D spending I mentioned above ("government as benevolent monopoly").

The difference between government funding and a 'benevolent monopoly' is that in the first case an entity that does not directly benefit financially is providing the funding, while in the second case the entity providing the funding has a financial incentive to do so.

And that's ignoring the fact that they listed the government as one source of funding, the link provided being an example of what they are talking about, not the sole instance. At least according to their comment, some funding for such things also comes from 'scientific charities'.

One core part of CRAM originated with that article; more has come later.

I'm involved in BAM and CRAM -- it feels like a collaboration to me!

These formats are a collaboration between bioinformaticians at a number of research institutes and companies around the world, with various charity, governmental, or other funding. They are maintained under the auspices of GA4GH, which is indeed something like W3C.

> These folks don't work for free.

Collaboration doesn't mean they work for free. It means many pool resources to solve the same problem, instead of everyone trying to reinvent the wheel and then charging everyone else for it. That's what AOM are doing, which disproves your claims in practice.

> Which is why entities like MPEG-LA exist in the first place.

Nope, they exist because of the messed up legal patent situation, which allows such parasitic entities to engage in patent protection racket which is the opposite of progress. Patent law needs serious reform to prevent such kind of abuse.

> a consortium of companies using patents and licensing to recover their investment

If this is their model, I expect them to be straightforward about it and I expect that most everyone will not touch their standards with 10 ft pole. Standards are not in short supply. I expect "this food contains known poison" sort of explicit label on their wares.

Instead they make it look like community effort, stay silent about strings attached.

We don't want their RnD under such terms.

> Instead they make it look like community effort, stay silent about strings attached.

Did they? That would be weird because MPEG standards are almost always subject to patents. MPEG is not a forum for organizing “community efforts.” There is an ISO/ITU declaration process for patents, but I’m not sure whether the standard is at a stage where genomesys was to have made the declaration already.

Show me where they notified others taking part of their patents or intent to patent. They sought out academics and invited them to take part. Yes I was naive, but I also felt rather mislead.

The GenomSys patents aren't even listed in the ISO patent list yet: https://www.iso.org/iso-standards-and-patents.html

I don't know if this is against ISO rules - it is unclear to me whether they only need to add their patents on grant, rather than submission.

The only reason I discovered these was due to an accidental hit from a Google Scholar alert. I didn't even realise it searched patents when I set that up.

> "barrage of 12 patents from GenomSys"

Based on the patent titles (I'll see if I can read some of them in detail tomorrow), most of these sounds almost exactly like the code I wrote while working at the JGI[1][2][3] in the early 2000s that managed moving large amounts of reads from the ABI (Sanger) sequencers, running it through phred/phrap, and storing it all so the biologists could access it easily. This included a custom Huffman tree based encoder/decoder to efficiently store FASTA files at (iirc) about ~2.5 bit/base (quality scores were just stored as packed array of bytes), a very large MySQL backend, and a large set of Perl libraries that provided easy access to reads/libraries/assemblies/etc. It was certainly a "method and apparatus" for "storing and accessing" + "indexing" bioinformatics data using a "compact representation" that provided many different types of "selective access".

I even had code that did a LD_PRELOAD hack on (circa 2002) Consed that intercepted calls to open(2) to load reads automagically from the DB. Reading Huffman encoded data in bulk from the DB (instead of one file per read) reduced the network bandwidth required to open an assembly with all it's aligned reads by ~90%. That sounds a lot like "transmission of bioinformatics data" over a network and "access ... structured in access units". It defiantly involved "reconstruction of genomic reference sequences from compressed genomic sequence reads".

They may have a more efficient compression method, and we didn't do anything re: "multiple genomic descriptors" (was that even a thing <2004?), but... no... they didn't invent what is basically a bioinformatics-specific variations of the same methods used everywhere in the computer industry for as long as "text file formats" have existed.

[1] https://jgi.doe.gov/

[2] These are my personal comments and opinions only, which are not endorsed by or currently affiliated with the Joint Genome Institute, Lawrence Berkeley National Laboratory, or the U.S. Department Of Energy.

[3] While I have no idea if any of that code even exists today (I left the JGI in 2004), I did mark the source files with the BSD license, since there was historical precedent.

> ~2.5 bit/base

Given that there are four bases, i would have thought you could reliably do it in 2. What am i missing?

At a minimum there were two additional symbols:

"N" for uNknown, where the raw sensor information - which should be a series of nice Gaussians[1 top] - was too ambiguous/low-quality for a useful basecall, but the presence of something can be inferred from the timing information[1 bottom, with "N"s in some bases]

An additional unused bit pattern was reserved for future emergency additions. It used the longest bit pattern so future additions wouldn't be balanced properly, but I wanted something simple, guaranteed binary-compatibility, that could handle a sudden change in requirements.

In some situations, a variation was used that included a few symbols for partial basecalls like "unknown purine" (A or G) and "unknown pyrimidine" (C or T).

[1] http://image.slidesharecdn.com/binoinfo-8-141021112354-conve...

Makes sense, thanks. Back when i was phredding for a living, i had the luxury of just cutting off the file when the quality of the read got dodgy!

It's hard with huffman given you need to deal with N. Realistically you'll end up with 3 bases at 2 bits, 1 and 3 and the other 3 being a prefix for everything else (N, ambiguity codes, etc), so somewhere averaging close to 2.3 bits is the norm.

If N is rare though, you're better off just doing blocks of 2-bit encoding and dropping to something more complex for the rare cases. Or of course just using an arithmetic / range / ANS coder.

You need to encode for an ambiguous base call (N). So it’s really 3 bits, but you can play with the format to encode it slightly more efficiently.

Probably that certain base sequences are more likely to occur together than random. If you account for this, it'll help in your compression.

That's right - the Huffman tree was built based on the base frequency of use, which actually varies depending on what type of organism you're sequencing[1].

[1] https://en.wikipedia.org/wiki/GC-content#Among-genome_variat...

The title is just a title, it doesn't have any legal significance. These patents address a specific encoding that is compact and indexed, not the general idea of such encodings. The patents expressly distinguish the claimed method from the approach you're describing (applying traditional compression techniques to FASTA/FASTQ files): https://patentscope.wipo.int/search/en/detail.jsf?docId=WO20...

> [0003] The most used genome information representations of sequencing data are based on zipping FASTQ and SAM formats. The objective is to compress the traditionally used file formats (respectively FASTQ and SAM for non-aligned and aligned data). Such files are constituted by plain text characters and are compressed, as mentioned above, by using general purpose approaches such as LZ (from Lempel and Ziv, the authors who published the first versions) schemes (the well-known zip, gzip etc). When general purpose compressors such as gzip are used, the result of compression is usually a single blob of binary data. The information in such monolithic form results quite difficult to archive, transfer and elaborate particularly when like in the case of high throughput sequencing the volume of data are extremely large. The BAM format is characterized by poor compression performance due to the focus on compression of the inefficient and redundant SAM format rather than on extracting the actual genomic information conveyed by SAM files and due to the adoption of general purpose text compression algorithms such as gzip rather than exploiting the specific nature of each data source (the genomic data itself).

(Note that GZIP, mentioned in the patent as an example of the prior art, uses Huffman coding.)

The patent also expressly distinguishes using an index separate from the bitstream of the genomic data, which would be the case in your method above where you're storing compressed data in a MySQL database:

> 1. For CRAM, data indexing is out of the scope of the specification (see section 12 of CRAM specification v 3.0) and it's implemented as a separate file. Conversely the approach of the invention described in this document employs a data indexing method that is integrated with the encoding process and indexes are embedded in the encoded bit stream.

It also explains:

> [0006] The present invention aims at compressing genomic sequences by organizing and partitioning data so that the redundant information to be coded is minimized and features such as selective access and support for incremental updates are enabled.

Without knowing more about it, it sounds like the approach you describe wouldn't allow for incremental updates.

If there are really patents protecting this format, it makes it a complete non-starter for a great deal of work (commercial and academic). Posts like this scare me. I don't want to devote effort to support a format that I might not be able to use in the future. The only thing that I could think of that might work is putting the patents in some sort of defensive portfolio in much the same way that the Open Invention Network protects Linux.

I understand the desire to develop bioinformatics file formats in a more disciplined way than we have done in the past, but this process seems like it may be more of a pain than a benefit. Unfortunately, I couldn't see some of the MPEG-G talks at ISMB this year (other talks were concurrent).

Could anyone explain what the benefits of the MPEG-G format is over something like CRAM? I mean, we were already starting to get close to the theoretical minimum in terms of file size. I personally would like to see more support for encryption and robustness (against bitrot) in formats, but this could be done in a very similar way to current formats.

I agree, the comparison with CRAM is in the whitepaper of MPEG-G. But the author of the blog has some more recent posts, where he is very skeptical of the claims made with respect to the CRAM format. It's worth the read.

Disclaimer - I am the author of the blog.

There are no "comparisons with" CRAM in the MPEG-G preprint, only comparison between CRAM and DeeZ, taken from the DeeZ paper. Those comparisons are fair and correct, but obviously were done at the time that paper was written - some 4 years ago. Since then CRAM has moved on (as has deez), but modern CRAM generally beats modern DeeZ if we restrict ourselves to the formats that permit random access (DeeZ has a higher compression non-random access mode).

So far there have been no direct data on how well MPEG-G does bar an old slide from a year ago; ISMB talk I think.


From that we can glean some compression ratios at least. I attempted to compress the same data set with CRAM, but that data set, while public, is in FASTQ format instead of BAM. I asked for author of the talk how he had produced the BAM, but got no response. I tried my best stab at creating something similar, but it's not a satisfactory comparison yet.

How did MPEG get involved in genomic data? I thought it was specifically chartered by ISO for audiovisual formats.

Interesting isn't it? As a biochemist I can see only a slight correspondence between a sequence of audio or visual data and of genetic data, (as a contrary notion, there is only a small statistical expectation of correlation between frame N and N+1 for DNA data; yet a lot between whole sequences in terms of evolutionary homology). But yet there it is plopped right in the middle of the wikipedia standards page[1]. Likely more easily explained from the business point of view than science.

[1] https://en.wikipedia.org/wiki/Moving_Picture_Experts_Group

So, they got a lot of patents for DNA compression standard MPEG-G. But there is no way to get a license of it? That's unusable standard for 20 years!

What are they thinking?

All MPEG standards have patents and this one is not an exception. If companies are interested they can license its use (assuming fair terms). This is far better than having proprietary formats which are locked or formats made by a single company which you don't know the patent situation clearly. Also, companies involved invested in the development of this standard and expect some return.

What I don't like in this post, is the call for non-adoption when the author has a competing format (CRAM) for which the patent situation and the performance is not clear. It seems a biased opinion.

> What I don't like in this post, is the call for non-adoption when the author has a competing format (CRAM) for which the patent situation and the performance is not clear. It seems a biased opinion.

Actually, the author has been part of the genomics community for a long time. CRAM (and BAM) are existing de-facto standards. There is no rent-seeking organization behind those formats; there are no patents.

MPEG, the Moving Picture Experts Group, is trying to move into the genomics space to make money. They are trying to create a 'standard' called MPEG-G. The very same people who are driving the MPEG-G spec are trying to obtain patents that cover the format.

These patent applications are probably invalid. They seem to be obvious and there seems to be lots of prior art in CRAM and other applications and papers. Proving this in order to invalidate them will be time-consuming and expensive. But also necessary, because you can be sure that the patents, if granted, will be used to extort money from people in bio-informatics. They may also be used offensively against CRAM.

This is a complete waste of time.

I am the author of an implementation, although not the author of the file format itself. Although yes that it is still a fair point if you look at just the one blog post. However there are a series of them where I clearly explain the process and my involvement, so don't just look at the last.

I agree though the message would be better if it came from a third party. I was hoping this would happen, but it didn't look likely before the GA4GH conference (where both myself and MPEG were speaking), so I self published before that to ensure people were aware and could ask appropriate questions (to both myself and MPEG of course).

As for royalties, CRAM comes under the governance of the Global Alliance for Genomics & Health (https://www.ga4gh.org). They stated explicitly in the recent conference that their standards are royalty free (as far as is possible to tell) and promote collaboration on the formats / interfaces, competition on the implementation. For the record, we are unaware of any patents covering CRAM and we have filed none of our own, nor do we intend to for CRAM 4.

I recognize that I only have read the last post and had searched in the past for royalty information about CRAM and never found it. Thanks for your answer.

In my opinion, there is clearly the need to assess both solutions with clear and meaningful data, not only in terms of performance but also in terms of patents. Conferences are a great place to do it and thus, I completely agree with you.

However, I don't see an MPEG standard as evil (or ugly) and I do think that both types of standards (CRAM and MPEG) can coexist. Every company should decide (based on factual information) what is the best solution for their needs and if the MPEG standard brings some advantage, a company may use it despite the licensing costs. The same happens for video coding standards, where the patent heavy HEVC is nowadays used in some scenarios (e.g. iPhone) and the royalty free AOM AV1 is used in others (e.g. streaming video). It is up to the market" to decide. The main problem with MPEG-G is that the licensing information is not known yet since it didn't reach draft international standard yet.

CRAM is a standard that was started at Sanger/EMBL/EBI and developed freely in the open by the genomics community over the past decade. As others have told you, it is royalty free and unencumbered. The work behind the first version of the standard was published in 2010 (https://genome.cshlp.org/content/21/5/734.full). Since then CRAM has undergone many revisions and improvements from a plurality of sources, and is widely adopted among users whose use cases demand genomic sequence compression.

MPEG-G at this point is a three year old attempt to patent troll the genomics community and grab money. The people involved are not experts and are not aware of the actual state of the art or motivating requirements in sequence compression, and are instead trying to dance their way around prior art, as evidenced by the contents of their patents and presentations.

There is no equivalency here to fit your worldview.

Some of the MPEG-G authors are experts in genomics data compression, while others are experts in video compression. It should, in theory, be a good mix.

MPEG are also well aware of the prior art. The authors of various existing state of the art genome compression tools were invited to one of the first MPEG-G conferences where they presented their work. Do not assume because they do not compare against the state of the art, that they are not aware of its existence or how it performs. It's more likely simply that "10x better than BAM" is a powerful message that sells products, more so than "a little bit better than CRAM". It's standard advertising techniques.

> when the author has a competing format (CRAM)

It's more that CRAM is an incumbent format, developed by (different members of) the same genomics community that made the preceding BAM format in the same space. Both BAM and CRAM have been in common use in the field for 5+ years.

As the newcomer, the onus is on the MPEG-G proponents to compare its performance to the formats already in common use.

There are several points to the blog post, but I think the main point you are missing is this: CRAM represents (as do other things, like BAM) prior art that calls the new patents into question. Still inconclusive for the moment, but the question has been raised, and your answer does not address it.

It is a mistake to take for granted that "more technological advance" is worth the price society would pay for it. That price, imposed through patents, is unacceptable in this case.

We are better off if other people encode in older, less efficient codecs that we can support in in free/libre software, than if they encode the files a little smaller and we are forbidden by the MPEG patent portfolio to handle it with free software.

See https://www.gnu.org/philosophy/software-literary-patents.htm... and https://www.gnu.org/philosophy/limit-patent-effect.html.

You'll note that I do not use the term "open source". Since 1983, I have led the free software movement, which campaigns to win freedom in our computing by insisting on software that respects users' freedom. Open source was coined in 1998 to discard the ethical foundation and present the software as a mere matter of convenience.

See https://gnu.org/philosophy/open-source-misses-the-point.html for more explanation of the difference between free software and open source. See also https://thebaffler.com/salvos/the-meme-hustler for Evgeny Morozov's article on the same point.

Which one you advocate is up to you. If you stand for freedom, please show it -- by saying "free" and "libre", rather than "open".

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact