> There are clear non-royalty based incentives for large companies to develop new compression algorithms and drive the industry forward. Both Google and Facebook have active data compression teams, lead by some of the world's top experts in the field.
Google and Facebook can afford to spend money on R&D because they throw off gobs of money from near-monopolies in important economic sectors. This is one of the archetypal models for R&D, and has a lot of precedent: AT&T Bell Labs (bankrolled by AT&T's telephone monopoly) and Xerox PARC (bankrolled by the copier monopoly built on Xerox's patents). Much of the really fundamental technologies underlying computing were developed this way.
But MPEG is thirty years old now, and the MPEG-1 standard is 25 years old. Until recently, the MPEG standard has been pushed forward not by a single giant corporation that can afford to bankroll everything, but a consortium of companies using patents and licensing to recover their investment into the R&D. This is one of the other archetypal models for R&D. Many of the other fundamental technologies underlying computing were developed this way.
(The third archetypal model is the government-funded project, e.g. TCP/IP, which is also an example of a monopoly bankrolling R&D.)
The "benevolent monopoly" model obviously has advantages for open source--because the company bankrolls R&D by monetizing something else, it can afford to release the results of the research for everyone to use. But it's not sustainable without the sponsor (and we know this, because open source has been around for a long time, and there is little precedent for a high-performance video codec designed by an independent group of open source developers).
I see people demonizing MPEG and espousing reliance on Google and FB as the way forward, but it's not clear to me that everyone fully understands the implications of that approach.
 Query whether Theora counts--it was based on an originally proprietary, patented codec.
There's also the idea that our personal medical data (which is only a portion of all genomic data currently but it's increasing rapidly) should be entirely open to reading, no licenses required.
"Collaboration" is a poor model for hard R&D. Some problem domains, like video coding, require PhDs with extensive domain knowledge. These folks don't work for free. Community projects don't have the resources to pay for systematic, concerted research efforts in these areas.
Which is why entities like MPEG-LA exist in the first place. If collaborative projects had developed these technologies first, the companies behind MPEG-LA never would've gotten their patents.
> Instead, some like MPEG-LA are doing concerted effort to extort money and prevent competing technologies from emerging.
Note that the key competitors to the patented MPEG technology is coming from "benevolent monopolies" like Google. Those alternatives are unsustainable without those sponsors. So in reality, "open collaboration" is not one of the options. It's a choice between R&D bankrolled by patents, or R&D bankrolled by Google's search profits or Facebook's social media profits.
That may be true in the abstract or in the case of MPEG's multimedia technologies. It is not true in this specific case of genomic file formats.
The existing formats in use in the field (BAM and CRAM) that MPEG-G is looking to supplant come from the genomics community itself. They have come about in the course of bioinformaticians' work analysing sequencing data or as bioinformatics research . Thus this R&D has primarily been bankrolled as scientific research; the funding comes from scientific charities or other scientific research funding.
That is to say, it has come from collaboration.
 CRAM originates from "Efficient storage of high throughput DNA sequencing data using reference-based compression", published in Genome Research in 2011. <https://genome.cshlp.org/content/21/5/734.long>
These are my made up definitions of course, so we can argue about what counts as "collaboration." I'd say something like W3C or Linux--companies and individuals working together who aren't bankrolled by either patents, the government, or a quasi-monopoly.
Secondly you missed out a key part of funding - precompetitive alliances. Eg see the Pistoia Alliance (https://www.pistoiaalliance.org/) who funded the SequenceSqueeze project into compression of FASTQ (http://www.sequencesqueeze.org/).
The notion here is simple - there are some technologies that are so core they cross all the commercial and academic boundaries. Collaboration rather than competition is considered to be to the mutual benefit of everyone involved. It is this mind set which lead to the Alliance for Open Media (AOM), who are also in direct competitive with MPEG.
The difference between government funding and a 'benevolent monopoly' is that in the first case an entity that does not directly benefit financially is providing the funding, while in the second case the entity providing the funding has a financial incentive to do so.
And that's ignoring the fact that they listed the government as one source of funding, the link provided being an example of what they are talking about, not the sole instance. At least according to their comment, some funding for such things also comes from 'scientific charities'.
I'm involved in BAM and CRAM -- it feels like a collaboration to me!
These formats are a collaboration between bioinformaticians at a number of research institutes and companies around the world, with various charity, governmental, or other funding. They are maintained under the auspices of GA4GH, which is indeed something like W3C.
Collaboration doesn't mean they work for free. It means many pool resources to solve the same problem, instead of everyone trying to reinvent the wheel and then charging everyone else for it. That's what AOM are doing, which disproves your claims in practice.
> Which is why entities like MPEG-LA exist in the first place.
Nope, they exist because of the messed up legal patent situation, which allows such parasitic entities to engage in patent protection racket which is the opposite of progress. Patent law needs serious reform to prevent such kind of abuse.
If this is their model, I expect them to be straightforward about it and I expect that most everyone will not touch their standards with 10 ft pole. Standards are not in short supply. I expect "this food contains known poison" sort of explicit label on their wares.
Instead they make it look like community effort, stay silent about strings attached.
We don't want their RnD under such terms.
Did they? That would be weird because MPEG standards are almost always subject to patents. MPEG is not a forum for organizing “community efforts.” There is an ISO/ITU declaration process for patents, but I’m not sure whether the standard is at a stage where genomesys was to have made the declaration already.
The GenomSys patents aren't even listed in the ISO patent list yet: https://www.iso.org/iso-standards-and-patents.html
I don't know if this is against ISO rules - it is unclear to me whether they only need to add their patents on grant, rather than submission.
The only reason I discovered these was due to an accidental hit from a Google Scholar alert. I didn't even realise it searched patents when I set that up.
Based on the patent titles (I'll see if I can read some of them in detail tomorrow), most of these sounds almost exactly like the code I wrote while working at the JGI in the early 2000s that managed moving large amounts of reads from the ABI (Sanger) sequencers, running it through phred/phrap, and storing it all so the biologists could access it easily. This included a custom Huffman tree based encoder/decoder to efficiently store FASTA files at (iirc) about ~2.5 bit/base (quality scores were just stored as packed array of bytes), a very large MySQL backend, and a large set of Perl libraries that provided easy access to reads/libraries/assemblies/etc. It was certainly a "method and apparatus" for "storing and accessing" + "indexing" bioinformatics data using a "compact representation" that provided many different types of "selective access".
I even had code that did a LD_PRELOAD hack on (circa 2002) Consed that intercepted calls to open(2) to load reads automagically from the DB. Reading Huffman encoded data in bulk from the DB (instead of one file per read) reduced the network bandwidth required to open an assembly with all it's aligned reads by ~90%. That sounds a lot like "transmission of bioinformatics data" over a network and "access ... structured in access units". It defiantly involved "reconstruction of genomic reference sequences from compressed genomic sequence reads".
They may have a more efficient compression method, and we didn't do anything re: "multiple genomic descriptors" (was that even a thing <2004?), but... no... they didn't invent what is basically a bioinformatics-specific variations of the same methods used everywhere in the computer industry for as long as "text file formats" have existed.
 These are my personal comments and opinions only, which are not endorsed by or currently affiliated with the Joint Genome Institute, Lawrence Berkeley National Laboratory, or the U.S. Department Of Energy.
 While I have no idea if any of that code even exists today (I left the JGI in 2004), I did mark the source files with the BSD license, since there was historical precedent.
Given that there are four bases, i would have thought you could reliably do it in 2. What am i missing?
"N" for uNknown, where the raw sensor information - which should be a series of nice Gaussians[1 top] - was too ambiguous/low-quality for a useful basecall, but the presence of something can be inferred from the timing information[1 bottom, with "N"s in some bases]
An additional unused bit pattern was reserved for future emergency additions. It used the longest bit pattern so future additions wouldn't be balanced properly, but I wanted something simple, guaranteed binary-compatibility, that could handle a sudden change in requirements.
In some situations, a variation was used that included a few symbols for partial basecalls like "unknown purine" (A or G) and "unknown pyrimidine" (C or T).
If N is rare though, you're better off just doing blocks of 2-bit encoding and dropping to something more complex for the rare cases. Or of course just using an arithmetic / range / ANS coder.
>  The most used genome information representations of sequencing data are based on zipping FASTQ and SAM formats. The objective is to compress the traditionally used file formats (respectively FASTQ and SAM for non-aligned and aligned data). Such files are constituted by plain text characters and are compressed, as mentioned above, by using general purpose approaches such as LZ (from Lempel and Ziv, the authors who published the first versions) schemes (the well-known zip, gzip etc). When general purpose compressors such as gzip are used, the result of compression is usually a single blob of binary data. The information in such monolithic form results quite difficult to archive, transfer and elaborate particularly when like in the case of high throughput sequencing the volume of data are extremely large. The BAM format is characterized by poor compression performance due to the focus on compression of the inefficient and redundant SAM format rather than on extracting the actual genomic information conveyed by SAM files and due to the adoption of general purpose text compression algorithms such as gzip rather than exploiting the specific nature of each data source (the genomic data itself).
(Note that GZIP, mentioned in the patent as an example of the prior art, uses Huffman coding.)
The patent also expressly distinguishes using an index separate from the bitstream of the genomic data, which would be the case in your method above where you're storing compressed data in a MySQL database:
> 1. For CRAM, data indexing is out of the scope of the specification (see section 12 of CRAM specification v 3.0) and it's implemented as a separate file. Conversely the approach of the invention described in this document employs a data indexing method that is integrated with the encoding process and indexes are embedded in the encoded bit stream.
It also explains:
>  The present invention aims at compressing genomic sequences by organizing and partitioning data so that the redundant information to be coded is minimized and features such as selective access and support for incremental updates are enabled.
Without knowing more about it, it sounds like the approach you describe wouldn't allow for incremental updates.
I understand the desire to develop bioinformatics file formats in a more disciplined way than we have done in the past, but this process seems like it may be more of a pain than a benefit. Unfortunately, I couldn't see some of the MPEG-G talks at ISMB this year (other talks were concurrent).
Could anyone explain what the benefits of the MPEG-G format is over something like CRAM? I mean, we were already starting to get close to the theoretical minimum in terms of file size. I personally would like to see more support for encryption and robustness (against bitrot) in formats, but this could be done in a very similar way to current formats.
There are no "comparisons with" CRAM in the MPEG-G preprint, only comparison between CRAM and DeeZ, taken from the DeeZ paper. Those comparisons are fair and correct, but obviously were done at the time that paper was written - some 4 years ago. Since then CRAM has moved on (as has deez), but modern CRAM generally beats modern DeeZ if we restrict ourselves to the formats that permit random access (DeeZ has a higher compression non-random access mode).
So far there have been no direct data on how well MPEG-G does bar an old slide from a year ago; ISMB talk I think.
From that we can glean some compression ratios at least. I attempted to compress the same data set with CRAM, but that data set, while public, is in FASTQ format instead of BAM. I asked for author of the talk how he had produced the BAM, but got no response. I tried my best stab at creating something similar, but it's not a satisfactory comparison yet.
What are they thinking?
What I don't like in this post, is the call for non-adoption when the author has a competing format (CRAM) for which the patent situation and the performance is not clear. It seems a biased opinion.
Actually, the author has been part of the genomics community for a long time. CRAM (and BAM) are existing de-facto standards. There is no rent-seeking organization behind those formats; there are no patents.
MPEG, the Moving Picture Experts Group, is trying to move into the genomics space to make money. They are trying to create a 'standard' called MPEG-G. The very same people who are driving the MPEG-G spec are trying to obtain patents that cover the format.
These patent applications are probably invalid. They seem to be obvious and there seems to be lots of prior art in CRAM and other applications and papers. Proving this in order to invalidate them will be time-consuming and expensive. But also necessary, because you can be sure that the patents, if granted, will be used to extort money from people in bio-informatics. They may also be used offensively against CRAM.
This is a complete waste of time.
I agree though the message would be better if it came from a third party. I was hoping this would happen, but it didn't look likely before the GA4GH conference (where both myself and MPEG were speaking), so I self published before that to ensure people were aware and could ask appropriate questions (to both myself and MPEG of course).
As for royalties, CRAM comes under the governance of the Global Alliance for Genomics & Health (https://www.ga4gh.org). They stated explicitly in the recent conference that their standards are royalty free (as far as is possible to tell) and promote collaboration on the formats / interfaces, competition on the implementation. For the record, we are unaware of any patents covering CRAM and we have filed none of our own, nor do we intend to for CRAM 4.
In my opinion, there is clearly the need to assess both solutions with clear and meaningful data, not only in terms of performance but also in terms of patents. Conferences are a great place to do it and thus, I completely agree with you.
However, I don't see an MPEG standard as evil (or ugly) and I do think that both types of standards (CRAM and MPEG) can coexist. Every company should decide (based on factual information) what is the best solution for their needs and if the MPEG standard brings some advantage, a company may use it despite the licensing costs. The same happens for video coding standards, where the patent heavy HEVC is nowadays used in some scenarios (e.g. iPhone) and the royalty free AOM AV1 is used in others (e.g. streaming video). It is up to the market" to decide. The main problem with MPEG-G is that the licensing information is not known yet since it didn't reach draft international standard yet.
MPEG-G at this point is a three year old attempt to patent troll the genomics community and grab money. The people involved are not experts and are not aware of the actual state of the art or motivating requirements in sequence compression, and are instead trying to dance their way around prior art, as evidenced by the contents of their patents and presentations.
There is no equivalency here to fit your worldview.
MPEG are also well aware of the prior art. The authors of various existing state of the art genome compression tools were invited to one of the first MPEG-G conferences where they presented their work. Do not assume because they do not compare against the state of the art, that they are not aware of its existence or how it performs. It's more likely simply that "10x better than BAM" is a powerful message that sells products, more so than "a little bit better than CRAM". It's standard advertising techniques.
It's more that CRAM is an incumbent format, developed by (different members of) the same genomics community that made the preceding BAM format in the same space. Both BAM and CRAM have been in common use in the field for 5+ years.
As the newcomer, the onus is on the MPEG-G proponents to compare its performance to the formats already in common use.
We are better off if other people encode in older, less efficient
codecs that we can support in in free/libre software, than if they
encode the files a little smaller and we are forbidden by the MPEG
patent portfolio to handle it with free software.
You'll note that I do not use the term "open source". Since 1983, I
have led the free software movement, which campaigns to win freedom
in our computing by insisting on software that respects users'
freedom. Open source was coined in 1998 to discard the ethical
foundation and present the software as a mere matter of convenience.
for more explanation of the difference between free software and open
source. See also https://thebaffler.com/salvos/the-meme-hustler for
Evgeny Morozov's article on the same point.
Which one you advocate is up to you. If you stand for freedom, please
show it -- by saying "free" and "libre", rather than "open".