Ask HN: Anyone has built there own file file-format? - chinmays
======
ktpsns
This question is really not specific.

In scientific and high performance computing, people are regularly inventing
new file formats. Many of these decisions also follow paradigms such as "We
don't like XML nor complexity, so let's do as if it was 1980 and serialise
these data as ASCII one datum per line".

Don't forget that your viewpoint comes from your community. If you are a CAD
person, you probably never use JSON. If you do data research, you probably
never use XML. If you do hardware development, you probably open any file with
a hex editor anyway just because data is usually some bitstream for you.

------
zzo38computer
I have designed some file formats for some programs. But your question is not
specific enough, I think.

------
mattbillenstein
I generally use json - inter-operable with almost any language.

Use line-delimited json (jsonl) if you need to store lots of records in a
single file -- gzip that file if size is important. foo.json.gz is a common
data interchange format supported in data warehousing systems and the like.

------
fuzzfactor
For scientific data acquisition I extended an extendable standard format and
collect data to the files real-time while displaying the signal. It's not as
easy as it sounds.

~~~
chinmays
Interesting. What were the challenges that you faced while doing the same?

~~~
fuzzfactor
It was part of a more challenging larger project, where the computer programs
that I wrote truly performed better and even more fully featured than
designed. But naturally I don't like to toot my own horn very much.

First was understanding the approach others had taken as they adapted the
flexible open file structure to their commercial softare, so I could read
supposedly "standard" files "exported" from established vendors' proprietary
formats into my system as data input for post-processing my way.

Then moving all the way upstream to buffer and parse incoming live data using
highest priority without losing ANY.

The parsed data is pre-processed as it is prepared for storage and it is
written to an open file in raw form according to the peviously determined
structure, and a facsimile sent to an autoscaling scrolling graphing module
for real-time (+latency) display on the screen.

The commercial softwares that do this use proprietary raw formats, are
expensive and confusingly vendor- and generation- specific (even if still
supported by the vendor) and did not start out PC-compatible in early
versions, plus I needed to interface to some instruments made before there was
software anyway.

That way I could collect analog data from antique reference instruments plus
process it without limitation compared to various vendors' exposure of fewer-
than-needed options. Then export raw files, or those processed my way, in an
early mainframe academic extensible file format still openable by today's
proprietary softwares, since it was the only format ever agreed upon for
interchange by the competing vendors. Plus the old file format had almost
completely standardized extensions for the lab application in the 1990's
before one of the vendors achieved domination, at which point development
stopped in 1998, but there were only a few vendors and they each went with
where they were at the time. The remaining standard document and its initial
implemenation turned out to function much better with abandoned development,
especially since it's so difficult to comprehand, utilize, and implement, that
basically each vendor has always known the standard is dead, and these are the
_modules that no one must touch_ so there turns out to be far better
consistency over more decades than if revisions would have been made along the
way.

This was so the rest of the world could enjoy a few of my analog things
digitally after that.

The only reason this works is because of the good fortune resulting from all
that early anti-lockin standardization effort before there was one dominant
player, and from customers who had been expecting it for so long that they
were willing to accept a not-fully-completed standard.

Turns out years of delays toward the end were because support from the
strongest vendor who had been one of the staunchest advocates of an open
standard as they were arising, became a changing landscape as the dominance
was building. Instead of engineers, more MBA-type decision makers, who were
considering the competitive implications of a more level playing field for
customers, were involved in grinding it to a halt.

But it did put me closer within reach for the first time to the machine
learning I was doing back in 1980 when I could put together an adversarial
system using high level language on a pair of application-specific operating
systems using more than one leading vendor's proprietary application-specific
language & data structure in a more favorable way than you could on a PC after
that type of office machine first gained popularity. Back then it was good
having the first desktop (benchtop) multi-user multi-tasking system as well as
the largely-competitive-performing single-threaded offerings from other
leading research equipment vendors. Not always literally _micro_ processor
either, some were bigger than that.

After all, I thought that was supposed to be the advantage of high-resolution
digital data, what you could learn & teach from it through computer
processing, because archival storage was not the main objective yet since it
was too expensive to store the full raw data.

I already knew what slide rules, vacuum tubes, and chart recorders could do.

Anyway, the live (progress) display was just the icing on the cake but it had
initiated the whole project because I was finally running out of the big
thermal paper that one of my favorite 1970's-design data systems needed for
its live high resolution display which it used in place of an analog chart
recorder. I had been gifted a large amount of the surplus high-dollar paper
refills from the vendor once this generation of instruments was retired, and
fortunately it had lasted until I had some slow time and dedicated equipment
to do some experimental computer programming. Now that's good customer
service, even though I hadn't purchased a new instrument from them for long
enough to be considered unsupportable.

The best antique small data systems had a provision for streaming out more
live data than they were capable of storing, for those who had access to more
powerful generic outboard storage & computing resources like mainframes or
eventually SOC's (Some Other Computer) like the early Apple.

After reviewing the revised, photocopied, re-revised documentation needed to
interface to the SOC at the time, which was not focused at all[0] on the Apple
the salespeople were helpfully encouraging operators to add to their bench, it
was understood how actual support was limited to the binder you were looking
at.

[0] this would later turn out to be very fortunate

The customer support was excellent and they would try to help out those few
small-timers trying to interface to things like the proven hardware from
Apple, but their engineers at the time were still having difficulty making it
always work trouble-free for these early adopters, and there was no guarantee.

So it should be understood why I always never wanted to do this, and waited
decades into the PC age until I was running out of paper and had no choice.

But it had been on my mind for a bit.

I don't often demo it on the laptop but one day I did go to a 21st century
digital petroleum exploration conference where they had vendors represented
like Halliburton, Schlumberger, and Paradigm. Microsoft had a vendor booth not
offering _off-the-shelf_ exploration ware but more like custom projects for
the kind of dollars at stake and you know they had people that can do it.

The only demo I did was where an exploration geek showed particularly good
understanding and interest in how lab data related to the field. They were at
the booth across from Microsoft who watched too but the only thing obvious
from that distance was the detailed animated data-collection graph filling my
screen as it progressed during that phase. That type display is really not
that much different than commercial spectrometer software already. I don't
think they were hearing the discussion I was having with the other vendor
(maybe with hindsight now I should rethink that). I did have some actual
unexpected VC interest from a major at the conference, but that's a completely
different story.

Well when the next Windows was coming out, the preview of version 8 had one of
the interestingly useful features not found in Windows 7 where when you click
for more detailed info during a file transfer, you get a real-time graph
instead of just the old progress bar. They plot ups-and-downs of transfer rate
against progress with a rate indicator floating along, where labs plot things
like ups-and-downs of analog signal across progress.

The odds against all of this happening were phenomenal.

Made me feel kind of like Forest Gump in a way :-)

The rest of the story may or may not be more interesting someday itself . . .

