Hacker News new | past | comments | ask | show | jobs | submit login
Why isn’t there a decent file format for tabular data? (successfulsoftware.net)
214 points by LeonB on May 3, 2022 | hide | past | favorite | 345 comments



Seems like the problem here is there is several high quality and well-developed formats, but the author and the commenters here dismiss them because of the different trade-offs they make.

csv -- Simple for simple use cases, text-based, however many edge cases, feature lacking etc

xlsx -- Works in excel, ubiquitous format with a standard, however complicated and missing scientific features

sqlite -- Designed for relational data, somewhat ubiquitous, types defined but not enforced

parquet / hdf5 / apache feather / etc -- Designed for scientific use cases, robust, efficient, less ubiquitous

capn proto, prototype buffers, avro, thrift -- Has specific features for data communication between systems

xml -- Useful if you are programming in early 2000s

GDBM, Kyoto Cabinet, etc -- Useful if you are programming in late 1990s

Pick your poison I guess. Engineering is about trade offs.


I've seen ubiquitous use of tab-separated value files instead of csv, as as simpler format without quoting support and a restriction that your data fields can't contain tabs or newlines, which (unlike commas) is okay for many scenarios.


But that’s sort of the problem with csv.

You never really know which rules your csv files has. Many .csv files are indeed tab separated.


Gotta wonder why the format isn't just a column separator char, a row separator char, and then all the data guaranteed not to have those two chars. Then you could save the thing by finding any two chars that aren't used in the data.

I guess this is why we have a zillion formats.


ASCII Code 29: Group Separator

ASCII Code 30: Record Separator

ASCII Code 31: Unit Separator

https://theasciicode.com.ar/ascii-control-characters/record-...


this doesn't directly address the ease-of-editing concern, but if you have csv (or tsv) and want ascii (0x1f, 0x1e) or unicode(U+241F, U+241E) separated (or vice versa), 'miller' [0] is a command line tool that supports converting among these formats with these one-liners:

    mlr --icsv --oasv cat input.csv  # csv to asv
    mlr --iasv --ocsv cat input.asv  # asv to csv
[0] https://miller.readthedocs.io/en/latest/

[1] https://miller.readthedocs.io/en/latest/file-formats/#csvtsv...


You're right, Miller is great for this, and it turns out that Miller already offers Unicode Separated Values (USV).


Thank you for pointing them out. I've browsed, and searched through that ascii table countless times, usually looking for something in particular, yet never realized that there were such a thing as separators before.


It makes more sense to me in these days of ubiquitous encoding of in-band data that it would be easier to edit if we just used tabs and newlines. If you need a tab in your data, the standard could support either \t (requiring \\t if you need a literal backslash followed by a literal 't') or %09 or &09; or \x09 or something.


If you require escaping then you may as well stick with csv - better to have the character that needs to be escaped to be common enough that people will run into it during testing.


My actual preference is LTSV. Then all that needs escaping or encoding is a quote since a tab or newline inside the quotes is different from one outside the quotes. And I prefer encoding to escaping generally, since it's easier for someone else to come along and build a fairly naive parser.


> Gotta wonder why the format isn't just a column separator char, a row separator char

That's what the article proposes.


Then you can't edit or view it in a normal text editor, which is part of the appeal of CSV.


Maybe normal text editors should support the separators.


That and the fact that the comma (or semicolon, as I prefer it due to the rare usage) is on every keyboard.

A text editor can be made to handle those separators, but editing or processing the file can be more cumbersome than csv and tsv.


I guess if you use an strange character nobody has on their keyboard, you could also just make it a binary format for efficiency


See my sibling comment.


It’s a very meta-level problem. The main problem with CSV is “this is CSV” is an ambiguous statement. One could commit to a rule set, and give it a new name. Probably, someone has done this already.


This is why you say RFC 4810 CSV


Sounds like xkcd 927


works on telnet natively if data is always shorter than 8 character, i think.

tab on terminal actually means to "indent to next align point (which is usually multiple of 8)


The DBT (DBase III) format was common in the 80s and 90s. It is a typed, fixed-width format that was directly supported by Excel, Visual Basic grid widgets, among many other tools. For example, Norton Commander supported it directly, letting you preview database tables without loading another program.


Do you have the file format documentation?



>The DBT (DBase III) format was common in the 80s and 90s.

That should actually be DBF, the format for the main database tables. DBT was an ancillary format for the memo fields, which were used to store longer pieces of text in one column of a DBF record. Overall, that generic format and associated software is called XBASE. People still use it in production. And data entry (CRUD) using it with plain text or TUI DOS-style apps blows Web and even GUI apps out of the water in speed of use.


The latest version of SQLite has a STRICT command to enforce the data types. This option is set per table, but even in a STRICT table you can specify the type of some columns as ANY if you want to allow any type of data in that column (this is not the same meaning of ANY in non-strict tables).


You are limited to the basic types (int, floating point, string, and blob), however. I can somewhat get behind the opinionated argument for not needing more specific types like most common language types, but not the lack of a date type.


> but not the lack of a date type.

i've also found this to be truly bizarre. even more bizarre than not actually respecting (via coercing or error) to the specified type...why even have types, then?


What's so special about having a named type for datetime? User will still need to call functions to manipulate the dates. If only for the default display and import?


I've seen a sqlite database with datetimes in three different formats in the same field, because different parts of the application I inherited had different ideas of how to write a datetime and sqlite accepts everything. It's only a string after all.

That's a mistake that the same bad developer couldn't have done with a PostgreSQL or a MySQL.


I understand that SQLite can store datetime in REAL type no problems, just on import one needs to convert the date strings properly. Of course, stuffing strings verbatim may work too, but that's about the robustness of the load process or the ETL pipeline.


For dates, having a specific date type instead of a text field is required to have proper behavior when sorting and efficient storage/data transfer.

It's also important to have date-related functions on the DB server side, so that you can use them in filtering data before it gets sent over to the user code running on the client, to avoid unnecessary data transfer and allow proper use of indexes in optimizing it.

Also, it is nice if a DB engine can perform the equivalent of `WHERE year(date)=2021` without actually running that function on every date, but rather automatically optimize it to an index lookup of `WHERE date between '2021-01-01' and '2021-12-31'`.


> ...`WHERE year(date)=2021` without actually running that function on every date, but rather automatically optimize it to an index lookup of `WHERE date between '2021-01-01' and '2021-12-31'`

Sure this would be handy. Are there engines that implement such optimization?

I can also see how the 'dated' WHERE clause could be used directly in SQLite to leverage the index. Of course, using year() is more expressive. It may also make sense in such a case to simply add a year column and have it indexed.


Having a standardised format for data from different sources. If there's no standard people will use their OS specific format. That makes it harder to compare datasets from different sources


In the context of this thread–a decent file format for tabular data–SQLite's type system peculiarities are irrelevant.


Got to pick a date format first, what date it is depends on what calendar you use, and where you are in the world. That's before you add time.

So you could base it on UT1/TAI, but then that ignores clocks in space.


Most csv utilities support an alternative delimiter. If I need to edit a file by hand, I'll typically pick an uncommon character for the delimiter (pipe "|" works well since it's uncommon). For me, that pretty much entirely eliminates any of the pain with CSV.


More tools should use the ASCII unit and field separator characters intended for this purpose: https://ronaldduncan.wordpress.com/2009/10/31/text-file-form...


See also Control Character Separated Values: https://www.ccsv.io/


wow, the solution hides in clear sight right before our eyes. Once again.


This is gold


tab-separated-value has never betrayed me! I think it's the default postgresql file export too.


As long as the fields are quoted and escaped : )


I've often wondered what would happen if there was a standard text editor plugin for dealing with parquet and co.

It seems like these formats are disliked, as they are difficult to inspect - but there really isn't any reason UTF-8 bytes arranged in a large sequence (aka CSV) should be any easier to read except for editor support. Sure writes would be slower, but I'd expect most users wouldn't care on modern hardware.


Yeah I don't really understand the downside to a format like parquet here. "Less ubiquitous" seems to be the only one in the parent comment's list?


I suppose that another possible format is PostScript format, which has some of its own advantages and disadvantages. It has both text and binary format, and the binary object sequence format can easily be read by a C code (I wrote a set of macros to do so) without needing to parse the entire structure into memory at first; just read the element that you need, alone, one at a time, in any order. (I have used this format in some of my own software.)

JSON also has its own advantages and disadvantages.

Of course, both of these formats are more structured than CSV, but XML is also a more structured format.

(And, like I and others have mentioned too, the format that they dsecribed isn't actually unique; it just doesn't seem to be that common, but there are enough people who had done the same thing independently, and a few programs which support it, that you could use it if you want to do, and it should work OK.)


One thing of note is that there isn't a single format that's optimal for storing tabular data - what's optimal depends on your use case. And if performance doesn't matter much, just use CSV.

As a simple example, column based formats can significantly speed up queries that don't access the full set of columns. They can come in handy for analysis type SQL queries against a big lump of exported data - where different users query the subset they are interested it.


As best I can tell no one has mentioned Recutils[0]?

It is a little bizarre that csv has never really been nailed down, but yea it's all about trade-offs.

[0]https://www.gnu.org/software/recutils/


Replying to myself, because after reviewing other comments I realized I didn't read the linked post thoroughly enough. Near the end he proposes a format called "usv", basically using the control codes (unit separator \u001f and record separator \u001e) built into ASCII and now Unicode for their intended purpose, which is actually a really good idea! Apparently this had been noted earlier as a format called "adt"[0].

[0]https://ronaldduncan.wordpress.com/2009/10/31/text-file-form...


Ther are tons of billion dollar companies that have entire systems utilizing csv and xlsx tubular data for mission critical processes lol


Surprised that nobody has mentioned ARFF: https://www.cs.waikato.ac.nz/ml/weka/arff.html


>xml -- Useful if you are programming in early 2000s

I guess nobody knows about document formats anymore.


They are re learning those lessons slowly. I.e. OpenAPI and json schema are pretty much poor re implementations of SOAP and XSD but for json. I don't want to be that get off my lawn guy but it's laughable how equivalent they are for 99% of daily use cases.


Every time I hear someone talking about validating JSON I just think about how, despite its flaws, XSD is actually pretty decent despite being 20 years old.


The first time I used XSD was in 2001 I think, for a format we were developing to do human rights violation reporting.

In one part of the document would be a list of people committing violations, another part a list of people witnessing violations, and then in another part violations.

The violations part would have attributes saying what people had taken part in the violation, and who had witnessed. These attributes were a comma separated list of ids.

This structure is of course easy to represent with XPath, probably via Schematron. But there is no real way to represent this kind of context dependent structure in the first version of XSD (I have not kept up for reasons that shall become clear).

Which led me to declare that XSD sucks. Although I have to admit that Henry Thompson is a great programmer, and one of the most worthwhile speakers on technical issues I have ever had the pleasure to here, and while his model of XSD validation as a finite state machine is also elegant it still does not make it suck any less because the standard could not validate many common markup structures.


Yeah I also hit the XSD complexity ceiling on occasion. Most of what we used it for back in the early-mid 00s was quite simple though and it did make much of our work simpler.


That XSD is complex is one thing, some things are by their nature complex, ok, that it was not able to validate common XML structures was another thing, a third thing was cross validation tool support sucked.

Actually this combination of difficulties actually made it sort of enjoyable for me to solve problems stemming from it's usage, and looking smart for doing so, while still feeling like I was using a deficient tool.


XML and its associated formats were just so complex. I remember considering getting a book on XML and it was 4 inches thick. Just for a text-based data storage format...

This is just prohibitively complex. Formats like JSON and YAML thrive because they don't have the complexity of trying to fit every possible scenario ever. The KISS principle still works.


Several things though -

1. tech books tend to be too big

2. These XML books tended to have section on XML and well formedness, namespaces, UTF-8, examples of designing a format - generally a book or address format - all this stuff probably came in to approximately 80-115 pages. Which was what you needed to understand the basis of XML.

3. Then would come the secondary stuff to understand, XPath and XSLT. I would say this would be another 100 - 150 pages, so a query language and a programming DSL to manipulating the data/document format. All this together 265 pages.

4. Then validation and entities in DTDs noting that this was old stuff from SGML days and you didn't need it and there was going to be some other way to validate really soon. Another 60 pages? (and then when that validation language came it sucked, as I noted elsewhere)

5. Then because tech books need to be thick and a 300 page book is not big enough a bunch of stuff that never amounted to anything, like Xlink or some breathless stuff about some XML formats, maybe a talk about SVG and VML, XSL-FO blah blah blah. Another 300 pages of unnecessary stuff.


The author doesn't like any of those tradeoffs and wishes to make another one, what's the problem with that ?

You don't believe the design space is exhaustively explored by the designs and protocols you mentioned, do you? there is always another local optimum to be found.


The problem is that the author states that none of these existing formats are "decent" and falls back on shallow dismissals like "Don’t even get me started on Excel’s proprietary, ghastly binary format."


None of them seem all that conducive to source control or merging. Any good format for that?


Why would we expect data to be amenable to version control systems designed for source code?


I don't _expect_ it but I want it. Surely you can understand the power of storing a config file as tabular data.


There is a git diff driver for CSV that makes the CSV diffing and merging a lot better. https://paulfitz.github.io/2014/07/09/diff-merge-csv.html

Works like a charm!


Looks good, although I don't think I would want the diffs to be in cell. I might just end up writing my own merge tool that passes conflicting cells to a fallback tool.


CSV should be pretty easy to merge, given that it’s line based.


Although it's the easiest, same row cell changes still conflict unnecessarily.


Interestingly, tables in XML would be easy to merge, as long as they're formatted with one cell per line. E.g.

    <table>
      <row>
        <cell>1</cell>
        <cell>45</cell>
      </row>
    </table>


We use Subversion to manage CSV files pretty much constantly.


Don't put data in source control; use a database.


Laughs in Lisp

Code is Data would like to have a word


Code may be data but data isn't code.


Data is code that prints/evaluates to itself.


In some programming languages (such as PostScript, where evaluating as itself vs being executed, is a flag separate from the value's type, and can be changed by the cvlit and cvx operators), it is.


The higher-ups might be unhappy if you use that excuse after committing a gigabyte of customer data.


Executable data was never a good idea.


What if you are doing both, as in trying to source control your database. A lot of tools (liquibase for one) can use CSV files as the basis of read-mostly lookup tables for example. E.g. if I had a list of "product categories" that I wanted to be part of my database when I bootstrap a new environment to run my software. Liquibase can turn that CSV file into a set of upserts that will run if the file hash has changed. So you can maintain that file with just regular edit file > git commit, and it will ensure that table is always consistent with what you want and gives you a much easier to understand commit history for your database, especially as things inevitably get complicated.


Databases are measured in gigabytes and terabytes. If you put only a portion of it in source control, how do you back up the rest of your database? Furthermore, static data is a minor subset of a database. Data by its nature is volatile. Transactions make up 80% of the data. A projection or snapshot/summary is a the summarization of the daily/hourly/minute transactions. If you want a golden copy to bootstrap new environments, I would argue you are better off backing up that golden copy and restoring it using native database tools. Mashing together two very different paradigms will lead to long term disaster. There was a thread about a year ago, when a DevOps engineer was surprised when his "nonproduction" configuration code deleted his production database. If your database is a terabyte in size, by the time you restore the database -- you are already fired.


I am way late responding...

> If you put only a portion of it in source control, how do you back up the rest of your database? I think this is simply a misunderstanding of how these DB change management tools work and what they can do to help you with complex migrations over an applications lifetime.

You still back your DB up like normal.

Your your "mostly static" data is managed by running inserts/updates/deletes when data changes (either manually, or the tool can usually generate them for you), when you actually apply it, the software also records that the migration has been applied in a database change log table. That way when you want to update your database with the latest migrations, it'll only run what hasn't been applied yet.

That allows your standard backup/restore procedures to work just fine.

> If you want a golden copy to bootstrap new environments, I would argue you are better off backing up that golden copy and restoring it using native database tools

So this is essentially what we are doing with liquibase... Using database dumps without any data as our baseline.

Any DDL changes are managed through migration scripts.

There are a number of things that are not managed by migration script, and are instead apply-on-change in our project though.

We found it better to have our views stored as individual files in liquibase, and have them apply on change, because of nested dependencies and other issues with not having a good source of truth for the view definition.

Functions/procedures were another that are best treated as code rather than migrations. It allows you to deal with conflicts between branches with your standard merge tools.

Our "static" data that would only change when there is an application change is managed through csv files managed by liquibase that apply on any change. That needs to be in-sync with the version of the application deployed, so it makes sense to manage that along with the code.


I could use a SQLite file for tabular data/configs but that would still not support merging.


And put that under source control.


> GDBM, Kyoto Cabinet, etc -- Useful if you are programming in late 1990s

Hold your horses:

https://charlesleifer.com/blog/completely-un-scientific-benc...

While I would never consider GDBM for any new project, I wouldn't dismiss its performance for small/locally accessed data stores.


ndjson is actually a really pragmatic choice here that should not be overlooked.

Tabular formats break down when the data stops being tabular. This comes up a lot. People love spread sheets as editing tools but they then end up doing things like putting comma separated values in a cell. I've also seen business people use empty cells to indicate hierarchical 'inheritance". An alternate interpretation of that is that that data has some kind of hierarchy and isn't really row based. People just shoehorn all sorts of stuff into spreadsheets because they are there.

With ndjson, every line is a json object. Every cell is a named field. If you need multiple values, you can use arrays for the fields. Json has actual types (int, float, strings, boolean). So you can have both hierarchical and multivalued data in a row. The case where all the fields are simple primitives is just the simple case. It has an actual specification too: https://github.com/ndjson/ndjson-spec. I like it because I can stream process it and represent arbitrarily complex objects/documents instead of having to flatten it into columns. The parsing overhead makes it more expensive to use than tsv though. The file size is fine if you use e.g. gzip compression. It compresses really well generally.

But I also use tab separated values quite often for simpler data. I mainly like it because google spread sheets provides that as an export option and is actually a great editor for tabular data that I can just give to non technical people.

Both file formats can be easily manipulated with command line tools (jq, csvkit, sed, etc.). Both can be processed using mature parsers in a wide range of languages. If you really want, you can edit them with simple text editors, though you probably should be careful with that. Tools like bat know how to format and highlight these files as well. Etc. Tools like that are important because you can use them and script them together rather than reinventing wheels.

Formats like parquet are cumbersome mainly because none of the tools I mention support it. No editors. Not a lot of command line tools. No formatting tools. If you want to inspect the data, you pretty much have to write a program to do it. I guess this would be fixable but people seem to be not really interested in doing that work. Parquet becomes nice when you need to process data at scale and in any case use a lot of specialized tooling and infrastructure. Not for everyone in other words.

Character encoding is not an issue with either tsv or ndjson if you simply use UTF-8, always. I see no good technical reason why you should use anything else. Anything else should be treated as a bug or legacy. Of course a lot of data has encoding issues regardless. Shit in, shit out basically. Fix it at the source, if you can.

The last point is actually key because all of the issues with e.g. csv usually start with people just using really crappy tools to produce source data. Switching to a different file format won't fix these issues since you still deal with the same crappy tools that of course do not support this file format. Anything else you could just fix to not suck to begin with. And if you do, it stops being an issue. The problem is when you can't.

Nothing wrong with tsv if you use UTF-8 and a some nice framework that generates properly escaped values and does all the right things. The worst you can say about it is that there are a bit too many choices here and people tend to improvise their own crappy data generation tools with escaping bugs and other issues. Most of the pain is self inflicted. The reason csv/tsv are popular is that you don't need a lot of frameworks / tools. But of course the flipside is that DYI leads to people introducing all sorts of unnecessary issues. Try not to do that.


> Columns are separated by \u001F (ASCII unit separator) > Rows are separated by \u001E (ASCII record separator)

That's a nightmare to try to edit yourself in a text editor?

I'd rather just have basically TSV, but with every value always quoted, always UTF-8. Quotes escaped with backslashes, backslashes escaped with backslashes, and that's it. Any binary allowed between the quotes.

I deal with CSVs all day every day. I'm known for these two posts

https://donatstudios.com/Falsehoods-Programmers-Believe-Abou...

https://donatstudios.com/CSV-An-Encoding-Nightmare

Some friends and I actually started an RFC about 11 years ago for a CSV enhancement with an HTTP inspired header section with metadata including encoding. UTF-8 wasn't as clear of a winner back then. Never went anywhere.


> I'd rather just have basically TSV, but with every value always quoted, always UTF-8. Quotes escaped with backslashes, backslashes escaped with backslashes, and that's it. Any binary allowed between the quotes.

Encode rather than escape, such as encoding an arbitrary byte as %xx where xx is two hex digits. Use this encoding for any %s in the values, as well as any field separators and record separators and any bytes that have special meaning in your format.

Encoding rather than escaping means that given a record I can split it into fields using the built-in string splitting method of whatever language I'm using. Dealing with a format that can have the field separators escaped in the values will usually present less opportunity to use the language's efficient built-in string functions.


One major limitation with quoted values that can this contain record delimiters (as opposed to escaping the delimiters) is that it stops systems from being able to load records in parallel.

Some systems ban embedded record delimiters, for this reason.

Btw, I’ve (previously) included at least one of your essays in “awesome csv” list at GitHub. https://github.com/secretGeek/AwesomeCSV#essays

There’s a few specs mentioned there too — is one of those the spec you worked on?


What's the difference between quoted and escaped delimiters? (Keeping in mind that escaping sequences can themselves be escaped, ad infinitum. You can't simply seek to an escape sequence and depend algorithmically on a small, fixed lookbehind.)


I think the parent that if newlines were encoded as "\n" (with a backslash) then you could always split on (actual) newlines and process them in parallel without having to tokenize the quote first.


Yep that’s exactly it.


I think you still can load in parallel, but it just introduces potential for backtracking/correcting the data if you speculate incorrectly.


"That's a nightmare to try to edit yourself in a text editor?"

I have found these "unusual" separators are useful, e.g., I use ASCII file separator (FS). I use tr and sed to add/remove/change separators. If I had to edit a table interactively I would change the separator to something visible before editing. Thats said, in nvi(1) or less(1), US is displayed as highlighted ^_ and RS as highlighted ^^. It is not difficult to work with if the data is just text, as it is for me. One could also use a hex editor like bvi.

On large tables, I prefer to make edits non-interactively. I use ed(1) scripts.

Unfortunately, the UNIX sort command -t option will only support a limited range characters as separators. US, RS and FS are not among them. If I want to use UNIX sort I have to change the separator to one that sort accepts as a separator before sorting.

The complaints about CSV I read on HN and elsewhere seem to be complaints about what people put into them, i.e., lack of enforced rules about what is acceptable, not about the format itself.


In the printing world inkjet printers for industrial use use a file format that is all RS, GS, US, and FS characters. It had no line breaks instead it used a RS character at the beginning of a record. It would routinely break if people tried to open in a text editor. Nothing wants to deal a 300mb file consisting of a single line. Ended up writing my own library to manipulate the files and used a lot of tr, see, and awk in the command line. It was a pain only because modern editors have forgotten control codes


At that point I’d be considering compiling my own sort. It can’t be that much code.


>> Columns are separated by \u001F (ASCII unit separator) > Rows are separated by \u001E (ASCII record separator)

>

> That's a nightmare to try to edit yourself in a text editor?

It is. When I needed to edit a file with field separators (a little tool I made used field separators) I found that Vim was great, because I could copy an existing character into a specific register, and then never use that register for anything else.


Excellent posts. Coincidentally couple weeks ago while evaluating a HTTP response for a web service, I noticed that for tabular data, CSV is much more optimal than JSON; yet there is lack of HTTP header support for CSV responses that could provide clients with supplementary information in order to keep parsers adaptable.

If you have a copy of the said RFC, would like to refer.


Don't know the parent poster's RFC... but there is a RFC 4810 which attempts to define CSV and when I looked was what Excel generated.


If every value is always UTF-8, then you can't embed arbitrary binary, since arbitrary bytes aren't necessarily valid UTF-8.


We’re talking about a tabular data file format. If you want to include arbitrary binary, use a binary data file. Or base64 encoded data. Most datasets you’d use data like this for are small enough to fit into memory, so let’s not get carried away.

(I happen to use tab delimited files to store data that can’t fit into memory, but that’s okay too)


Yes. I think we're agreeing. I was responding to this "Any binary allowed between the quotes.". Binary data can't generally be dropped directly into a text format without some kind of organized encoding.


Yeah, I think so… I thought they meant using a quote as a flag for “binary data lies ahead”, which really seemed odd to me. But — it is completely possible in a custom file type. But yes, if this case, the entire file wouldn’t be UTF8, even if all of the non-quotes data would be.

In retrospect, the idea of random binary data enclosed in quotes is what I’m mainly responding to — which I think we can all agree is a bad idea. (If you need to do that, encode it!)


Sure you can, here is example:

0101010010000111101010101


That's just text. In one sense, yes, it's arbitrary bits. But it's also clearly text. It's probably not what was referred to. If you have binary data that's encoded as text, it seems obvious that that can be embedded in text. It probably wouldn't be worth mentioning that "Any binary allowed between the quotes."


> That's a nightmare to try to edit yourself in a text editor?

You just need a text editor that can support thia format.


So… not a text editor then, right?


A Unicode text editor is not an ascii text editor either.


Actually, I think that it is, if it supports UTF-8 without BOM and does not try to do such things like convert quotation marks into non-ASCII quotation marks automatically, etc.

However, using a proper ASCII text editor would be better, if you do not want Unicode, to avoid many of the problems with Unicode if you are loading an unknown file or copying unknown stuff using clipboard, etc, which may be homoglyphs, reverse text direction override, etc.

(I use vim with exclusively ASCII-only mode, and do not use a Unicode locale, on my computer.)


So vim with a plugin isn't a text editor?


I just mean, if you _require_ plugins in order to be able to edit the content, then the content can't easily be described as text. It is fine to use a specialized application to edit a file of a non-text format, I have nothing against that, but you have then left the realm of text editor.

As an example of what I mean, if someone wrote a vim plugin that allowed a user to interact with a sqlite file and change the schema or edit raw row values from vim, it could be a really valuable and useful plugin. But the presence or absence of a plugin for some given text editor doesn't change whether a given format is generally considered a format suitable for being edited in a text editor. What it does instead is convert vim into an editor of non-text files.

I admit that the proposed file format is much closer to being editable in a text editor than a binary format such as sqlite, but the fact that the characters cannot be typed without special functionality suggests that the format is not suitable for a text editor.


The described format _is_ editable in vim without a plugin. It's just a little awkward, because everything will be on a single line, and you have to use more complicated key commands to enter the delimiters (for example `<C-v><C-6>` for \x1e).


> I just mean, if you _require_ plugins in order to be able to edit the content,

Vim doesn't require plugins (as long as you are editing an existing file, and not trying to create a new one).


Parquet is a wonderful file format and is a dream to work with compared to CSV. Parquet embeds the schema in the footer metadata, so the query engines don't need to guess what the column names / data types are.

Parquet used to be poorly supported, but now it's well supported by almost all languages. You can even view Parquet files in text editors now, but that's not something I've ever needed (https://blog.jetbrains.com/blog/2020/02/25/update-on-big-dat...).

Parquet column pruning & predicate pushdown filtering allow for great query performance improvements, see this blog post I wrote for benchmarks: https://coiled.io/blog/parquet-file-column-pruning-predicate...

Delta Lake (Parquet files + a transaction log) makes it quite pleasant to manage a large lake of data stored in Parquet. There are tons of other features Delta allows for like time travel, schema enforcement, schema evolution, etc.


I've recently been working on a WebAssembly version of Parquet, to bring Parquet to the Web too!

[0]: https://github.com/kylebarron/parquet-wasm

[1]: https://observablehq.com/@kylebarron/geoparquet-on-the-web


Parquet is still relatively poorly supported in the JVM world unless this changed in the last year? Yes, you can use Spark but that's an absolutely huge dependency just to read a file representing a table of data. The alternative - trying to use poorly documented Hadoop libraries - was only marginally better. Maybe the story has changed in the last year?

The other problem with Parquet is that it's overly flexible/supports application-specific metadata. It's all fine when you use a single tool/library for reading and writing files but cross-platform is problematic. Saving a Pandas dataframe to parquet, for example, will include a bunch of Pandas-specific metadata which is ignored or skipped by other libraries.

In this case, it required converting timestamp/datetime columns to a nano time int64 representation before writing data from Pandas (for example), otherwise you could not read those columns using anything that wasn't Pandas.

But maybe this has changed but at the time I last used parquet as a table format?


And DuckDB (DuckDB.org) is a lightweight and super fast library/CLI for working with Parquet.

It’s SQLite for column formats, and for Python users it’s only a pip install away. I use it on the command line to inspect and work with Parquet. It’s aliso Pandas compatible and actually more performant than Pandas.

No need to use Spark, which is heavy and has tons of boilerplate.


I've been pretty impressed with parquet lately. One thing I've missed is a way to group tables. Is there a standard for that? While parquet is generally column oriented it has support for metadata about tables of multiple columns. However, I'm not aware of any format that groups the tables, short of just zipping a bunch of files.

For context, this would be for an application that passes sqlite files around. So naturally it has good support for the database level of storage. But parquet is so fast for some applications as well as so compressed.


Is Spark what you're looking for? You can do all sorts of joins, groupings, and aggregations with parquet(s) acting as your source(s)


You want to search for “DataFrame” libraries.

Another commenter mentioned Spark, Panda’s is another popular one, not used it but think it’s lighter weight where Spark is more for large distributed computation even though can run locally.

There’s a bunch of these tools which lets you treat parquet files as tables doing joins, aggregations etc.


Arrow is really the future here


Isn’t Apache Arrow an in memory format that the various DataFrame libraries can standardise on to interact with each other? inter-process communication (IPC)?

My understanding is your raw data on disk is still a format such as Parquet, but when you load that Parquet in to your application it’s stored as Arrow in-memory for processing?


Arrow also has its own on-disk format called Feather - https://arrow.apache.org/docs/python/feather.html


It's it possible to diff a parquet file?


I would load the parquet files in Python Pandas and do the diff you want using Pandas.


Its possible but you can't just diff the file bytes. Because you will get spurious differences due to metadata, etc.


Is it possible to diff an image?


Yes, multiple tools exist to diff images, e.g. https://github.com/dmtrKovalenko/odiff


There is a decent file format for tabular data, and the author dismisses it: parquet.

It's compact, encodes all the common data types well, does int/float distinction (thanks for teaching us about how important that is json), stores null records with a mask instead of a special value, row major order, has compression, speedy random access... it has it all. And it isn't bogged down with legacy cruft (yet).

Since you need to look at tabular data outside of a text editor anyway[0], I don't see a ton of benefit to making it a plaintext format. Especially not with the author's suggestion of un-typeable ascii delimiters. If I can't type it on my keyboard, I may as well be looking at a hex dump of a binary file because I can't really edit it.

[0] Who among us hasn't experienced the joy of a pull request updating a checked in csv file? A mess of ,,,,,,,"Birmingham",,,AL, etc.


parquet is great but it's not particularly easy to read or write. the libraries that do exist to work with it are few and far between, and those that do either have a hundred dependencies or depend on native code (e.g. libarrow). certainly an important dimension in an ideal file format should be the ease of parsing/writing it, and parquet gets an extremely low score on that front imo


Parquet is also column-major which is great for many use cases, but bad for others, where row-major is more useful. For example, if you want to get just the first x rows.


Then you want avro


Sure, but any new format is going to have the same problems. I think you're right that implementation complexity needs to be considered, but it's not like Word or Excel files or something where you need to replicate bug for bug a format accreted over decades.

Parquet isn't trivial to parse / write but that's probably good imo. CSV is really easy to write, and... that just means everybody does it slightly differently. Being somewhat difficult to interact with encourages people to use a library to centralize a bit, but it's not so complex that someone motivated couldn't write a new implementation in a reasonable amount of time.


Unsure if it was added after your comment, but there is a brief comment on parquet (in that it's binary - so it can be annoying to view/edit).


Yeah, he mentioned it quickly, but kind of dismisses it for bad reasons imo.


Editors such as Notepad++ make it fairly easy to insert US and RS ASCII codes. But not quite as simple as typing a comma or return, obviously.


SQLite is one of the few file formats that's recommended by the US Library of Congress for archival storage: https://www.loc.gov/preservation/digital/formats/fdd/fdd0004...

See also this page on the SQLite website (they're understandably very proud of this): https://www.sqlite.org/locrsf.html

I think it's a fantastic format for archiving and distributing data.


I honestly love SQLlite, so many things about this database are refreshingly amazing. The lightweight embedded nature, the extreme ubiquity[1], the single-source-file-to-compile idea[2], the gigantic test suite[3], their commitment to stability and backward compatability[4], etc ...

Simply a modern marvel of engineering. A thing to put on the next Voyager and send to aliens as a representative of what human minds and hands can do.

[1] https://www.sqlite.org/mostdeployed.html

[2]https://www.sqlite.org/amalgamation.html

[3] https://www.sqlite.org/testing.html

[4]https://www.sqlite.org/lts.html


All I'd want is a CLI tool that can dump an SQLite file so that it actually looks like a table (using "|", "-", "+", etc.), and maybe also accept grep-style filter specs (per specified column[s] of course).


This sounds like an interesting weekend project. Are you saying you can't find anything for this?


There is catsql[1] but it barfs in my Python environment.

[1]https://github.com/paulfitz/catsql


you can get a long way with something like

$ echo "<sqlite3 commands / sql queries>" | sqlite3 FILE


sqlite also has roots in the us military (navy iirc). this could explain the US govs willingness to adopt.


The LOC page explains why they recommend SQLite.

https://www.loc.gov/preservation/digital/formats/fdd/fdd0004...


I think it’s because csv is good enough.

All the standards I’ve seen haven’t been worth the effort to implement. So since csv, with all its flaws, is good enough it crowds out other open standards.

People complain about it, but it’s not really much of a challenge to use csv. I’d also prefer it over the crap (rdf, xml, even schemad json) proposed by people who value more structure. It’s easier for me to just make clean production and consumption programs than to spend time on a really structured table format.

Although I would love a simple, tabular format so I’d never have to use csv again.


I live in France and let me tell you: CSV is not good enough.

What about France you say? The decimal point is a coma, and to avoid confusion, the list separator is a semicolon. So "1.5, 1.6" becomes "1,5; 1,6" when localized. And if you think it is terrible for CSVs, well, it is worse than that.

If we all used RFC4180 consistently, it would work, but of course, that would be too simple. Some software insist on localization. So unless you need to know the software and locale that goes with that CSV. In the most simple case: you can treat both comas and semicolons as separators and it will work for reading, for writing, you have to pick one that the target software hopefully understands.

Hell starts when you have decimal numbers, because sometimes, you get exported CSVs where the coma is used both as a decimal point and as a separator. It makes the number ambiguous and your file becomes useless. It can result in data loss (personal experience). For example "1,2,3" may be [1.2, 3] or [1, 2.3]. And even without that, you still have to guess if you are using points and comas or comas and semicolons.

And of course, there the quoting/escaping issues every country has. Good thing the RFC has clear rules, I heard you can even find software where they are properly implemented. Of course, in France we also have the added bonus of questioning whether or not comas should be quoted since they are not the localized separator.

And character encoding? Of course it is a mess, as with every text file, but CVS makes no effort to help, is it UTF-8? Latin1? I have seen both, and I have seen files mangled because they were opened with the wrong encoding. Also, line endings should be CRLF, for those who care.

CSV as defined by the RFC is good enough, the problem is how it is used in practice.


I think CSV is crappy because commas are so common in real data.

For almost all scenarios I've had to work with, I'd have been perfectly happy with TSV where literal Tab was a disallowed character. No escaping histrionics required.


I prefer commas because I can see them over tabs. I spend zero time escaping commas because the libraries and read and write with (usually pandas but pretty much everything) escape for me. So unless I’m manually building csvs myself it’s a non-issue and certainly not histrionics.


I prefer tabs because my data can often have commas in it. Seeing tabs isn’t an issue as I also have invisible characters visible in all of my editors.

But having your delimiter not be allowed in the record (as in \t is disallowed), makes parsing so much easier. CSV is a bear to parse because you have to read each value from a buffer to handle the quoting.

    line.strip().split(‘\t’)
Is so handy. Batteries included are one thing, but I don’t want to pull in a library just to parse CSV files.


Given your example looks very much like python you could use the CSV module which is built into the standard library and handles all of this for you in a standards compliant manner.

You really do want to use a library to parse CSV since there are a number of corner cases. For example, your example code may not read a whole row since rows can have newlines in them.


> You really do want to use a library to parse CSV

Right… you absolutely need to use a library. I’ve written the parsers for CSV which handle the edge cases (at least RFC edge cases). But I don’t want to include a library to be able to do it. Most of my scripts are small, and adding libraries makes them more difficult to move around. So, I shy away from CSV as a format as a result.

I’m okay with not allowing new lines and tabs in my fields. It’s a valid trade off for my data.

Yes, many of my scripts are Python, but not all, so adding new libraries for different languages is more difficult for me to remember as opposed to just splitting a line by a record separator.

Also, many of my files require having comments in the file’s header. Another reason why CSV can’t be easily used, as this isn’t part of the RFC.


I appreciate, as described, some of your needs may differ from the standard format (I'd probably still use them to make interoperability between other l languages/people/systems easier though and have my comments etc. in documentation). However, with all that in mind I may not have communicated well enough about CSV as a built in.

> Most of my scripts are small, and adding libraries makes them more difficult to move around. So, I shy away from CSV as a format as a result.

The CSV library I speak of is part[0] of the python standard library. Unless you're working with a restricted subset of the language or something it should have no impact on the size of your script or its portability between platforms/locations.

[0] https://docs.python.org/3/library/csv.html


Newlines are pretty common in text too, depending on your data. What do you do about those?


Pipes are usually pretty safe too.


Depends entirely on context. I did consulting in wholesale and supplier data very often contained pipes as content separators (e.g. for product categories, sizes, etc.).


Not being able to include Tabs and Carriage Returns in your data can be a problem though.


The ideal would be a format that avoids characters that are common in data, but still allows a way to encode rare characters that are part of data.

I think that was the motivation for multi-character delimiters like <% %> or {{ }} in templating languages.


Depending on the library, (double) quoted fields _may_ be able to contain carriage returns (0x13) and/or line feeds (0x10).

Sometimes it works, sometimes it doesn't, can be a real pain in the butt.


Unicode just needs a single special delimiter character that is only used as a delimiter.


Ideally a separate row and a column delimiter


As I understand it, ASCII already has this. And isn't every ASCII codepoint valid UTF-8?


CSV has more problems than just quoting: one problem is that excel is so commonly used to open CSV files that various weirdnesses in excel have essentially become part of the CSV format.

For example, if a value looks like a formula, then the only way to get excel to treat it as text is to put a single quote in front of it, "'=not a formula". This in turn means that its common for values in a CSV to begin with a single quote, but for that single quote to not be intended as part of the value.

I think a decent format could be like CSV, but tab separated, where the header row is mandatory, and with explicit support for data types. Tabs and newlines (along with any other unicode character) can be included via backslash escaping.

eg. (imagine the spaces are tabs)

    int:Id  str:Name      json:Attributes    dyn:Other
    1       John\tSmith   {"age": 30}        int:42
    2       Bill          {"age": 40}        str:Hello
Applications can use their own type specifiers (eg. "formula:") to avoid needing to introduce special rules. The value of a field (before interpretation) is the exact string written in the file. The type specifier indicates how that string should be interpreted as a typed value by the application. For example, the `str` type specifier says that backslash escape sequences should be converted to the corresponding characters when interpreting the value.

This has the following properties:

- Machine and human readable and writable.

- Does not place artificial restrictions on values.

- Can be extended as needed with new type specifiers.

- Can be processed losslessly even if an application does not understand all type specifiers in use.

- Is not ambiguous about whether data is part of a header or not.

- Can store common data patterns (eg. all values in a column have the same type) efficiently without being restrictive (values can still have different types even in the same column).

- Can easily jump to a line or field (newlines and tabs not allowed except as separators - they must be escaped when in a value).

- Easy to remember.


I like the format description, thanks. Sounds like something I might implement some day for fun!

The reason I still don't like this is because most of the time I'm sharing csv is to a jira ticket or an email, or to customer and customer insists it needs to be csv.

When customer wants csv there is nothing I can do.

But in other cases I see this CSV file. So it's visual clarity is important. CSV has all the problems you listed, but TSV (and similar, such as your format) has the short coming of relying on an invisible separator. And worse, the width of tab is undefined, undecided.

I don't know what the best trade-of... Maybe we need better tooling so frontends show CSV/TSV et al as separated by cells.


When the customer wants csv, it's because they will open it on Excel, so anything about standardization or data quality is moot. They will have their data mangled in an application specific format.

You need to export to csv. But that doesn't mean you can't use an actually usable format for anything else.


> - Can be extended as needed with new type specifiers.

That's often not a good thing, at least if that allowance is given to third parties, as it wildly opens up the format to effective incompatibilities (fragmenting the ecosystem) and is a never ending source of security issues.


This assumes that if you don't include such capability, that applications won't find a way to extend it on their own regardless of the spec, and that's empirically false.

There's a reason the vast majority of file formats do have some capacity for extension. By providing this capability you avoid much worse hacks (the devil you know and all that...) and you can ensure certain invariants are upheld (such as applications being able to process records losslessly even if they don't understand the format specifier)


> This assumes that if you don't include such capability, that applications won't find a way to extend it on their own regardless of the spec, and that's empirically false.

Except it's true. For a trivial example, json which does not offer these capabilities doesn't suffer from incompatibilities or security issues anywhere close to how much yaml does. Because a json file which is not POD requires a lot of additional documentation and code.

> There's a reason the vast majority of file formats do have some capacity for extension.

Yes, naïvety (if not outright brain damage): for a long long time it's been assumed that flexibility and extensibility were innocuous, and while we've known that was not at all the case for at least a generation, these ideas remain attractive nuisances.

More modern formats have tended to stay away from these, and those few which have not (like, once again, yaml) are case studies in why they should have.


There’s lots of protocols like what you describe and I’ve written a few myself. But they are a pain to teach to other people.

Csv isn’t perfect, but I’d rather live with its weaknesses than use anything else.


CSV is fine. If you care about edge cases, implement RFC 4180:

https://www.rfc-archive.org/getrfc.php?rfc=4180

If you don't, then split each line on ",". Problem solved.

If you find tab delimited easier to read (as I do), then check out the IANA spec for TSV files:

https://www.iana.org/assignments/media-types/text/tab-separa...

It's easier to parse than CSV. Unfortunately, you have to decide how to handle newlines and tabs yourself (probably as \n, \t, with \\ for backslashes).


> If you don't, then split each line on ",". Problem solved.

And the millionth bad CSV parser is born.


More than the millionth, I'm sure. If you know your input doesn't have commas (for example, because you generated it) and the code is throwaway, then splitting on comma is fine.

But, yeah, I prefer tab delimited, since "no commas" is painful, and data that contains quotes leads to scenarios where the crappy CSV I generated can't be parsed by RFC 4180 parsers.


yeah the "CSV is fine" bit scared me. people think they know what everyone is doing and why; in reality (where these problems actually need to be solved) no one knows.


> If you don't, then split each line on ",". Problem solved.

What? Real data contains '"' quote character too. That doesn't work.


Read the RFC I linked. It handles all the characters.


Yeah I agree. Just the combination of opening in excel on double click and being dead simple to spit out from any software makes it a winner. It helps that it's usable in plain text too.

If it would have used ANYTHING other than half the world's decimal separator as a value separator, it would be a no brainer. If it had been .psv (pipe separated values) the edge cases would be so few that we could just ignore them.


Gnumeric (an open source spreadsheet) has a great dialog box that pops up and lets you tell it exactly how to parse the tab delimited / csv file you just handed it. It even has a preview of what your parsing options will do to the file you're opening. Also, "gnumeric foo.csv" does the right thing from the command line.


I don't think it's necessarily that CSV is good enough but that its the only ASCII tabular format that "double clicks into Excel" by default.

Excel supports a lot more tabular formats these days, but you have to search the ribbon for their importers. It's JSON support is actually surprisingly good, for instance.

Sometimes I think the best thing that the Excel team could do to make the state of the world of tabular formats for us better as developers is just to add a bunch of new file extensions that you can rename files to. For instance, if you could just rename a JSON file to end with something like .XLJSON and a Business Analyst or Accountant can just double click to open in Excel, that would open up a lot more nicer tabular format options beyond CSV with better developer ergonomics.


I never understood why the ASCII separator characters aren't used more. It seems like we're one simple text editor feature away from having easy display and modification. Is there some historical reason for not doing that?


ASCII separators (mnemonics FS, GS, RS, US) are difficult for most users to type, and have no obvious/standardized visual representation.


Early on they were used in some serial and modem protocols, which was problematic when you used them in actual file content. I remember trying to use them in a file, and then when I tried to transfer them, the transfer would stop at RS.


Who is typing out CSVs anyway. You’d either use a spreadsheet tool or do it in code.


From my point of view, CSV is mostly operated in text editors because they are much faster and don't risk to do any unintended modifications.


My editor, and lots of editors now, put spaces instead of tabs when I press the tab key. No reason something similar couldn't happen in CSV modes/apps.


Notepad++ show RS and US as little black boxes with 'RS' and 'US' in. It works well enough.


I’ve wondered this too. At one point I started writing a hobby programming language and explicitly added escape codes for these extra control characters to be used in string literals. (Ada has them all included in the “ASCII” package which is nice.) It seems like a no-brainer to bake them into file formats like the OP describes.


Because they don't have standardized escaping, and when you are rolling your own ad-hoc escaping scheme you can just use printable separator anyway.


Really? Then I wonder what the ASCII character coded 0x10 is for?


That is what I considered too, actually.


> Tab separated is a bit better than CSV. But can’t store tabs and still has issues with line endings, encodings etc.

There are three main variations of tab-separated-values files: 1) Those that don't allow tab and line endings. 2) Those that replace tab and newline characters with escaped values (\n for newline, \t for tab, \r for carriage return, \\ for backslash). 3) Those that follow the CSV convention of quoting fields as defined in RFC4180

The third option is by far the best and is what Microsoft Excel uses. Microsoft Excel has a save file type called "Unicode Text (.txt)" which saves the data as a tab-separated file using RFC4180 quoting/escaping with the UTF-16LE character encoding. In older versions of Excel, "Unicode Text (.txt)" was the only way to export any values containing Unicode characters since the "CSV (Comma delimited) (.csv)" export format uses the ANSI encoding(Windows-1252 on Western/US computers) corrupting any characters not contained in the ANSI character set. In late 2016, Microsoft finally added the "CSV UTF-8 (Comma delimited) (.csv)" option for exporting a CSV file containing Unicode characters.

https://en.wikipedia.org/wiki/Tab-separated_values


> Most tabular data currently gets exchanged as: CSV, Tab separated, XML, JSON or Excel. And they are all highly sub-optimal for the job.

> CSV is a mess. One quote in the wrong place and the file is invalid.

That breaks the other formats too, why pick on CSV? I can imagine a format designed to be friendly to syntax errors, but contra Postel's Law I'm not sure it would be an improvement over a strict, fail-fast syntax.

That's CSV/TSV's real shortcoming: about the only generic validation they allow is to make sure the column count is the same for all rows.


> That's CSV/TSV's real shortcoming: about the only generic validation they allow is to make sure the column count is the same for all rows.

Once upon a time, when I was doing a lot of data interchange between a wide variety of systems (OS'es, applications, etc.) I considered proposing an "enhanced CSV" (ECSV) where the values did not start on the second row in the file, but instead the second row would be regular expressions that could be used to validate the contents of the columns that followed, and data would start on row 3.

In other words, you might have:

``` ID,NAME,DATE

   "/^\d+$/","//","/^\d{4}-\d{2}-\d{2}$/"

   867,Alice,1984-01-09

   5309,Bob,1981-11-16
```

(Newlines added because HN doesn't speak Markdown, sigh.)

In the end, I think the solution was far simpler: we just exchanged a separate descriptor file that had column names and their corresponding regexp patterns for validation as a separate (versioned) file, in order to save a few bytes inside each file transmission, which was a real savings when you paid per-byte over an EDI network.


HN actually does support markdown code blocks, you just have to preface lines with four (nope, two)[1] spaces instead of the (newer) triple-backtick codefences.

  ID,NAME,DATE
  "/^\d+$/","//","/^\d{4}-\d{2}-\d{2}$/"
  867,Alice,1984-01-09
  5309,Bob,1981-11-16
[1] apparently it's actually two spaces instead of the normal markdown standard of four, making HN even more non-standard than usual

https://news.ycombinator.com/formatdoc


Yup, I noticed that and my Level of Care was insufficient to go and edit my comment a second time to fix it. But, thanks for confirming that it could have worked!


I spent a while making a binary format for tabularish documents, and even started on an editor for it. What I decided on after some long months of gradual iteration was to give each cell its own header that could contain various forms of type info, flags, and modes, and to define a cell type that described forms of break (space, line, page, etc. - a 16-bit range of break types could be encoded). The document header also described a dictionary mapping for the data so that it could immediately be presented to the editor in a readable form.

But now I just use plain old spreadsheets to do things - I obsoleted my own tech, although I like certain things about it. The editing and storage encoding isn't really the problem so much as the integrity of the literals, which a solution like the regex idea could accommodate.

I do think that CSV would benefit by having a header area that described the encoding of breaks in cells and lines. Maybe that's the only thing that really needs fixing in it. And if it included arbitrary break levels like my thing and dropped the rectangular row-column shape, it would cover a huge number of documents.


We can blame CSV, or we can blame the way people use CSV. Either way CSV is so unreliable that I try to “fail-fast” as soon as possible in automated pipeline.

At work, we explicitly define data structuring process, converting CSV to Parquet with strict schema and technical/structural validation. We assign interns and new grad engineers for this, which is nicely within their capabilities too with minimal training.


>> CSV is a mess. One quote in the wrong place and the file is invalid.

> That breaks the other formats too, why pick on CSV?

I think it's perhaps badly worded, but the implied (and more important) criticism seems to me to be that CSV makes this kind of error much more likely, with its handling of quotes. Having worked with CSV files that had commas in the data (and sometimes quotes too), I quickly learned that I should `set readonly` on my editor and only interact with the file through programmatic tools, and give up any notion of it being a plaintext hand-editable data format.


Indeed, it’s because people use an editor that isn’t designed for editing csv to edit csv.

Every csv files is a text file, but not every text file is a csv file, but people use text editors to edit them.

A csv editor would forbid you from entering that quote, automatically add a matching one, ask you for your intentions, or whatever, but it shouldn’t silently corrupt your file.

A spreadsheet-like UI, but without formulas, automatic single header row, etc. would be a better UI for csv files.

If you use zip to edit a .xlsx or .docx file, you easily get an invalid file, too.


I think you just described VisiData:

VisiData is an interactive multitool for tabular data. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility which can handle millions of rows with ease.

https://www.visidata.org/


if you add an extra comma in a CSV (outside of quoting) then the rest of the cells in that row are off by 1. Which is not good obviously. But if you add an extra quote, then the entire rest of the file is garbage.


HDF5 is often used in scientific computing for this.

https://en.wikipedia.org/wiki/Hierarchical_Data_Format


You can just use sqlite then. Very compact, highly popular (in different role though). Seen it used for large datasets - map tiles (millions of jpeg files). Much smaller size than zip or tar archive, indexed, fast.

P.S.

  sqlite> .mode csv
  sqlite> .import city.csv cities


Cool, I didn't know about .mode and .import. Super handy tip.


You can also use

    .mode csv
    .headers on
    .output file.csv
Then run a query and it gets output to file.csv.


> This results in a truly hierarchical, filesystem-like data format. In fact, resources in an HDF5 file can be accessed using the POSIX-like syntax /path/to/resource.

That seems a whole higher level of complexity compared to CSV or the other options listed in TFA (perhaps comparable to Excel).


NetCDF4 (built on top of HDF5 largely through sub-setting) is considerably more powerful than excel/libreoffice. Its also easy to access through widely-available libraries. I frequently use the Python `netCDF4` (yes, it really is capitalized that way) library for exploratory work.


Single Cell RNA Seq data is often stored in Loom which is a HDF5 format.

https://linnarssonlab.org/loompy/

Its a little weird at first but its a great format and has libraries in a lot of major languages. It stores a sparse matrix which cuts the size down a lot.

https://linnarssonlab.org/loompy/format/index.html


HDF5 has some limitations that make it suboptimal for cloud based storage systems.

Zarr overcomes these limitations for array data and Parquet overcomes these limitations for tabular data.


The OP wants a text based format, he doesn't care about what is optimal.


> Why can’t we have a format where

Does Excel support it? No? Then that's the end of that.

Excel is tabular data to all non developers. The formats supported by Excel are the whole thing.

And if we're inventing a CSV-like format that uses a more convenient character than quotes and commas, maybe jumping to a non-displayable non-typeable character isn't the best?

Honestly, if I were inventing a table format, I'd use a strict subset of HTML5 tables. Strip it down as hard as possible to the bare minimal elements - no unneeded close tags. The only real flaw is that their encoding of whitespace and escaped chars are verbose and ugly.

    <!DOCTYPE html>
    <html>
    <head>
        <title>Example workbook</title>
        <meta charset="utf-8">
        <meta name="subtype" content="tabular data">
    </head>
    <body>
        
    <table><caption>Persons</caption>
    <tr><th>Name<th>Address
    <tr><td>Alice<td>123 Fake St, Faketon FA 12345
    <tr><td>Bob<td>789 Any Pl, Anytown AY 54321
    </table>

    <table><caption>Cars</caption>
    <tr><th>Make<th>Model
    <tr><td>Buick<td>LeSabre
    <tr><td>Pontiac<td>Vibe
    </table>


I've pondered exactly that. After a previous HN thread, I tried my hand at writing a specification that was minimal as possible but followed HTML5. For example many end tags are optional like `tr` and `td` end tags. Though I pulled in RDFa for richer data types.

Here's the GitHub repo for what I like to call HSV5: https://github.com/elcritch/hsv5/blob/main/README.md ;)

And an example of the format, pretty similar to yours:

    <!doctype html>
    <html vocab="https://hsv5.org/" typeof="hsv5" lang="en">
    <title>sensor data @ 2020-05-26T00:00:00.000Z</title>
    <table>
    <thead>
    <tr>
      <th property="time/iso">Timestamp
      <th property="temperature/celsius" datatype="f32">Temperature
      <th property="humidity/percent" datatype="f32">Humidity
      <th property="pressure/kPa" datatype="f32">Pressure
    <tbody>
    <tr>2020-05-26T00:00:00.000Z<td>8.32<td>75.5<td>102073
    <tr>2020-05-26T00:01:00.000Z<td>8.31<td>NaN<td>102074
    <tr>2020-05-26T00:02:00.000Z<td>8.31<td>75.4<td>102074
    <tr>2020-05-26T00:03:00.000Z<td>8.30<td>75.4<td>102074
    <table>


Neat!

edit: wait, the stuff after <tr> is implicitly the first cell? That's new to me.


As far as I can tell! Browsers seem fine with it too.


Is <meta name="subtype"> used anywhere else?

(I had considered <html application="tabular-data">; my reason was that it seems to work better with user CSS, and might be simpler to parse in other cases too.)


I can't remember where I saw "subtype". Probably something I misrememebered. Yours is better.


This is a very cool idea but why not just add the closing tags and have a minimal format that is also valid HTML and can easily view the data by pointing your browser at the file?

It's already a universal format and as dtech mentions below, you get Excel compatibility.

I would also add headers and an optional type attribute...

  <tr>
  <th data-type="int">Age</th>
  </tr>
If the file is too big then you probably need compress it or use a more appropriate format.


The document I posted was valid html6 -- all of those end tags are optional.

From an XML purist's perspective that's abhorrent, but if we're trying to make a terse table format skipping the close tags is good.


SGML originally defined optional tags. The intent was for doc writers to not need to end tags. XML in my view was a 20 year regression..


Also didn't SGML offer a terse generic closer </> ?


FFS I don't know how I typo'd "html6" I meant HTML5. Too late for edit.


This also allows CSS selectors such as:

  a[data-type="int"] {
    color: green;
  }


Excel actually opens HTML tables


Works well until some joker comes along and puts <td> or <table> in your data :)


Well yeah, no matter what escape chars will be a challenge. Also whitespace. Imaging all the &nbsp; values if you import a whitespace-heavy string.


it will also be at least 2x bigger on disk for no reason.


Your data must be very sparse or include a lot of escape chars for that.

I've seen a lot of CSV where everything is quoted, meaning that the cell separator is effectively

    ","
which is only one character less than

    <td>
and still beats the pants off of JSON or XML. imho, it would be a good compromise, in that there's already partial tooling and GUI support.


But usually it will be 1 character vs 4. So that adds a lot of space that doesn’t add much value.

It’s also harder to read.


I would say invisible control characters would be even harder to read.


US and RS characters are visible in some editors, e.g. Notepad++.


Why not sqlite? You can read/write to a sqlite file in pretty much any language using decent bindings aside from maybe browser JS. I've worked with third party sqlite files several times and it usually isn't that bad. You can't just use any text editor to work with them, but the sqlite CLI is very easy to install and use. If you want a GUI, DB Browser seems to work very well.


I mean, in business contexts at least, Excel is what we end up using nearly all the time because it's ubiquitous. This doesn't mean there aren't issues with it; obviously it's nonFree, and it's notorious for making assumptions about data (see recent coverage of scientists literally renaming a gene (I think?) because Excel kept assuming it was a date or something).

But Excel as a tool is on so many desktops that you can assume access, and so it gets used.

I have a strong sense that this may be the biggest reason why no other format has taken over. It's "good enough" that it steals oxygen from any competing idea.


Yeh, I think you’re spot on. It’s bad enough that lot of people will implement direct alternatives, but it’s good enough that none of the alternatives will get any traction — unless they can be “just click the file” compatible with Excel (or something at least as possible and useful as excel… a browser maybe? )


Fundamentally, why are people so concerned about about hand editing tabular data?

The options for that are pretty terrible. Only editing an existing value in one cell is ok. Everything else - adding new rows, new columns and making sure the file format and column structure is correct - is all terrible. This gets doubly so if you stick it in git and start getting messy pull requests.

I would carefully question what use case you have that this is so important. If you truly do have such a use case, I'd politely suggest tabular data maybe isn't what you need.

Once you remove that requirement, binary solutions for tabular data are possible again and several good options have already been pointed out in other comments.


I regularly need to hand-edit tabular data I have generated from other data when it turns out there are a few weird cases I didn't handle right. I do a lot of one-off data processing in several stages, so it's not just a question of fixing the generator. If I want to edit the structure, sure, I'll take it into Sheets, but for data that's slightly weird, a text editor or shell pipe is often faster.


> No escaping. If you want to put \u001F or \u001E in your data – tough you > can’t. Use a different format.

> It would be reasonably compact, efficient to parse and easy to manually edit > (Notepad++ shows the unit separator as a ‘US’ symbol).

Is it me or it won't be human readable because of the lack of new lines?


I wonder if there is some hacky way around that. For example, an LF (line feed) directly after a RS character is not considered data (but will break the line when shown in a text editor).


I rarely want to see tabular data in a human-readable format. It is always the most tedious way to approach it. My go-to is Excel/LibreOffice Calc. This approach is at least tolerable to edit in a text editor, while something like the OpenDocument Spreadsheet format or the Excel format is impenetrable.


I rarely do it, but it’s nice to be able to Human read when I need to. Also being able to use all the command line text tools is super convenient.

I think it’s a think where having the option for the .1% of times when you need it keeps me using it.


> I rarely do it, but it’s nice to be able to Human read when I need to. Also being able to use all the command line text tools is super convenient.

Sometimes it helps a lot to eyeball what you have before doing one off scripts to filter/massage your data. Had a recent case where the path of least resistance was database to csv to one off python to statistic tools with a gui and tabular display.

Could have probably done some enterprise looking export infrastructure but it was a one off and not worth it.


Yes, that is an issue with the suggested approach. Unless we can persuade all the editor developers to break lines as 'RS' characters (which is not going to happen).


> Columns are separated by \u001F (ASCII unit separator) > Rows are separated by \u001E (ASCII record separator)

Or, how about columns separated by \u002C, and rows separated by \u000A. And, for bonus points, we can even define unambiguous ways of escaping those two characters so that they CAN appear within column values, if we wanted to, and not tell people that our encoding format is totally stupid and that they need to use a different format.

OP's proposal is equally "highly sub-optimal for the job" for exactly the same imaginary reasons they dislike the currently available encoding formats, but they don't seem to realize it.


> OP's proposal is equally "highly sub-optimal for the job" for exactly the same imaginary reasons they dislike the currently available encoding formats, but they don't seem to realize it.

This is a really unfair appraisal in a bunch of different ways.

Removing the ability to embed record delimiters, for example, means you can process the records in parallel. That’s a massive improvement all by itself.

Stating that their reasons are “imaginary” is just a needless insult, apart from being wrong. Why be like that?


> Stating that their reasons are “imaginary” is just a needless insult, apart from being wrong. Why be like that?

It's not an insult, let alone a needless one: it is a statement of fact.

The reasons OP cites for criticizing CSV are creations of OP's imagination, their subjective beliefs, and not based in objective truth.

For example:

> "CSV is a mess."

Actually, CSV is quite orderly and predictable. OP only imagines it's a mess because of some skewed experience they've personally had.

> "One quote in the wrong place and the file is invalid."

Any file format that is encoded improperly renders it invalid. Imagining that this is strictly limited to CSV, is just that: a creation of OP's imagination.

> "It is difficult to parse efficiently using multiple cores, due to the quoting (you can’t start parsing from part way through a file)."

This is absolutely untrue. Anyone who's written a high-volume streaming CSV parser that is multi-process and multi-threaded should be giving OP the side-eye here.

Yes, you can absolutely start parsing from any arbitrary point in the file, and yes, in the degenerate worst-case scenario, you may need to rewind back to the beginning of the file if the data exists in such a way that your arbitrary starting point requires it, but that is true of ANY file format: it is possible to craft a worst-case scenario and specific starting position within that data that would require this. CSV is no exception.

Hope this helps you better understand what I had written.


> Removing the ability to embed record delimiters, for example, means you can process the records in parallel. That’s a massive improvement all by itself.

I can process records from a csv in parallel just fine. Lexing isn't slow.

.csv stream -> tokens -> records (-> aggregate into ~100ms blocks) -> work queue with N workers


You lex the whole file, separate it into records and then process them?

Doesn’t sound like you’re processing the file in parallel to me.

TFA’s way would allow you to start at any arbitrary file position, just advance until you find a record separator then start parsing from there without having to worry about starting in the middle of a quoted string.


Not like sequential steps, a whole bunch of foreach / yield return loops until the last one dumps things onto a thread-safe queue.


The proposal is TSV but using ^_ instead of ^I (tab) to separate columns and ^^ instead of ^M (CR) or ^J (LF) to separate rows. So instead of being unable to put ^I and ^J in your table cells you are now unable to put ^_ and ^^ in your table cells.

That sounds exactly as good, or bad, as TSV. So, okay? Sure, do that if you like, that sounds fine.

Emacs TAGS files, Info files, and BABYL mailboxes https://quimby.gnus.org/notes/BABYL similarly use weird control characters as delimiters.


US and RS characters are much less likely to appear in textual data than Tab or LF.

Also I believe this was the original purpose for US and RS in ASCII. So I'm not sure it is 'weird' to use them for that (although it does seem to be uncommon).


Yeah, it kind of was; also FS and GS. GNU Info files and Babyl files use US that way. Really though you can define whatever weird format you want if you're writing the code for it.


> But it is binary, so can’t be viewed or edited with standard tools, which is a pain.

I've heard this sentiment expressed multiple times before, and a minor quibble I have with it is that the fact that it's binary has nothing to do with whether or not it's a pain. It's a pain because the tools aren't ubiquitous, so you can't count on them always being installed everywhere. But I'd argue that sqlite _is_ ubiquitous at this point and, as others have mentioned, it's a _great_ format for storing tabular data.

JSON is also a fine choice, if you want it to be human readable, and I'm not sure why this is claiming it's "highly sub-optimal" (which I read as dev-speak for 'absolute trash'). JSON is extremely flexible, compresses very well, has great support for viewing in lots of editors, and even has a decent schema specification. Oh, and line-delimited JSON is used in lots of places, and allows readers to begin at arbitrary points in the file.


JSON is good for structured data, but I prefer TSV for simple human-readable tabular data. In situations where it's the right choice, a TSV file consists of data and whitespace and nothing else. You can view and edit it with any imaginable tool, and there is no overhead in the form of delimeters and encodings distracting you from the data.


I really like LTSV. (That stands for labeled tab-separated values.)

LTSV is basically equivalent to a JSON object per line. You have columns consisting of a label, a colon, then a value. The columns are then separated by tabs. The value can be quoted. If you need a tab in the value, it goes inside the quotes.

As the http://ltsv.org/ suggests, I use it for logging, too, so that a log line is easily parseable and a log file is basically a table. Notice there are parsers for many languages, and there are several tools supporting it including fluentd.


Related thread: CSVs: The Good, the Bad, and the Ugly (2020)

https://news.ycombinator.com/item?id=25014721

FWIW I have designed an upgrade to TSV as part of https://www.oilshell.org. Oil actually emits it now from 'pp proc' to pretty print "functions" and their docstrings as a table. It will be used in many parts of Oil, most of which aren't implemented yet.

It's called QTT -- Quoted, Typed Tables (formerly QTSV)

It's built on top of QSN, which is just Rust string literal notation -- https://www.oilshell.org/release/latest/doc/qsn.html

There is a stub doc which links to a wiki proposal -- http://www.oilshell.org/release/latest/doc/qtt.html

------

I think the only way that such a format becomes popular is if Oil itself becomes popular, so I haven't really emphasized it as a spec. (Similar to how JSON wouldn't be popular if the browser / JS weren't popular)

But it is a pretty easy spec -- just take TSV, add types to the column header, and specify that any cell that begins with a single quote is QSN.

A nice property is that every almost TSV file is a valid QTT file -- the exception being TSV files where a cell just a single quote, or some malformed QSN.

Note that TSV officially can't represent fields with tabs, but QTT can with '\t'.

As always feel free to contact me if you want to work on any of this

----

The ASCII field separators don't make sense -- they break tools and take you off the narrow waist of Unix. https://www.oilshell.org/blog/2022/02/diagrams.html

A key design point of QTT is that it's a special case of the "lines of text" narrow waist. CSV does not have that property, because fields can contain embedded newlines. And unlike TSV, QTT cells can contain arbitrary values.


You should make the connection of QTT and QSN to Rust more clear by coming up with a backronym for RSON.


Hm I wasn't aware of RSON. https://github.com/rson-rs/rson

QSN isn't intended to be tied to Rust in any way (and isn't), while RSON says it uses the Serde data model.

This gets at an issue I have been having a hard time explaining, mentioned here:

http://www.oilshell.org/blog/2022/03/backlog-arch.html

That is, narrow waists are necessarily a COMPROMISE. JSON is a compromise, and Rust users will be equally unhappy as Lua or Erlang users. That is a feature and not a bug for something meant of interoperability. You are "stuck with" the lowest common denominator, but that's what enables interop.

(The lowest common denominator of them all is Unix byte streams.)

I contrast "monoglot" serialization formats like Python pickle an Go .gob with language-independent formats like JSON, TSV, and HTML. The wisdom of JSON is that Crockford specified it independently of JavaScript.

But both are useful.

It's not clear if RSON is meant to be monoglot or polyglot, but it's a huge difference and it seems more monoglot. QSN on the other hand is definitely a polyglot design like JSON, despite being derived from Rust.


If we are talking about machine transfer, there are tons of formats that work, and many of them are binary. xls, ods, SYLK, and dta are examples (not sure about UTF8 wit SYLK and dta). Then there are the "just write it in code" formats like SQL dumpfiles (some xml fits here, too) - which work well and carry really nice, readable type declarations.

If we are talking about human readable, text editor safe formats, we have lots of challenges. ASCII delimiter characters don't display and aren't on the keyboard (and will sometimes have side effects with old wire protocols). Nearly any character that is on a keyboard (or international keyboards) eventually becomes a problem because you will have to escape them when they are not used as a field (, is the classic example) or record delimiter (For example, <CR> can be valid inside of a longer text field). Also, many of the first 31 ASCII characters (the control characters) are used in different and surprising ways (the classic being null) by hardware and software.

BTW what makes CSV bad is that there is no escape character for , or ". Likewise, TSV has the same problem with tabs. JSON, YAML, and XML seem to be much better, but all have edge cases.


if you are ok with a binary format there is apache parquet or apache feather or 'jay' (https://datatable.readthedocs.io/en/latest/api/frame/to_jay....).


No doubt binary formats like Parquet are the way to go for high performance with multi-GB datasets. Seems like total overkill if you have a few hundred or thousand rows of data though. Being able to create/edit/view stuff in a text editor and easily version it is very useful.


It’s strange that we pretend text is not binary. The truth is our many tools are set up to handle binary text data and these tools are not set up for alternate encodings.

If you grab a text file from a Windows machine and bring it to a mac, you’ll see that txt is far from perfect.

This is a long way of saying that if we develop both the format and the tooling then the distinction of text vs “binary” tabular data goes away.


Fair point. But the vast majority of programming tools do already handle UFT8 text (although perhaps don't do a great job with US/RS characters).

also you could write an efficient parser for US/RS separated data in 5 minutes. For parquet you would have to integrate with a library (with all it's dependencies and design choices) or spend days writing your own parser.


Text is text. It's stored as bits, typically 8 bits per character. It allows arbitrary precision but is very space (and memory throughput) inefficient. Fine, compression works but there are many more bits flying around and you're spending CPU cycles on decompression.

What could you do instead? Use appropriate integers, fixed point, or floating point representations. Everybody here knows these things. Nobody is pretending anything.

To the point of portability: IEEE-754 is the same everywhere.


Do people really edit csvs in a text editor? It's horrific, the columns don't line up at all, empty cells are represented by a bunch of commas in a row (which, are you supposed to count all the commas?)

And in terms of versioning, I have seen people commit diffs of csvs before, and they're equally unreadable.

CSV is a plain text format, but that basically buys you nothing. As long as you're going to be loading it into excel or whatever anyway, might as well just pick a good binary format like parquet.


I create and edit CSVs by hand daily. To create/modify simple examples to provide technical support for a data transformation tool.


I frequently edit CSVs by hand. As long as you understand your data, and stakes are low (e.g. it's a one off analysis not prod code) it's pretty easy and convenient.


I frequently use the table syntax in org-mode. No, I wouldn’t edit CSVs because they are miserable, but a friendly syntax with some basic text editor support, absolutely.


Rainbow CSV in VSCode makes it a lot easier to deal with.


Parquet is the answer, human readable is not a valid requirement for tabular data anyway. If you absolute must be human readable CSV is good enough.


You can also easily read it in and view it in a human readable format, I’ve never had a problem with that


> it could be called something like “unicode separated value”

There is. USV uses Unicode Record Separator and Unicode Field Separator.

And data handling tools such as Miller offer USV:

https://miller.readthedocs.io/en/latest/file-formats/#csvtsv...


The proposed format is reasonably same, but you really want to prevent people from writing them by hand, and adding a bit of metadata to describe the column data types at a minimum, and ideally more information such as allowed values, semantics, etc.

To that end, I suggest that putting the tabular data file, along with a metadata descriptor file, inside an archive format (zip, tarball, etc.); that would put just the right size speed-bump to encourage accessing the data through tools and libraries (though if someone is just a bit determined, reading and editing the contained data directly isn't actually impossible or forbidden).

All that said, if you want a better tabular data interchange format badly enough that you're considering devising one, you should probably look at using something even more featureful, like SQLite:

https://www.sqlite.org/appfileformat.html


I put some work into creating a standard, csvz, for putting csv files and their metadata, into a zip file.

https://github.com/secretGeek/csvz

It’s a pretty powerful concept.

SimonW’s preferred technique of using sqlite as the means of exchange is also very powerful. Particularly when combined with all of the utils he maintains.


Being able to create small tabular datasets by hand is incredibly useful to me (doing support for data wrangling software).

Having an optional associated meta data file would be useful though.


Are you really typing the data directly into a csv file, or are you exporting csv data from something like a spreadsheet?


I agree with the author. There’s merit in an open format to be developed with a simple spec as pretty much defined here. Of course more formalism should he attached to it i.e. BNF (am I too old?)

Anyway, one of the giants making billions out of data should maybe commission the publication of such a standard and the implementations in every programming language will follow, I’m sure.

The Unicode bit alone and an enforced mine type makes the case in my opinion and it should have streaming capabilities as well. Maybe chunking of content so that it can be used for caching and compression, ideally still human readable but not marked up like XML so you can’t see the end of it.


I just came up with a new format guys, BCSV!

<pre>

def parse(fileData): rows=[]

  for line in fileData.splitlines():

    for cell in line.split(","):
        rows.append(b64decode(cell)

  return rows
</pre>

Tada!

Edit: I give up, can't figure out hn's code format syntax.


I literally don't get why JSON is bad:

[{row1}, {row2}, {row3}]

The fact that it can do more is in no way a negative. Can even make a limited JSON parser with reduced capabilities. And with JSON can do more definitions like header names vs column names vs just arrays of arrays.


Doesn't have a great int64 number story, no native dates / date-times. If you want named tuples, then the names need to go everywhere, otherwise it's a serialization mechanism on top of JSON.


Number size is out of JSON scope, int64 is a problem of JavaScript no matter what format you use.


It also has opinions about floating point numbers.


Doesn’t open in Excel.

And since it doesn’t require one record per line it can be a hassle to read without having to parse it.

It’s really nice to be able to do “head -n 5”


jsonlines (jsonlines.org) is one record per line.


I think the problem is that there is no clear standard or convention for how to do this. Indeed there are many ways to represent tabular data. Most large software projects probably have at least one bespoke file format that they use internally. The trouble is the lack of such a format that everyone agrees on.


A JSON array doesn't allow easily appending a row. JSONLines is a bit better.


I remember my annoyance at XML that you couldn’t simply append a row, due to the closing tag.

I think there was some off-spec dialog called something like “appendable xml” (only snappier) that said the closing tag of the root element was optional, but it was clearly a hack.

I wanted there to be no root element at all, and the filename (or, the last part of the url) would act as the root element. Kind of weird idea in hindsight, but kind of useful too.

One downside is that there’d be no obvious way to specify attributes on the root element.


That's inferior to even csv in any all its forms. It cannot be appended w/o removing the trailing square bracket. It requires column names per each row, you can't even know how many rows are there w/o parsing. It requires quotes (technical csv allows multiline rows but that's rarely used in practice)

json is absolutely horrid for this type of hack-in jobs. Tabular data just not well structure data, easy to cut and paste.


The repetition of the keys seems like the sore point here.

I could see a specialized form of JSON using jagged arrays

    {"Header": ["Name", "Street Address", "City"]
    "Body": [
      ["Alice", "123 Fake St", "Faketon"],
      ["Bob", "987 Any Pl", "Anytown"],
    ]}
in that way the keys aren't repeated. It wouldn't be coherent useful JS objects when deserialized, but it would be trivial to convert js table object into a true array of Javascript objects.


Not great because you can't just append records without modifying the entire "body". Better to use JSONL!


From TFA:

>> XML and Javascript are tree structures and not suitable for efficiently storing tabular data (plus other issues).


"Javascript" was meant to be "JSON". (now fixed)


What does the acronym TFA mean here?


"The Fine Article" (though more traditionally, and/or depending on tone, "The Fucking Article").


I thought it was “the featured article” — in any case, it means in this context — “the linked url that this hacker news thread is discussing.”


Not being able to do more is exactly the point. By restricting the space of the file format, we can free the mind to think about how to fit the data to the format.

If you can do anything, it becomes hard to do the right thing. If you can only do the right thing, it becomes trivial to decide.


This table has one column, one row, and thus one cell. What is its value?

[{"col1":"val1", "col1":"val2"}]


If you're really trying to replace csv with json, it would probably look like

    ["header1", "header2"]
    ["val1", "val2"]
    ...
The entire file isn't valid JSON, but if you load each line independent of the next (jsonl), then you're fine.


I've done exactly what the author suggests for the exact same reasons. CSV headaches got me searching for other delimiters and I saw there were already unit and record separators, "WTF!?".

As long as you're dealing with text, and you're writing and parsing the data yourself, and you never need to use tools other than things that can be modified to set the item and line delimiters, then it's great. I haven't used the approach since then.

Mostly I haven't used it because half my use cases need binary numerical data (floats) and the others can be worked with entirely in Excel or as CSVs. But I like the idea, even if the support just isn't there.


Have you see tiledb? https://tiledb.com/data-types/dataframes My team is currently transitioning from HDF5 to tiledb for genomics data.


Hi folks, Stavros from TileDB here. Here are my two cents on tabular data. TileDB (Embedded) is a very serious competitor to Parquet, the only other sane choice IMO when it comes to storing large volumes of tabular data (especially when combined with Arrow). Admittedly, we haven’t been advertising TileDB’s tabular capabilities, but that’s only because we were busy with much more challenging applications, such as genomics (population and single-cell), LiDAR, imaging and other very convoluted (from a data format perspective) domains.

Similar to Parquet:

* TileDB is columnar and comes with a lot of compressors, checksum and encryption filters.

* TileDB is built in C++ with multi-threading and vectorization in mind

* TileDB integrates with Arrow, using zero-copy techniques

* TileDB has numerous optimized APIs (C, C++, C#, Python, R, Java, Go)

* TileDB pushes compute down to storage, similar to what Arrow does

Better than Parquet:

* TileDB is multi-dimensional, allowing rapid multi-column conditions

* TileDB builds versioning and time-traveling into the format (no need for Delta Lake, Iceberg, etc)

* TileDB allows for lock-free parallel writes / parallel reads with ACID properties (no need for Delta Lake, Iceberg, etc)

* TileDB can handle more than tables, for example n-dimensional dense arrays (e.g., for imaging, video, etc)

Useful links:

* Github repo (https://github.com/TileDB-Inc/TileDB)

* TileDB Embedded overview (https://tiledb.com/products/tiledb-embedded/)

* Docs (https://docs.tiledb.com/)

* Webinar on why arrays as a universal data model (https://tiledb.com/blog/why-arrays-as-a-universal-data-model)

Happy to hear everyone’s thoughts.


TileDB-VCF does work very well. Which types of data stored as HDF5s you are ingesting into TileDB?


I've used \u001F and \u001E for years. Don't need to check for and un/escape separator characters, very compact, and python's split() and join() makes implementation trivial.

Hope this article makes them more popular.


Plain ASCII has everything you need, but nobody uses these features because they’re non-printing chars. On the one hand you want a special format, but on the other you want to edit it like plain text?


The article's proposal seems worse than tab separated.

Not having escaping is unacceptable of course, and once you add escaping you might as well use the standard tab and newline characters as delimiters (and the \t, \n, \\ escapes plus \N for SQL NULL), resulting in a file that is properly formatted by the terminal and text editors and just works with Unix tools and most spreadsheet/database importers unlike a file with the weird separators the article proposes.


>Not having escaping is unacceptable of course

How often do you use the ASCII US and RS characters in your data (apart from binary data blobs, which it wouldn't be a good fit for)?


I don't know why.

That said, I've enjoyed using Carlo Strozzi's NoSQL program in the past. It's not NoSQL in the current sense of the word. It's a plain text database that stores data in delimited text fields -- e.g. tsv or csv. Create those fields however you want. I just used rlwrap/readline in a simple shell program for input. It does the basic relational database things, but all with plain text. A text editor will work just fine for data entry, and for viewing it later. It will be around as long as Linux commandline utilities are around.

Your data can be viewed in a line-oriented or list-oriented fashion -- the program does the conversion for you in commandline fashion. And you can view everything in a text editor. I think it's just a bit of code tying together a bunch of standard Linux commandline utilities. It's not in a lot of repositories these days. But back when, I found it quite handy for working with line-oriented tabular data. It was the guts of a bailing wire and bubblegum program I wrote to create reinforcing steel barlists back when I only thought I knew what I was doing. My ignorance persists but now I'm aware of and at peace with it.


I like one of the json lines formats, although pretty sure you'd wanna treat floats as strings there, so they don't get mangled.

Each line is separate object there.


https://git.sr.ht/~sforman/sv

> This is a simple file format for storing tabular content. It's meant to work like CSV files. So far only the low-level encoding is specified, handling e.g. type conversion, etc., is left to a higher-level abstraction layer (as yet unwritten.)

> An *.sv file consists of one or more documents separated by ASCII Group Separator character (0x1D), sv files are UTF-8 encoded.

> Each document can have an optional header indicated by the presence of ASCII Start of Heading character (0x01) followed by zero or more records of metadata. If present, the first record should be field names, the second record should be type names, any further records are user-defined. The first two records (names and types) must be the same length as the data records. The header is concluded by an ASCII Start of Text character (0x02).

> Records are separated by ASCII Record Separator characters (0x1E), and data units (fields) of each record are separated by ASCII Unit Separator characters (0x1F). All record values must be strings.

> ...a grammar:

    sv  ::=  doc [GS sv]

    doc  ::=  [header] records

    header  ::=  SOH records STX

    records  ::=  record [RS records]

    record ::= field [US record]

    field  ::= <sequence of zero or more non-sv Unicode characters>
A table of the sv ASCII characters:

    1   01  00000001    SOH &#001;      Start of Heading
    2   02  00000010    STX &#002;      Start of Text
    29  1D  00011101    GS  &#029;      Group Separator
    30  1E  00011110    RS  &#030;      Record Separator
    31  1F  00011111    US  &#031;      Unit Separator


1. Either you want to edit the file in a text editor and you need to have a separator that has a key on your keyboard. Comma is indeed not the greatest choice, but tab or vertical pipe are usually fine. Most parsers will allow you to define the separator.

2. Or you don’t and you’re much better off with any binary format.

Why aren’t unicode separators keys on your keyboard? Maybe that’s suboptimal, but good luck changing that.


> There doesn’t seem to be anything that is reasonably space efficient, simple and quick to parse and text based (not binary) so you can view and edit it with a standard editor.

> XML and Javascript are tree structures and not suitable for efficiently storing tabular data (plus other issues).

You can certainly be efficient with json(net). See:

Notice how they are separate objects:

   {'name': 'foo', 'age': 2}
   {'name': 'cat, 'age': 6}
You can do it very efficiently: https://github.com/simdjson/simdjson

Compress it if you need compact.

There's also UBF, but it never saw much traction: https://ubf.github.io/ubf/ubf-user-guide.en.html#specificati...


XML can be more efficient, since you do not have to surround the key with quotes. It saves two characters at the foo/cat each:

   <r name="foo" age="2"/>
   <r name="cat" age="6"/>


In my view you can't have it both ways - if you want efficiency it's not reasonable to expect to be able to hand edit the data in a text editor. Especially since at the scale that efficiency matters your text editor is going to struggle.

If you don't care that much about efficiency then CSV, JSON, XML all provide perfectly good solutions and they're actually more easily editable in a text editor than what is being suggested. It's also much more sensible if you are hand-editing tabular data to just slurp it in to a jupyter notebook and do the edits there and shoot it back out because it means you've got a straight forward record of what changes you've had to make and can re-run those changes at a click of a button.


Because a file (or computer memory in general) is a linear (1D) storage medium and tabular (2D) data is not well suited to linear serialisation. You can't avoid adding complexity (i.e. making trade-offs) to the format and any code that deals with it.


I've actually started working on something very similar: https://github.com/tmccombs/ssv

So far it's just a python library, but I'm planning on adding editor plugins at least for vim, vscode and maybe emacs, libraries for additional languages, and maybe some cli commands for it.

One distinction from the OP is the delimiters also include a tab (for fields) or newline (for records) by default (but not in "compact" mode). That has the benefit that the files are at least readable with editors and pagers that aren't aware of the format.


Amazon ION perhaps?

https://amzn.github.io/ion-docs/guides/cookbook.html#convert...

As for why, the rationale is that tabular data often needs more than the things that the article decries (e.g. efficient parsing / processing of partial records, sparse values, richly typed values, flexibility of schema evolution (i.e. not having to assume specific columns upfront)...


If it's tabular, I want schema for the columns. Is this column a 'number' or a string? Even better, is there a max length or precision known? Can the cell be null and how is that represented? How are dates formatted? Are they UTC/specific TZ, etc.

Most of my complaints about CSV relate to trying to determine the types used to parse or import as, not how commas are escaped. Excel, for example, actually warns you about this if you try to save a native Excel file as CSV: you are effectively throwing away information.


Mostly the problem comes from how excel is apocalyptically shitty at inferring datatypes, incorrectly assuming non-dates are dates and ISO8601 dates are god knows what, when a sane format would default to text if it didn't know better.


That kind of frustration was what drove me away from spreadsheets in the end. Someone sent me a native excel file for the first time in ages the other day and I just opened it in Pandas without thinking about it.

Pandas also gets things wrong (eg give it a column of unix timestamps with one missing value and there's a good chance it will treat them as float) and it took me a while to adapt to its vector math approach. But once I got comfortable with it working with a new pile of messy data began to feel pleasant rather than annoying.


Language procurement is best at more youthful ages, when the brain is pliable. Having your kids concentrate on German has financial, scholarly, and instructive advantages that won't take too lengthy to even think about harvesting. Also that is as of now on top of the current benefits of bilingualism! https://www.sevenmentor.com/german-language-classes-in-pune....


I think CSV is a decent file format for tabular data. The author claims that CSV files are

> difficult to parse efficiently using multiple cores, due to the quoting (you can’t start parsing from part way through a file).

But I do not see why this is the case.

Step 1: loop over file (in parallel) to determine indices of quote characters

Step 2: loop over indices outside quote regions (in parallel) to determine indices of comma and return characters

Step 3: return two dimensional integer array with pointers to cells of table


I'm inclined to agree. CSVs which are well-formed (escapes within fields handled consistently) shouldn't be that hard to parse.

I can't think of a reason your algo wouldn't be logically sound for good CSV files, although a little backtracking might be necessary to recognize escaping of delimiters in edge cases.

The author writes "CSV is a mess. One quote in the wrong place and the file is invalid.", but what logical formats can tolerate arbitrary corruption? An unclosed tag is similarly problematic for xml. In both cases you wind up falling back to heuristics.

It's true that CSVs often contain a mess of encodings inside fields, but that's not the problem of the CSV format per se. Validation of field encodings, or validation that the entire file is in a uniform encoding... those are separate requirements.


Indeed, every distributed query engine I've used can easily parallelise CSV in the same file (so long as it's splittable, friends don't let friends gzip their data), with the option to ignore bad rows, log them, or throw your hands up and die.

Admittedly, all of them are Java based and use Hadoop libs for handling CSV, which makes sense, the Elephant ecosystem has spent years getting this stuff right.


“No escaping. If you want to put \u001F or \u001E in your data – tough you can’t. Use a different format.”

I fully support this format just for this.


Simple format resolving most of the issues:

http://www.tsvx.org/


I don't think anyone wants tabular data. They want to ingest it into a system so they can query it, or join it, or aggregate it. They want to get rid of tabular data as quickly as possible

Data at the boundary must be validated, a file format can't do that for your. Semantics is harder than syntax and you can't push every problem to that level.


The finance world runs on tables. They want to ingest it, query it, join it, aggregate it ... and look at the result in tables. Tables are everywhere, and a generic tabular format is constantly useful for a la carte one off spontaneous data processing. I love strict schema validation as much as the next guy, but sometimes you just want to cut and paste a table.


yeah, um, my life is in tables (stats, finance, etc).

while csv is less than optimal, it's always worked out in practice for me (because we still have to validate anything of significant size and nothing finance or city/nation sized should generally be edited by hand anyway, and errors are everywhere independent of format).

honestly, my bigger nightmare has been people trying to put things into something like xml...

edit: not that there aren't huge problems with the csv/excel/database software and interaction: generally I just end up having to assume everything is text and defining type information my end afterwards on successful parsing


We've had XML for decades and the default interop is still tabular data - probably in large part because unless you go into non-standard extensions, tabular data is the easiest thing to get from a database query.

Spreadsheets haven't exactly gone away either.


And the data in the system is… almost always tabular? Either Excel (which runs half the world) or RDB/SQL (which runs the other half).


It's tables all the way down...


> Why can’t we have a format where:

> <A bunch of rules>

Erm, we do. You just defined it and gave it a name. There's nothing stopping you using it in your software. Persuading others to use it will be trickier given the existing widely-supported options really aren't so bad, but you could give it a go.


Perhaps we “just” need Excel and Numbers to support reading .parquet files, and come up with a CLI tool to view them like head/tail/awk? Yes, they still cannot be read as plain text so functionality is quite restricted with respect to CLI pipelines


Why are all these "separated"? Sounds like null-terminated C strings. Wouldn't it be better to put the dimension in a header and then just dump the contents unseparated? Why the need to view them in simple text editors?


What you suggested works for fixed width fields only, doesn't it?


Yeah, true. Still, one or more bytes of overhead per cell and row/column seems wasteful.


Fixed width formats are only efficient when everything in a column is a similar length. You only need a single 1000 char data value in a column to mean either the whole column is 1000 chars or the data is truncated.


Space efficient, that is.

Fixed width formats are pretty efficient for quick random access.

Btw, you can have a mostly fixed width format, but with an escape that puts larger value somewhere else. That's more or less what some relational databases are doing, I think.


I had also thought of the same format described there (but without Unicode), and others also have done. The SQLite command shell also implements this format as one of the possibilities.


I prefer tab-separated is to comma-separated. With an editor like Excel etc. you can then make it so that you exactly see where columns are, which column each datum belongs to.


Why not http://ltsv.org/ ???


Take a look at LAS std - commonly used in oil Field electric log data, it has its positives re indexing


Don't know why csv cant be done in parallel by splitting across lines.


Because the line break might be escaped and inside a data value. You have to parse all the way from character 1 to be sure.


I thought in-band signalling was generally considered harmful these days?

+++


You mean, everybody prefers FTP to HTTP?


This tabular file format doesn't exist.com


What about a HTML/XML table?


SQLite is the answer.


SQL INSERT's


This is a perennial topic. Related prior discussion on this forum alone is voluminous:

> I'm surprised never to have seen any support for CSV files using US/RS instead of TAB/(CR)LF or comma/(CR)LF. Or any support in Unix to use RS instead of LF or NUL to delineate records.

February 27, 2022, on: "The Great Curly Brace Trace Chase (2003)"

https://news.ycombinator.com/item?id=30493914

> If we're going to change the format, why not just use the record separator, and field separator characters.

August 19, 2021, on: "A straightforward way to extend CSV with metadata"

https://news.ycombinator.com/item?id=28236339

> Time to retire the CSV?

August 18, 2021

https://news.ycombinator.com/item?id=28221654

> I think we should just stop using commas and newlines and start using the ASCII unit separator and record separator

June 3, 2021, on "RFC 4180: Common Format and MIME Type for CSV Files (2005)"

https://news.ycombinator.com/item?id=27381832

> The record separator ship sailed long ago. The advantage of newline is that it’s easy to use with a bunch of standard tools like less/head/tail/your text editor.

May 25, 2021, on "Newline Delimited JSON"

https://news.ycombinator.com/item?id=27282810

> What both models lacked was a good way to handle optional/sparse fields.

February 26, 2021

https://news.ycombinator.com/item?id=26271877

> It's very annoying that ascii includes file and record separator characters and csv still exists

December 18, 2020 on: "What If OpenDocument Used SQLite? (2014)"

https://news.ycombinator.com/item?id=25466122

> ASCII 31 is a "unit separator" (or field separator as we'd call it today) and ASCII 30 is a record separator.

November 29, 2020

https://news.ycombinator.com/item?id=25248935

> Easy: CSVs are human readable and writeable and humans tend to have a comma sign on their keyboard, while they don't have the ASCII record seperator visible (teaching them to use eg Alt + 30 will not work).

May 15, 2020 on: "So you want to write your own CSV code (2014)"

https://news.ycombinator.com/item?id=23190918

> ASCII also includes control characters for delimiting text records. If people used these purposed-designed characters instead of comma or tab characters as delimiters, we could avoid many headaches quoting and escaping CSV data.

September 26, 2019, on: "Four Column ASCII (2017)"

https://news.ycombinator.com/item?id=21077054

> many of the pain points of CSV/TSV could be addressed by using US (unit separator) or RS (record separator) bytes instead.

April 11, 2018, on: "Problems with CSVs (2016)"

https://news.ycombinator.com/item?id=16812211

> ASCII actually contains four control characters for this purpose: file, group, record and unit separator. I'm sure a lot of problems would be solved if people just used these.

April 11, 2018, on: "Problems with CSVs (2016)"

https://news.ycombinator.com/item?id=16810744

> Ask HN: Why isn't ASCII codes 28 – 31 used more often to serialize tabular data?

December 15, 2017

https://news.ycombinator.com/item?id=15934407

> ASCII defines a unit separator, and a record separator. I wish we would just bite the bullet and start using this 50 year old solution.

December 27, 2016

https://news.ycombinator.com/item?id=13267340

> It really makes me sad that CSV even exists: ASCII defines field ('unit') & record separator characters (also group & file, but those are less-useful), as well as an escape character.

June 8, 2016 on: "ParaText: CSV parsing at 2.5 GB per second"

https://news.ycombinator.com/item?id=11862769

> ASCII has field record separator for instance. Free CSV.

July 2, 2015 on: "Stop the Vertical Tab Madness (2010)"

https://news.ycombinator.com/item?id=9817935

> Good old ASCII has characters specifically devoted to separating fields, keys, etc. that no-one uses for anything else. Why not use them instead of inventing a new character that does the same thing?

June 5, 2015 on: "Almost every Cassandra feature has some surprising behavior"

https://news.ycombinator.com/item?id=9666275

> ASCII Delimited Text – Not CSV or TAB delimited text

March 26, 2014

https://news.ycombinator.com/item?id=7474600




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: