Seems like the problem here is there is several high quality and well-developed formats, but the author and the commenters here dismiss them because of the different trade-offs they make.
csv -- Simple for simple use cases, text-based, however many edge cases, feature lacking etc
xlsx -- Works in excel, ubiquitous format with a standard, however complicated and missing scientific features
sqlite -- Designed for relational data, somewhat ubiquitous, types defined but not enforced
parquet / hdf5 / apache feather / etc -- Designed for scientific use cases, robust, efficient, less ubiquitous
capn proto, prototype buffers, avro, thrift -- Has specific features for data communication between systems
xml -- Useful if you are programming in early 2000s
GDBM, Kyoto Cabinet, etc -- Useful if you are programming in late 1990s
Pick your poison I guess. Engineering is about trade offs.
I've seen ubiquitous use of tab-separated value files instead of csv, as as simpler format without quoting support and a restriction that your data fields can't contain tabs or newlines, which (unlike commas) is okay for many scenarios.
Gotta wonder why the format isn't just a column separator char, a row separator char, and then all the data guaranteed not to have those two chars. Then you could save the thing by finding any two chars that aren't used in the data.
this doesn't directly address the ease-of-editing concern, but if you have csv (or tsv) and want ascii (0x1f, 0x1e) or unicode(U+241F, U+241E) separated (or vice versa), 'miller' [0] is a command line tool that supports converting among these formats with these one-liners:
mlr --icsv --oasv cat input.csv # csv to asv
mlr --iasv --ocsv cat input.asv # asv to csv
Thank you for pointing them out. I've browsed, and searched through that ascii table countless times, usually looking for something in particular, yet never realized that there were such a thing as separators before.
It makes more sense to me in these days of ubiquitous encoding of in-band data that it would be easier to edit if we just used tabs and newlines. If you need a tab in your data, the standard could support either \t (requiring \\t if you need a literal backslash followed by a literal 't') or %09 or &09; or \x09 or something.
If you require escaping then you may as well stick with csv - better to have the character that needs to be escaped to be common enough that people will run into it during testing.
My actual preference is LTSV. Then all that needs escaping or encoding is a quote since a tab or newline inside the quotes is different from one outside the quotes. And I prefer encoding to escaping generally, since it's easier for someone else to come along and build a fairly naive parser.
It’s a very meta-level problem. The main problem with CSV is “this is CSV” is an ambiguous statement. One could commit to a rule set, and give it a new name. Probably, someone has done this already.
The DBT (DBase III) format was common in the 80s and 90s. It is a typed, fixed-width format that was directly supported by Excel, Visual Basic grid widgets, among many other tools.
For example, Norton Commander supported it directly, letting you preview database tables without loading another program.
>The DBT (DBase III) format was common in the 80s and 90s.
That should actually be DBF, the format for the main database tables. DBT was an ancillary format for the memo fields, which were used to store longer pieces of text in one column of a DBF record.
Overall, that generic format and associated software is called XBASE. People still use it in production. And data entry (CRUD) using it with plain text or TUI DOS-style apps blows Web and even GUI apps out of the water in speed of use.
The latest version of SQLite has a STRICT command to enforce the data types. This option is set per table, but even in a STRICT table you can specify the type of some columns as ANY if you want to allow any type of data in that column (this is not the same meaning of ANY in non-strict tables).
You are limited to the basic types (int, floating point, string, and blob), however. I can somewhat get behind the opinionated argument for not needing more specific types like most common language types, but not the lack of a date type.
i've also found this to be truly bizarre. even more bizarre than not actually respecting (via coercing or error) to the specified type...why even have types, then?
What's so special about having a named type for datetime? User will still need to call functions to manipulate the dates. If only for the default display and import?
I've seen a sqlite database with datetimes in three different formats in the same field, because different parts of the application I inherited had different ideas of how to write a datetime and sqlite accepts everything. It's only a string after all.
That's a mistake that the same bad developer couldn't have done with a PostgreSQL or a MySQL.
I understand that SQLite can store datetime in REAL type no problems, just on import one needs to convert the date strings properly. Of course, stuffing strings verbatim may work too, but that's about the robustness of the load process or the ETL pipeline.
For dates, having a specific date type instead of a text field is required to have proper behavior when sorting and efficient storage/data transfer.
It's also important to have date-related functions on the DB server side, so that you can use them in filtering data before it gets sent over to the user code running on the client, to avoid unnecessary data transfer and allow proper use of indexes in optimizing it.
Also, it is nice if a DB engine can perform the equivalent of `WHERE year(date)=2021` without actually running that function on every date, but rather automatically optimize it to an index lookup of `WHERE date between '2021-01-01' and '2021-12-31'`.
> ...`WHERE year(date)=2021` without actually running that function on every date, but rather automatically optimize it to an index lookup of `WHERE date between '2021-01-01' and '2021-12-31'`
Sure this would be handy. Are there engines that implement such optimization?
I can also see how the 'dated' WHERE clause could be used directly in SQLite to leverage the index. Of course, using year() is more expressive. It may also make sense in such a case to simply add a year column and have it indexed.
Having a standardised format for data from different sources. If there's no standard people will use their OS specific format. That makes it harder to compare datasets from different sources
Most csv utilities support an alternative delimiter. If I need to edit a file by hand, I'll typically pick an uncommon character for the delimiter (pipe "|" works well since it's uncommon). For me, that pretty much entirely eliminates any of the pain with CSV.
I've often wondered what would happen if there was a standard text editor plugin for dealing with parquet and co.
It seems like these formats are disliked, as they are difficult to inspect - but there really isn't any reason UTF-8 bytes arranged in a large sequence (aka CSV) should be any easier to read except for editor support. Sure writes would be slower, but I'd expect most users wouldn't care on modern hardware.
I suppose that another possible format is PostScript format, which has some of its own advantages and disadvantages. It has both text and binary format, and the binary object sequence format can easily be read by a C code (I wrote a set of macros to do so) without needing to parse the entire structure into memory at first; just read the element that you need, alone, one at a time, in any order. (I have used this format in some of my own software.)
JSON also has its own advantages and disadvantages.
Of course, both of these formats are more structured than CSV, but XML is also a more structured format.
(And, like I and others have mentioned too, the format that they dsecribed isn't actually unique; it just doesn't seem to be that common, but there are enough people who had done the same thing independently, and a few programs which support it, that you could use it if you want to do, and it should work OK.)
One thing of note is that there isn't a single format that's optimal for storing tabular data - what's optimal depends on your use case. And if performance doesn't matter much, just use CSV.
As a simple example, column based formats can significantly speed up queries that don't access the full set of columns. They can come in handy for analysis type SQL queries against a big lump of exported data - where different users query the subset they are interested it.
Replying to myself, because after reviewing other comments I realized I didn't read the linked post thoroughly enough. Near the end he proposes a format called "usv", basically using the control codes (unit separator \u001f and record separator \u001e) built into ASCII and now Unicode for their intended purpose, which is actually a really good idea! Apparently this had been noted earlier as a format called "adt"[0].
They are re learning those lessons slowly. I.e. OpenAPI and json schema are pretty much poor re implementations of SOAP and XSD but for json. I don't want to be that get off my lawn guy but it's laughable how equivalent they are for 99% of daily use cases.
Every time I hear someone talking about validating JSON I just think about how, despite its flaws, XSD is actually pretty decent despite being 20 years old.
The first time I used XSD was in 2001 I think, for a format we were developing to do human rights violation reporting.
In one part of the document would be a list of people committing violations, another part a list of people witnessing violations, and then in another part violations.
The violations part would have attributes saying what people had taken part in the violation, and who had witnessed. These attributes were a comma separated list of ids.
This structure is of course easy to represent with XPath, probably via Schematron. But there is no real way to represent this kind of context dependent structure in the first version of XSD (I have not kept up for reasons that shall become clear).
Which led me to declare that XSD sucks. Although I have to admit that Henry Thompson is a great programmer, and one of the most worthwhile speakers on technical issues I have ever had the pleasure to here, and while his model of XSD validation as a finite state machine is also elegant it still does not make it suck any less because the standard could not validate many common markup structures.
Yeah I also hit the XSD complexity ceiling on occasion. Most of what we used it for back in the early-mid 00s was quite simple though and it did make much of our work simpler.
That XSD is complex is one thing, some things are by their nature complex, ok, that it was not able to validate common XML structures was another thing, a third thing was cross validation tool support sucked.
Actually this combination of difficulties actually made it sort of enjoyable for me to solve problems stemming from it's usage, and looking smart for doing so, while still feeling like I was using a deficient tool.
XML and its associated formats were just so complex. I remember considering getting a book on XML and it was 4 inches thick. Just for a text-based data storage format...
This is just prohibitively complex. Formats like JSON and YAML thrive because they don't have the complexity of trying to fit every possible scenario ever. The KISS principle still works.
2. These XML books tended to have section on XML and well formedness, namespaces, UTF-8, examples of designing a format - generally a book or address format - all this stuff probably came in to approximately 80-115 pages. Which was what you needed to understand the basis of XML.
3. Then would come the secondary stuff to understand, XPath and XSLT. I would say this would be another 100 - 150 pages, so a query language and a programming DSL to manipulating the data/document format. All this together 265 pages.
4. Then validation and entities in DTDs noting that this was old stuff from SGML days and you didn't need it and there was going to be some other way to validate really soon. Another 60 pages? (and then when that validation language came it sucked, as I noted elsewhere)
5. Then because tech books need to be thick and a 300 page book is not big enough a bunch of stuff that never amounted to anything, like Xlink or some breathless stuff about some XML formats, maybe a talk about SVG and VML, XSL-FO blah blah blah. Another 300 pages of unnecessary stuff.
The author doesn't like any of those tradeoffs and wishes to make another one, what's the problem with that ?
You don't believe the design space is exhaustively explored by the designs and protocols you mentioned, do you? there is always another local optimum to be found.
The problem is that the author states that none of these existing formats are "decent" and falls back on shallow dismissals like "Don’t even get me started on Excel’s proprietary, ghastly binary format."
Looks good, although I don't think I would want the diffs to be in cell. I might just end up writing my own merge tool that passes conflicting cells to a fallback tool.
In some programming languages (such as PostScript, where evaluating as itself vs being executed, is a flag separate from the value's type, and can be changed by the cvlit and cvx operators), it is.
What if you are doing both, as in trying to source control your database. A lot of tools (liquibase for one) can use CSV files as the basis of read-mostly lookup tables for example. E.g. if I had a list of "product categories" that I wanted to be part of my database when I bootstrap a new environment to run my software. Liquibase can turn that CSV file into a set of upserts that will run if the file hash has changed. So you can maintain that file with just regular edit file > git commit, and it will ensure that table is always consistent with what you want and gives you a much easier to understand commit history for your database, especially as things inevitably get complicated.
Databases are measured in gigabytes and terabytes. If you put only a portion of it in source control, how do you back up the rest of your database? Furthermore, static data is a minor subset of a database. Data by its nature is volatile. Transactions make up 80% of the data. A projection or snapshot/summary is a the summarization of the daily/hourly/minute transactions. If you want a golden copy to bootstrap new environments, I would argue you are better off backing up that golden copy and restoring it using native database tools. Mashing together two very different paradigms will lead to long term disaster. There was a thread about a year ago, when a DevOps engineer was surprised when his "nonproduction" configuration code deleted his production database. If your database is a terabyte in size, by the time you restore the database -- you are already fired.
> If you put only a portion of it in source control, how do you back up the rest of your database?
I think this is simply a misunderstanding of how these DB change management tools work and what they can do to help you with complex migrations over an applications lifetime.
You still back your DB up like normal.
Your your "mostly static" data is managed by running inserts/updates/deletes when data changes (either manually, or the tool can usually generate them for you), when you actually apply it, the software also records that the migration has been applied in a database change log table. That way when you want to update your database with the latest migrations, it'll only run what hasn't been applied yet.
That allows your standard backup/restore procedures to work just fine.
> If you want a golden copy to bootstrap new environments, I would argue you are better off backing up that golden copy and restoring it using native database tools
So this is essentially what we are doing with liquibase... Using database dumps without any data as our baseline.
Any DDL changes are managed through migration scripts.
There are a number of things that are not managed by migration script, and are instead apply-on-change in our project though.
We found it better to have our views stored as individual files in liquibase, and have them apply on change, because of nested dependencies and other issues with not having a good source of truth for the view definition.
Functions/procedures were another that are best treated as code rather than migrations. It allows you to deal with conflicts between branches with your standard merge tools.
Our "static" data that would only change when there is an application change is managed through csv files managed by liquibase that apply on any change. That needs to be in-sync with the version of the application deployed, so it makes sense to manage that along with the code.
ndjson is actually a really pragmatic choice here that should not be overlooked.
Tabular formats break down when the data stops being tabular. This comes up a lot. People love spread sheets as editing tools but they then end up doing things like putting comma separated values in a cell. I've also seen business people use empty cells to indicate hierarchical 'inheritance". An alternate interpretation of that is that that data has some kind of hierarchy and isn't really row based. People just shoehorn all sorts of stuff into spreadsheets because they are there.
With ndjson, every line is a json object. Every cell is a named field. If you need multiple values, you can use arrays for the fields. Json has actual types (int, float, strings, boolean). So you can have both hierarchical and multivalued data in a row. The case where all the fields are simple primitives is just the simple case. It has an actual specification too: https://github.com/ndjson/ndjson-spec. I like it because I can stream process it and represent arbitrarily complex objects/documents instead of having to flatten it into columns. The parsing overhead makes it more expensive to use than tsv though. The file size is fine if you use e.g. gzip compression. It compresses really well generally.
But I also use tab separated values quite often for simpler data. I mainly like it because google spread sheets provides that as an export option and is actually a great editor for tabular data that I can just give to non technical people.
Both file formats can be easily manipulated with command line tools (jq, csvkit, sed, etc.). Both can be processed using mature parsers in a wide range of languages. If you really want, you can edit them with simple text editors, though you probably should be careful with that. Tools like bat know how to format and highlight these files as well. Etc. Tools like that are important because you can use them and script them together rather than reinventing wheels.
Formats like parquet are cumbersome mainly because none of the tools I mention support it. No editors. Not a lot of command line tools. No formatting tools. If you want to inspect the data, you pretty much have to write a program to do it. I guess this would be fixable but people seem to be not really interested in doing that work. Parquet becomes nice when you need to process data at scale and in any case use a lot of specialized tooling and infrastructure. Not for everyone in other words.
Character encoding is not an issue with either tsv or ndjson if you simply use UTF-8, always. I see no good technical reason why you should use anything else. Anything else should be treated as a bug or legacy. Of course a lot of data has encoding issues regardless. Shit in, shit out basically. Fix it at the source, if you can.
The last point is actually key because all of the issues with e.g. csv usually start with people just using really crappy tools to produce source data. Switching to a different file format won't fix these issues since you still deal with the same crappy tools that of course do not support this file format. Anything else you could just fix to not suck to begin with. And if you do, it stops being an issue. The problem is when you can't.
Nothing wrong with tsv if you use UTF-8 and a some nice framework that generates properly escaped values and does all the right things. The worst you can say about it is that there are a bit too many choices here and people tend to improvise their own crappy data generation tools with escaping bugs and other issues. Most of the pain is self inflicted. The reason csv/tsv are popular is that you don't need a lot of frameworks / tools. But of course the flipside is that DYI leads to people introducing all sorts of unnecessary issues. Try not to do that.
> Columns are separated by \u001F (ASCII unit separator)
> Rows are separated by \u001E (ASCII record separator)
That's a nightmare to try to edit yourself in a text editor?
I'd rather just have basically TSV, but with every value always quoted, always UTF-8. Quotes escaped with backslashes, backslashes escaped with backslashes, and that's it. Any binary allowed between the quotes.
I deal with CSVs all day every day. I'm known for these two posts
Some friends and I actually started an RFC about 11 years ago for a CSV enhancement with an HTTP inspired header section with metadata including encoding. UTF-8 wasn't as clear of a winner back then. Never went anywhere.
> I'd rather just have basically TSV, but with every value always quoted, always UTF-8. Quotes escaped with backslashes, backslashes escaped with backslashes, and that's it. Any binary allowed between the quotes.
Encode rather than escape, such as encoding an arbitrary byte as %xx where xx is two hex digits. Use this encoding for any %s in the values, as well as any field separators and record separators and any bytes that have special meaning in your format.
Encoding rather than escaping means that given a record I can split it into fields using the built-in string splitting method of whatever language I'm using. Dealing with a format that can have the field separators escaped in the values will usually present less opportunity to use the language's efficient built-in string functions.
One major limitation with quoted values that can this contain record delimiters (as opposed to escaping the delimiters) is that it stops systems from being able to load records in parallel.
Some systems ban embedded record delimiters, for this reason.
What's the difference between quoted and escaped delimiters? (Keeping in mind that escaping sequences can themselves be escaped, ad infinitum. You can't simply seek to an escape sequence and depend algorithmically on a small, fixed lookbehind.)
I think the parent that if newlines were encoded as "\n" (with a backslash) then you could always split on (actual) newlines and process them in parallel without having to tokenize the quote first.
"That's a nightmare to try to edit yourself in a text editor?"
I have found these "unusual" separators are useful, e.g., I use ASCII file separator (FS). I use tr and sed to add/remove/change separators. If I had to edit a table interactively I would change the separator to something visible before editing. Thats said, in nvi(1) or less(1), US is displayed as highlighted ^_ and RS as highlighted ^^. It is not difficult to work with if the data is just text, as it is for me. One could also use a hex editor like bvi.
On large tables, I prefer to make edits non-interactively. I use ed(1) scripts.
Unfortunately, the UNIX sort command -t option will only support a limited range characters as separators. US, RS and FS are not among them. If I want to use UNIX sort I have to change the separator to one that sort accepts as a separator before sorting.
The complaints about CSV I read on HN and elsewhere seem to be complaints about what people put into them, i.e., lack of enforced rules about what is acceptable, not about the format itself.
In the printing world inkjet printers for industrial use use a file format that is all RS, GS, US, and FS characters. It had no line breaks instead it used a RS character at the beginning of a record. It would routinely break if people tried to open in a text editor. Nothing wants to deal a 300mb file consisting of a single line. Ended up writing my own library to manipulate the files and used a lot of tr, see, and awk in the command line. It was a pain only because modern editors have forgotten control codes
>> Columns are separated by \u001F (ASCII unit separator) > Rows are separated by \u001E (ASCII record separator)
>
> That's a nightmare to try to edit yourself in a text editor?
It is. When I needed to edit a file with field separators (a little tool I made used field separators) I found that Vim was great, because I could copy an existing character into a specific register, and then never use that register for anything else.
Excellent posts. Coincidentally couple weeks ago while evaluating a HTTP response for a web service, I noticed that for tabular data, CSV is much more optimal than JSON; yet there is lack of HTTP header support for CSV responses that could provide clients with supplementary information in order to keep parsers adaptable.
If you have a copy of the said RFC, would like to refer.
We’re talking about a tabular data file format. If you want to include arbitrary binary, use a binary data file. Or base64 encoded data. Most datasets you’d use data like this for are small enough to fit into memory, so let’s not get carried away.
(I happen to use tab delimited files to store data that can’t fit into memory, but that’s okay too)
Yes. I think we're agreeing. I was responding to this "Any binary allowed between the quotes.". Binary data can't generally be dropped directly into a text format without some kind of organized encoding.
Yeah, I think so… I thought they meant using a quote as a flag for “binary data lies ahead”, which really seemed odd to me. But — it is completely possible in a custom file type. But yes, if this case, the entire file wouldn’t be UTF8, even if all of the non-quotes data would be.
In retrospect, the idea of random binary data enclosed in quotes is what I’m mainly responding to — which I think we can all agree is a bad idea. (If you need to do that, encode it!)
That's just text. In one sense, yes, it's arbitrary bits. But it's also clearly text. It's probably not what was referred to. If you have binary data that's encoded as text, it seems obvious that that can be embedded in text. It probably wouldn't be worth mentioning that "Any binary allowed between the quotes."
Actually, I think that it is, if it supports UTF-8 without BOM and does not try to do such things like convert quotation marks into non-ASCII quotation marks automatically, etc.
However, using a proper ASCII text editor would be better, if you do not want Unicode, to avoid many of the problems with Unicode if you are loading an unknown file or copying unknown stuff using clipboard, etc, which may be homoglyphs, reverse text direction override, etc.
(I use vim with exclusively ASCII-only mode, and do not use a Unicode locale, on my computer.)
I just mean, if you _require_ plugins in order to be able to edit the content, then the content can't easily be described as text. It is fine to use a specialized application to edit a file of a non-text format, I have nothing against that, but you have then left the realm of text editor.
As an example of what I mean, if someone wrote a vim plugin that allowed a user to interact with a sqlite file and change the schema or edit raw row values from vim, it could be a really valuable and useful plugin. But the presence or absence of a plugin for some given text editor doesn't change whether a given format is generally considered a format suitable for being edited in a text editor. What it does instead is convert vim into an editor of non-text files.
I admit that the proposed file format is much closer to being editable in a text editor than a binary format such as sqlite, but the fact that the characters cannot be typed without special functionality suggests that the format is not suitable for a text editor.
The described format _is_ editable in vim without a plugin. It's just a little awkward, because everything will be on a single line, and you have to use more complicated key commands to enter the delimiters (for example `<C-v><C-6>` for \x1e).
Parquet is a wonderful file format and is a dream to work with compared to CSV. Parquet embeds the schema in the footer metadata, so the query engines don't need to guess what the column names / data types are.
Delta Lake (Parquet files + a transaction log) makes it quite pleasant to manage a large lake of data stored in Parquet. There are tons of other features Delta allows for like time travel, schema enforcement, schema evolution, etc.
Parquet is still relatively poorly supported in the JVM world unless this changed in the last year? Yes, you can use Spark but that's an absolutely huge dependency just to read a file representing a table of data. The alternative - trying to use poorly documented Hadoop libraries - was only marginally better. Maybe the story has changed in the last year?
The other problem with Parquet is that it's overly flexible/supports application-specific metadata. It's all fine when you use a single tool/library for reading and writing files but cross-platform is problematic. Saving a Pandas dataframe to parquet, for example, will include a bunch of Pandas-specific metadata which is ignored or skipped by other libraries.
In this case, it required converting timestamp/datetime columns to a nano time int64 representation before writing data from Pandas (for example), otherwise you could not read those columns using anything that wasn't Pandas.
But maybe this has changed but at the time I last used parquet as a table format?
And DuckDB (DuckDB.org) is a lightweight and super fast library/CLI for working with Parquet.
It’s SQLite for column formats, and for Python users it’s only a pip install away. I use it on the command line to inspect and work with Parquet. It’s aliso Pandas compatible and actually more performant than Pandas.
No need to use Spark, which is heavy and has tons of boilerplate.
I've been pretty impressed with parquet lately. One thing I've missed is a way to group tables. Is there a standard for that? While parquet is generally column oriented it has support for metadata about tables of multiple columns. However, I'm not aware of any format that groups the tables, short of just zipping a bunch of files.
For context, this would be for an application that passes sqlite files around. So naturally it has good support for the database level of storage. But parquet is so fast for some applications as well as so compressed.
Another commenter mentioned Spark, Panda’s is another popular one, not used it but think it’s lighter weight where Spark is more for large distributed computation even though can run locally.
There’s a bunch of these tools which lets you treat parquet files as tables doing joins, aggregations etc.
Isn’t Apache Arrow an in memory format that the various DataFrame libraries can standardise on to interact with each other? inter-process communication (IPC)?
My understanding is your raw data on disk is still a format such as Parquet, but when you load that Parquet in to your application it’s stored as Arrow in-memory for processing?
There is a decent file format for tabular data, and the author dismisses it: parquet.
It's compact, encodes all the common data types well, does int/float distinction (thanks for teaching us about how important that is json), stores null records with a mask instead of a special value, row major order, has compression, speedy random access... it has it all. And it isn't bogged down with legacy cruft (yet).
Since you need to look at tabular data outside of a text editor anyway[0], I don't see a ton of benefit to making it a plaintext format. Especially not with the author's suggestion of un-typeable ascii delimiters. If I can't type it on my keyboard, I may as well be looking at a hex dump of a binary file because I can't really edit it.
[0] Who among us hasn't experienced the joy of a pull request updating a checked in csv file? A mess of ,,,,,,,"Birmingham",,,AL, etc.
parquet is great but it's not particularly easy to read or write. the libraries that do exist to work with it are few and far between, and those that do either have a hundred dependencies or depend on native code (e.g. libarrow). certainly an important dimension in an ideal file format should be the ease of parsing/writing it, and parquet gets an extremely low score on that front imo
Parquet is also column-major which is great for many use cases, but bad for others, where row-major is more useful. For example, if you want to get just the first x rows.
Sure, but any new format is going to have the same problems. I think you're right that implementation complexity needs to be considered, but it's not like Word or Excel files or something where you need to replicate bug for bug a format accreted over decades.
Parquet isn't trivial to parse / write but that's probably good imo. CSV is really easy to write, and... that just means everybody does it slightly differently. Being somewhat difficult to interact with encourages people to use a library to centralize a bit, but it's not so complex that someone motivated couldn't write a new implementation in a reasonable amount of time.
I honestly love SQLlite, so many things about this database are refreshingly amazing. The lightweight embedded nature, the extreme ubiquity[1], the single-source-file-to-compile idea[2], the gigantic test suite[3], their commitment to stability and backward compatability[4], etc
...
Simply a modern marvel of engineering. A thing to put on the next Voyager and send to aliens as a representative of what human minds and hands can do.
All I'd want is a CLI tool that can dump an SQLite file so that it actually looks like a table (using "|", "-", "+", etc.), and maybe also accept grep-style filter specs (per specified column[s] of course).
All the standards I’ve seen haven’t been worth the effort to implement. So since csv, with all its flaws, is good enough it crowds out other open standards.
People complain about it, but it’s not really much of a challenge to use csv. I’d also prefer it over the crap (rdf, xml, even schemad json) proposed by people who value more structure. It’s easier for me to just make clean production and consumption programs than to spend time on a really structured table format.
Although I would love a simple, tabular format so I’d never have to use csv again.
I live in France and let me tell you: CSV is not good enough.
What about France you say? The decimal point is a coma, and to avoid confusion, the list separator is a semicolon. So "1.5, 1.6" becomes "1,5; 1,6" when localized. And if you think it is terrible for CSVs, well, it is worse than that.
If we all used RFC4180 consistently, it would work, but of course, that would be too simple. Some software insist on localization. So unless you need to know the software and locale that goes with that CSV. In the most simple case: you can treat both comas and semicolons as separators and it will work for reading, for writing, you have to pick one that the target software hopefully understands.
Hell starts when you have decimal numbers, because sometimes, you get exported CSVs where the coma is used both as a decimal point and as a separator. It makes the number ambiguous and your file becomes useless. It can result in data loss (personal experience). For example "1,2,3" may be [1.2, 3] or [1, 2.3]. And even without that, you still have to guess if you are using points and comas or comas and semicolons.
And of course, there the quoting/escaping issues every country has. Good thing the RFC has clear rules, I heard you can even find software where they are properly implemented. Of course, in France we also have the added bonus of questioning whether or not comas should be quoted since they are not the localized separator.
And character encoding? Of course it is a mess, as with every text file, but CVS makes no effort to help, is it UTF-8? Latin1? I have seen both, and I have seen files mangled because they were opened with the wrong encoding. Also, line endings should be CRLF, for those who care.
CSV as defined by the RFC is good enough, the problem is how it is used in practice.
I think CSV is crappy because commas are so common in real data.
For almost all scenarios I've had to work with, I'd have been perfectly happy with TSV where literal Tab was a disallowed character. No escaping histrionics required.
I prefer commas because I can see them over tabs. I spend zero time escaping commas because the libraries and read and write with (usually pandas but pretty much everything) escape for me. So unless I’m manually building csvs myself it’s a non-issue and certainly not histrionics.
I prefer tabs because my data can often have commas in it. Seeing tabs isn’t an issue as I also have invisible characters visible in all of my editors.
But having your delimiter not be allowed in the record (as in \t is disallowed), makes parsing so much easier. CSV is a bear to parse because you have to read each value from a buffer to handle the quoting.
line.strip().split(‘\t’)
Is so handy. Batteries included are one thing, but I don’t want to pull in a library just to parse CSV files.
Given your example looks very much like python you could use the CSV module which is built into the standard library and handles all of this for you in a standards compliant manner.
You really do want to use a library to parse CSV since there are a number of corner cases. For example, your example code may not read a whole row since rows can have newlines in them.
> You really do want to use a library to parse CSV
Right… you absolutely need to use a library. I’ve written the parsers for CSV which handle the edge cases (at least RFC edge cases). But I don’t want to include a library to be able to do it. Most of my scripts are small, and adding libraries makes them more difficult to move around. So, I shy away from CSV as a format as a result.
I’m okay with not allowing new lines and tabs in my fields. It’s a valid trade off for my data.
Yes, many of my scripts are Python, but not all, so adding new libraries for different languages is more difficult for me to remember as opposed to just splitting a line by a record separator.
Also, many of my files require having comments in the file’s header. Another reason why CSV can’t be easily used, as this isn’t part of the RFC.
I appreciate, as described, some of your needs may differ from the standard format (I'd probably still use them to make interoperability between other l languages/people/systems easier though and have my comments etc. in documentation). However, with all that in mind I may not have communicated well enough about CSV as a built in.
> Most of my scripts are small, and adding libraries makes them more difficult to move around. So, I shy away from CSV as a format as a result.
The CSV library I speak of is part[0] of the python standard library. Unless you're working with a restricted subset of the language or something it should have no impact on the size of your script or its portability between platforms/locations.
Depends entirely on context. I did consulting in wholesale and supplier data very often contained pipes as content separators (e.g. for product categories, sizes, etc.).
CSV has more problems than just quoting: one problem is that excel is so commonly used to open CSV files that various weirdnesses in excel have essentially become part of the CSV format.
For example, if a value looks like a formula, then the only way to get excel to treat it as text is to put a single quote in front of it, "'=not a formula". This in turn means that its common for values in a CSV to begin with a single quote, but for that single quote to not be intended as part of the value.
I think a decent format could be like CSV, but tab separated, where the header row is mandatory, and with explicit support for data types. Tabs and newlines (along with any other unicode character) can be included via backslash escaping.
Applications can use their own type specifiers (eg. "formula:") to avoid needing to introduce special rules. The value of a field (before interpretation) is the exact string written in the file. The type specifier indicates how that string should be interpreted as a typed value by the application. For example, the `str` type specifier says that backslash escape sequences should be converted to the corresponding characters when interpreting the value.
This has the following properties:
- Machine and human readable and writable.
- Does not place artificial restrictions on values.
- Can be extended as needed with new type specifiers.
- Can be processed losslessly even if an application does not understand all type specifiers in use.
- Is not ambiguous about whether data is part of a header or not.
- Can store common data patterns (eg. all values in a column have the same type) efficiently without being restrictive (values can still have different types even in the same column).
- Can easily jump to a line or field (newlines and tabs not allowed except as separators - they must be escaped when in a value).
I like the format description, thanks. Sounds like something I might implement some day for fun!
The reason I still don't like this is because most of the time I'm sharing csv is to a jira ticket or an email, or to customer and customer insists it needs to be csv.
When customer wants csv there is nothing I can do.
But in other cases I see this CSV file. So it's visual clarity is important. CSV has all the problems you listed, but TSV (and similar, such as your format) has the short coming of relying on an invisible separator. And worse, the width of tab is undefined, undecided.
I don't know what the best trade-of... Maybe we need better tooling so frontends show CSV/TSV et al as separated by cells.
When the customer wants csv, it's because they will open it on Excel, so anything about standardization or data quality is moot. They will have their data mangled in an application specific format.
You need to export to csv. But that doesn't mean you can't use an actually usable format for anything else.
> - Can be extended as needed with new type specifiers.
That's often not a good thing, at least if that allowance is given to third parties, as it wildly opens up the format to effective incompatibilities (fragmenting the ecosystem) and is a never ending source of security issues.
This assumes that if you don't include such capability, that applications won't find a way to extend it on their own regardless of the spec, and that's empirically false.
There's a reason the vast majority of file formats do have some capacity for extension. By providing this capability you avoid much worse hacks (the devil you know and all that...) and you can ensure certain invariants are upheld (such as applications being able to process records losslessly even if they don't understand the format specifier)
> This assumes that if you don't include such capability, that applications won't find a way to extend it on their own regardless of the spec, and that's empirically false.
Except it's true. For a trivial example, json which does not offer these capabilities doesn't suffer from incompatibilities or security issues anywhere close to how much yaml does. Because a json file which is not POD requires a lot of additional documentation and code.
> There's a reason the vast majority of file formats do have some capacity for extension.
Yes, naïvety (if not outright brain damage): for a long long time it's been assumed that flexibility and extensibility were innocuous, and while we've known that was not at all the case for at least a generation, these ideas remain attractive nuisances.
More modern formats have tended to stay away from these, and those few which have not (like, once again, yaml) are case studies in why they should have.
It's easier to parse than CSV. Unfortunately, you have to decide how to handle newlines and tabs yourself (probably as \n, \t, with \\ for backslashes).
More than the millionth, I'm sure. If you know your input doesn't have commas (for example, because you generated it) and the code is throwaway, then splitting on comma is fine.
But, yeah, I prefer tab delimited, since "no commas" is painful, and data that contains quotes leads to scenarios where the crappy CSV I generated can't be parsed by RFC 4180 parsers.
yeah the "CSV is fine" bit scared me. people think they know what everyone is doing and why; in reality (where these problems actually need to be solved) no one knows.
Yeah I agree. Just the combination of opening in excel on double click and being dead simple to spit out from any software makes it a winner. It helps that it's usable in plain text too.
If it would have used ANYTHING other than half the world's decimal separator as a value separator, it would be a no brainer. If it had been .psv (pipe separated values) the edge cases would be so few that we could just ignore them.
Gnumeric (an open source spreadsheet) has a great dialog box that pops up and lets you tell it exactly how to parse the tab delimited / csv file you just handed it. It even has a preview of what your parsing options will do to the file you're opening. Also, "gnumeric foo.csv" does the right thing from the command line.
I don't think it's necessarily that CSV is good enough but that its the only ASCII tabular format that "double clicks into Excel" by default.
Excel supports a lot more tabular formats these days, but you have to search the ribbon for their importers. It's JSON support is actually surprisingly good, for instance.
Sometimes I think the best thing that the Excel team could do to make the state of the world of tabular formats for us better as developers is just to add a bunch of new file extensions that you can rename files to. For instance, if you could just rename a JSON file to end with something like .XLJSON and a Business Analyst or Accountant can just double click to open in Excel, that would open up a lot more nicer tabular format options beyond CSV with better developer ergonomics.
I never understood why the ASCII separator characters aren't used more. It seems like we're one simple text editor feature away from having easy display and modification. Is there some historical reason for not doing that?
Early on they were used in some serial and modem protocols, which was problematic when you used them in actual file content. I remember trying to use them in a file, and then when I tried to transfer them, the transfer would stop at RS.
My editor, and lots of editors now, put spaces instead of tabs when I press the tab key. No reason something similar couldn't happen in CSV modes/apps.
I’ve wondered this too. At one point I started writing a hobby programming language and explicitly added escape codes for these extra control characters to be used in string literals. (Ada has them all included in the “ASCII” package which is nice.) It seems like a no-brainer to bake them into file formats like the OP describes.
> Tab separated is a bit better than CSV. But can’t store tabs and still has issues with line endings, encodings etc.
There are three main variations of tab-separated-values files:
1) Those that don't allow tab and line endings.
2) Those that replace tab and newline characters with escaped values (\n for newline, \t for tab, \r for carriage return, \\ for backslash).
3) Those that follow the CSV convention of quoting fields as defined in RFC4180
The third option is by far the best and is what Microsoft Excel uses. Microsoft Excel has a save file type called "Unicode Text (.txt)" which saves the data as a tab-separated file using RFC4180 quoting/escaping with the UTF-16LE character encoding. In older versions of Excel, "Unicode Text (.txt)" was the only way to export any values containing Unicode characters since the "CSV (Comma delimited) (.csv)" export format uses the ANSI encoding(Windows-1252 on Western/US computers) corrupting any characters not contained in the ANSI character set. In late 2016, Microsoft finally added the "CSV UTF-8 (Comma delimited) (.csv)" option for exporting a CSV file containing Unicode characters.
> Most tabular data currently gets exchanged as: CSV, Tab separated, XML, JSON or Excel. And they are all highly sub-optimal for the job.
> CSV is a mess. One quote in the wrong place and the file is invalid.
That breaks the other formats too, why pick on CSV? I can imagine a format designed to be friendly to syntax errors, but contra Postel's Law I'm not sure it would be an improvement over a strict, fail-fast syntax.
That's CSV/TSV's real shortcoming: about the only generic validation they allow is to make sure the column count is the same for all rows.
> That's CSV/TSV's real shortcoming: about the only generic validation they allow is to make sure the column count is the same for all rows.
Once upon a time, when I was doing a lot of data interchange between a wide variety of systems (OS'es, applications, etc.) I considered proposing an "enhanced CSV" (ECSV) where the values did not start on the second row in the file, but instead the second row would be regular expressions that could be used to validate the contents of the columns that followed, and data would start on row 3.
(Newlines added because HN doesn't speak Markdown, sigh.)
In the end, I think the solution was far simpler: we just exchanged a separate descriptor file that had column names and their corresponding regexp patterns for validation as a separate (versioned) file, in order to save a few bytes inside each file transmission, which was a real savings when you paid per-byte over an EDI network.
HN actually does support markdown code blocks, you just have to preface lines with four (nope, two)[1] spaces instead of the (newer) triple-backtick codefences.
Yup, I noticed that and my Level of Care was insufficient to go and edit my comment a second time to fix it. But, thanks for confirming that it could have worked!
I spent a while making a binary format for tabularish documents, and even started on an editor for it. What I decided on after some long months of gradual iteration was to give each cell its own header that could contain various forms of type info, flags, and modes, and to define a cell type that described forms of break (space, line, page, etc. - a 16-bit range of break types could be encoded). The document header also described a dictionary mapping for the data so that it could immediately be presented to the editor in a readable form.
But now I just use plain old spreadsheets to do things - I obsoleted my own tech, although I like certain things about it. The editing and storage encoding isn't really the problem so much as the integrity of the literals, which a solution like the regex idea could accommodate.
I do think that CSV would benefit by having a header area that described the encoding of breaks in cells and lines. Maybe that's the only thing that really needs fixing in it. And if it included arbitrary break levels like my thing and dropped the rectangular row-column shape, it would cover a huge number of documents.
We can blame CSV, or we can blame the way people use CSV. Either way CSV is so unreliable that I try to “fail-fast” as soon as possible in automated pipeline.
At work, we explicitly define data structuring process, converting CSV to Parquet with strict schema and technical/structural validation. We assign interns and new grad engineers for this, which is nicely within their capabilities too with minimal training.
>> CSV is a mess. One quote in the wrong place and the file is invalid.
> That breaks the other formats too, why pick on CSV?
I think it's perhaps badly worded, but the implied (and more important) criticism seems to me to be that CSV makes this kind of error much more likely, with its handling of quotes. Having worked with CSV files that had commas in the data (and sometimes quotes too), I quickly learned that I should `set readonly` on my editor and only interact with the file through programmatic tools, and give up any notion of it being a plaintext hand-editable data format.
Indeed, it’s because people use an editor that isn’t designed for editing csv to edit csv.
Every csv files is a text file, but not every text file is a csv file, but people use text editors to edit them.
A csv editor would forbid you from entering that quote, automatically add a matching one, ask you for your intentions, or whatever, but it shouldn’t silently corrupt your file.
A spreadsheet-like UI, but without formulas, automatic single header row, etc. would be a better UI for csv files.
If you use zip to edit a .xlsx or .docx file, you easily get an invalid file, too.
VisiData is an interactive multitool for tabular data. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility which can handle millions of rows with ease.
if you add an extra comma in a CSV (outside of quoting) then the rest of the cells in that row are off by 1. Which is not good obviously. But if you add an extra quote, then the entire rest of the file is garbage.
You can just use sqlite then. Very compact, highly popular (in different role though). Seen it used for large datasets - map tiles (millions of jpeg files). Much smaller size than zip or tar archive, indexed, fast.
> This results in a truly hierarchical, filesystem-like data format. In fact, resources in an HDF5 file can be accessed using the POSIX-like syntax /path/to/resource.
That seems a whole higher level of complexity compared to CSV or the other options listed in TFA (perhaps comparable to Excel).
NetCDF4 (built on top of HDF5 largely through sub-setting) is considerably more powerful than excel/libreoffice. Its also easy to access through widely-available libraries. I frequently use the Python `netCDF4` (yes, it really is capitalized that way) library for exploratory work.
Its a little weird at first but its a great format and has libraries in a lot of major languages. It stores a sparse matrix which cuts the size down a lot.
Does Excel support it? No? Then that's the end of that.
Excel is tabular data to all non developers. The formats supported by Excel are the whole thing.
And if we're inventing a CSV-like format that uses a more convenient character than quotes and commas, maybe jumping to a non-displayable non-typeable character isn't the best?
Honestly, if I were inventing a table format, I'd use a strict subset of HTML5 tables. Strip it down as hard as possible to the bare minimal elements - no unneeded close tags. The only real flaw is that their encoding of whitespace and escaped chars are verbose and ugly.
I've pondered exactly that. After a previous HN thread, I tried my hand at writing a specification that was minimal as possible but followed HTML5. For example many end tags are optional like `tr` and `td` end tags. Though I pulled in RDFa for richer data types.
(I had considered <html application="tabular-data">; my reason was that it seems to work better with user CSS, and might be simpler to parse in other cases too.)
This is a very cool idea but why not just add the closing tags and have a minimal format that is also valid HTML and can easily view the data by pointing your browser at the file?
It's already a universal format and as dtech mentions below, you get Excel compatibility.
I would also add headers and an optional type attribute...
<tr>
<th data-type="int">Age</th>
</tr>
If the file is too big then you probably need compress it or use a more appropriate format.
Why not sqlite? You can read/write to a sqlite file in pretty much any language using decent bindings aside from maybe browser JS. I've worked with third party sqlite files several times and it usually isn't that bad. You can't just use any text editor to work with them, but the sqlite CLI is very easy to install and use. If you want a GUI, DB Browser seems to work very well.
I mean, in business contexts at least, Excel is what we end up using nearly all the time because it's ubiquitous. This doesn't mean there aren't issues with it; obviously it's nonFree, and it's notorious for making assumptions about data (see recent coverage of scientists literally renaming a gene (I think?) because Excel kept assuming it was a date or something).
But Excel as a tool is on so many desktops that you can assume access, and so it gets used.
I have a strong sense that this may be the biggest reason why no other format has taken over. It's "good enough" that it steals oxygen from any competing idea.
Yeh, I think you’re spot on. It’s bad enough that lot of people will implement direct alternatives, but it’s good enough that none of the alternatives will get any traction — unless they can be “just click the file” compatible with Excel (or something at least as possible and useful as excel… a browser maybe? )
Fundamentally, why are people so concerned about about hand editing tabular data?
The options for that are pretty terrible. Only editing an existing value in one cell is ok. Everything else - adding new rows, new columns and making sure the file format and column structure is correct - is all terrible. This gets doubly so if you stick it in git and start getting messy pull requests.
I would carefully question what use case you have that this is so important. If you truly do have such a use case, I'd politely suggest tabular data maybe isn't what you need.
Once you remove that requirement, binary solutions for tabular data are possible again and several good options have already been pointed out in other comments.
I regularly need to hand-edit tabular data I have generated from other data when it turns out there are a few weird cases I didn't handle right. I do a lot of one-off data processing in several stages, so it's not just a question of fixing the generator. If I want to edit the structure, sure, I'll take it into Sheets, but for data that's slightly weird, a text editor or shell pipe is often faster.
I wonder if there is some hacky way around that. For example, an LF (line feed) directly after a RS character is not considered data (but will break the line when shown in a text editor).
I rarely want to see tabular data in a human-readable format. It is always the most tedious way to approach it. My go-to is Excel/LibreOffice Calc. This approach is at least tolerable to edit in a text editor, while something like the OpenDocument Spreadsheet format or the Excel format is impenetrable.
> I rarely do it, but it’s nice to be able to Human read when I need to. Also being able to use all the command line text tools is super convenient.
Sometimes it helps a lot to eyeball what you have before doing one off scripts to filter/massage your data. Had a recent case where the path of least resistance was database to csv to one off python to statistic tools with a gui and tabular display.
Could have probably done some enterprise looking export infrastructure but it was a one off and not worth it.
Yes, that is an issue with the suggested approach. Unless we can persuade all the editor developers to break lines as 'RS' characters (which is not going to happen).
> Columns are separated by \u001F (ASCII unit separator)
> Rows are separated by \u001E (ASCII record separator)
Or, how about columns separated by \u002C, and rows separated by \u000A. And, for bonus points, we can even define unambiguous ways of escaping those two characters so that they CAN appear within column values, if we wanted to, and not tell people that our encoding format is totally stupid and that they need to use a different format.
OP's proposal is equally "highly sub-optimal for the job" for exactly the same imaginary reasons they dislike the currently available encoding formats, but they don't seem to realize it.
> OP's proposal is equally "highly sub-optimal for the job" for exactly the same imaginary reasons they dislike the currently available encoding formats, but they don't seem to realize it.
This is a really unfair appraisal in a bunch of different ways.
Removing the ability to embed record delimiters, for example, means you can process the records in parallel. That’s a massive improvement all by itself.
Stating that their reasons are “imaginary” is just a needless insult, apart from being wrong. Why be like that?
> Stating that their reasons are “imaginary” is just a needless insult, apart from being wrong. Why be like that?
It's not an insult, let alone a needless one: it is a statement of fact.
The reasons OP cites for criticizing CSV are creations of OP's imagination, their subjective beliefs, and not based in objective truth.
For example:
> "CSV is a mess."
Actually, CSV is quite orderly and predictable. OP only imagines it's a mess because of some skewed experience they've personally had.
> "One quote in the wrong place and the file is invalid."
Any file format that is encoded improperly renders it invalid. Imagining that this is strictly limited to CSV, is just that: a creation of OP's imagination.
> "It is difficult to parse efficiently using multiple cores, due to the quoting (you can’t start parsing from part way through a file)."
This is absolutely untrue. Anyone who's written a high-volume streaming CSV parser that is multi-process and multi-threaded should be giving OP the side-eye here.
Yes, you can absolutely start parsing from any arbitrary point in the file, and yes, in the degenerate worst-case scenario, you may need to rewind back to the beginning of the file if the data exists in such a way that your arbitrary starting point requires it, but that is true of ANY file format: it is possible to craft a worst-case scenario and specific starting position within that data that would require this. CSV is no exception.
Hope this helps you better understand what I had written.
> Removing the ability to embed record delimiters, for example, means you can process the records in parallel. That’s a massive improvement all by itself.
I can process records from a csv in parallel just fine. Lexing isn't slow.
.csv stream -> tokens -> records (-> aggregate into ~100ms blocks) -> work queue with N workers
You lex the whole file, separate it into records and then process them?
Doesn’t sound like you’re processing the file in parallel to me.
TFA’s way would allow you to start at any arbitrary file position, just advance until you find a record separator then start parsing from there without having to worry about starting in the middle of a quoted string.
The proposal is TSV but using ^_ instead of ^I (tab) to separate columns and ^^ instead of ^M (CR) or ^J (LF) to separate rows. So instead of being unable to put ^I and ^J in your table cells you are now unable to put ^_ and ^^ in your table cells.
That sounds exactly as good, or bad, as TSV. So, okay? Sure, do that if you like, that sounds fine.
US and RS characters are much less likely to appear in textual data than Tab or LF.
Also I believe this was the original purpose for US and RS in ASCII. So I'm not sure it is 'weird' to use them for that (although it does seem to be uncommon).
Yeah, it kind of was; also FS and GS. GNU Info files and Babyl files use US that way. Really though you can define whatever weird format you want if you're writing the code for it.
> But it is binary, so can’t be viewed or edited with standard tools, which is a pain.
I've heard this sentiment expressed multiple times before, and a minor quibble I have with it is that the fact that it's binary has nothing to do with whether or not it's a pain. It's a pain because the tools aren't ubiquitous, so you can't count on them always being installed everywhere. But I'd argue that sqlite _is_ ubiquitous at this point and, as others have mentioned, it's a _great_ format for storing tabular data.
JSON is also a fine choice, if you want it to be human readable, and I'm not sure why this is claiming it's "highly sub-optimal" (which I read as dev-speak for 'absolute trash'). JSON is extremely flexible, compresses very well, has great support for viewing in lots of editors, and even has a decent schema specification. Oh, and line-delimited JSON is used in lots of places, and allows readers to begin at arbitrary points in the file.
JSON is good for structured data, but I prefer TSV for simple human-readable tabular data. In situations where it's the right choice, a TSV file consists of data and whitespace and nothing else. You can view and edit it with any imaginable tool, and there is no overhead in the form of delimeters and encodings distracting you from the data.
I really like LTSV. (That stands for labeled tab-separated values.)
LTSV is basically equivalent to a JSON object per line. You have columns consisting of a label, a colon, then a value. The columns are then separated by tabs. The value can be quoted. If you need a tab in the value, it goes inside the quotes.
As the http://ltsv.org/ suggests, I use it for logging, too, so that a log line is easily parseable and a log file is basically a table. Notice there are parsers for many languages, and there are several tools supporting it including fluentd.
FWIW I have designed an upgrade to TSV as part of https://www.oilshell.org. Oil actually emits it now from 'pp proc' to pretty print "functions" and their docstrings as a table. It will be used in many parts of Oil, most of which aren't implemented yet.
It's called QTT -- Quoted, Typed Tables (formerly QTSV)
I think the only way that such a format becomes popular is if Oil itself becomes popular, so I haven't really emphasized it as a spec. (Similar to how JSON wouldn't be popular if the browser / JS weren't popular)
But it is a pretty easy spec -- just take TSV, add types to the column header, and specify that any cell that begins with a single quote is QSN.
A nice property is that every almost TSV file is a valid QTT file -- the exception being TSV files where a cell just a single quote, or some malformed QSN.
Note that TSV officially can't represent fields with tabs, but QTT can with '\t'.
As always feel free to contact me if you want to work on any of this
A key design point of QTT is that it's a special case of the "lines of text" narrow waist. CSV does not have that property, because fields can contain embedded newlines. And unlike TSV, QTT cells can contain arbitrary values.
That is, narrow waists are necessarily a COMPROMISE. JSON is a compromise, and Rust users will be equally unhappy as Lua or Erlang users. That is a feature and not a bug for something meant of interoperability. You are "stuck with" the lowest common denominator, but that's what enables interop.
(The lowest common denominator of them all is Unix byte streams.)
I contrast "monoglot" serialization formats like Python pickle an Go .gob with language-independent formats like JSON, TSV, and HTML. The wisdom of JSON is that Crockford specified it independently of JavaScript.
But both are useful.
It's not clear if RSON is meant to be monoglot or polyglot, but it's a huge difference and it seems more monoglot. QSN on the other hand is definitely a polyglot design like JSON, despite being derived from Rust.
If we are talking about machine transfer, there are tons of formats that work, and many of them are binary. xls, ods, SYLK, and dta are examples (not sure about UTF8 wit SYLK and dta). Then there are the "just write it in code" formats like SQL dumpfiles (some xml fits here, too) - which work well and carry really nice, readable type declarations.
If we are talking about human readable, text editor safe formats, we have lots of challenges. ASCII delimiter characters don't display and aren't on the keyboard (and will sometimes have side effects with old wire protocols). Nearly any character that is on a keyboard (or international keyboards) eventually becomes a problem because you will have to escape them when they are not used as a field (, is the classic example) or record delimiter (For example, <CR> can be valid inside of a longer text field). Also, many of the first 31 ASCII characters (the control characters) are used in different and surprising ways (the classic being null) by hardware and software.
BTW what makes CSV bad is that there is no escape character for , or ". Likewise, TSV has the same problem with tabs. JSON, YAML, and XML seem to be much better, but all have edge cases.
No doubt binary formats like Parquet are the way to go for high performance with multi-GB datasets. Seems like total overkill if you have a few hundred or thousand rows of data though. Being able to create/edit/view stuff in a text editor and easily version it is very useful.
It’s strange that we pretend text is not binary. The truth is our many tools are set up to handle binary text data and these tools are not set up for alternate encodings.
If you grab a text file from a Windows machine and bring it to a mac, you’ll see that txt is far from perfect.
This is a long way of saying that if we develop both the format and the tooling then the distinction of text vs “binary” tabular data goes away.
Fair point. But the vast majority of programming tools do already handle UFT8 text (although perhaps don't do a great job with US/RS characters).
also you could write an efficient parser for US/RS separated data in 5 minutes. For parquet you would have to integrate with a library (with all it's dependencies and design choices) or spend days writing your own parser.
Text is text. It's stored as bits, typically 8 bits per character. It allows arbitrary precision but is very space (and memory throughput) inefficient. Fine, compression works but there are many more bits flying around and you're spending CPU cycles on decompression.
What could you do instead? Use appropriate integers, fixed point, or floating point representations. Everybody here knows these things. Nobody is pretending anything.
To the point of portability: IEEE-754 is the same everywhere.
Do people really edit csvs in a text editor? It's horrific, the columns don't line up at all, empty cells are represented by a bunch of commas in a row (which, are you supposed to count all the commas?)
And in terms of versioning, I have seen people commit diffs of csvs before, and they're equally unreadable.
CSV is a plain text format, but that basically buys you nothing. As long as you're going to be loading it into excel or whatever anyway, might as well just pick a good binary format like parquet.
I frequently edit CSVs by hand. As long as you understand your data, and stakes are low (e.g. it's a one off analysis not prod code) it's pretty easy and convenient.
I frequently use the table syntax in org-mode. No, I wouldn’t edit CSVs because they are miserable, but a friendly syntax with some basic text editor support, absolutely.
The proposed format is reasonably same, but you really want to prevent people from writing them by hand, and adding a bit of metadata to describe the column data types at a minimum, and ideally more information such as allowed values, semantics, etc.
To that end, I suggest that putting the tabular data file, along with a metadata descriptor file, inside an archive format (zip, tarball, etc.); that would put just the right size speed-bump to encourage accessing the data through tools and libraries (though if someone is just a bit determined, reading and editing the contained data directly isn't actually impossible or forbidden).
All that said, if you want a better tabular data interchange format badly enough that you're considering devising one, you should probably look at using something even more featureful, like SQLite:
SimonW’s preferred technique of using sqlite as the means of exchange is also very powerful. Particularly when combined with all of the utils he maintains.
I agree with the author. There’s merit in an open format to be developed with a simple spec as pretty much defined here. Of course more formalism should he attached to it i.e. BNF (am I too old?)
Anyway, one of the giants making billions out of data should maybe commission the publication of such a standard and the implementations in every programming language will follow, I’m sure.
The Unicode bit alone and an enforced mine type makes the case in my opinion and it should have streaming capabilities as well. Maybe chunking of content so that it can be used for caching and compression, ideally still human readable but not marked up like XML so you can’t see the end of it.
The fact that it can do more is in no way a negative. Can even make a limited JSON parser with reduced capabilities. And with JSON can do more definitions like header names vs column names vs just arrays of arrays.
Doesn't have a great int64 number story, no native dates / date-times. If you want named tuples, then the names need to go everywhere, otherwise it's a serialization mechanism on top of JSON.
I think the problem is that there is no clear standard or convention for how to do this. Indeed there are many ways to represent tabular data. Most large software projects probably have at least one bespoke file format that they use internally. The trouble is the lack of such a format that everyone agrees on.
I remember my annoyance at XML that you couldn’t simply append a row, due to the closing tag.
I think there was some off-spec dialog called something like “appendable xml” (only snappier) that said the closing tag of the root element was optional, but it was clearly a hack.
I wanted there to be no root element at all, and the filename (or, the last part of the url) would act as the root element. Kind of weird idea in hindsight, but kind of useful too.
One downside is that there’d be no obvious way to specify attributes on the root element.
That's inferior to even csv in any all its forms. It cannot be appended w/o removing the trailing square bracket. It requires column names per each row, you can't even know how many rows are there w/o parsing. It requires quotes (technical csv allows multiline rows but that's rarely used in practice)
json is absolutely horrid for this type of hack-in jobs. Tabular data just not well structure data, easy to cut and paste.
in that way the keys aren't repeated. It wouldn't be coherent useful JS objects when deserialized, but it would be trivial to convert js table object into a true array of Javascript objects.
Not being able to do more is exactly the point. By restricting the space of the file format, we can free the mind to think about how to fit the data to the format.
If you can do anything, it becomes hard to do the right thing. If you can only do the right thing, it becomes trivial to decide.
I've done exactly what the author suggests for the exact same reasons. CSV headaches got me searching for other delimiters and I saw there were already unit and record separators, "WTF!?".
As long as you're dealing with text, and you're writing and parsing the data yourself, and you never need to use tools other than things that can be modified to set the item and line delimiters, then it's great. I haven't used the approach since then.
Mostly I haven't used it because half my use cases need binary numerical data (floats) and the others can be worked with entirely in Excel or as CSVs. But I like the idea, even if the support just isn't there.
Hi folks, Stavros from TileDB here. Here are my two cents on tabular data. TileDB (Embedded) is a very serious competitor to Parquet, the only other sane choice IMO when it comes to storing large volumes of tabular data (especially when combined with Arrow). Admittedly, we haven’t been advertising TileDB’s tabular capabilities, but that’s only because we were busy with much more challenging applications, such as genomics (population and single-cell), LiDAR, imaging and other very convoluted (from a data format perspective) domains.
Similar to Parquet:
* TileDB is columnar and comes with a lot of compressors, checksum and encryption filters.
* TileDB is built in C++ with multi-threading and vectorization in mind
* TileDB integrates with Arrow, using zero-copy techniques
I've used \u001F and \u001E for years. Don't need to check for and un/escape separator characters, very compact, and python's split() and join() makes implementation trivial.
Plain ASCII has everything you need, but nobody uses these features because they’re non-printing chars. On the one hand you want a special format, but on the other you want to edit it like plain text?
The article's proposal seems worse than tab separated.
Not having escaping is unacceptable of course, and once you add escaping you might as well use the standard tab and newline characters as delimiters (and the \t, \n, \\ escapes plus \N for SQL NULL), resulting in a file that is properly formatted by the terminal and text editors and just works with Unix tools and most spreadsheet/database importers unlike a file with the weird separators the article proposes.
That said, I've enjoyed using Carlo Strozzi's NoSQL program in the past. It's not NoSQL in the current sense of the word. It's a plain text database that stores data in delimited text fields -- e.g. tsv or csv. Create those fields however you want. I just used rlwrap/readline in a simple shell program for input. It does the basic relational database things, but all with plain text. A text editor will work just fine for data entry, and for viewing it later. It will be around as long as Linux commandline utilities are around.
Your data can be viewed in a line-oriented or list-oriented fashion -- the program does the conversion for you in commandline fashion. And you can view everything in a text editor. I think it's just a bit of code tying together a bunch of standard Linux commandline utilities. It's not in a lot of repositories these days. But back when, I found it quite handy for working with line-oriented tabular data. It was the guts of a bailing wire and bubblegum program I wrote to create reinforcing steel barlists back when I only thought I knew what I was doing. My ignorance persists but now I'm aware of and at peace with it.
> This is a simple file format for storing tabular content. It's meant to work like CSV files. So far only the low-level encoding is specified, handling e.g. type conversion, etc., is left to a higher-level abstraction layer (as yet unwritten.)
> An *.sv file consists of one or more documents separated by ASCII Group Separator character (0x1D), sv files are UTF-8 encoded.
> Each document can have an optional header indicated by the presence of ASCII Start of Heading character (0x01) followed by zero or more records of metadata. If present, the first record should be field names, the second record should be type names, any further records are user-defined. The first two records (names and types) must be the same length as the data records. The header is concluded by an ASCII Start of Text character (0x02).
> Records are separated by ASCII Record Separator characters (0x1E), and data units (fields) of each record are separated by ASCII Unit Separator characters (0x1F). All record values must be strings.
> ...a grammar:
sv ::= doc [GS sv]
doc ::= [header] records
header ::= SOH records STX
records ::= record [RS records]
record ::= field [US record]
field ::= <sequence of zero or more non-sv Unicode characters>
A table of the sv ASCII characters:
1 01 00000001 SOH  Start of Heading
2 02 00000010 STX  Start of Text
29 1D 00011101 GS  Group Separator
30 1E 00011110 RS  Record Separator
31 1F 00011111 US  Unit Separator
1. Either you want to edit the file in a text editor and you need to have a separator that has a key on your keyboard. Comma is indeed not the greatest choice, but tab or vertical pipe are usually fine. Most parsers will allow you to define the separator.
2. Or you don’t and you’re much better off with any binary format.
Why aren’t unicode separators keys on your keyboard? Maybe that’s suboptimal, but good luck changing that.
> There doesn’t seem to be anything that is reasonably space efficient, simple and quick to parse and text based (not binary) so you can view and edit it with a standard editor.
> XML and Javascript are tree structures and not suitable for efficiently storing tabular data (plus other issues).
You can certainly be efficient with json(net). See:
In my view you can't have it both ways - if you want efficiency it's not reasonable to expect to be able to hand edit the data in a text editor. Especially since at the scale that efficiency matters your text editor is going to struggle.
If you don't care that much about efficiency then CSV, JSON, XML all provide perfectly good solutions and they're actually more easily editable in a text editor than what is being suggested. It's also much more sensible if you are hand-editing tabular data to just slurp it in to a jupyter notebook and do the edits there and shoot it back out because it means you've got a straight forward record of what changes you've had to make and can re-run those changes at a click of a button.
Because a file (or computer memory in general) is a linear (1D) storage medium and tabular (2D) data is not well suited to linear serialisation. You can't avoid adding complexity (i.e. making trade-offs) to the format and any code that deals with it.
So far it's just a python library, but I'm planning on adding editor plugins at least for vim, vscode and maybe emacs, libraries for additional languages, and maybe some cli commands for it.
One distinction from the OP is the delimiters also include a tab (for fields) or newline (for records) by default (but not in "compact" mode). That has the benefit that the files are at least readable with editors and pagers that aren't aware of the format.
As for why, the rationale is that tabular data often needs more than the things that the article decries (e.g. efficient parsing / processing of partial records, sparse values, richly typed values, flexibility of schema evolution (i.e. not having to assume specific columns upfront)...
If it's tabular, I want schema for the columns. Is this column a 'number' or a string? Even better, is there a max length or precision known? Can the cell be null and how is that represented? How are dates formatted? Are they UTC/specific TZ, etc.
Most of my complaints about CSV relate to trying to determine the types used to parse or import as, not how commas are escaped. Excel, for example, actually warns you about this if you try to save a native Excel file as CSV: you are effectively throwing away information.
Mostly the problem comes from how excel is apocalyptically shitty at inferring datatypes, incorrectly assuming non-dates are dates and ISO8601 dates are god knows what, when a sane format would default to text if it didn't know better.
That kind of frustration was what drove me away from spreadsheets in the end. Someone sent me a native excel file for the first time in ages the other day and I just opened it in Pandas without thinking about it.
Pandas also gets things wrong (eg give it a column of unix timestamps with one missing value and there's a good chance it will treat them as float) and it took me a while to adapt to its vector math approach. But once I got comfortable with it working with a new pile of messy data began to feel pleasant rather than annoying.
Language procurement is best at more youthful ages, when the brain is pliable. Having your kids concentrate on German has financial, scholarly, and instructive advantages that won't take too lengthy to even think about harvesting. Also that is as of now on top of the current benefits of bilingualism!
https://www.sevenmentor.com/german-language-classes-in-pune....
I'm inclined to agree. CSVs which are well-formed (escapes within fields handled consistently) shouldn't be that hard to parse.
I can't think of a reason your algo wouldn't be logically sound for good CSV files, although a little backtracking might be necessary to recognize escaping of delimiters in edge cases.
The author writes "CSV is a mess. One quote in the wrong place and the file is invalid.", but what logical formats can tolerate arbitrary corruption? An unclosed tag is similarly problematic for xml. In both cases you wind up falling back to heuristics.
It's true that CSVs often contain a mess of encodings inside fields, but that's not the problem of the CSV format per se. Validation of field encodings, or validation that the entire file is in a uniform encoding... those are separate requirements.
Indeed, every distributed query engine I've used can easily parallelise CSV in the same file (so long as it's splittable, friends don't let friends gzip their data), with the option to ignore bad rows, log them, or throw your hands up and die.
Admittedly, all of them are Java based and use Hadoop libs for handling CSV, which makes sense, the Elephant ecosystem has spent years getting this stuff right.
I don't think anyone wants tabular data. They want to ingest it into a system so they can query it, or join it, or aggregate it. They want to get rid of tabular data as quickly as possible
Data at the boundary must be validated, a file format can't do that for your. Semantics is harder than syntax and you can't push every problem to that level.
The finance world runs on tables. They want to ingest it, query it, join it, aggregate it ... and look at the result in tables. Tables are everywhere, and a generic tabular format is constantly useful for a la carte one off spontaneous data processing. I love strict schema validation as much as the next guy, but sometimes you just want to cut and paste a table.
yeah, um, my life is in tables (stats, finance, etc).
while csv is less than optimal, it's always worked out in practice for me (because we still have to validate anything of significant size and nothing finance or city/nation sized should generally be edited by hand anyway, and errors are everywhere independent of format).
honestly, my bigger nightmare has been people trying to put things into something like xml...
edit: not that there aren't huge problems with the csv/excel/database software and interaction: generally I just end up having to assume everything is text and defining type information my end afterwards on successful parsing
We've had XML for decades and the default interop is still tabular data - probably in large part because unless you go into non-standard extensions, tabular data is the easiest thing to get from a database query.
Erm, we do. You just defined it and gave it a name. There's nothing stopping you using it in your software. Persuading others to use it will be trickier given the existing widely-supported options really aren't so bad, but you could give it a go.
Perhaps we “just” need Excel and Numbers to support reading .parquet files, and come up with a CLI tool to view them like head/tail/awk? Yes, they still cannot be read as plain text so functionality is quite restricted with respect to CLI pipelines
Why are all these "separated"? Sounds like null-terminated C strings. Wouldn't it be better to put the dimension in a header and then just dump the contents unseparated? Why the need to view them in simple text editors?
Fixed width formats are only efficient when everything in a column is a similar length. You only need a single 1000 char data value in a column to mean either the whole column is 1000 chars or the data is truncated.
Fixed width formats are pretty efficient for quick random access.
Btw, you can have a mostly fixed width format, but with an escape that puts larger value somewhere else. That's more or less what some relational databases are doing, I think.
I had also thought of the same format described there (but without Unicode), and others also have done. The SQLite command shell also implements this format as one of the possibilities.
I prefer tab-separated is to comma-separated.
With an editor like Excel etc. you can then make it so that you exactly see where columns are, which column each datum belongs to.
This is a perennial topic. Related prior discussion on this forum alone is voluminous:
> I'm surprised never to have seen any support for CSV files using US/RS instead of TAB/(CR)LF or comma/(CR)LF. Or any support in Unix to use RS instead of LF or NUL to delineate records.
February 27, 2022, on: "The Great Curly Brace Trace Chase (2003)"
> The record separator ship sailed long ago. The advantage of newline is that it’s easy to use with a bunch of standard tools like less/head/tail/your text editor.
> Easy: CSVs are human readable and writeable and humans tend to have a comma sign on their keyboard, while they don't have the ASCII record seperator visible (teaching them to use eg Alt + 30 will not work).
May 15, 2020 on: "So you want to write your own CSV code (2014)"
> ASCII also includes control characters for delimiting text records. If people used these purposed-designed characters instead of comma or tab characters as delimiters, we could avoid many headaches quoting and escaping CSV data.
September 26, 2019, on: "Four Column ASCII (2017)"
> ASCII actually contains four control characters for this purpose: file, group, record and unit separator. I'm sure a lot of problems would be solved if people just used these.
> It really makes me sad that CSV even exists: ASCII defines field ('unit') & record separator characters (also group & file, but those are less-useful), as well as an escape character.
June 8, 2016 on: "ParaText: CSV parsing at 2.5 GB per second"
> Good old ASCII has characters specifically devoted to separating fields, keys, etc. that no-one uses for anything else. Why not use them instead of inventing a new character that does the same thing?
June 5, 2015 on: "Almost every Cassandra feature has some surprising behavior"
csv -- Simple for simple use cases, text-based, however many edge cases, feature lacking etc
xlsx -- Works in excel, ubiquitous format with a standard, however complicated and missing scientific features
sqlite -- Designed for relational data, somewhat ubiquitous, types defined but not enforced
parquet / hdf5 / apache feather / etc -- Designed for scientific use cases, robust, efficient, less ubiquitous
capn proto, prototype buffers, avro, thrift -- Has specific features for data communication between systems
xml -- Useful if you are programming in early 2000s
GDBM, Kyoto Cabinet, etc -- Useful if you are programming in late 1990s
Pick your poison I guess. Engineering is about trade offs.