It seems as though one could easily build a file format far more useful than CSV simply by utilizing these separators, and I'm sure it's been done countless times.
Perhaps this would make an interesting personal project. Are you aware of any hurdles, missing key features, etc. that previous attempts at creating such a format have run into (other than adoption, obviously)?
I've done ETL work with systems that used the ASCII separators. It was very pleasant work. Not having to worry about escaping things (because the ASCII separators weren't permitted to be in valid source data to begin with) was very, very nice.
I'm a Notepad++ person. When I needed to mock-up data typing the characters was easy-- just ALT and the ASCII code on the numeric pad. It took a bit to memorize the codes I needed to use. Their visual representation is just inverse text and initials.
The ASCII unit separator and record separator characters are not well supported by editors. That is why people stick to the (horrible and inconsistent) CSV format.
I started writing up a spec and library for such a format, then my ADHD drew me to other projects before I finished it. Hopefully I'll get back to it someday.
The "compact" file format (without the tab and newline) should be the SSV standard for data interchange and storage. The pretty-printed format should only be used locally to display/edit with non-compliant editors, then convert back as soon as you're done.
In time, editors and file browsers should come to render separators visually in a logical way.
What you've got so far looks promising to me. Pretty much just what I was thinking of doing, in fact, albeit with some details worked out that I hadn't yet considered.
Nice job. I hope you come back to finish the project eventually.
Yes, sometimes, of course. It's a bit like JSON. Sometimes it's easiest to inject a small piece of hand-written data into a test or whatever.
(That said every text editor since ever should have had a "table mode" that uses the ASCII field/record seperators (or whatever you choose), I was always confused why this isn't common. Maybe vim and emacs do?)
A lot of the machine learning world has started using it, it's annoying as hell, solves a problem that doesn't exist, has inadequate documentation, lacks a good GUI viewer, and lacks good command line converters to JSON, XML, CSV, and everything else.
No binary format will ever kill CSV: plain-text based formats embody the UNIX philosophy of text files and text processing tool pipes to go with them, and nothing is more durable than keeping your data in text based exchange formats.
You won't remember Parquet in 15 years, but you will have CSV files in 50 years.
> You won't remember Parquet in 15 years, but you will have CSV files in 50 years.
You're probably right about CSV but probably not parquet. Parquet is already 11 years old, there are vast data warehouses that store parquet, it's first class in the spark ecosystem, and a key component of iceberg. Crucially, formats like parquet are "good enough" for a use case that doesn't appear to be going away. There is a high probability in my estimation that enough places are still using them in 15 years to be memorable even if it isn't as common or as visible.
CSV is actually a nice format if it weren't for literal newlines being allowed INSIDE values. That alone makes it much harder to parse correctly with simple code because you can't count on ASCII mode readline()-like functions to fetch 1 record in entirety.
Considering it also separates records with newlines, they really should have replaced newlines with "\n" and require escaping "\" with "\\".
I often use them in compound keys (e.g., in a flat key space as might be used by a cache or similar simple key/value store). IMHO, they are superior to other common separators like colons, dashes, etc. because they are (1) semantically appropriate and (2) less likely to be present in the constituent pieces of data, especially if the data in question is already limited to a subset of characters that do not include the separators, which it often is (e.g., a URL).
“Less likely” doesn’t help if you may get arbitrary (user) input. If you can use a byte sequence as the key, a better strategy is to UTF-8-encode the pieces and use 0xFF as the separator byte, which can never occur in UTF-8.
Dedicated separator characters don't solve the problem--you'd still need to escape them. Or validate that the data (which may come from untrusted web forms etc.) does not contain them, which means you have another error condition to handle.
There's an ASCII character for escaping, too, if you need it.
The advantage of ASV is not that you can't have invalid or insecure data, it's that valid data will almost never contain ASCII control characters in the record fields themselves. Commas, quotation marks, and backslashes, meanwhile, are everywhere.
Or specify that the data can't contain this data. If it does, you have to use a different format. This keeps everything super simple. And how often are ASCII US and RS characters used in data? I don't think I have ever seem one in the wild, apart from in a .asv file.
I'm no expert on character encodings or Unicode itself, but would this be as simple as checking for the byte 1F in the data? Assuming the file is ASCII or UTF-8 encoded (or attempting to confirm this as much as possible as well), it seems like that check would suffice to validate the absence of the code point in the data, but I imagine it's not quite so simple.
For text data, it would work fine, but you'd have to do some finagling with binary data; $1F is a perfectly valid byte to have in, say, a 4-byte integer.
My going assumption is that arbitrary binary data should be in a binary format.
Feel free to correct me, but I figure that as long as data can be from 0x00 to 0xFF per byte, no format that uses characters in that range will ever be safe. I’m not a big C developer but I figure the null terminated strings have the same limitation.
But if its something entered by keyboard you should be ok to use control codes.
Personally, I find tab and return to be fine for text driven stuff. Shows up in an editor just like intented.
The “problem” I’m referring to is that we chose a widely used character as a field separator. Of course you still have to write a parser, etc, it’s just a lot easier if you choose a dedicated character.
Because they're zero-width. If you can't see them when you print your data, it's a machine-only separator, which makes it a bad separator for data that humans need to look at and work with.
(Because CSV is a terrible data exchange format in terms of information per byte. But that makes sense, because it's an intentionally human readable data exchange format, not a machine format)