Yes, we have unit, record, group and file separators. And we chose never to use ...

g4zj · 2024-03-12T15:11:33 1710256293

It seems as though one could easily build a file format far more useful than CSV simply by utilizing these separators, and I'm sure it's been done countless times.

Perhaps this would make an interesting personal project. Are you aware of any hurdles, missing key features, etc. that previous attempts at creating such a format have run into (other than adoption, obviously)?

EvanAnderson · 2024-03-12T16:07:59 1710259679

I've done ETL work with systems that used the ASCII separators. It was very pleasant work. Not having to worry about escaping things (because the ASCII separators weren't permitted to be in valid source data to begin with) was very, very nice.

I'm a Notepad++ person. When I needed to mock-up data typing the characters was easy-- just ALT and the ASCII code on the numeric pad. It took a bit to memorize the codes I needed to use. Their visual representation is just inverse text and initials.

hermitcrab · 2024-03-12T15:25:19 1710257119

The ASCII unit separator and record separator characters are not well supported by editors. That is why people stick to the (horrible and inconsistent) CSV format.

thayne · 2024-03-13T03:26:31 1710300391

I started writing up a spec and library for such a format, then my ADHD drew me to other projects before I finished it. Hopefully I'll get back to it someday.

Edit: this is what I have so far: https://github.com/tmccombs/ssv

sumergomonstro · 2024-03-18T02:03:01 1710727381

The "compact" file format (without the tab and newline) should be the SSV standard for data interchange and storage. The pretty-printed format should only be used locally to display/edit with non-compliant editors, then convert back as soon as you're done.

In time, editors and file browsers should come to render separators visually in a logical way.

g4zj · 2024-03-13T12:20:13 1710332413

What you've got so far looks promising to me. Pretty much just what I was thinking of doing, in fact, albeit with some details worked out that I hadn't yet considered.

Nice job. I hope you come back to finish the project eventually.

tambourine_man · 2024-03-12T15:26:41 1710257201

People don’t like invisible hard to type character. They prefer suffering quoting, escaping, escaping quotes and all that fun stuff

t-3 · 2024-03-12T20:00:50 1710273650

Are people actually typing up *SV files by hand? It's trivial to support editing in an IDE and exporting from data-producing applications.

andyferris · 2024-03-12T22:06:00 1710281160

Yes, sometimes, of course. It's a bit like JSON. Sometimes it's easiest to inject a small piece of hand-written data into a test or whatever.

(That said every text editor since ever should have had a "table mode" that uses the ASCII field/record seperators (or whatever you choose), I was always confused why this isn't common. Maybe vim and emacs do?)

dheera · 2024-03-12T23:31:11 1710286271

Unfortunately everyone has moved to "Parquet" (Packet? Parket? Pacquet?) already and we've sailed even further.

I absolutely HATE this Parcquage.

_kst_ · 2024-03-13T00:28:35 1710289715

https://www.databricks.com/glossary/what-is-parquet

(I don't think everyone has moved to it. I had never heard of it myself.)

dheera · 2024-03-13T01:42:14 1710294134

A lot of the machine learning world has started using it, it's annoying as hell, solves a problem that doesn't exist, has inadequate documentation, lacks a good GUI viewer, and lacks good command line converters to JSON, XML, CSV, and everything else.

jll29 · 2024-03-13T04:11:05 1710303065

No binary format will ever kill CSV: plain-text based formats embody the UNIX philosophy of text files and text processing tool pipes to go with them, and nothing is more durable than keeping your data in text based exchange formats.

You won't remember Parquet in 15 years, but you will have CSV files in 50 years.

nyokodo · 2024-03-13T04:32:20 1710304340

> You won't remember Parquet in 15 years, but you will have CSV files in 50 years.

You're probably right about CSV but probably not parquet. Parquet is already 11 years old, there are vast data warehouses that store parquet, it's first class in the spark ecosystem, and a key component of iceberg. Crucially, formats like parquet are "good enough" for a use case that doesn't appear to be going away. There is a high probability in my estimation that enough places are still using them in 15 years to be memorable even if it isn't as common or as visible.

dheera · 2024-03-13T05:56:39 1710309399

CSV is actually a nice format if it weren't for literal newlines being allowed INSIDE values. That alone makes it much harder to parse correctly with simple code because you can't count on ASCII mode readline()-like functions to fetch 1 record in entirety.

Considering it also separates records with newlines, they really should have replaced newlines with "\n" and require escaping "\" with "\\".

mechanicalpulse · 2024-03-12T16:09:01 1710259741

I often use them in compound keys (e.g., in a flat key space as might be used by a cache or similar simple key/value store). IMHO, they are superior to other common separators like colons, dashes, etc. because they are (1) semantically appropriate and (2) less likely to be present in the constituent pieces of data, especially if the data in question is already limited to a subset of characters that do not include the separators, which it often is (e.g., a URL).

layer8 · 2024-03-12T20:15:49 1710274549

“Less likely” doesn’t help if you may get arbitrary (user) input. If you can use a byte sequence as the key, a better strategy is to UTF-8-encode the pieces and use 0xFF as the separator byte, which can never occur in UTF-8.

eirikbakke · 2024-03-12T15:17:52 1710256672

Dedicated separator characters don't solve the problem--you'd still need to escape them. Or validate that the data (which may come from untrusted web forms etc.) does not contain them, which means you have another error condition to handle.

AdamH12113 · 2024-03-12T15:37:21 1710257841

There's an ASCII character for escaping, too, if you need it.

The advantage of ASV is not that you can't have invalid or insecure data, it's that valid data will almost never contain ASCII control characters in the record fields themselves. Commas, quotation marks, and backslashes, meanwhile, are everywhere.

hermitcrab · 2024-03-12T15:23:51 1710257031

Or specify that the data can't contain this data. If it does, you have to use a different format. This keeps everything super simple. And how often are ASCII US and RS characters used in data? I don't think I have ever seem one in the wild, apart from in a .asv file.

g4zj · 2024-03-12T17:57:52 1710266272

I'm no expert on character encodings or Unicode itself, but would this be as simple as checking for the byte 1F in the data? Assuming the file is ASCII or UTF-8 encoded (or attempting to confirm this as much as possible as well), it seems like that check would suffice to validate the absence of the code point in the data, but I imagine it's not quite so simple.

rhelz · 2024-03-12T18:13:32 1710267212

For text data, it would work fine, but you'd have to do some finagling with binary data; $1F is a perfectly valid byte to have in, say, a 4-byte integer.

runlaszlorun · 2024-03-13T01:13:56 1710292436

My going assumption is that arbitrary binary data should be in a binary format.

Feel free to correct me, but I figure that as long as data can be from 0x00 to 0xFF per byte, no format that uses characters in that range will ever be safe. I’m not a big C developer but I figure the null terminated strings have the same limitation.

But if its something entered by keyboard you should be ok to use control codes.

Personally, I find tab and return to be fine for text driven stuff. Shows up in an editor just like intented.

hermitcrab · 2024-03-12T23:18:25 1710285505

Without escaping, it wouldn't be suitable for arbitary binary data.

tambourine_man · 2024-03-12T15:24:51 1710257091

The “problem” I’m referring to is that we chose a widely used character as a field separator. Of course you still have to write a parser, etc, it’s just a lot easier if you choose a dedicated character.

TheRealPomax · 2024-03-12T19:23:27 1710271407

Because they're zero-width. If you can't see them when you print your data, it's a machine-only separator, which makes it a bad separator for data that humans need to look at and work with.

(Because CSV is a terrible data exchange format in terms of information per byte. But that makes sense, because it's an intentionally human readable data exchange format, not a machine format)

Hence https://github.com/SixArm/usv/tree/main/doc/faq#why-choose-u...

gardenhedge · 2024-03-12T22:59:30 1710284370

I never knew they existed until this post