Fascinated this uses the Unicode glyphs / symbols for unit and record separator rather than the unit and record separators themselves (ASCII US and RS).
Perfect deployment of David Wheeler's aphorism:
> All problems in computer science can be solved by adding another level of indirection.
The answer makes sense to me, but I wish we could fix editors to properly handle the ASCII separators (1C, 1D, 1E, 1F) instead of resorting to Unicode control picture characters (241C, 241D, 241E, 241F).
Maybe if editors are fixed up we could adopt ASCII Separated Values (ASV) as the new standard.
Emacs has handled literal ASCII control characters correctly I believe since around the time I was born - probably somewhat earlier, if we count back further than GNU.
Unicode works fine there too, so it makes no nevermind to me which flavor people use. I just think it's funny how "everything old is new again".
Indeed, if the result is to be encoded with UTF-8, using 1-byte separators vs the multi-byte encoding of (241F) would make sense to me.
I'd also prefer if escapes were done in the "traditional" manner of, for example, "\t" for a tab because you can then read in stuff with something like input.split("\t").map(unescape); you know any actual tab character in the input is a field separator, and then you can go through the fields to put back the escaped ones.
> you can then read in stuff with something like input.split("\t").map(unescape)
What about input lines like 'asdf\\thjkl\tzxcvb'? That should be two fields, one the string ‘asdf\thjkl’ and the other the string ‘zxcvb.’
I think that your way is a bit like trying to match context-free grammars with a regular expression. The right way is to parse the input character by character.
I think the suggestion is that the field separator is an actual tab character (ascii code 9) but tabs inside the field are `\t`. So, splitting on the tab character always works because fields cannot contain ascii code 9 but must use the two character escape instead.
Although matching up nested pairs of brackets requires something at least as powerful as a pushdown automaton (CFG matcher), discriminating between an arbitrary number of escaped backslashes followed by an unescaped 't' versus an arbitrary number of escaped backslashes followed by the '\t' escape sequence doesn't require anything more powerful than a finite state machine.
Indeed... I didn't read the standard in detail to check whether escaping is allowed/taken into account, but what if my data contains those symbols? I mean, they are perfectly legal Unicode printable characters, unlike the ASCII ones.
The point is ASCII DSV, which gives innately better hierarchy than CSV, but with visible tokens and stream accommodation. You should read the github readme. It's not that long.
As for still needing escapes, using obscure symbols instead of ones that are extremely common in writing inherently means needing far far faaaaaaar fewer of them.
What's the point of visible tokens if it's all squished in one line? You are not going to be editing this in regular editor once you have non-trivial amount of data.
And yes, I read README and source code, so I know that newlines are optional, existing tools don't generate them, and multi-line examples are basically fake.
> What's the point of visible tokens if it's all squished in one line?
It doesn't have to be all squished in one line, it just doesn't hurt anything. Visually splitting squished lines for presentation or perusal is trivial because of the record separator.
> You are not going to be editing this in regular editor
I know (or at least I think) that you meant this in relation to squished lines getting very long, but maybe we can talk about it in a broader context, since record splitting is trivial...
One could easily say these same words about documents written in right-to-left languages. But people in Israel manage to create files too somehow, so that's clearly not an insurmountable barrier.
Editors generally support composing right-to-left languages that way? So I suppose the metaphor suggests that all editors should directly support the visible glyphs semantically?
And yet, that's explicitly not the semantic purpose of those glyphs. The actual delimiters already exist at a lower code point. If we're asking editors to semantically support delimiters we should be asking them to support the semantic delimiters.
You shouldn't need escapes for separator characters precisely because they are not designed for data. Their entire purpose is to separate hierarchical data.
If it turns out that escaping is needed, it will still be far rarer than escaping commas and newlines.
(For text processing, I use octal \034 all the time.)
Perhaps there is a software developer version of "Needs more cowbell" called "Needs more complexity"
Computer languages generally use the Latin alphabet. And even in a case like APL, which some HN commenters call "hieroglyphics", the number of symbols is limited and each is precisely defined (cf. potentially up to 1.1 million Unicode symbols and "emojis" that are open to interpretation).
The OP was probably assuming no human would want to actually read a CSV raw, and so was probably correct from their POV. Your POV is probably from someone who reads CSVs raw. You don't have to be so rude about it, you're being even more smug than the OP, probably.
One of the two likely works with CSVs for a living, and it's definitely not the person suggesting "What if it just was hard to eyeball/edit".
If you don't understand why something is the way it is, it might be better to start with a question than with a statement implying the tech misses existing tech. Chesterton's fence still applies, and ignoring it means you're outsourcing your work to others. RTFM is a perfectly valid answer at that point.
ASCII has a field delimiter character. The fact that we chose comma and tabs because a field delimiter character is hard to type or see is one of those things that saddens me in computing.
Imagine the amount of pain that could have been spared if we had done it right from the start some 50 years ago.
The great thing about comma as a field separator is (1) the character is visible and (2) the character is common, so if there are escaping bugs in either the generator or the parser, they will quickly become apparent. Much better to fail fast than having a parse error at line 28357283 because a more uncommon separator character somehow still made its way into the data.
The bad thing is that it is common so you have to escape a lot. The much worse thing is that csv implementations have varying ways of handling escaping, or sometimes don't support escaping at all, so in practice csv files can't be used interoperably.
Unfortunately, ASCII delimiter fields were used streaming data. The pain that created was simple - you'd try to send data containing an RS or a US and it would cause bad things to happen on the other end of the wire.
> Imagine the amount of pain that could have been spared if we had done it right from the start some 50 years ago.
I think it's Putt's Law: If you design something for idiots, someone will make a better idiot. In this case, it took less than 15 years.
I've used the ASCII delimiters in a webapp once; Javascript in the browser formatted data with them and sent it to my server via HTTP POSTs. I was a bit nervous that something in the path would break the data but happily it all just worked fine.
Currently saving the day in a data pipeline project which depends on a tool which only exports unescaped csvs. They work very well through the pipeline, Unix split, awk, and then snowflake all support them nicely. One annoying thing is that they are annoying to type and you never quite know if you need to refer to them using octal, hex or what, and what special shell escaping might be used.
Yeah it's really interesting to me how much of what we use/do is shaped by our input devices. Macropads are a start, but I'd love a keyboard with screens on each key, that's not absurdly expensive and can be layered easily.
I would think anyone this serious about keyboard control would be able to use layers which is becoming pretty common and not have to "see" the keycaps.
It seems as though one could easily build a file format far more useful than CSV simply by utilizing these separators, and I'm sure it's been done countless times.
Perhaps this would make an interesting personal project. Are you aware of any hurdles, missing key features, etc. that previous attempts at creating such a format have run into (other than adoption, obviously)?
I've done ETL work with systems that used the ASCII separators. It was very pleasant work. Not having to worry about escaping things (because the ASCII separators weren't permitted to be in valid source data to begin with) was very, very nice.
I'm a Notepad++ person. When I needed to mock-up data typing the characters was easy-- just ALT and the ASCII code on the numeric pad. It took a bit to memorize the codes I needed to use. Their visual representation is just inverse text and initials.
The ASCII unit separator and record separator characters are not well supported by editors. That is why people stick to the (horrible and inconsistent) CSV format.
I started writing up a spec and library for such a format, then my ADHD drew me to other projects before I finished it. Hopefully I'll get back to it someday.
The "compact" file format (without the tab and newline) should be the SSV standard for data interchange and storage. The pretty-printed format should only be used locally to display/edit with non-compliant editors, then convert back as soon as you're done.
In time, editors and file browsers should come to render separators visually in a logical way.
What you've got so far looks promising to me. Pretty much just what I was thinking of doing, in fact, albeit with some details worked out that I hadn't yet considered.
Nice job. I hope you come back to finish the project eventually.
Yes, sometimes, of course. It's a bit like JSON. Sometimes it's easiest to inject a small piece of hand-written data into a test or whatever.
(That said every text editor since ever should have had a "table mode" that uses the ASCII field/record seperators (or whatever you choose), I was always confused why this isn't common. Maybe vim and emacs do?)
A lot of the machine learning world has started using it, it's annoying as hell, solves a problem that doesn't exist, has inadequate documentation, lacks a good GUI viewer, and lacks good command line converters to JSON, XML, CSV, and everything else.
No binary format will ever kill CSV: plain-text based formats embody the UNIX philosophy of text files and text processing tool pipes to go with them, and nothing is more durable than keeping your data in text based exchange formats.
You won't remember Parquet in 15 years, but you will have CSV files in 50 years.
> You won't remember Parquet in 15 years, but you will have CSV files in 50 years.
You're probably right about CSV but probably not parquet. Parquet is already 11 years old, there are vast data warehouses that store parquet, it's first class in the spark ecosystem, and a key component of iceberg. Crucially, formats like parquet are "good enough" for a use case that doesn't appear to be going away. There is a high probability in my estimation that enough places are still using them in 15 years to be memorable even if it isn't as common or as visible.
CSV is actually a nice format if it weren't for literal newlines being allowed INSIDE values. That alone makes it much harder to parse correctly with simple code because you can't count on ASCII mode readline()-like functions to fetch 1 record in entirety.
Considering it also separates records with newlines, they really should have replaced newlines with "\n" and require escaping "\" with "\\".
I often use them in compound keys (e.g., in a flat key space as might be used by a cache or similar simple key/value store). IMHO, they are superior to other common separators like colons, dashes, etc. because they are (1) semantically appropriate and (2) less likely to be present in the constituent pieces of data, especially if the data in question is already limited to a subset of characters that do not include the separators, which it often is (e.g., a URL).
“Less likely” doesn’t help if you may get arbitrary (user) input. If you can use a byte sequence as the key, a better strategy is to UTF-8-encode the pieces and use 0xFF as the separator byte, which can never occur in UTF-8.
Dedicated separator characters don't solve the problem--you'd still need to escape them. Or validate that the data (which may come from untrusted web forms etc.) does not contain them, which means you have another error condition to handle.
There's an ASCII character for escaping, too, if you need it.
The advantage of ASV is not that you can't have invalid or insecure data, it's that valid data will almost never contain ASCII control characters in the record fields themselves. Commas, quotation marks, and backslashes, meanwhile, are everywhere.
Or specify that the data can't contain this data. If it does, you have to use a different format. This keeps everything super simple. And how often are ASCII US and RS characters used in data? I don't think I have ever seem one in the wild, apart from in a .asv file.
I'm no expert on character encodings or Unicode itself, but would this be as simple as checking for the byte 1F in the data? Assuming the file is ASCII or UTF-8 encoded (or attempting to confirm this as much as possible as well), it seems like that check would suffice to validate the absence of the code point in the data, but I imagine it's not quite so simple.
For text data, it would work fine, but you'd have to do some finagling with binary data; $1F is a perfectly valid byte to have in, say, a 4-byte integer.
My going assumption is that arbitrary binary data should be in a binary format.
Feel free to correct me, but I figure that as long as data can be from 0x00 to 0xFF per byte, no format that uses characters in that range will ever be safe. I’m not a big C developer but I figure the null terminated strings have the same limitation.
But if its something entered by keyboard you should be ok to use control codes.
Personally, I find tab and return to be fine for text driven stuff. Shows up in an editor just like intented.
The “problem” I’m referring to is that we chose a widely used character as a field separator. Of course you still have to write a parser, etc, it’s just a lot easier if you choose a dedicated character.
Because they're zero-width. If you can't see them when you print your data, it's a machine-only separator, which makes it a bad separator for data that humans need to look at and work with.
(Because CSV is a terrible data exchange format in terms of information per byte. But that makes sense, because it's an intentionally human readable data exchange format, not a machine format)
The point of text-based formats is that you can edit them in a text editor by hand trivially, if typing the character is nontrivial, then it entirely defeats the point (that's also why USV ads very little value IMHO).
You can actually type a bunch of ASCII control characters very easily on a keyboard. Look at an ASCII table with 32 characters per column (I like this one[1]). The key combo for a control character is Ctrl + the letter on the same row as the control character. So:
BELL Ctrl-G
RECORD SEPARATOR Ctrl-^
UNIT SEPARATOR Ctrl-_
ESCAPE Ctrl-[
You can think of the Ctrl key as clearing the two most-significant bits of the letter's ASCII code. Not all key combos are supported in all environments. Notepad++ doesn't support Ctrl-] (GROUP SEPARATOR) at all, but does support e.g. SHIFT OUT as Ctrl-Shift-N, for instance. The Windows CMD.EXE command line supports many combinations (but not UNIT SEPARATOR, unfortunately), displaying them as e.g. ^[ or ^G in the console.
If I need a table or Google to figure out how to type something, that's not "very easily"
If you need to train your employees on that, it's not "very easily".
"Very easily" is when I can take any family member who's seen a computer in their life, give them a keyboard and they can figure it out on their own without Google in 2 seconds (like csv).
What I mean is that a simple key combination is easier to use than an Alt code or having to copy and paste from another document. The ASCII table stuff is just fun trivia. "Press Ctrl-_ to insert a column separator" isn't any harder than "Press Ctrl-S to save" or "Press Ctrl-T to open a new browser tab". It's definitely easier than letting your hypothetical family member reinvent character escapes on their own the moment they encounter an address that has an extra comma in it. :-)
What’s the key to enter the euro symbol? That means you can’t use it in a text editor?
There is no perfect solution, but I’d rather open a text file in a decent editor than having to deal with the escaping hell that is CSV.
They could have chosen the pipe character “|” at least, but the comma is the thousand separator in many languages (number formatting is kind of important for tabular data, if you ask me) and also, you know, general prose.
And it was there even before we got euro coins in our hands (I know this because I'm still using my first (mechanical) keyboard that I got with my first own PC in 2001: and there is a “€” symbol on it)
Well some of us do. There's this interesting effect where many people perceive the limitations on their current tools to be equivalent to limitations on their abstract abilities. If they don't know how to do it, it's impossible.
I think that's exactly the point that the parent poster is trying to make by example? Just because we don't have good tooling today for using ASCII delimiter characters, doesn't mean it's impossible -- just like typing the euro symbol on an american keyboard
It doesn't mean it's impossible, but it's definitely cumbersome. Any non English people who has had to type in their native language from an american keyboard can tell you.
Oh yes certainly. And I think that when you're deep into creation it can be really really hard to remember that experience, and so recently I'm trying to find ways to help pull back the curtain for folks.
Their examples if anything convinced me not to use this for a long time.
I need to zoom to be able to tell these apart, so I'll need editor support for it to be convenient to work with these anyway. And then clicking through to the comparisons, it demonstrates the difference existing support for CSV "everywhere" makes - Github renders the CSV examples nicely as tables, while again I need to zoom in to see which separator is which for USV.
Maybe once there is widespread editor support. But if you need editor support for it to be comfortable anyway, then the main benefit vs. using the old-school actual separator characters goes out the window.
I think you're articulating something about this proposal that bothers me.
The thing about the actual separators is that an editor could and should probably display them as they were intended, as data separators. It should be a setting in an editor you control, sort of like how you control tab width and things like that.
Just because a glyph is "invisible" doesn't mean it has to actually be invisible.
The symbols for the separators are hard to read, like you're pointing out, which means someone would eventually replace them with some other graphical display, in which case you were just as well off with the actual separators themselves.
They would have been better off advocating for editor support for actual separator display.
The thing is, while I'll probably just stick with CSV too, I'm sympathetic to the intent, but given I expect it'll need tooling anyway I'm less sympathetic to them not picking the existing separator.
I also think there are failed lessons here that reduces the incentive for switching.
E.g. If you're going to improve on CSV, a key improvement would be to aim to make the format trivially splittable, because the lesson from CSV is that when a format looks this trivial people will assume they can just split on a fixed string or trivial regex, and so the more you can reduce the harm of that the better.
As such, I'd avoid most of the escaping they show, especially for line endings, and just make RS '\n' the record separator, or possibly RS '\n'*. Optionally do the same for US. Require escaping LF immediately after RS/US, and only allow escaping RS, so unescaping can be done with a trivial fixed replace per field if you have a reason to assume your data might have leading linefeeds in fields - a lot of apps will get away with just ignoring that.
Then parsing is reduced to something like `data.split(RS).map{|row| row.split(US).map{|col| col.gsub(ESCAPE,"\n") } }` (assuming RS, US, and ESCAPE are regexps that include the optional trailing linefeeds and escapes leading linefeeds respectively). Being able to copy a correct one-liner from Stackoverflow ought to avoid most of the problems with broken CSV/TSV parsing.
I'm also not convinced adding GS, FS, ETB is a good idea, partly for that reason, partly because a lot of the tools people will want to load data into will not handle more than one set of records, and so you'll end up splitting files anyway, in which case I'd just use a proper archive format... Those characters feels like they're trying to do too much given they're "competing" primarily with CSV/TSV.
Their spec also needs to talk about encoding, because unless I've missed something, they only talk about codepoints, and they're likely to e.g. get people splitting on the UTF8 sequence etc. This to me is another reason for using the ASCII values - they encode the same in ASCII based characters sets and UTF8, and so it feels likely to be more robust against the horrors of people doing naive split-based parsing.
CSV isn't even restricted to comma as the separator. You can use any character you like (pipe | is a common one) and csvkit will happy still work with a simple CLI flag. Pretty much all Unix tools have a similar flag. I've always been able to find an ASCII character that my data doesn't use, though maybe there are exceptions I haven't hit.
I love csvkit, particularly csvstat. I just wish it were quicker on larger files. The types I deal with routinely take 5-20 minutes to run and those are usually the ones I want the csvstat output for the most.
It's all down to font differences. You would use the file with a font that uses larger letters diagonally, for control pictures, instead of tiny letters horizontally.
And the main benefit isn't anything to do with the editor, I have no idea what you meant by that. The main benefit is that commas show up a lot more often in normal text than control pictures do.
There's no space for larger letters diagonally unless I waste screen estate by increasing the font size, which I categorically will not do. So I'd need to replace a font I'm happy with and find one with other symbols that are readable enough. In which case it's just as easy and less invasive for me to adjust my editor to display them using different glyphs. In which case I can just as well do that with the actually ASCII control characters.
The point is that their stated "advantage" does not exist for me. I still need to make changes to my setup to handle them. In which case why should I pick this option? (as you can see elsewhere, especially as this isn't the only issue I have with their format choices).
> And the main benefit isn't anything to do with the editor, I have no idea what you meant by that.
The main benefit relative to using the actual control characters is only the tool support. Where this does not work for me without making changes anyway to how the symbols are displayed anyway. Hence that "advantage" does not actually buy me anything.
> Unicode separated values (USV) is a data format that uses Unicode symbol characters between data parts. USV competes with comma separated values (CSV), tab separated values (TSV), ASCII separated values (ASV), and similar systems. USV offers more capabilities and standards-track syntax.
> Separators:
>
> ␟ U+241F Symbol for Unit Separator (US)
>
> ␞ U+241E Symbol for Record Separator (RS)
>
> ␝ U+241D Symbol for Group Separator (GS)
>
> ␜ U+241C Symbol for File Separator (FS)
>
> Modifiers:
>
> ␛ U+241B Symbol for Escape (ESC)
>
> ␗ U+2417 Symbol for End of Transmission Block (ETB)
Wait a second … he’s not proposing using unit/record/group/file separators as separators, he’s proposing using the symbols for those separators as separators! Why not just use the separators themselves‽
Yes, rather than using U+1F (the ASCII and Unicode unit separator), he proposes using U+241F (the Unicode symbol for the unit separator). I almost feel like this must be an early April Fool’s joke?
Also, he writes ‘comprised of’ rather than ‘composed of’ or ‘comprises’ throughout his RFC.
They cover the reasoning for using the control picture characters instead of the control characters in the FAQ:
"We tried using the control characters, and also tried configuring various editors to show the control characters by rendering the control picture characters.
First, we encountered many difficulties with editor configurations, attempting to make each editor treat the invisible zero-width characters by rendering with the visible letter-width characters.
Second, we encountered problems with copy/paste functionality, where it often didn't work because the editor implementations and terminal implementations copied visible letter-width characters, not the underlying invisible zero-width characters.
Third, users were unable to distinguish between the rendered control picture characters (e.g. the editor saw ASCII 31 and rendered Unicode Unit Separator) versus the control picture characters being in the data content (e.g. someone actually typed Unicode Unit Separator into the data content)."
I can't read those characters at the size I can/prefer to read the text at, so I need the tooling to support and render these differently anyway... This feels like solving the wrong problem in a way that will still end up with the same amount of work.
I don't see any real advantage over using ASCII unit and record separators (.asv).
Also I am not convinced about the need for an escape character. If you really need to use ASCII unit or record separators as data - tough use a different format.
If only editors would display the ASCII unit separator (Notepad++ does) and treat the ASCII record as a carriage return (Notepad++ doesn't) then .asv format would be a huge improvement on CSV.
The ASCII separators are visible in my editor. If something doesn’t support ASCII text, that sounds like a bug which should be fixed, not a reason to misuse graphical characters for something other than their purpose.
Their ABNF uses RS, defined as U+241E, not U+241E + '\n' as the record separator. They seem to add an "USV escape" in front of the linefeeds.
My bet is that this will lead to implementations that wrongly treats "␞␛\n" (RS ESC \m) as the real record separator, the same way lots of "CSV" implementations just split on comma and LF.
Seems to me if you're going to add support for something like that you should just bite the bullet and declare an LF immediately following an RS as part of the record separator, or you're falling in the same trap as CSV of being "close enough" to naively splittable that people will do it because it works often enough.
I'm aware. I don't think that serves a useful purpose - I think the way they've done it is likely to make people more likely to get the parsing wrong for pretty much zero benefit. My guess is you'll end up seeing a lot of "pseudo-USV" parsers the same way we have a ton of "pseudo-CSV" parsers that breaks on escapes or quoted strings with commas, and so I think they fundamentally failed to learn the lessons of CSV.
If you're doing spreadsheets, then it should show in a spreadsheet and not in an editor. It's like complaining that he can't edit jpegs in Sublime or something... there's a reason that's working poorly.
Speaking of which, last time I had a control code heavy file open in Sublime, it actually did show the control codes as special characters, and it was possible to copy/paste those. This proposal is so bad I suspect it will become a standard.
There are a lot of cases where I would rather inspect/quickfix a csv file in a text editor rather than open it as a spreadsheet. Especially cases where something is wrong in the format, and it will just not open as a spreadsheet at all. Adding unnecessary levels of obfuscation to your data should never be considered a good idea imo.
Using Unicode graphic characters as metasyntactic escape characters is fundamentally wrong. Those Unicode characters are for displaying the symbols for Unit Separator, Record Separator, etc. and not for actually being separators! ASCII already has those! Included in Unicode!
To be fair, I don’t quite get those graphic characters, because the original characters should already be displayed that way, shouldn’t they? Now when I see such a character, I have no idea if it’s the real character or just it’s graphic-character counterpart.
I mean, my assumption (yeah, I know) is that the 'display' variant is more for documentation talking _about_ the control character and not meant to _be_ the control character. Abusing the 'display' variant this way seems... misguided.
I am very confused. The author provides "assistive accessibility software to people with vision/cognition/mobility impairments", but these character symbols are indistinguishable for folks with impaired vision.
Good connection to BoldContacts.org :-) Screen readers handle these well. If someone wants to create a font with big bold separators, that would be awesome.
I checked Windows Narrator, and two other utilities and they didn't utter anything. I think requiring users to install a custom font would further hold back the adoption of your format. I hope you succeed; just my 2c.
Back when rolling your own application level protocol on top of TCP was common (as opposed to using http, zeromq, etc) I frequently used file/record/group/unit separators for delimiters, and considered them an underrated gem, especially for plain-text data where they were prohibited to occur in the message body so you didn't have to escape them (still good to scan and reject messages containing them). As a modern example they (and most other ASCII control characters) are disallowed in json strings.
MLLP (Minimal Lower Layer Protocol) -- used extensively to transmit HL7 in health systems -- uses file separators to delimit messages.
OB vertical tab
<content>
1C file separator
0D carriage return
I wrote one of the most popular translators for MLLP, which converts it to HTTP [1].
---
P.S. Ironically, HL7 messages have something literally called a "field separator" but don't use the field separator character, usually they use vertical bar.
The way I read the json standard, the only way to include control characters is to encode them as hex. For example BEL can be encoded as "\u0007", but escaping it by using a backslash followed by a literal BEL character is not allowed. So literal control characters should never be in json text.
Funny thing, excel, which is the most common spreadsheet editor, does not practically support CSV files if you happen to live in countries where the default official convention is using commas for decimal points in numbers. Unless you go around and manually set stuff in how it imports it or you change your default settings. It has reached meme levels at my work.
Tab separated files are much better imo in not getting confused with the delimiter for a sufficiently sane tsv file.
Unfortunately CSVs vary a lot in the wild. Some people use commas as a delimiter, some use semi-colons. Escaping rules vary. And the text encoding is not specified.
I randomly generated some CSVs and fed them into Excel and Numbers and they were differently interpreted.
This is why I tend to use the Pg COPY version of TSV - works beautifully with 'cut' and friends, loads trivially into most databases, and the 'vary a lot' problem is (ish) avoided by specifying COPY escaping which is clearly documented and something people often already recognise.
Generally my only interaction with CSV itself is to fling it through https://p3rl.org/Text::CSV since that seems to be able to get a pretty decent parse of every sort of CSV I've yet had to deal with in the wild.
> You cannot edit it in regular editor, like csv/tsv/jsonlines.
If only there were shortcuts on modern operating systems to allow us to do things that aren't readily on our keyboards. Like upper case characters. Or copy and paste. Or close windows. Our lives would be so much better.
If ASV had caught on, there could be common shared shortcuts to type them, and fonts would regularly display them (just like the unicode characters proposed). But CSV was simple enough and readily type-able.
> There is no schema or efficient storage, like binary formats.
I'm not quite certain where you're trying to go with this. Binary formats aren't really meant to be human readable in an average text editor. It doesn't know to differentiate 1, 2, 4, or 8 bytes as an integer or a float. Even current hex editors to make it easier to navigate these formats don't really know unless you are able to tell it somehow.
> There is no wide library support.
It's a critical mass problem. Not enough people are using them, so no libraries are being made.
> Not all data is representable.
I'm not quite certain what data couldn't be represented. f you can represent your data in CSV, you can represent it in ASV. It's all plain text that gets interpreted based on what you need. They're nearly a 1:1 replacement. Commas get replaced by unit separators, new lines get replaced by group separators. Then you have record and file separators to do with for further levels of abstraction if you need.
> I'm not quite certain what data couldn't be represented.
What do you do if you receive data already containing a unit separator, or a group separator, and you need to put it into a field? The whole value proposition of ASV over, say, TSV is that you should never need to escape anything, but that's only possible by rejecting some input data.
Re editors: The problem with USV is not that it's hard to type the characters, but rather than the newlines are completely optional. Which means that in general case, most line-based tools are not going to work with USV.
Now, the readme actually has that optional newline separator thing, but the optionality of it makes it completely useless, it seems like an after-thought. Fr example the first "real" USV writer I found, the "csv-to-usv", does not put them [0] and thus makes uneditable files.
And if we are going to end up with uneditable files, might as well go with something schema-full, like parquet or avro. You are going to have the same "critical mass problem", but at least the tooling is much better and you have neat features like schemas.
Literally untrue. (And were it true, it still wouldn't be a reason why one should use this over CSV—not sure what's so hard to grasp about the conversational/contextual premise here.)
CSV is honestly not that problematic. Figuring out if an field contains and comma and then properly quoting it is trivial. And fields without commas don't need quoting. Sometimes your application even guarantees no commas, especially if CSV is into it from the beginning.
I'm guessing you haven't worked in custom support where people send you their "CSV" files. Even the field delimiter varies (many Europeans use semi-colons).
No, I have. I don't consider abuse of the format a problem with the format. Though I can see how having to delimit with special characters will help the type of person who writes print(','.join(stuff)).
```USV works with many kinds of editors. Any editor that can render the USV characters will work. We use vi, emacs, Coda, Notepad++, TextMate, Sublime, VS Code, etc.```
I loaded an example in my fairly generic Emacs and it worked out of the box. The separators were pretty small so I had to increase my font size to distinguish US from RS. And of course I have no idea how to enter those characters. I'm sure there is, but cut & paste worked.
I'm fascinated that a lot of posters in this thread are not understanding the ideas and experiences, that the inventors of this file format had or made. They invented this format because it works for machines as well as for humans. Text editors can handle the proposed UTF characters just fine. Humans can see them. The only challenge is that it is cumbersome to type the delimiters. And that the format is not used in any relevant software (like Excel). Both are reason enough, that USV will not be used anywhere. But I can see why they went this way on their file format.
We might be able to see them, but for me they're just a blur unless I zoom in significantly, so I'll need editor accommodations just as much for these characters as if they used the already existing RS/FS/US/GS characters.
It feels like instead of fixing it properly, they went with an option that will still need tool improvements, will be controversial, and adds unnecessary details (e.g. the SYN they've added will be an active nuisance and I'd be willing to bet will get ignored by enough tools to become a hazard to data integrity).
I quite like an initiative to make use of proper record and unit separators, but this feels poorly thought through in several respects (e.g. their quirky escape characters that adds differently depending on the class of the following character will be a 'fun' source of bugs; that splitting records on LF requires three characters almost certainly will mean a number of tools will incorrectly treat those three characters as a unit, etc. -- these assumptions are based on how slapdash a lot of CSV parsing and generation is; if you want to compete with CSV you ought to learn those lessons)
CSV works for machines as well as humans, why do you assume or imply otherwise? Making the separator hard to type makes this ‘invention’ hard for humans to use. Using the glyphs instead of the semantic Unicode separators might also make this harder to use, even if you can understand why they did it, and to some degree it subverts the intent of the Unicode standard’s separator and glyph characters.
We don't need a new format which works for machines as well as for humans, because there are are tons of existing ones. You have CSV or TSV for wide support; JSONlines if you want very easy edit-ability and structure; and if those don't work for some reason, pretty much any other delimiter/escape would work better (example: newline for records, "^^" for fields, "^"-style character escaping; or JS-style "\"-escaping with field separator being "\N")
The usv github repository says it is "the standard for data markup of ...", has 66 stars, and is currently applying for "text/usv" MIME type. That's all about it.
Maybe I'll consider it when it does not belong to a company, has two more zeros in the number of stars, and has RFC/ISO attached to it. Because right now it is not much more of a "standard" than a hobby project I create on a whim.
About the most annoying thing about the modern Internet is this kind of chip-on-the-shoulder comment about "oh he has such a big ego" and nonsense like that.
Man, I preferred it when people could just write up and propose things. The insufferable "is that professional?", "What about consensus?", "Wow the ego to propose something".
Yeah I agree, the project seems like a decent enough idea to discuss, and by the amount of engagement it's getting here there is some data to back up that assertion.
Additionally I'm sure that those ("they have such a big ego") types of comments and thoughts existed in the early internet as well since its fairly human reaction whenever anyone tries to build or propose something that disrupts the status quo.
USV would have the disadvantage of using multi-byte characters as delimiters, so you have to decode the file in order to separate records. And you still can’t type the characters directly or be guaranteed to display them without font support. This honestly seems like cleverness for cleverness’s sake.
The way I would have gone would be to define the standard to support both, such that the two sets of codes MUST be considered semantically equivalent, but that generation tools SHOULD prefer to generate the control codes for new files.
This way people can initially use the visible glyphs while editors don't support the format, and this will always be supported. But, as editors add support and start to generate the files via tools or manually in tabular interfaces where the codes themselves disappear, usage will automatically transition over to the control codes.
Ah fair enough. Of course you could configure your shell/editor/whatever to make control characters visible. Seems like if you were going to edit USV or ASV by hand you'd probably want a customized editor anyway.
This is so weird, since the purpose of the former characters is displaying the latter characters. If they are actually used for display, then you can’t tell which is which.
> The Synchronous Idle (SYN) symbol is a heartbeat, and is especially useful for streaming data, such as to keep a connection alive.
>
> SYN tells the data reader that data streaming is still in progress.
>
> SYN has no effect on the output content.
>
> Example of a unit that contains a Synchronous Idle:
>
> a␖b␞
Why would this go in-band inside a document format? Just why? If you want keep-alives, use a kind of connection that supports out-of-band keepalives.
If you download the same document twice, and the second time the server is heavily loaded (or it's waiting on some dependency, or whatever), presumably the server will helpfully generate some SYNs in the middle of the document to keep the connection alive (?), but now you've got the same document "spelled" two different ways, that won't checksum alike.
SYN along with the weirdness of
> Escape + [non-USV-special] character: the character is ignored
means that you have arbitrarily many ways of writing semantically-same documents.
I've long wanted a successor to CSV, but this is kinda stupid. People like CSVs because they look good, feel natural even in plaintext. This is the same reason that Markdown in successful.
As for including commas in your data, it could just have been managed with a simple escape character like a \, for when there's actually a comma in your data. That's it.
I don't see this as a perfect solution, but CSV is not great either.
A comma is super common in both text and numbers. Here in Europe we often use commas as decimal separator and use a semicolon as value separator.
As a result spreadsheets almost always fail to automatically parse a CSV.
I do like the idea of having a dedicated separator character, that would work right worldwide. And then just standardize the use of a dot as decimal separator in these files.
>As for including commas in your data, it could just have been managed with a simple escape character like a \, for when there's actually a comma in your data. That's it.
Not quite. What if there is a \ in your data? Then you have to escape that.
Seems complex enough that you'd only manipulate files in this format by serializing through a tool, and by then it's competing with established binary formats rather than CSV.
I've actually been employing Emoji Separated Values (ESV), often , here and there when doing some of this kind of work. Granted, it's not standard, but it's been really useful when I've needed it.
*edit Apparently emojis don't fly here, but it was an index finger pointing right.
* Editors will play nicely with the graphical representation. If you need better graphics, it's done with font customization, which everyone already supports.
* It announces that the data is source text, vs transmitted bytes. The type/token distinction is not easy to overcome.
* It sits way out in Unicode's space where a collision is unlikely. The whole reason why CSV-type formats create frustration is because the tooling is ad-hoc, never does the right thing and uses the lower byte spaces where people stuff all kinds of random junk. This is the "fuck it, you get the same treatment as a Youtube video id" kind of solution.
That said, if used, someone will attack it by printing those characters as input.
I certainly hope that anyone proposing a Unicode CSV variant as a joke would pick some raised hand emoji as the separator and the victory gesture (0xe011, also popular as an approximation of how an air quote emoji would look like) as the quote character.
But we already keep stumbling over missing support for the on-demand quote character even with separators like comma and tab, using more exotic characters as the separator will only make it worse. The value of less escaping is negative.
Apparently it is not. They have submitted it to the ietf. I will have to watch closely to see if librecalc/excel and languages/libraries adopt support. Seems like it does solve some common problems with CSV.
And now and then you encounter a web form in the .fi domain that rejects "," and expects ".", but does not tell you that that is the reason for rejecting your input. The web "designers" that deploy such crap in .fi should be sent to Siberia.
Can you not customize the separators used when importing csv-likes into excel? Libreoffice has a neat little window for it that even shows a preview of what values go into which cells.
If you would like to run csv-to-usv from 15+ languages (not only rust!) then check out this demo I made, converting the library to an Extism plugin function: https://github.com/extism/extism-csv-to-usv
Here's a snippet that runs it in your browser:
// Simple example to run this in your browser! But will work in Go, PHP, Ruby, Java, Python, etc...
const extism = await import("https://esm.sh/@extism/extism");
const plugin = await extism.createPlugin("https://cdn.modsurfer.dylibso.com/api/v1/module/a28e7322a6fde92cc27344584b5e86c211dbd5a345fe6ec95f1389733c325541.wasm",
{ useWasi: false }
);
let out = await plugin.call("csv_to_usv", "a,b,c");
console.log(out.text());
I'm sorry but.. why? The library is a single function consisting of 10 lines of Rust code. And would be about 10 LOCs to re-implement in any language that has native csv libs. It seems a little bit unnecessary to load a WASM runtime for that.
If I understand the API correctly from my brief glance, the crate returns a triply-nested vector with the outermost vector being the equivalent of CSV rows, then CSV columns, then "units" which don't have a direct CSV equivalent. It would be helpful if there was an API method that returned results without this final level of nesting, perhaps panicking if there is more than one unit. This would make it easier to deal with the common case (in CSV at least) where each column only has a single value.
I think the units are the csv fields, records are rows, groups would be multiple CSV files (or multiple sheets in an excel file) and file separator... a zip with multiple CSV files? (or multiple excel files).
I'm interested in this too. IMO there's actually a huge benefit to being able to concisely represent 3D / 4D data (i.e. xarrays, slices, datasets) in an easily digestible text format. Mainly thinking about this approach over e.g. the netCDF format or deeply nested JSON.
CSV is like an invasive plant species, or perhaps a curse; you're never going to be able to root it out even thought there are a billion better data formats.
CSV can be manually read/edited by non-technical/non-developer humans using commonly available tools like Excel and Notepad. Not many of the better data formats match that criteria.
Notepad, I agree. Excel... not so much: it tends to change data silently unless you are very cautious with your environment (e.g. dates transformed to number of days since 1900, and some strings to dates)
That helps, no doubt. But last week one of my coworkers touched a Csv with Excel, and all dates went from ISO8601 to MDY. We are based in Europe (i.e. we use DMY at minimum). In my experience, a Csv touched by Excel is not trustable for further analysis.
True but there's so much scope for people to do naive implementations with join() or split() functions and then you end up with nothing escaped properly and a big mess
I have seen Unicode Separated Values. I don't like Unicode and I even more don't like USV. I like ASCII Separated Values, which can encode each separator as a single byte, and can be used with character encodings other than Unicode (and, even if you do use it with Unicode, does not prevent you from using the Unicode control pictures in your data; USV does prevent you from using those characters in your data even though the data is (allegedly) Unicode).
What they say about display and input really depends on the specific editors and viewers that you are using (and perhaps on the fonts as well). When I use vi, I have no difficulty entering ASCII control characters in the text. However, there is also the problem with line breaking, with ASV and with USV, anyways; and they do mention this in the issues anyways.
Fortunately, I can write a program to convert these formats without too much difficulty, even without implementing Unicode (since it is a fixed sequence of bytes that will need to be replaced; however, it does mean that it will need to read multiple bytes to figure out whether or not it is a record separator, which is not as simple as ASV).
I've been using an emoji separated values format for a personal project where fields contain lots of special characters including whitespace.
I'd previously given up using ASV because of the printability and copy/paste problems described. Replacing the control characters by their printable glyphs solves all my previous problems and is as genius as it is naughty.
I sympathize with the arguments people here present against and agree the SYN character and Group Separator are weird -- but cause no harm. I'm not bothered by the same data having multiple representations since I'm insisting on human readability rather than byte-by-byte perfection in the first place.
It took 20 minutes to convert my project and I'm very happy.
Only tooling change I had to make was adding digraphs to vim
digraph rs 9246 us 9247
etc. Easy to type directly in my .usv file. Easy to type and read in some Python consuming it.
Regardless of it becoming a standard and my lingering grouchiness about multi-byte characters, needing to use non-xterm, etc. this works very well for me.
Well, CSV would be much harder to import, than something like USV, because the delimiters are well-defined in USV and there is no need for quoting strings.
If you live in a place where comma is the decimal separator, your CSV files will often use semicolon as the separator instead of comma. Will this tool cater for that?
If I work with CSV files they are most often not comma-separated but semicolon-separated because of the numbers. An Excel installation localized for decimal comma would not read 'real' CSV files correct.
If csv-to-usv cannot cater for this type of CSV files, it would not be usable in a large part of the world.
Yeah they should add it. The tool is like 20 lines of Rust code. It's a thin wrapper around the csv Rust crate, which does support specifying alternative delimiters.
I am uncertain, but this is likely to reintroduce the issue of Unicode buffer overflow into the mainstream. What are your proposed solutions, considering it is expected to become standardized?
USV defaults to UTF-8, which prevents the issue of Unicode buffer overflow-- presuming you mean the typical kind where Unicode flows into ASCII.
The primary USV library implementation uses Rust, which is notably good at UTF-8 and conversions between operating system string encodings (such as ASCII) and UTF-8,
In the same way that CSV supports fields that contain nested CSV documents: cumbersomely / painfully, with lots of escaping of the delimiter characters.
This is needlessly adding yet another standard¹ to the mix. If you are in a position to choose what standard you use, just use:
• Whatever is best for the data model and/or languages you use. JSON is a common modern choice, suitable for most things.
• If you want something more tabular, closer to CSV (which is a valid choice for bulk data), use strict RFC 4180 compliant data.
• If you want to specify your own binary super-compact data, use ASN.1. I am also given to understand that Protobuf is a popular modern choice.
If you aren’t in a position to choose your standards, just do whatever you need to do to parse whatever junk you are given, and emit as standards-compliant data as possible as output; again, RFC 4180 is a great way to standardize your own CSV output, as long as you stick to a subset which the receiving party can parse.
conversion of file encoding from simple ASCII to UTF-8 has consequences beyond the field/record problem.
Some tools will randomly convert " to 'LEFT DOUBLE QUOTATION MARK' and 'RIGHT DOUBLE QUOTATION MARK' if they see UTF-8 flagging. Thus, the file is converted without your voluntary participation.
Unicode involves more than the set of glyphs and their encodings; it also involves properties, etc. However, it can be an attack vector even ignoring that stuff; it does not have to be Turing-complete to be an attack vector. But, the specific kind of attacks depends on the application.
Different kind of character sets and character encodings will be good for different purposes. Unicode is "equally bad" for many uses.
Yes, Unicode is too complicated and too messy, whether or not it is Turing-complete (it is complicated enough that maybe it is Turing-complete; I don't know).
Y'know, I greatly dislike this. It's an actual emotional reaction. This should not be standardized. No one should use this. This is a bad idea and deserves to die in obscurity.
I'll tell you why, it's pretty simple. The characters this... thing is stealing, exist to represent invisible control sequences. That is their use. The fact that they can be mentioned by direct input is inevitable, but not to be encouraged.
I will be greatly disappointed if this is accepted as a standard. The fact that a USV file looks like a rendered ASV file is a show stopping bug, an anti-feature, an insult to life itself. Kill it with fire.
Perfect deployment of David Wheeler's aphorism:
> All problems in computer science can be solved by adding another level of indirection.
https://en.wikipedia.org/wiki/David_Wheeler_(computer_scient...