Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Comma Separated Values (CSV) to Unicode Separated Values (USV) (crates.io)
208 points by jph 8 months ago | hide | past | favorite | 287 comments



Fascinated this uses the Unicode glyphs / symbols for unit and record separator rather than the unit and record separators themselves (ASCII US and RS).

Perfect deployment of David Wheeler's aphorism:

> All problems in computer science can be solved by adding another level of indirection.

https://en.wikipedia.org/wiki/David_Wheeler_(computer_scient...



The answer makes sense to me, but I wish we could fix editors to properly handle the ASCII separators (1C, 1D, 1E, 1F) instead of resorting to Unicode control picture characters (241C, 241D, 241E, 241F).

Maybe if editors are fixed up we could adopt ASCII Separated Values (ASV) as the new standard.


Emacs has handled literal ASCII control characters correctly I believe since around the time I was born - probably somewhat earlier, if we count back further than GNU.

Unicode works fine there too, so it makes no nevermind to me which flavor people use. I just think it's funny how "everything old is new again".


Yes you're right. That's a long term goal.


Why not combine zero width character with visible character, i.e. use 2 characters for separators?

,<FS> for fields \n<RS> for records

This removes ambiguity in parsing and remains user readable. It's also relatively easy to auto-fix files edited by users in normal editors.

It also mostly removes need for escaping.

It's also smaller or same size as unicode multibyte characters (haven't checked).


Wouldn't this easily break your file in subtle ways when someone tries to edit it in their editor and the zero width character is not visible?

How could you make the difference with a standard CSV file if it looks like a standard CSV file?

They explain why they don't use control characters. Editors are not consistent in how they show control/zero-length characters:

https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...


I tried to create a "printable binary" format for better visual inspection of raw binary data (and a to/from format)

https://github.com/pmarreck/elixir-snippets/blob/master/prin...


Indeed, if the result is to be encoded with UTF-8, using 1-byte separators vs the multi-byte encoding of (241F) would make sense to me.

I'd also prefer if escapes were done in the "traditional" manner of, for example, "\t" for a tab because you can then read in stuff with something like input.split("\t").map(unescape); you know any actual tab character in the input is a field separator, and then you can go through the fields to put back the escaped ones.


> you can then read in stuff with something like input.split("\t").map(unescape)

What about input lines like 'asdf\\thjkl\tzxcvb'? That should be two fields, one the string ‘asdf\thjkl’ and the other the string ‘zxcvb.’

I think that your way is a bit like trying to match context-free grammars with a regular expression. The right way is to parse the input character by character.


> you know any actual tab character in the input is a field separator, and then you can go through the fields to put back the escaped ones

The "\t" in "split" is not a "slash-tee" but an actual tab character and then escape sequences in fields are handled by the "unescape" function.


I think the suggestion is that the field separator is an actual tab character (ascii code 9) but tabs inside the field are `\t`. So, splitting on the tab character always works because fields cannot contain ascii code 9 but must use the two character escape instead.


Although matching up nested pairs of brackets requires something at least as powerful as a pushdown automaton (CFG matcher), discriminating between an arbitrary number of escaped backslashes followed by an unescaped 't' versus an arbitrary number of escaped backslashes followed by the '\t' escape sequence doesn't require anything more powerful than a finite state machine.


Indeed... I didn't read the standard in detail to check whether escaping is allowed/taken into account, but what if my data contains those symbols? I mean, they are perfectly legal Unicode printable characters, unlike the ASCII ones.


I one time attempted to write a blog post about escaping stuff in rss feeds, while technically correct nothing could parse the rss feed for the blog.


There's an escape.


I thought the point is you don't need escapes?

If you still need to implement escape mechanism, might as well do CSV/TSV.


The point is ASCII DSV, which gives innately better hierarchy than CSV, but with visible tokens and stream accommodation. You should read the github readme. It's not that long.

https://github.com/SixArm/usv/tree/main/doc/faq#why-choose-u...

As for still needing escapes, using obscure symbols instead of ones that are extremely common in writing inherently means needing far far faaaaaaar fewer of them.


What's the point of visible tokens if it's all squished in one line? You are not going to be editing this in regular editor once you have non-trivial amount of data.

And yes, I read README and source code, so I know that newlines are optional, existing tools don't generate them, and multi-line examples are basically fake.


> What's the point of visible tokens if it's all squished in one line?

It doesn't have to be all squished in one line, it just doesn't hurt anything. Visually splitting squished lines for presentation or perusal is trivial because of the record separator.

> You are not going to be editing this in regular editor

I know (or at least I think) that you meant this in relation to squished lines getting very long, but maybe we can talk about it in a broader context, since record splitting is trivial...

One could easily say these same words about documents written in right-to-left languages. But people in Israel manage to create files too somehow, so that's clearly not an insurmountable barrier.


Editors generally support composing right-to-left languages that way? So I suppose the metaphor suggests that all editors should directly support the visible glyphs semantically?

And yet, that's explicitly not the semantic purpose of those glyphs. The actual delimiters already exist at a lower code point. If we're asking editors to semantically support delimiters we should be asking them to support the semantic delimiters.


Good point. I'm adding automatic record separator newlines to the crate now.


You shouldn't need escapes for separator characters precisely because they are not designed for data. Their entire purpose is to separate hierarchical data.

If it turns out that escaping is needed, it will still be far rarer than escaping commas and newlines.


This makes me sad; such a missed opportunity.


(For text processing, I use octal \034 all the time.)

Perhaps there is a software developer version of "Needs more cowbell" called "Needs more complexity"

Computer languages generally use the Latin alphabet. And even in a case like APL, which some HN commenters call "hieroglyphics", the number of symbols is limited and each is precisely defined (cf. potentially up to 1.1 million Unicode symbols and "emojis" that are open to interpretation).


Well, yeah, not every language uses the Latin alphabet.


Perfect deployment of HL33tibCe7’s aphorism:

> For every interesting HN post, there’s at least one smug commenter who thinks he knows better, but actually doesn’t

https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...


The OP was probably assuming no human would want to actually read a CSV raw, and so was probably correct from their POV. Your POV is probably from someone who reads CSVs raw. You don't have to be so rude about it, you're being even more smug than the OP, probably.


One of the two likely works with CSVs for a living, and it's definitely not the person suggesting "What if it just was hard to eyeball/edit".

If you don't understand why something is the way it is, it might be better to start with a question than with a statement implying the tech misses existing tech. Chesterton's fence still applies, and ignoring it means you're outsourcing your work to others. RTFM is a perfectly valid answer at that point.


I use CSVs for a living but I rarely read them manually. I’d rather have ASCII than Unicode in my CSVs.

My point above, though, is that everyone has opinions and you don’t have to be a dickhead about “correcting” them.


ASCII has a field delimiter character. The fact that we chose comma and tabs because a field delimiter character is hard to type or see is one of those things that saddens me in computing.

Imagine the amount of pain that could have been spared if we had done it right from the start some 50 years ago.


The great thing about comma as a field separator is (1) the character is visible and (2) the character is common, so if there are escaping bugs in either the generator or the parser, they will quickly become apparent. Much better to fail fast than having a parse error at line 28357283 because a more uncommon separator character somehow still made its way into the data.


We have editors that can work with invisible characters. It’s not hard. I do that all the time in Vim with tabs and CR/LF anyway.

Unfortunately that ship has sailed. We have standards for escaping commas, escaping quotes, it’s escaping all the way down


The bad thing is that it is common so you have to escape a lot. The much worse thing is that csv implementations have varying ways of handling escaping, or sometimes don't support escaping at all, so in practice csv files can't be used interoperably.


> (1) the character is visible and

This can be fixed in the font


This all predates fonts as we know them today. TTY were limited by the keyboard and the character generator ROM. Today no one thinks twice.


Unfortunately, ASCII delimiter fields were used streaming data. The pain that created was simple - you'd try to send data containing an RS or a US and it would cause bad things to happen on the other end of the wire.

> Imagine the amount of pain that could have been spared if we had done it right from the start some 50 years ago.

I think it's Putt's Law: If you design something for idiots, someone will make a better idiot. In this case, it took less than 15 years.


I've used the ASCII delimiters in a webapp once; Javascript in the browser formatted data with them and sent it to my server via HTTP POSTs. I was a bit nervous that something in the path would break the data but happily it all just worked fine.


Currently saving the day in a data pipeline project which depends on a tool which only exports unescaped csvs. They work very well through the pipeline, Unix split, awk, and then snowflake all support them nicely. One annoying thing is that they are annoying to type and you never quite know if you need to refer to them using octal, hex or what, and what special shell escaping might be used.


Yeah it's really interesting to me how much of what we use/do is shaped by our input devices. Macropads are a start, but I'd love a keyboard with screens on each key, that's not absurdly expensive and can be layered easily.


Something like the Optimus Maximus?

https://en.wikipedia.org/wiki/Optimus_Maximus_keyboard

(It's been almost 20 years and you still can't get one...)


We'll maybe get there in a roundabout way :

https://en.wikipedia.org/wiki/Sonder_Design


I would think anyone this serious about keyboard control would be able to use layers which is becoming pretty common and not have to "see" the keycaps.


I'm hoping the Flux keyboard will deliver on that.

https://fluxkeyboard.com


Interesting. Are you referring to the unit separator (1F)?

https://www.ascii-code.com/31


Yes, we have unit, record, group and file separators. And we chose never to use them.


It seems as though one could easily build a file format far more useful than CSV simply by utilizing these separators, and I'm sure it's been done countless times.

Perhaps this would make an interesting personal project. Are you aware of any hurdles, missing key features, etc. that previous attempts at creating such a format have run into (other than adoption, obviously)?


I've done ETL work with systems that used the ASCII separators. It was very pleasant work. Not having to worry about escaping things (because the ASCII separators weren't permitted to be in valid source data to begin with) was very, very nice.

I'm a Notepad++ person. When I needed to mock-up data typing the characters was easy-- just ALT and the ASCII code on the numeric pad. It took a bit to memorize the codes I needed to use. Their visual representation is just inverse text and initials.


The ASCII unit separator and record separator characters are not well supported by editors. That is why people stick to the (horrible and inconsistent) CSV format.


I started writing up a spec and library for such a format, then my ADHD drew me to other projects before I finished it. Hopefully I'll get back to it someday.

Edit: this is what I have so far: https://github.com/tmccombs/ssv


The "compact" file format (without the tab and newline) should be the SSV standard for data interchange and storage. The pretty-printed format should only be used locally to display/edit with non-compliant editors, then convert back as soon as you're done.

In time, editors and file browsers should come to render separators visually in a logical way.


What you've got so far looks promising to me. Pretty much just what I was thinking of doing, in fact, albeit with some details worked out that I hadn't yet considered.

Nice job. I hope you come back to finish the project eventually.


People don’t like invisible hard to type character. They prefer suffering quoting, escaping, escaping quotes and all that fun stuff


Are people actually typing up *SV files by hand? It's trivial to support editing in an IDE and exporting from data-producing applications.


Yes, sometimes, of course. It's a bit like JSON. Sometimes it's easiest to inject a small piece of hand-written data into a test or whatever.

(That said every text editor since ever should have had a "table mode" that uses the ASCII field/record seperators (or whatever you choose), I was always confused why this isn't common. Maybe vim and emacs do?)


Unfortunately everyone has moved to "Parquet" (Packet? Parket? Pacquet?) already and we've sailed even further.

I absolutely HATE this Parcquage.


https://www.databricks.com/glossary/what-is-parquet

(I don't think everyone has moved to it. I had never heard of it myself.)


A lot of the machine learning world has started using it, it's annoying as hell, solves a problem that doesn't exist, has inadequate documentation, lacks a good GUI viewer, and lacks good command line converters to JSON, XML, CSV, and everything else.


No binary format will ever kill CSV: plain-text based formats embody the UNIX philosophy of text files and text processing tool pipes to go with them, and nothing is more durable than keeping your data in text based exchange formats.

You won't remember Parquet in 15 years, but you will have CSV files in 50 years.


> You won't remember Parquet in 15 years, but you will have CSV files in 50 years.

You're probably right about CSV but probably not parquet. Parquet is already 11 years old, there are vast data warehouses that store parquet, it's first class in the spark ecosystem, and a key component of iceberg. Crucially, formats like parquet are "good enough" for a use case that doesn't appear to be going away. There is a high probability in my estimation that enough places are still using them in 15 years to be memorable even if it isn't as common or as visible.


CSV is actually a nice format if it weren't for literal newlines being allowed INSIDE values. That alone makes it much harder to parse correctly with simple code because you can't count on ASCII mode readline()-like functions to fetch 1 record in entirety.

Considering it also separates records with newlines, they really should have replaced newlines with "\n" and require escaping "\" with "\\".


I often use them in compound keys (e.g., in a flat key space as might be used by a cache or similar simple key/value store). IMHO, they are superior to other common separators like colons, dashes, etc. because they are (1) semantically appropriate and (2) less likely to be present in the constituent pieces of data, especially if the data in question is already limited to a subset of characters that do not include the separators, which it often is (e.g., a URL).


“Less likely” doesn’t help if you may get arbitrary (user) input. If you can use a byte sequence as the key, a better strategy is to UTF-8-encode the pieces and use 0xFF as the separator byte, which can never occur in UTF-8.


Dedicated separator characters don't solve the problem--you'd still need to escape them. Or validate that the data (which may come from untrusted web forms etc.) does not contain them, which means you have another error condition to handle.


There's an ASCII character for escaping, too, if you need it.

The advantage of ASV is not that you can't have invalid or insecure data, it's that valid data will almost never contain ASCII control characters in the record fields themselves. Commas, quotation marks, and backslashes, meanwhile, are everywhere.


Or specify that the data can't contain this data. If it does, you have to use a different format. This keeps everything super simple. And how often are ASCII US and RS characters used in data? I don't think I have ever seem one in the wild, apart from in a .asv file.


I'm no expert on character encodings or Unicode itself, but would this be as simple as checking for the byte 1F in the data? Assuming the file is ASCII or UTF-8 encoded (or attempting to confirm this as much as possible as well), it seems like that check would suffice to validate the absence of the code point in the data, but I imagine it's not quite so simple.


For text data, it would work fine, but you'd have to do some finagling with binary data; $1F is a perfectly valid byte to have in, say, a 4-byte integer.


My going assumption is that arbitrary binary data should be in a binary format.

Feel free to correct me, but I figure that as long as data can be from 0x00 to 0xFF per byte, no format that uses characters in that range will ever be safe. I’m not a big C developer but I figure the null terminated strings have the same limitation.

But if its something entered by keyboard you should be ok to use control codes.

Personally, I find tab and return to be fine for text driven stuff. Shows up in an editor just like intented.


Without escaping, it wouldn't be suitable for arbitary binary data.


The “problem” I’m referring to is that we chose a widely used character as a field separator. Of course you still have to write a parser, etc, it’s just a lot easier if you choose a dedicated character.


Because they're zero-width. If you can't see them when you print your data, it's a machine-only separator, which makes it a bad separator for data that humans need to look at and work with.

(Because CSV is a terrible data exchange format in terms of information per byte. But that makes sense, because it's an intentionally human readable data exchange format, not a machine format)

Hence https://github.com/SixArm/usv/tree/main/doc/faq#why-choose-u...


I never knew they existed until this post


> ASCII has a field delimiter character.

Where's the key on my keyboard yo make one?

The point of text-based formats is that you can edit them in a text editor by hand trivially, if typing the character is nontrivial, then it entirely defeats the point (that's also why USV ads very little value IMHO).


You can actually type a bunch of ASCII control characters very easily on a keyboard. Look at an ASCII table with 32 characters per column (I like this one[1]). The key combo for a control character is Ctrl + the letter on the same row as the control character. So:

BELL Ctrl-G

RECORD SEPARATOR Ctrl-^

UNIT SEPARATOR Ctrl-_

ESCAPE Ctrl-[

You can think of the Ctrl key as clearing the two most-significant bits of the letter's ASCII code. Not all key combos are supported in all environments. Notepad++ doesn't support Ctrl-] (GROUP SEPARATOR) at all, but does support e.g. SHIFT OUT as Ctrl-Shift-N, for instance. The Windows CMD.EXE command line supports many combinations (but not UNIT SEPARATOR, unfortunately), displaying them as e.g. ^[ or ^G in the console.

[1] https://upload.wikimedia.org/wikipedia/commons/1/1b/ASCII-Ta...


If I need a table or Google to figure out how to type something, that's not "very easily"

If you need to train your employees on that, it's not "very easily".

"Very easily" is when I can take any family member who's seen a computer in their life, give them a keyboard and they can figure it out on their own without Google in 2 seconds (like csv).


What I mean is that a simple key combination is easier to use than an Alt code or having to copy and paste from another document. The ASCII table stuff is just fun trivia. "Press Ctrl-_ to insert a column separator" isn't any harder than "Press Ctrl-S to save" or "Press Ctrl-T to open a new browser tab". It's definitely easier than letting your hypothetical family member reinvent character escapes on their own the moment they encounter an address that has an extra comma in it. :-)


An editor could easily have an Insert Special Character menu item.


Control underscore is the unit separator character. (Some editors may require you to escape that character, though.)


What’s the key to enter the euro symbol? That means you can’t use it in a text editor?

There is no perfect solution, but I’d rather open a text file in a decent editor than having to deal with the escaping hell that is CSV.

They could have chosen the pipe character “|” at least, but the comma is the thousand separator in many languages (number formatting is kind of important for tabular data, if you ask me) and also, you know, general prose.


> What’s the key to enter the euro symbol?

There's one on French keyboards actually!

And it was there even before we got euro coins in our hands (I know this because I'm still using my first (mechanical) keyboard that I got with my first own PC in 2001: and there is a “€” symbol on it)


There's also the generic currency symbol, ¤, which I think is on some keyboard layouts pre-Euro.


That is an underappreciated gem. It should find more use!


Ooh is that what that is? TIL


Mandatory plug of new Azerty :

https://norme-azerty.fr/en/


>What’s the key to enter the euro symbol? That means you can’t use it in a text editor?

Alt gr+E? Like it's shown on the keyboard.


Not on a US keyboard layout. The point is that we insert characters that aren’t written on the keyboard keys with some regularity, like ©, ®, ™, etc


Well some of us do. There's this interesting effect where many people perceive the limitations on their current tools to be equivalent to limitations on their abstract abilities. If they don't know how to do it, it's impossible.


I think that's exactly the point that the parent poster is trying to make by example? Just because we don't have good tooling today for using ASCII delimiter characters, doesn't mean it's impossible -- just like typing the euro symbol on an american keyboard


It doesn't mean it's impossible, but it's definitely cumbersome. Any non English people who has had to type in their native language from an american keyboard can tell you.


Oh yes certainly. And I think that when you're deep into creation it can be really really hard to remember that experience, and so recently I'm trying to find ways to help pull back the curtain for folks.


Except that not all keyboards have an AltGr key.

European ones tend to have one. US keyboards don't.

Not sure about British. Are they different from US?


Their examples if anything convinced me not to use this for a long time.

I need to zoom to be able to tell these apart, so I'll need editor support for it to be convenient to work with these anyway. And then clicking through to the comparisons, it demonstrates the difference existing support for CSV "everywhere" makes - Github renders the CSV examples nicely as tables, while again I need to zoom in to see which separator is which for USV.

Maybe once there is widespread editor support. But if you need editor support for it to be comfortable anyway, then the main benefit vs. using the old-school actual separator characters goes out the window.


I think you're articulating something about this proposal that bothers me.

The thing about the actual separators is that an editor could and should probably display them as they were intended, as data separators. It should be a setting in an editor you control, sort of like how you control tab width and things like that.

Just because a glyph is "invisible" doesn't mean it has to actually be invisible.

The symbols for the separators are hard to read, like you're pointing out, which means someone would eventually replace them with some other graphical display, in which case you were just as well off with the actual separators themselves.

They would have been better off advocating for editor support for actual separator display.


csvkit makes displaying CSV in a terminal trivial and has all the tools to manipulate/filter data I've ever needed - https://csvkit.readthedocs.io/en/latest/

I don't really get this project at all.


The thing is, while I'll probably just stick with CSV too, I'm sympathetic to the intent, but given I expect it'll need tooling anyway I'm less sympathetic to them not picking the existing separator.

I also think there are failed lessons here that reduces the incentive for switching.

E.g. If you're going to improve on CSV, a key improvement would be to aim to make the format trivially splittable, because the lesson from CSV is that when a format looks this trivial people will assume they can just split on a fixed string or trivial regex, and so the more you can reduce the harm of that the better.

As such, I'd avoid most of the escaping they show, especially for line endings, and just make RS '\n' the record separator, or possibly RS '\n'*. Optionally do the same for US. Require escaping LF immediately after RS/US, and only allow escaping RS, so unescaping can be done with a trivial fixed replace per field if you have a reason to assume your data might have leading linefeeds in fields - a lot of apps will get away with just ignoring that.

Then parsing is reduced to something like `data.split(RS).map{|row| row.split(US).map{|col| col.gsub(ESCAPE,"\n") } }` (assuming RS, US, and ESCAPE are regexps that include the optional trailing linefeeds and escapes leading linefeeds respectively). Being able to copy a correct one-liner from Stackoverflow ought to avoid most of the problems with broken CSV/TSV parsing.

I'm also not convinced adding GS, FS, ETB is a good idea, partly for that reason, partly because a lot of the tools people will want to load data into will not handle more than one set of records, and so you'll end up splitting files anyway, in which case I'd just use a proper archive format... Those characters feels like they're trying to do too much given they're "competing" primarily with CSV/TSV.

Their spec also needs to talk about encoding, because unless I've missed something, they only talk about codepoints, and they're likely to e.g. get people splitting on the UTF8 sequence etc. This to me is another reason for using the ASCII values - they encode the same in ASCII based characters sets and UTF8, and so it feels likely to be more robust against the horrors of people doing naive split-based parsing.


CSV isn't even restricted to comma as the separator. You can use any character you like (pipe | is a common one) and csvkit will happy still work with a simple CLI flag. Pretty much all Unix tools have a similar flag. I've always been able to find an ASCII character that my data doesn't use, though maybe there are exceptions I haven't hit.


I love csvkit, particularly csvstat. I just wish it were quicker on larger files. The types I deal with routinely take 5-20 minutes to run and those are usually the ones I want the csvstat output for the most.


I've been tempted a few times to rewrite some csvkit utilities in a faster language than Python.


It's all down to font differences. You would use the file with a font that uses larger letters diagonally, for control pictures, instead of tiny letters horizontally. And the main benefit isn't anything to do with the editor, I have no idea what you meant by that. The main benefit is that commas show up a lot more often in normal text than control pictures do.


There's no space for larger letters diagonally unless I waste screen estate by increasing the font size, which I categorically will not do. So I'd need to replace a font I'm happy with and find one with other symbols that are readable enough. In which case it's just as easy and less invasive for me to adjust my editor to display them using different glyphs. In which case I can just as well do that with the actually ASCII control characters.

The point is that their stated "advantage" does not exist for me. I still need to make changes to my setup to handle them. In which case why should I pick this option? (as you can see elsewhere, especially as this isn't the only issue I have with their format choices).

> And the main benefit isn't anything to do with the editor, I have no idea what you meant by that.

The main benefit relative to using the actual control characters is only the tool support. Where this does not work for me without making changes anyway to how the symbols are displayed anyway. Hence that "advantage" does not actually buy me anything.


For those wondering what USV is, like myself:

> Unicode separated values (USV) is a data format that uses Unicode symbol characters between data parts. USV competes with comma separated values (CSV), tab separated values (TSV), ASCII separated values (ASV), and similar systems. USV offers more capabilities and standards-track syntax.

> Separators:

>

> ␟ U+241F Symbol for Unit Separator (US)

>

> ␞ U+241E Symbol for Record Separator (RS)

>

> ␝ U+241D Symbol for Group Separator (GS)

>

> ␜ U+241C Symbol for File Separator (FS)

>

> Modifiers:

>

> ␛ U+241B Symbol for Escape (ESC)

>

> ␗ U+2417 Symbol for End of Transmission Block (ETB)

>

> ␖ U+2416 Symbol For Synchronous Idle (SYN)


Wait a second … he’s not proposing using unit/record/group/file separators as separators, he’s proposing using the symbols for those separators as separators! Why not just use the separators themselves‽

Yes, rather than using U+1F (the ASCII and Unicode unit separator), he proposes using U+241F (the Unicode symbol for the unit separator). I almost feel like this must be an early April Fool’s joke?

Also, he writes ‘comprised of’ rather than ‘composed of’ or ‘comprises’ throughout his RFC.


They cover the reasoning for using the control picture characters instead of the control characters in the FAQ:

"We tried using the control characters, and also tried configuring various editors to show the control characters by rendering the control picture characters.

First, we encountered many difficulties with editor configurations, attempting to make each editor treat the invisible zero-width characters by rendering with the visible letter-width characters.

Second, we encountered problems with copy/paste functionality, where it often didn't work because the editor implementations and terminal implementations copied visible letter-width characters, not the underlying invisible zero-width characters.

Third, users were unable to distinguish between the rendered control picture characters (e.g. the editor saw ASCII 31 and rendered Unicode Unit Separator) versus the control picture characters being in the data content (e.g. someone actually typed Unicode Unit Separator into the data content)."

- https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...


I can't read those characters at the size I can/prefer to read the text at, so I need the tooling to support and render these differently anyway... This feels like solving the wrong problem in a way that will still end up with the same amount of work.



'Too many competing standards' is not one of the quoted reasons.



I don't see any real advantage over using ASCII unit and record separators (.asv).

Also I am not convinced about the need for an escape character. If you really need to use ASCII unit or record separators as data - tough use a different format.

If only editors would display the ASCII unit separator (Notepad++ does) and treat the ASCII record as a carriage return (Notepad++ doesn't) then .asv format would be a huge improvement on CSV.


'comprised of' is standard verbiage in US patents, and I am guessing he is trying to sound formal and official. Also see https://en.wikipedia.org/wiki/Comprised_of .


An issue with CSV is that commas need to be escaped. Are the U+241F characters escaped in this USV format?


Using a visible character rather than an invisible one makes editing in an editor a lot easier.


The ASCII separators are visible in my editor. If something doesn’t support ASCII text, that sounds like a bug which should be fixed, not a reason to misuse graphical characters for something other than their purpose.


It won't wrap at the record separator, so you'll get a very long line.


The example seems to use `␞\n` as a separator rather than just `␞`. I assume their proposed standard is more definitive.


Their ABNF uses RS, defined as U+241E, not U+241E + '\n' as the record separator. They seem to add an "USV escape" in front of the linefeeds.

My bet is that this will lead to implementations that wrongly treats "␞␛\n" (RS ESC \m) as the real record separator, the same way lots of "CSV" implementations just split on comma and LF.

Seems to me if you're going to add support for something like that you should just bite the bullet and declare an LF immediately following an RS as part of the record separator, or you're falling in the same trap as CSV of being "close enough" to naively splittable that people will do it because it works often enough.


The escape symbol lets you ignore any non-special character, not just newlines: https://github.com/sixarm/usv?tab=readme-ov-file#escape-esc


I'm aware. I don't think that serves a useful purpose - I think the way they've done it is likely to make people more likely to get the parsing wrong for pretty much zero benefit. My guess is you'll end up seeing a lot of "pseudo-USV" parsers the same way we have a ton of "pseudo-CSV" parsers that breaks on escapes or quoted strings with commas, and so I think they fundamentally failed to learn the lessons of CSV.


that's a lie as far as I can see, the csv-to-usv tools does not add any newlines:

[0] https://github.com/SixArm/csv-to-usv-rust-crate/blob/30a0324...



the submitted tool does not have produce them, check out the tests - note there is no \n anywhere in USV

https://github.com/SixArm/csv-to-usv-rust-crate/blob/30a0324...


If you're doing spreadsheets, then it should show in a spreadsheet and not in an editor. It's like complaining that he can't edit jpegs in Sublime or something... there's a reason that's working poorly.

Speaking of which, last time I had a control code heavy file open in Sublime, it actually did show the control codes as special characters, and it was possible to copy/paste those. This proposal is so bad I suspect it will become a standard.


There are a lot of cases where I would rather inspect/quickfix a csv file in a text editor rather than open it as a spreadsheet. Especially cases where something is wrong in the format, and it will just not open as a spreadsheet at all. Adding unnecessary levels of obfuscation to your data should never be considered a good idea imo.


They explain in the FAQ that this approach works with most text editors and copy-paste situations.


It doesn't "work" because I can't read the darn things at a sane zoom level.


Using Unicode graphic characters as metasyntactic escape characters is fundamentally wrong. Those Unicode characters are for displaying the symbols for Unit Separator, Record Separator, etc. and not for actually being separators! ASCII already has those! Included in Unicode!


To be fair, I don’t quite get those graphic characters, because the original characters should already be displayed that way, shouldn’t they? Now when I see such a character, I have no idea if it’s the real character or just it’s graphic-character counterpart.


I like the LINE FEED character to be interpreted as a line feed, not displayed as “␊”.


I mean, my assumption (yeah, I know) is that the 'display' variant is more for documentation talking _about_ the control character and not meant to _be_ the control character. Abusing the 'display' variant this way seems... misguided.


I am very confused. The author provides "assistive accessibility software to people with vision/cognition/mobility impairments", but these character symbols are indistinguishable for folks with impaired vision.


Good connection to BoldContacts.org :-) Screen readers handle these well. If someone wants to create a font with big bold separators, that would be awesome.


I checked Windows Narrator, and two other utilities and they didn't utter anything. I think requiring users to install a custom font would further hold back the adoption of your format. I hope you succeed; just my 2c.


I always wonder why we don't use this.


Back when rolling your own application level protocol on top of TCP was common (as opposed to using http, zeromq, etc) I frequently used file/record/group/unit separators for delimiters, and considered them an underrated gem, especially for plain-text data where they were prohibited to occur in the message body so you didn't have to escape them (still good to scan and reject messages containing them). As a modern example they (and most other ASCII control characters) are disallowed in json strings.


MLLP (Minimal Lower Layer Protocol) -- used extensively to transmit HL7 in health systems -- uses file separators to delimit messages.

  OB vertical tab

  <content>

  1C file separator

  0D carriage return
I wrote one of the most popular translators for MLLP, which converts it to HTTP [1].

---

P.S. Ironically, HL7 messages have something literally called a "field separator" but don't use the field separator character, usually they use vertical bar.

[1] https://github.com/rivethealth/mllp-http


You can put control characters in JSON strings, you just need to escape them.


The way I read the json standard, the only way to include control characters is to encode them as hex. For example BEL can be encoded as "\u0007", but escaping it by using a backslash followed by a literal BEL character is not allowed. So literal control characters should never be in json text.


This makes me wonder if the Escape control character (\u001B) would work in a JSON string. Time to go test things out. :)


CSV is the javascript of the tabular data world.

Everyone thinks they can do better, but nothing's more widely supported (for a sufficiently generous definition of 'supported')


Funny thing, excel, which is the most common spreadsheet editor, does not practically support CSV files if you happen to live in countries where the default official convention is using commas for decimal points in numbers. Unless you go around and manually set stuff in how it imports it or you change your default settings. It has reached meme levels at my work.

Tab separated files are much better imo in not getting confused with the delimiter for a sufficiently sane tsv file.


Yes it does, but then it uses ; as a separator.


Unfortunately CSVs vary a lot in the wild. Some people use commas as a delimiter, some use semi-colons. Escaping rules vary. And the text encoding is not specified.

I randomly generated some CSVs and fed them into Excel and Numbers and they were differently interpreted.


This is why I tend to use the Pg COPY version of TSV - works beautifully with 'cut' and friends, loads trivially into most databases, and the 'vary a lot' problem is (ish) avoided by specifying COPY escaping which is clearly documented and something people often already recognise.

Generally my only interaction with CSV itself is to fling it through https://p3rl.org/Text::CSV since that seems to be able to get a pretty decent parse of every sort of CSV I've yet had to deal with in the wild.


Countries that use . as the thousands separator (e.g. 1.000) use , as the CSV separator.

Countries that use , as the thousands separator (e.g. 1,000) use ; as the CSV separator.

Why? Because that’s how Excel does it.


Errata: as the decimal separator, not as the thousands separator.


In a POSIX shell, I actually prefer to use the bell character for IFS.

  while IFS="$(printf \\a)" read -r field1 field2...
  do ...
  done
This works just as well as anything outside the range of printing characters.


> I actually prefer to use the bell character for IFS

Heaven help you if you cat the source file in a shell, though!


All downsides, no upsides.

You cannot edit it in regular editor, like csv/tsv/jsonlines.

There is no schema or efficient storage, like binary formats.

There is no wide library support.

Not all data is representable.


> You cannot edit it in regular editor, like csv/tsv/jsonlines.

If only there were shortcuts on modern operating systems to allow us to do things that aren't readily on our keyboards. Like upper case characters. Or copy and paste. Or close windows. Our lives would be so much better.

If ASV had caught on, there could be common shared shortcuts to type them, and fonts would regularly display them (just like the unicode characters proposed). But CSV was simple enough and readily type-able.

> There is no schema or efficient storage, like binary formats.

I'm not quite certain where you're trying to go with this. Binary formats aren't really meant to be human readable in an average text editor. It doesn't know to differentiate 1, 2, 4, or 8 bytes as an integer or a float. Even current hex editors to make it easier to navigate these formats don't really know unless you are able to tell it somehow.

> There is no wide library support.

It's a critical mass problem. Not enough people are using them, so no libraries are being made.

> Not all data is representable.

I'm not quite certain what data couldn't be represented. f you can represent your data in CSV, you can represent it in ASV. It's all plain text that gets interpreted based on what you need. They're nearly a 1:1 replacement. Commas get replaced by unit separators, new lines get replaced by group separators. Then you have record and file separators to do with for further levels of abstraction if you need.


> I'm not quite certain what data couldn't be represented.

What do you do if you receive data already containing a unit separator, or a group separator, and you need to put it into a field? The whole value proposition of ASV over, say, TSV is that you should never need to escape anything, but that's only possible by rejecting some input data.


Re editors: The problem with USV is not that it's hard to type the characters, but rather than the newlines are completely optional. Which means that in general case, most line-based tools are not going to work with USV.

Now, the readme actually has that optional newline separator thing, but the optionality of it makes it completely useless, it seems like an after-thought. Fr example the first "real" USV writer I found, the "csv-to-usv", does not put them [0] and thus makes uneditable files.

And if we are going to end up with uneditable files, might as well go with something schema-full, like parquet or avro. You are going to have the same "critical mass problem", but at least the tooling is much better and you have neat features like schemas.

[0] https://github.com/SixArm/csv-to-usv-rust-crate/blob/30a0324...


Good catch! I just fixed csv-to-usv so it prints newlines now. You're right, Parquet and Avro are both great, for use with schemas.


1 Editors can be improved 2 Same as CSV etc then 3 Libraries can be improved 4 Escaping characters exists

ASCII 1963 had 8 separators, 1965 reduced it to 4, and named them. See 6.3.12 of https://dl.acm.org/doi/pdf/10.1145/363831.363839


The task here is to explain why one should use this over CSV. By your own admissions, there is no reason to prefer this over CSV.


There's no standard for CSV files, thus no one can parse them properly

The only time you need to escape a character is if it's a control character that's rarely used, unlike the " and , characters


> There's no standard for CSV files

Literally untrue. (And were it true, it still wouldn't be a reason why one should use this over CSV—not sure what's so hard to grasp about the conversational/contextual premise here.)


I've sketched out a replacement for JSON which would use these characters - https://shkspr.mobi/blog/2017/03/kyli-because-it-is-superior...


CSV is honestly not that problematic. Figuring out if an field contains and comma and then properly quoting it is trivial. And fields without commas don't need quoting. Sometimes your application even guarantees no commas, especially if CSV is into it from the beginning.


I'm guessing you haven't worked in custom support where people send you their "CSV" files. Even the field delimiter varies (many Europeans use semi-colons).


No, I have. I don't consider abuse of the format a problem with the format. Though I can see how having to delimit with special characters will help the type of person who writes print(','.join(stuff)).


>I don't consider abuse of the format a problem with the format.

That's a fair point. But you could argue that when the abuse is so widespread, it becomes a defacto part of the format (even if it isn't in the RFC).


Not easily readable / editable using a regular text editor.


According to their GitHub README:

```USV works with many kinds of editors. Any editor that can render the USV characters will work. We use vi, emacs, Coda, Notepad++, TextMate, Sublime, VS Code, etc.```

I loaded an example in my fairly generic Emacs and it worked out of the box. The separators were pretty small so I had to increase my font size to distinguish US from RS. And of course I have no idea how to enter those characters. I'm sure there is, but cut & paste worked.


I'm fascinated that a lot of posters in this thread are not understanding the ideas and experiences, that the inventors of this file format had or made. They invented this format because it works for machines as well as for humans. Text editors can handle the proposed UTF characters just fine. Humans can see them. The only challenge is that it is cumbersome to type the delimiters. And that the format is not used in any relevant software (like Excel). Both are reason enough, that USV will not be used anywhere. But I can see why they went this way on their file format.


We might be able to see them, but for me they're just a blur unless I zoom in significantly, so I'll need editor accommodations just as much for these characters as if they used the already existing RS/FS/US/GS characters.

It feels like instead of fixing it properly, they went with an option that will still need tool improvements, will be controversial, and adds unnecessary details (e.g. the SYN they've added will be an active nuisance and I'd be willing to bet will get ignored by enough tools to become a hazard to data integrity).

I quite like an initiative to make use of proper record and unit separators, but this feels poorly thought through in several respects (e.g. their quirky escape characters that adds differently depending on the class of the following character will be a 'fun' source of bugs; that splitting records on LF requires three characters almost certainly will mean a number of tools will incorrectly treat those three characters as a unit, etc. -- these assumptions are based on how slapdash a lot of CSV parsing and generation is; if you want to compete with CSV you ought to learn those lessons)


CSV works for machines as well as humans, why do you assume or imply otherwise? Making the separator hard to type makes this ‘invention’ hard for humans to use. Using the glyphs instead of the semantic Unicode separators might also make this harder to use, even if you can understand why they did it, and to some degree it subverts the intent of the Unicode standard’s separator and glyph characters.


We don't need a new format which works for machines as well as for humans, because there are are tons of existing ones. You have CSV or TSV for wide support; JSONlines if you want very easy edit-ability and structure; and if those don't work for some reason, pretty much any other delimiter/escape would work better (example: newline for records, "^^" for fields, "^"-style character escaping; or JS-style "\"-escaping with field separator being "\N")


I don't really see what benefit it provides over CSV other than needing to escape less frequently. That hardly seems like it's worth it.


Do those characters map to something visually useful in (typical) unicode fonts?

That would be neat :)

Edit: Apparently, kinda (e.g. https://www.compart.com/en/unicode/U+241E )

Not the most creative....


Who here uses a "regular" text editor, let's be real


The usv github repository says it is "the standard for data markup of ...", has 66 stars, and is currently applying for "text/usv" MIME type. That's all about it.

Maybe I'll consider it when it does not belong to a company, has two more zeros in the number of stars, and has RFC/ISO attached to it. Because right now it is not much more of a "standard" than a hobby project I create on a whim.


Thanks. That was supposed to say "a standards-track" and goofed. Fixed now.

By the way, USV doesn't belong to a company. It's just me. RFC/ISO is work in progress, and I submitted IETF ID 00 last week.


Yeah, I can't imagine the ego one needs to basically go "Hey everyone, I've invented a new standard!"...


What a Pedantic take on what constitutes a 'standard'.


About the most annoying thing about the modern Internet is this kind of chip-on-the-shoulder comment about "oh he has such a big ego" and nonsense like that.

Man, I preferred it when people could just write up and propose things. The insufferable "is that professional?", "What about consensus?", "Wow the ego to propose something".

Time to return to monke.


Yeah I agree, the project seems like a decent enough idea to discuss, and by the amount of engagement it's getting here there is some data to back up that assertion.

Additionally I'm sure that those ("they have such a big ego") types of comments and thoughts existed in the early internet as well since its fairly human reaction whenever anyone tries to build or propose something that disrupts the status quo.


We'll be waiting on baited breath


Not sure I understand the advantage over ASCII Separated Values (ASV) which use ASCII control characters 0x1E and 0x1F


Addressed in the FAQ: https://github.com/SixArm/usv/tree/main/doc/faq#why-choose-u...

Main point: ”USV provides typically-visible letter-width characters (such as Unicode 241F), whereas ASV provides typically-invisible zero-width characters (such as ASCII 31).”


USV would have the disadvantage of using multi-byte characters as delimiters, so you have to decode the file in order to separate records. And you still can’t type the characters directly or be guaranteed to display them without font support. This honestly seems like cleverness for cleverness’s sake.


The way I would have gone would be to define the standard to support both, such that the two sets of codes MUST be considered semantically equivalent, but that generation tools SHOULD prefer to generate the control codes for new files.

This way people can initially use the visible glyphs while editors don't support the format, and this will always be supported. But, as editors add support and start to generate the files via tools or manually in tabular interfaces where the codes themselves disappear, usage will automatically transition over to the control codes.


Ah fair enough. Of course you could configure your shell/editor/whatever to make control characters visible. Seems like if you were going to edit USV or ASV by hand you'd probably want a customized editor anyway.


This is so weird, since the purpose of the former characters is displaying the latter characters. If they are actually used for display, then you can’t tell which is which.


Surprisingly, they actually did write a FAQ entry on it (I'm honestly surprised):

https://github.com/SixArm/usv/tree/main/doc/comparisons#asci...


> The Synchronous Idle (SYN) symbol is a heartbeat, and is especially useful for streaming data, such as to keep a connection alive. > > SYN tells the data reader that data streaming is still in progress. > > SYN has no effect on the output content. > > Example of a unit that contains a Synchronous Idle: > > a␖b␞

Why would this go in-band inside a document format? Just why? If you want keep-alives, use a kind of connection that supports out-of-band keepalives.

If you download the same document twice, and the second time the server is heavily loaded (or it's waiting on some dependency, or whatever), presumably the server will helpfully generate some SYNs in the middle of the document to keep the connection alive (?), but now you've got the same document "spelled" two different ways, that won't checksum alike.

SYN along with the weirdness of

> Escape + [non-USV-special] character: the character is ignored

means that you have arbitrarily many ways of writing semantically-same documents.


This entire thing is a solution in search of a problem, and this is the most obvious one.

Why does a file format need a transport protocol?

---

Existing transport protocols (TCP, QUIC) already provide this.


Great point about SYN. I'll drop it.


I've long wanted a successor to CSV, but this is kinda stupid. People like CSVs because they look good, feel natural even in plaintext. This is the same reason that Markdown in successful.

As for including commas in your data, it could just have been managed with a simple escape character like a \, for when there's actually a comma in your data. That's it.


I don't see this as a perfect solution, but CSV is not great either. A comma is super common in both text and numbers. Here in Europe we often use commas as decimal separator and use a semicolon as value separator.

As a result spreadsheets almost always fail to automatically parse a CSV.

I do like the idea of having a dedicated separator character, that would work right worldwide. And then just standardize the use of a dot as decimal separator in these files.


>As for including commas in your data, it could just have been managed with a simple escape character like a \, for when there's actually a comma in your data. That's it.

Not quite. What if there is a \ in your data? Then you have to escape that.


> Not quite. What if there is a \ in your data? Then you have to escape that.

No problem, any character following a `\` is a literal character. `\\` => literal `\`. `\,` => literal comma. `\a` => literal `a`, etc.

Parsing this is easy, generating it is easy, and there is only one rule to remember for humans reading or generating it.

Each rule added for parsing is one more added complexity and point of failure.


You still have to solve this in USV.


Or two commas in a row can be the escape, without overloading backslash.


That wouldn't allow for empty fields.


Four commas in a row.


Can you parse this 2-row CSV:

SomeCommas,MoreCommas,OnlyOneComma,ALotOfCommas

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


Pathological cases are never difficult to find for any app.


Wrong. Consider the standard backslash escape: represent literal comma as "\,", and literal backslash as "\\". Backslashes are otherwise forbidden.

It will be difficult to find pathological cases for this grammar, because they don't exist.


Well okay, but there's an outer circle of Hell where people write code to manipulate Windows filepaths.


Seems complex enough that you'd only manipulate files in this format by serializing through a tool, and by then it's competing with established binary formats rather than CSV.


I've actually been employing Emoji Separated Values (ESV), often , here and there when doing some of this kind of work. Granted, it's not standard, but it's been really useful when I've needed it.

*edit Apparently emojis don't fly here, but it was an index finger pointing right.


The benefit of this is that you can use different emojis to denote content type.

e.g. if it's a frowny face you know it's an invoice.


Funnily enough, I published a Python library two days ago that uses emojis to indicate where certain non-msgpackable builtin types have been forced into msgpackable objects: https://github.com/umarbutler/persist-cache/blob/main/src/pe...

is used for tuples, for sets, for frozen sets, for pickles, for bytes and for bytearrays.

I thought it was pretty ingenious but clearly I’m not the only one to think of it.


My delimiter over several projects over the years has been:

Only a matter of time before something breaks catastrophically but it hasn't happened yet.


It's sensible in principle:

* Editors will play nicely with the graphical representation. If you need better graphics, it's done with font customization, which everyone already supports.

* It announces that the data is source text, vs transmitted bytes. The type/token distinction is not easy to overcome.

* It sits way out in Unicode's space where a collision is unlikely. The whole reason why CSV-type formats create frustration is because the tooling is ad-hoc, never does the right thing and uses the lower byte spaces where people stuff all kinds of random junk. This is the "fuck it, you get the same treatment as a Youtube video id" kind of solution.

That said, if used, someone will attack it by printing those characters as input.


I'm still confused whether this is a joke or not.


I certainly hope that anyone proposing a Unicode CSV variant as a joke would pick some raised hand emoji as the separator and the victory gesture (0xe011, also popular as an approximation of how an air quote emoji would look like) as the quote character.

But we already keep stumbling over missing support for the on-demand quote character even with separators like comma and tab, using more exotic characters as the separator will only make it worse. The value of less escaping is negative.


Apparently it is not. They have submitted it to the ietf. I will have to watch closely to see if librecalc/excel and languages/libraries adopt support. Seems like it does solve some common problems with CSV.

https://www.ietf.org/archive/id/draft-unicode-separated-valu...

https://datatracker.ietf.org/doc/draft-unicode-separated-val...


I don't think it's a joke; at https://github.com/sixarm/usv they discuss how they're working on IANA standardization.


Completely unreadable. Then again, Germans know the pain of decimal points.

We write 3.000,00 for exactly three thousand, instead of 3,000.00

Now imagine how often parsing breaks.


Finland uses 3 000,00 which is also kinda pain to parse.

I think rarely used ' to group thousands is actually most sensible solution.


And now and then you encounter a web form in the .fi domain that rejects "," and expects ".", but does not tell you that that is the reason for rejecting your input. The web "designers" that deploy such crap in .fi should be sent to Siberia.


In my head 3.000,00 is correct and I always get confused because it seems most(?) people use the other method.


CSV is great because excel can import it, but it can't import USV, so at that point, why use USV when you can use JSON?

https://github.com/tyleradams/json-toolkit/


Maybe their objective in submiting to the ietf is to get programs like Excel to start supporting it.


That’s… not how Excel/Microsoft works.


Yes you're exactly right.


Can you not customize the separators used when importing csv-likes into excel? Libreoffice has a neat little window for it that even shows a preview of what values go into which cells.


Sure if you want to stop and fiddle with Excel.

If you want to just double click and get to work, no.


If you would like to run csv-to-usv from 15+ languages (not only rust!) then check out this demo I made, converting the library to an Extism plugin function: https://github.com/extism/extism-csv-to-usv

Here's a snippet that runs it in your browser:

    // Simple example to run this in your browser! But will work in Go, PHP, Ruby, Java, Python, etc...
    const extism = await import("https://esm.sh/@extism/extism");
        
    const plugin = await extism.createPlugin("https://cdn.modsurfer.dylibso.com/api/v1/module/a28e7322a6fde92cc27344584b5e86c211dbd5a345fe6ec95f1389733c325541.wasm",
      { useWasi: false }
    );

    let out = await plugin.call("csv_to_usv", "a,b,c");
    console.log(out.text());


> esm.sh

> cdn.modsurfer.dylibso.com

Do people routinely do this - just run random code from arbitrary endpoints.

Yikes


I'm sorry but.. why? The library is a single function consisting of 10 lines of Rust code. And would be about 10 LOCs to re-implement in any language that has native csv libs. It seems a little bit unnecessary to load a WASM runtime for that.


But without WASM, how are you are going to get 500ms+ startup time and an 3rd party server dependency in your critical path?


And two domains you're blindly trusting not to be hijacked.


Sorry do you know what “demo” means?


for sure — do it!


I'm good I'm just here to chat, not to promote anything ;)


I can certainly appreciate that! Would encourage you to try sometime. Creating and sharing more than your opinion is a lot of fun.


> Is USV aiming to become a standard? > > Yes and we've submitted the first draft of the USV standard to the IETF: link.

This is a nice idea, and all, but seems unlikely to become a meaningful standard without some major backing behind that "we".


Description of USV: https://github.com/sixarm/usv


Absolutely terrible documentation. The RFC doesn't even explain the purpose of the "End of Transmission Block" token.


Good catch, thank you. An accidental omission. Fixed now in the repo. Fixed in the next IETF Internet Draft.


If I understand the API correctly from my brief glance, the crate returns a triply-nested vector with the outermost vector being the equivalent of CSV rows, then CSV columns, then "units" which don't have a direct CSV equivalent. It would be helpful if there was an API method that returned results without this final level of nesting, perhaps panicking if there is more than one unit. This would make it easier to deal with the common case (in CSV at least) where each column only has a single value.


I think the units are the csv fields, records are rows, groups would be multiple CSV files (or multiple sheets in an excel file) and file separator... a zip with multiple CSV files? (or multiple excel files).


My mistake then about the correspondence :)


You're directionally correct. :-)

USV terminology is units, records, groups, files.

Spreadsheet equivalents are cells, lines, sheets, folios.

Database equivalents are fields, rows, tables, schemas.

The USV Rust crate provides iterators str.units(), str.records(), str.groups(), str.files(), so it's easy to get the parts you want.


I'm interested in this too. IMO there's actually a huge benefit to being able to concisely represent 3D / 4D data (i.e. xarrays, slices, datasets) in an easily digestible text format. Mainly thinking about this approach over e.g. the netCDF format or deeply nested JSON.


A similar concept that is (IMHO) much nicer: RSV

It doesn't need any escaping or quoting: a field just has to be valid UTF-8.

The trick is that the delimiters are bytes that are invalid UTF-8.

The spec fits on a napkin, parsing is trivial, you can jump to the middle of a doc and find the nearest row, etc.

Main downside is you need an editor/viewer that can handle it.

https://github.com/Stenway/RSV-Specification


Thank you. I added a comparison to RSV.


CSV is like an invasive plant species, or perhaps a curse; you're never going to be able to root it out even thought there are a billion better data formats.


CSV can be manually read/edited by non-technical/non-developer humans using commonly available tools like Excel and Notepad. Not many of the better data formats match that criteria.


Notepad, I agree. Excel... not so much: it tends to change data silently unless you are very cautious with your environment (e.g. dates transformed to number of days since 1900, and some strings to dates)


Actually Excel finally added a "stop &$*@ing up my data" option recently: https://mashable.com/article/microsoft-excel-disable-setting...


That helps, no doubt. But last week one of my coworkers touched a Csv with Excel, and all dates went from ISO8601 to MDY. We are based in Europe (i.e. we use DMY at minimum). In my experience, a Csv touched by Excel is not trustable for further analysis.


For its use case a good and simple format with just three simple rules and three special purpose characters.


True but there's so much scope for people to do naive implementations with join() or split() functions and then you end up with nothing escaped properly and a big mess


I have seen Unicode Separated Values. I don't like Unicode and I even more don't like USV. I like ASCII Separated Values, which can encode each separator as a single byte, and can be used with character encodings other than Unicode (and, even if you do use it with Unicode, does not prevent you from using the Unicode control pictures in your data; USV does prevent you from using those characters in your data even though the data is (allegedly) Unicode).

What they say about display and input really depends on the specific editors and viewers that you are using (and perhaps on the fonts as well). When I use vi, I have no difficulty entering ASCII control characters in the text. However, there is also the problem with line breaking, with ASV and with USV, anyways; and they do mention this in the issues anyways.

Fortunately, I can write a program to convert these formats without too much difficulty, even without implementing Unicode (since it is a fixed sequence of bytes that will need to be replaced; however, it does mean that it will need to read multiple bytes to figure out whether or not it is a record separator, which is not as simple as ASV).


I've been using an emoji separated values format for a personal project where fields contain lots of special characters including whitespace.

I'd previously given up using ASV because of the printability and copy/paste problems described. Replacing the control characters by their printable glyphs solves all my previous problems and is as genius as it is naughty.

I sympathize with the arguments people here present against and agree the SYN character and Group Separator are weird -- but cause no harm. I'm not bothered by the same data having multiple representations since I'm insisting on human readability rather than byte-by-byte perfection in the first place.

It took 20 minutes to convert my project and I'm very happy.

Only tooling change I had to make was adding digraphs to vim

digraph rs 9246 us 9247

etc. Easy to type directly in my .usv file. Easy to type and read in some Python consuming it.

Regardless of it becoming a standard and my lingering grouchiness about multi-byte characters, needing to use non-xterm, etc. this works very well for me.


Good points. I dropped SYN now because of feedback here.

Thanks for the digraph vi info. Thanks to you, I added a vi-specific page in the repo for your tip.


First time hearing about USV, nifty! However, I think the adoptability challenge remains here to be Excel support (very tough).


Excel can't even handle CSV correctly without using the import function.


Well, CSV would be much harder to import, than something like USV, because the delimiters are well-defined in USV and there is no need for quoting strings.


How to put a USV example into one column of a USV without qualifier?


If you live in a place where comma is the decimal separator, your CSV files will often use semicolon as the separator instead of comma. Will this tool cater for that?


What do you mean cater to that? The point is you separate with a value that is not used within the fields. So decimal your numbers however you want.


This is (nominally) a discussion about the csv-to-usv tool. They are asking if the csv-to-usv tool also accepts semi-colon delimited files as input.

Have you maybe lost track of what post you're commenting under?

(I believe the answer is no BTW, the tool only supports , as delimiter in its input.)


Yes, this. Thank you.

If I work with CSV files they are most often not comma-separated but semicolon-separated because of the numbers. An Excel installation localized for decimal comma would not read 'real' CSV files correct.

If csv-to-usv cannot cater for this type of CSV files, it would not be usable in a large part of the world.


Yeah they should add it. The tool is like 20 lines of Rust code. It's a thin wrapper around the csv Rust crate, which does support specifying alternative delimiters.


Yes, the next release adds custom delimiters, e.t.a. one week.


There are some well researched alternatives to CSV,

From the top of my head, I can highly recommend SML

https://dev.stenway.com/SML/SimpleML.html

Recommend watching the, 'stop using CSV video' too

https://youtu.be/mGUlW6YgHjE?si=zDG_9Jv8LSy-ttP4


Thank you, I'll add a todo to compare SML.



Nope, this isn't a good approach. I prefer tab-separated values (TSV) and use it as much as possible.


Instead of using separator-character:

a <comma> b <comma> c <enter> d <comma> e <comma> f

why not using header-character:

<row><cell> a <cell> b <cell> c <row><cell> d <cell> e <cell> f


I am uncertain, but this is likely to reintroduce the issue of Unicode buffer overflow into the mainstream. What are your proposed solutions, considering it is expected to become standardized?


USV defaults to UTF-8, which prevents the issue of Unicode buffer overflow-- presuming you mean the typical kind where Unicode flows into ASCII.

The primary USV library implementation uses Rust, which is notably good at UTF-8 and conversions between operating system string encodings (such as ASCII) and UTF-8,


This is just ESV files with extra complexity!

ESV: eggplant-separated values. Because who is ever going to put AUBERGINE (U+1F346) into a dataset? It's the perfect record separator!


I prefer tomato as separator. It is tastier


Alternatives to CSV are also covered in length at:

https://news.ycombinator.com/item?id=31220841


Does USV supports nested fields? While reading the USV GitHub's README I did not clearly understand the purpose of the "group separator"


In the same way that CSV supports fields that contain nested CSV documents: cumbersomely / painfully, with lots of escaping of the delimiter characters.


This is needlessly adding yet another standard¹ to the mix. If you are in a position to choose what standard you use, just use:

• Whatever is best for the data model and/or languages you use. JSON is a common modern choice, suitable for most things.

• If you want something more tabular, closer to CSV (which is a valid choice for bulk data), use strict RFC 4180 compliant data.

• If you want to specify your own binary super-compact data, use ASN.1. I am also given to understand that Protobuf is a popular modern choice.

If you aren’t in a position to choose your standards, just do whatever you need to do to parse whatever junk you are given, and emit as standards-compliant data as possible as output; again, RFC 4180 is a great way to standardize your own CSV output, as long as you stick to a subset which the receiving party can parse.

Nobody needs “USV”, and nobody should use it.

1. <https://xkcd.com/927/>


Thank you for the feedback. Your perspective makes sense to me for most people. I've added this to the "criticisms" page.


conversion of file encoding from simple ASCII to UTF-8 has consequences beyond the field/record problem.

Some tools will randomly convert " to 'LEFT DOUBLE QUOTATION MARK' and 'RIGHT DOUBLE QUOTATION MARK' if they see UTF-8 flagging. Thus, the file is converted without your voluntary participation.


I think I'd prefer to wear out my keyboard typing XML tags than deal with this.


Unicode is Turing complete which makes it an attack vector.


It is a set of glyphs and their encodings. How is that 'Turing complete'?


Unicode involves more than the set of glyphs and their encodings; it also involves properties, etc. However, it can be an attack vector even ignoring that stuff; it does not have to be Turing-complete to be an attack vector. But, the specific kind of attacks depends on the application.

Different kind of character sets and character encodings will be good for different purposes. Unicode is "equally bad" for many uses.


Yes turning complete wasn't the right term. , it would be better to say that it's surprisingly complex to parse


Yes, Unicode is too complicated and too messy, whether or not it is Turing-complete (it is complicated enough that maybe it is Turing-complete; I don't know).


USV is doomed because Worse is Better[1] (edit: fix url)

[1] https://dreamsongs.com/RiseOfWorseIsBetter.html


Y'know, I greatly dislike this. It's an actual emotional reaction. This should not be standardized. No one should use this. This is a bad idea and deserves to die in obscurity.

I'll tell you why, it's pretty simple. The characters this... thing is stealing, exist to represent invisible control sequences. That is their use. The fact that they can be mentioned by direct input is inevitable, but not to be encouraged.

I will be greatly disappointed if this is accepted as a standard. The fact that a USV file looks like a rendered ASV file is a show stopping bug, an anti-feature, an insult to life itself. Kill it with fire.


I can't disagree.

I know we're not supposed to do low-brow dismissal here, but everything about this idea is wrong.

Rationale, solution, representation, everything.


Nice! Is there a Python library?


Why not use parquet at this point? (or a row-oriented equivalent like Avro or SQLite)

If you don't have a human-readable file, might as well be compressible, queriable, and metadata-enabled I think.


This is the XKCD comic in action. https://xkcd.com/927/

Someone should write a family of filters of the form CSV2ASV, CSV2USV, CSV2JSON ,USV2XML , TOML2USV, USV2Cuneiform.......


/me looks at calendar, nope not April 1st yet.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: