Don't do this. Tsv has won this race, closely followed by Csv. Anything else will cause untold grief for you and fellow data scientists and programmers. I say this as someone who routinely parses 20gb text files, mostly Tsv's and occasionally Csv's for a living. The solution you are proposing is definitely superior but isn't going to get adopted soon.
I was surprised to see you list tsv as more common than csv. I encounter csv's on a pretty regular basis, but I don't think I've had to parse a tsv in the past 3 or 4 years. As a junior web developer, I don't have much experience though. 9 times out of 10, the csv is coming from or going to Excel, or a system that was designed to support Excel. If you don't mind my asking, what types of data do you regularly work with that are in tsv format?
Excel actually doesn't 'care'. It uses the record separator defined in your Windows "Regional Settings", and the defaults there differ for each system locale.
TSV is nicer for output (on stderr/out or a logfile), so tends to crop op if you want to parse the output/logfile of something. I haven’t seen Excel in use at my workplace yet.
Floating point values with a given precision and some integers. We’ll have to buy a proper supercomputer before the latter take more than seven digits :\
There tends to be less overhead in TSV. Unless you want to represent text that has embedded tabs it seems unnecessary. It works with standard *nix tools. Not a bad compromise and part of the reason that people whose standard "file" is 100Gb prefer it.
That's unfortunately a very accurate summary:)
Real estate data, traffic data, weather data, population demographics, stock prices, tweets - I've parsed all that and more. Every one of them was a giant Tsv (except finance ones, which were csv's because Excel). Say you purchase the database containing every single home sold/bought in California for past decade. That's 11 20gb Tsv's with 250 tab separated columns plus 1 data dictionary which tells you what each of the 250 columns mean.That's what Reology sells you - gigantic txt files with tabs, that are easy to handle with awk, cut, sed and more.
I could preface a lot of this with 'kids these days', but...
What you write is so true. So many large companies use text files to shuttle around data. I worked at one place that used pipe-delimited 10+GB files. It's not sexy, and using awk/sed/cut seems like a hack at first, then you realize that it works and it is the simplest solution to the problem.
What would make one pair of ASCII characters (comma/linefeed or tab/linefeed) handle nesting any better than another pair (unit separator/record separator)?
Because CSV actually has three special characters. The field separator is ',' (or tab for TSV). The record separator is '\n' (or "\r\n"). And the quote/escape character is '"'. Commas or newlines which are part of a quoted string are data rather than control characters. Quote characters can be escaped by preceding them with a second quote character. There is an RFC that describes this.
You could use ESC (\x1b) to escape itself and either of your delimiter characters, but of course now you've gotten back all that complexity you were trying to avoid by using non-printable characters.
It's hardly too late. If you just look at the use case of application logging for example, the latest fashion is logging using JSON format log files, which is INSANE for a different set of reasons. I have servers that generate TB of log files on a regular basis. ASCII delimited log file format standard could be adopted by the application logging space, could result in some uniform tools that provide better streaming support for log shipping, and gain adoption in other adjacent use cases from there.
I think CSV is more common, since it allows for escaping whereas with TSV I don't believe there is any method of escaping.
I had to deal with a system using TSV once, with that "feature", and that point made it so that we had to do the escaping at some higher level, with \t and \\.