Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yup exactly, it just pushes the problem around, without solving it.

The delimiters can occur in binary data, or when there's nesting -- trying to store TSV in TSV, or JSON in JSON, etc. The latter definitely happens a lot, for better or worse



The title says ASCII Delimited Text not ASCII Delimited Binary Data.

For the purposes of CSV, I consider text to be anything that satisfies the regex ^\P{Cc}+$ (https://www.compart.com/en/unicode/category/Cc) and I normally strip chars in that category before saving some text (for single-line text). ^[\p{Cc}&&[^\n]]+$ is a regex that can be used to strip all control chars except for the newline.


Thanks for that. Handy. I think I’ll have use for that myself.


You can disallow those metacharacters in the data proper. Then you have a format that can store any utf8 or whatever except the non-whitespace control codes without any escaping. That solves a problem in an opinionated way. Just like how json is opinionated (utf8 only).

You can convert to another format if you need something crazier than rows and columns consisting of normal text.


That already exists -- TSV. You disallow the tab metacharacter.


Then I don't understand. Like the sibling comment said the really "problematic" character in TSV is the line feed. But a tab can occur as well.

The format that I described does not already exist in the form of TSV. And further based on your original comment I would have thought that both TSV and this format would be discarded as not-useful.


And the newline character. Both of which commonly occur in normal text.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: