My wife is a Wend, and so we visited this area in Texas. One of the things I found interesting was that there was a local paper that printed articles in German, English, and Wendish (Sorbian) – there’s a link in the Wiki article. The church we visited was so beautiful.
But you always need to have the code for escaping and you always need to check for it. I don't see how an implementation can be simpler without dropping support for tabs within columns, which would make it non-conformant to the spec.
Because, while you always _should_ implement the proper escaping, that takes work. Not a large amount of work, but more than zero. In many cases your data doesn't contain commas or tabs, so you can do it the super simple way and get back that time. There are more cases where data is tabless than commaless, so using TSV affords you more opportunities to get this quick and dirty timesave when you need a fast solution.
Ah, so what you actually mean is more performant (for a subset of uses), not simpler?
So if I have a TSV and a CSV containing either pure numbers or complex data (say the contents of each file in a codebase where each row likely contains both commas and tabs), they would be equivalent in both performance, right?
If I have a TSV and a CSV containing natural written language TSV might be more performant since there are likely much more commas than tabs (I'm guessing this is your point?).
Regardless of the input data the encoding/decoding code would be equally simple (since they need to account for the same edge cases), right?
Sorry, I probably chose my words incorrectly. I meant "work" and "time" in relation to human work to produce the parser.
With CSV, it's more likely you'll encounter data where you need to implement the escaping. With TSV, you can get away with the simple parser for much longer, as it's comparatively rare to find data that contains tabs.
If you propose to use a TSV parser that does not handle escaping then that sounds very unsafe to me. Do you also want to skip checking for escaped newlines? Or escaped backslashes?
What you are proposing is not using TSV, but a format that completely bans newlines or tabs in any data. There are certainly uses for such a format but to make it non-risky to use you'd need strict validation on the input to the encoder and make it very clear that it is not TSV, since it does not follow the rules of TSV encoding/decoding and will not produce the same data as a proper TSV implementation.
We're coming at this from different angles. I completely agree with you that the proper way to read these files is using a fully standards-compliant parser. You make the distinction that a parser that can't handle tabs in the data doesn't technically parse "TSV", instead a subset of TSV-like files with limitations - sure, that makes sense.
What I'm trying to get at, is that there are situations in which implementing such a limited parser is justifiable (and for the main discussion in this thread, TSV makes this more commonly achievable than CSV).
With the luxury of time, all our parsers would handle delimiter escaping, unicode, control characters, byte order marks, etc, perfectly, and truly parse "TSV" and "CSV. Personally, I work on-call in SRE - if something is broken, we need solutions NOW. If I have a CSV of stuff, I am not going to implement a proper parser, I don't even have time to boot up a programming language with a CSV library, I am going to split by comma in the terminal of whatever box I'm logged into to get what I need. Most of the time it'll work, and to the discussion in the thread, TSV makes it more likely to work because it's less likely for the delimiter to be in the data. Less likely to need need those 5-6 extra characters of regex lookbehind.
My main point: as a consumer of these files, I prefer it when people send me TSVs rather than CSVs, because I am more likely to be able to use a simple not-really-TSV/CSV parser to read them. Sometimes the data's really messy and I need a real parser, but TSV makes this less likely.
> My main point: as a consumer of these files, I prefer it when people send me TSVs rather than CSVs, because I am more likely to be able to use a simple not-really-TSV/CSV parser to read them
My point is that you are not really talking about CSV/TSV since your parser does not handle CSV/TSV. You are using a custom dataformat. Which is fine and perfectly reasonable, and its probably specified to avoid all those issues.
But it is not CSV or TSV. When you say "a simple not-really-TSV/CSV parser to read them" you mean you are not using CSV or TSV. That's fine for non-CSV and non-TSV. usage. Just be clear about what format you are actually using and specify it. It clearly isn't TSV or CSV.
Thanks for the explanation. Ah, I think I see where our difference is.
A website produces a file with a normal CSV exporter. This is a fully standards compliant proper CSV. I call this a CSV. They provide this file for download, I download it unchanged. By this point, I still call the file a CSV.
Next, I parse the CSV file with my non-CSV parser. Here's our point of contention: I still think the original file is a CSV; I have operated upon it with a non-CSV parser, but for my way of thinking, the file itself is still a CSV. You disagree here, because in order for my use of the parser to be correct, I can't possibly have operated upon a CSV file, I must have operated on a CSV-like file.
I was thinking from the perspective of the file itself and where it came from, so using an incorrect parser doesn't change it. You were thinking in terms of the grammar accepted by the parser I'm using - assuming the parser is appropriate, it's impossible for me to be reading a CSV, it must be something else CSV-like.
I think we are both right, and I think we both understand where the other is coming from.
> I have operated upon it with a non-CSV parser, but for my way of thinking, the file itself is still a CSV. You disagree here, because in order for my use of the parser to be correct, I can't possibly have operated upon a CSV file, I must have operated on a CSV-like file.
Not quite my opinion. The file is still a CSV file, but IMO the parser is not a CSV parser unless it supports the full spec. The file is still CSV, and it happens to be compatible with the incomplete parser because it does not use any "harder" CSV features.
Lets say we have a website that uses UTF-8 (declared via content-encoding and similar). Some pages on this website only uses ASCII, some uses higher codepoints within UTF-8.
I can parse some of these pages with a ASCII decoder, but that does not mean that my ASCII decoder is a UTF-8 decoder since it only handles a very small subset of UTF-8 that aligns with ASCII. In this example your CSV-lite would be like ASCII and CSV would be UTF-8.
I completely understand the concept, I really do. I'm just struggling to work out where the original disagreement came from, I think it's completely my fault for not articulating myself properly, thank you for your patience. I'm going to annotate my original comment here with clarification on what I originally meant:
Because, while you always _should_ implement the proper escaping [in order to extract the information you need from CSV/TSV files that you have received from an external source that produces correctly-formatted CSV/TSV files], that takes [human effort]. Not a large amount of [human effort], but more than zero. In many cases [the data stored in CSV/TSV representation] doesn't contain commas or tabs, so you can [extract the data from the file] the super simple way [by implementing a naiive CSV/TSV-like parser that just happens to work for a subset of CSV files that don't contain escaping] and get back that time. [In doing this, you have extracted the information you need from the file, but you have not done the work to implement a real CSV/TSV parser. You have implemented a parser for a mystery format, misused it on a CSV/TSV file, but it happened to work and you got the data you needed]. There are more cases where data [in the CSV/TSV file you got from an external source] is tabless than commaless, so [if the external source happens to provide you a TSV file instead of a CSV file, this] affords you more opportunities to [be able to misuse use your TSV-like parser on the TSV file and still get the data you need, giving you a] quick and dirty timesave when you need [an immediate] solution [where you lack the time to get a real CSV/TSV parser and can tolerate the inherent lack of safety in using a CSV/TSV-like parser on a CSV/TSV].
> I think it's completely my fault for not articulating myself properly, thank you for your patience.
Not at all, and thank you for your patience and engaging this deep!
I think the disagreement came from different people. I initially tried to question why /u/guidedlight thought a certain delimiter would be easier/simpler just because it was less common and then we went down a rabbit hole of "what is a CSV/TSV".
I agree with you and I think there are many use-cases for not-as-full CSV/TSV parsers/encoders. My main objection was calling a CSV/TSV implementation a CSV/TSV implementation when it clearly skipped a lot of parts (while it can of course parse a lot of files without those parts). I'd like to call those simpler formats something other than CSV/TSV, but that ship has sailed.
It is simply easier to take basic written human text and put it into a TSV than a CSV as humans use commas far more than tabs. You can replace a TAB with spaces and keep things legible but replacing a comma in a sentence can literally change the meaning.
Maybe you could wrap everything in quotation marks but that is ugly.
It's not easier or simpler since you need the exact same checks and steps, just with a different delimiter. It doesn't matter if you need to do them less times for certain inputs since you need the same checks, the same encoding/decoding steps and so on.
Do you think you would have an easier time writing a TSV parser than a CSV one? If so, why?
And wrapping in quotes does not solve anything since now you need to both check for escaped quotes and tabs/commas. It's the same but one level deeper.
I think what they're saying is that with some minor control over the data in your dataset, you don't need to care about escaping _in your parser_ at all. The same might be said of CSV but I would argue that in the majority of situations tabs are less semantically meaningful than commas and newlines, so it is generally fine just to strip them out.
Obviously this is not robust solution, but in cases I've seen, it works adequately. If one were to be doing it "the right way" then I agree with you wholeheartedly.
I get what you are saying, but my point is that is not CSV or TSV. It's a homemade format with its own rules that just happens to be inspired by TSV or CSV.
Sure. Another option is semicolon-delimited files which are also in use in Europe, and Pandas handles that fine too.
I was responding to your comment that you had never seen commas in data fields in CSV files, and wanted to point out that this is a quite common issue in Europe.
(It also often wreaks havoc with Excel files btw, as Excel will then only casts strings to decimal numbers when a file is opened in some locales...)
> I have reported all of my experience to the ScalaCenter in 2019. I was hoping to see concrete actions, such as building a reporting mechanism, to protect minorities in the community. Unfortunately, I am not aware of such actions taken.
That sounds like a community failure, not just One Bad Guy.
We're looking for a great DevOps engineer. You'd be a crucial part of our development team. If you're interested, or just want to know more, send me an email at will@helloreverb.com.
Reverb Technologies is the the company behind Wordnik.com; Reverb for Publishers; Swagger, Atmosphere and Scalatra; and other specialness soon to be revealed.
Reverb is looking for a senior, hands-on developer capable of
interfacing with the Amazon EC2 API and others, who would be responsible
for building internal tools to manage our software infrastructure. This
would include both back-end workflow as well as user-interface components.
Duties and Responsibilities
Experience with Cloud deployment tools/scripts for AWS Cloud Services
Nuts-and-bolts understanding of RHEL, CentOS, performance tuning,
monitoring
Puppet or Chef automated deployment tools
Nagios, Cacti, other monitoring and alerting systems
Apache/Nginx support, configuration
MySQL, Java application deployment
Strong knowledge of TCP/IP, UDP
Experience managing high-availability systems
User interface development using javascript (Play-Scalate a bonus)
Low-level operating system exposure for application and
infrastructure deployment
Experience with both performance and error monitoring & alerting
systems
Track record in a high-uptime, high traffic application infrastructure
Reverb Technologies (aka Wordnik) has a number of good positions available including: iPad Visual Designer, iPad Interaction Designer, Full-Time Web and Mobile Designer, iOS Developer, Frontend Hacker, Server Engineer, Machine Learning Expert, Computational Linguist, and Analytics and Data-Mining Expert. Job descriptions are at http://www.helloreverb.com/jobs/ (where you'll also see a bit about what it's like to work here).
Feel free to contact me at will@helloreverb.com if you want to apply or have questions — and check out what we're building at http://helloreverb.com (as well as http://wordnik.com, which just got a bit of a refresh).