Hacker News new | past | comments | ask | show | jobs | submit login

Great collection of tools and articles. Plain text is also a great (and the best) choice for datasets. Do NOT use JSON, YAML and friends, for example, as your input format but use plain text with a custom parser to import into any SQL database and than you can easily export to JSON, YAML and friends. See the football.db and the Premier League or World Cup match schedule as living examples [1]. [1]: https://github.com/openfootball





If your dataset is mostly a list of strings, sure. If it's anything more structured, why exactly?

I'd argue that using "plaintext" for structured data (a.k.a, inventing your own data representation) will set up both you and the users of your dataset for unnecessary pain dealing with unescaping and parsing.


> If it's anything more structured, why exactly?

It is way easier to input / type and change / update. And compared to lets say JSON, YAML or friends at least 5x times more compact (less typing is better). See the world cup all-in-one page schedule in .txt [1] and compare to versions in JSON, XML and friends that are page long dumps, for example.

[1]: https://github.com/openfootball/world-cup/blob/master/2018--...


One advantage I can think of is it doesn't need to be parsed into a node based tree structure like JSON. It's a lot easier to stream parts of it at a time.

if the dataset is "more structured" you can try to simplify this structure for great gains. As a byproduct, you get to use text files for the data.

Could you give an example?

See above the world cup match schedule [1], for another other examples with geo tree (e.g. country/province/city/district/etc.), see the Belgian Football clubs, for example [2] or for yet another example the Football leagues [3] with tiers (1,2,3, etc.) and cups and supercups, playoffs, etc. The .txt version are pretty compact with "tight" / strict error checking and JSON, YAML and friends I'd say it would be 2x, 3x or even more effort / typing. [1]: https://github.com/openfootball/world-cup/blob/master/2018--... [2]: https://github.com/openfootball/clubs/blob/master/europe/bel... [3]: https://github.com/openfootball/leagues/blob/master/europe/l...

I see what you mean. I agree, for a human editor with domain knowledge, those files are easier to read and maintain than JSON. However, it's definitely nontrivial to parse as a machine-readable format. If other projects are supposed to consume the .txt files directly (i.e. not going through the command-line utility), you should at least provide an EBNF grammar.

Example: I assume, the scorer lists are actually lists-of-lists, where equivalent JSON could look like this:

  [
    {"player":"Gazinsky", "goals":[{"minute":12}]},
    {"player":"Cheryshev",
"goals":[{"minute":43}, {"minute":90, "overtime":1}]}, ... ]

... which is absolutely more verbose.

However, if someone just went by the data, they could get parsing wrong: It looks like the outer list (of players) is delimited by spaces - however, there are also spaces inside the player names. A better approach could be to split the list by ' signs as each player has at least one time - however, players can have more than one time and could probably also have apostrophes inside their names (e.g. Irish players). So I guess, the best delimiter would be a letter after an apostrophe after a number. Except, we might also have parantheses, etc etc.


I'm confused about what plain text means if JSON and YAML don't qualify. They are non-binary and non-proprietary. Is CSV plain text? And the example URL of openfootball has data files with fixed column positions and square brackets. Looks like you're packing semantics implicitly into the parser rather than leaving it explicit. I don't see why that's an argument in favor of plain text.

JSON and YAML qualify as plain text, for sure. Plain text is a spectrum. Let's say from "free form" english text as your comment to more machine-oriented structured formats like JSON and YAML. YAML, for example, tries to be a more human plain text format than JSON e.g. it supports keys without enclosing quotes or it supports comments and it supports variants and many shortcuts and much more. JSON is pretty "inhuman" if start hand-editing from scratch and NOT recommended, see Awesome JSON Next for "Why JSON is NOT a good / great configuration format" or "How to fix JSON" and so on - https://github.com/json-next/awesome-json-next

The football data looks easy for a human to read but a pain in the arse for a program to consume. Personally I think it's terrible, and the fact that they have had to develop a custom 'sportsdb' tool to manage it rather than using something generic like 'jq' is telling.

https://github.com/openfootball/england/blob/master/2019-20/...

To properly parse this file you need to write a parser that cut fixed-width fields (wait, are the team names padded to fixed-width or is the '-', which doubles as part of the result, a delimiter?), trim strings, knows that "Aug/24" is August 24th, deals with wrapping of the months over the two years, is sensitive to indentation, and understands that "Matchday [0-9]+" and the [] bracketed dates are section delimiters. And what about that first line beginning with '=', comments? Where is the formal grammar for this format?

CSV of "matchday,fulldate,team1,team2,result" would be just as easy to read, much easier to parse, and probably smaller in size


Good point. See the football.csv project :-) @ https://github.com/footballcsv Here's, for example, the Premier League 2019/20 match schedule example - https://github.com/footballcsv/england/blob/master/2010s/201...

The point is as you say - the .csv format is easy to read / parse / write with a script for automation BUT it's way harder to start from scratch to input / type and keep it up-to-date. That's why you need both type of formats (one for hand-writing and one for easy auto-generation).


> rather than using something generic like 'jq' is telling.

The best generic tool for managing (structured) data is SQL. Once you have the datasets imported (via the custom readers / loaders) it's just plain SQL (and works with SQLite, PostgreSQL, MySQL, etc.)


For large data repositories, especially public/open datasets, a major concern is versioning. While it is not impossible to render a nice diff between two SQLite files, it's not as ingrained in our everyday tooling (e.g. GitHub) as plain-text diffs.

For small to medium-sized datasets, a nice middleground would be SQL dumps. Put the dumps in Git for versioning and diffing, and load them into $DATABASE for actual queries.


As great as such formats are for human consumption, they should come with reference specifications and parser implementations to be usable.

You might like the Comma-Separated Values (CSV) Format Specifications (and Tests) org @ https://github.com/csvspecs Trying to improve the world's most popular plain text format (and - surprise, surprise - nobody cares).

Plain text files that don't have a parser/schema also leave room for later breakage when somebody wants to add a field, make something longer, put a comment or note in, etc.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: