Hacker News new | past | comments | ask | show | jobs | submit login

The football data looks easy for a human to read but a pain in the arse for a program to consume. Personally I think it's terrible, and the fact that they have had to develop a custom 'sportsdb' tool to manage it rather than using something generic like 'jq' is telling.

https://github.com/openfootball/england/blob/master/2019-20/...

To properly parse this file you need to write a parser that cut fixed-width fields (wait, are the team names padded to fixed-width or is the '-', which doubles as part of the result, a delimiter?), trim strings, knows that "Aug/24" is August 24th, deals with wrapping of the months over the two years, is sensitive to indentation, and understands that "Matchday [0-9]+" and the [] bracketed dates are section delimiters. And what about that first line beginning with '=', comments? Where is the formal grammar for this format?

CSV of "matchday,fulldate,team1,team2,result" would be just as easy to read, much easier to parse, and probably smaller in size






Good point. See the football.csv project :-) @ https://github.com/footballcsv Here's, for example, the Premier League 2019/20 match schedule example - https://github.com/footballcsv/england/blob/master/2010s/201...

The point is as you say - the .csv format is easy to read / parse / write with a script for automation BUT it's way harder to start from scratch to input / type and keep it up-to-date. That's why you need both type of formats (one for hand-writing and one for easy auto-generation).


> rather than using something generic like 'jq' is telling.

The best generic tool for managing (structured) data is SQL. Once you have the datasets imported (via the custom readers / loaders) it's just plain SQL (and works with SQLite, PostgreSQL, MySQL, etc.)


For large data repositories, especially public/open datasets, a major concern is versioning. While it is not impossible to render a nice diff between two SQLite files, it's not as ingrained in our everyday tooling (e.g. GitHub) as plain-text diffs.

For small to medium-sized datasets, a nice middleground would be SQL dumps. Put the dumps in Git for versioning and diffing, and load them into $DATABASE for actual queries.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: