
How to Correct 32,000 Incorrect CSV Files in Fewer Than 32,000 Steps - jeffkeen
https://medium.com/p/a5f1ba25d951
======
randyzwitch
How is this fewer than 32,000 steps? Doesn't the act of running this gem on
every file to clean it count as a step?

~~~
dredmorbius
The point is to have a single parsing rule, rather than individually assessing
and parsing each of the 32,000 files.

Having worked with messy data myself in the past (a badly b0rken archive of
about 125,000 Web archive posts), it's possible to fix issues in a reasonable
amount of time (a few hours work) by using available tools (for me: the tidy
HTML validator and a bunch of one-off sed/awk scripts), working on each class
of error.

Since the corpus itself was generated programatically, there were (generally)
a limited set of issues, each representing some bug or another in the original
code to address.

The article here is referring to human-input records, which tend to be far
more creative in how they deviate from spec or expectation, though in a more
constrained space (CSV rather than longer HTML documents).

