
Sidehelm: a pipeline to validate, test, and pull CSV data - iamwil
http://www.sidehelm.com/
======
victorNicollet
It really is not that hard or time-consuming to write a CSV sanitization tool,
one that actually manages to recover data from many kinds of bad (but
consistent) formatting. I'd say my team spent around 6 man-weeks on it.

The real cost is that the fight is never over. Whenever a new customer comes
in with a CSV so badly formatted that it is rejected by your system, further
analysis reveals that there's actually some sanity behind the madness. That
you could in fact add new rules to detect and correct the situation.

We've seen so many horrors. European dates ? Try a file with three different
date formats over six date columns. Quotes that should be parsed in some
columns but kept verbatim in others. Columns that contain raw binary data,
without any escaping (but you can still infer the boundaries from context).
UTF-16 with a byte skipped in the middle. Tab-separated files where the first
line is space-separated instead. Cells that are 8MB of white space.

There is no end to the... creativity of CSV file authors.

~~~
kafkaesq
_There is no end to the... creativity of CSV file authors._

Indeed, it never ceases to amaze. That's why it's so hard to imagine a
"general" solution to this problem. You can think of something to handle to
90% most likely business use cases, maybe... but not the fat tail.

------
grenoire
How trustworthy is Sidehelm in handling of the data? Do they ensure that the
pipelined data does not come into contact with human eyes, or is not copied?

I cannot really think of many services off the top of my head that would be so
willing to give significant chunks of data to another pipelining service.

~~~
iamwil
Sidehelm operates the same way many ETL or logging services do. Everything
that it does is completely automated and any data retained is for the end user
to inspect, not us. Any domain specific data error that happens, we would
notify you for you to fix and replay through the pipeline.

------
michalskop
Related for working with CSV data, there is an interesting standardization
project by Open Knowledge: [http://frictionlessdata.io/guides/tabular-data-
package/](http://frictionlessdata.io/guides/tabular-data-package/)

It already includes some tools for working with and displaying CSV data.

------
notfed
Please tell me this is not a third party web service?

~~~
iamwil
What is your concern about a 3rd party web service here?

~~~
kafkaesq
Bad "architecture smell", generally (except for niche cases).

Basically you don't want to rely on something that (1) requires a whole new
service / monitoring later, (2) intrinsically exposes your data to untrusted
sources and (3) can of course go "poof!" at any time -- unless you absolutely
have to, or it's for something mostly ancillary to business and your data
pipeline. (Or you're fairly small and just don't have a lot of people).

Especially when it's just CSV parsing, after all.

------
iforgotmypass
Dear HN users,

Could you please link to alternative solutions?

Does AWS Glue solve the same problems?
[https://aws.amazon.com/glue/](https://aws.amazon.com/glue/)

~~~
ZenoArrow
I could do with those suggestions too, we have a strong need for a CSV
validation tool.

After a brief search, this seems promising:

[https://theodi.org/blog/introducing-
csvlint](https://theodi.org/blog/introducing-csvlint)

~~~
iamwil
What do you do for work that you have a strong need for a CSV validation tool?

~~~
ZenoArrow
I'm a software tester, but the company I work for gets a lot of third-party
data from CSV files, which frequently do not match the requirements set out
for those files. Using a CSV validation tool (ideally a web service controlled
by the company I work for) would allow our clients to validate the files
before they sent them to us.

I'm slowly putting together my own solution for this, but if suitable open-
source software already exists I'd be happy to make use of or modify that.

~~~
iamwil
I'm guessing you'd want the validation tool to be controlled by the company
you work for, because of the sensitivity of the data being processed?

What type of data/industry is it? It'd help us to know where to look for
people that might need us.

------
automatwon
The 'Parse', 'Transform', 'Validate' icons look alot like Google Cloud
Platform Icons.

[https://docs.google.com/presentation/d/1vjm5YdmOH5LrubFhHf1v...](https://docs.google.com/presentation/d/1vjm5YdmOH5LrubFhHf1vlqW2O9Z2UqdWA8biN3e8K5U/edit#slide=id.p14)

~~~
jimktrains2
I've been looking for the icon set for presentations I've made for a while and
never found this; thanks!

------
triplenineteen
Does this support unicode characters? Other encodings?

~~~
iamwil
Yes. Part of the reason seemingly simple CSV are a pain is because it's often
malform, badly formatted, with odd encodings in the file. We take care of that
for you, so you can fix it or throw it out.

------
TAForObvReasons
Is there a functional demo or video demonstration? All I see is a marketing
page ...

------
jimktrains2
non sequitur: I find it interesting that they're using the Google cloud blue
hex style in their icons, but don't seem to support BigQuery or
BigTable/HBase.

