
How to share your data effectively - Atlas10
http://www.caitlinrivers.com/1/post/2013/04/send-me-your-data-pdf-is-fine-said-no-one-ever-how-to-share-your-data-effectively.html
======
onemorepassword
_My preference is .csv, because it can be read by almost any program._

Except the one most used for it: Excel. Excel sucks at csv, especially if the
recipients have various internationalized versions. It can't deal properly
with multi-line strings and spaces, but especially the internationalization is
hell (semi-colons instead of commas, decimal separators, date formats).

Many times have I dealt with CSV-exports that could be read in any software
except client's Excel versions.

~~~
shrikant
I've always found the C in CSVs to be needlessly limiting. I'm not aware of
any tool that contextualise commas in addresses and other "free text" fields
that might form part of the data dump/extract.

For this reason, I am _strongly_ in favour of pipe-separated values in such
files. The probability that a data extract has a pipe symbol in any data field
is quite small (in my experience, it's been 0 so far).

The first thing I do in any Windows system I have to use on a regular basis is
to change the the system separator to '|' in the Regional Settings..

~~~
kyllo
Yes, pipe-delimited format is great, and much easier to parse than comma-
delimited.

Tab-delimited is also better than comma-delimited, I think.

Commas just have far too high a probability of appearing within the data
itself, and when you hit a comma that was supposed to be escaped, it adds an
extra column to the row, screwing everything up.

~~~
jleader
Tabs are great, until you try to explain to a non-programmer why your tool
doesn't parse their file correctly, even though in the word-processor they're
using as a text editor it looks identical to another file that parses
correctly.

------
JustARandomGuy
_Use a flexible file format. My preference is .csv, because it can be read by
almost any program. I'll tolerate .xls, but I'm not pleased with .xlsx (not
everyone uses Excel!). And please, please, please do not use pdf._

Why is .xls tolerated but not .xlsx?

XLS and XLSX are both Excel formats, but XLS is a binary blob while XLSX is
just a zip containing xml files. I'd much rather parse xml files (XLSX) than
the binary blob of XLS.

~~~
cookiecaper
My experience is that open-source implementations like LibreOffice have much
better support for the older file formats. docx/xlsx/etc may be a "zip of XML
files" but it doesn't really matter if the original data is stored as XML or a
blob when it comes to the end user's rendering experience -- it's about how
the system can interpret that data. And the interpretations of the new file
formats are mostly severely lacking.

~~~
BrianEatWorld
To be fair, isn't Libre is also pretty terrible at handling more basic formats
like csvs?

I still have yet to find a good workaround for the "numbers as text" issue
aside from switching to tsvs.

~~~
maxerickson
What do you mean by good workaround?

Specifying that the column is text during the import at least gets the data
into the sheet.

~~~
BrianEatWorld
Yes, it gets data into the sheet, but means that you need extra steps to run
calculations. There is no option to specify the column as a number during the
import phase. So what takes one step in Excel takes a minimum of two in Libre.

~~~
maxerickson
I thought you were worried about leading zeros or something.

In calc 4.0 on XP, I can choose from a few options for each column and the
default (called "Standard") results in a column that I can use in calculations
(if I highlight the cells it shows the sum, in formulas, etc.).

If I specify the column is "Text", leading zeros are retained.

------
a_p
ProPublica recently published an article (with an unseemly title) about the
nightmare of scraping data from pdfs and putting it into tables. [1] They
wrote some custom software to do the scraping. At the end of the article they
mention some software released recently by a Knight-Mozilla OpenNews Fellow,
Manuel Aristarán, called Tabula, which is a PDF data table scaper.[2] I
haven't tested it yet.

[1] [http://www.propublica.org/nerds/item/heart-of-nerd-
darkness-...](http://www.propublica.org/nerds/item/heart-of-nerd-darkness-why-
dollars-for-docs-was-so-difficult)

[2] [http://source.mozillaopennews.org/en-
US/articles/introducing...](http://source.mozillaopennews.org/en-
US/articles/introducing-tabula/)

------
yebyen
There is lots of bad advice in this article. OK maybe the article is shorter
than I thought. Here, this is bad:

Use short variable names with no whitespace. Underscores are usually a safe
bet, so instead of "Number of new tuberculosis cases" use "incident_tb"

While it might be clear to the domain experts, I had no idea that
"incident_tb" meant "Number of new tuberculosis cases" other than that it was
placed next to that text.

I think the message is right, but honestly if they are sending you data to
parse in PDF, you better just give up now. That person will most likely not
have seen a README file before in their life, they're probably not even
trying.

~~~
mrtriangle
I agree with you, I actually had a client who demanded that the data in their
project be done in excel (with multiple files ranging from 1-3 gb in size!).
Both myself and my partner tried exhaustively to use a database or even
JSON/XML solutions but the client kept repeating "in the real world business
is done in excel!" Needless to say it was a nightmare.

~~~
dav-id
I don't have the full picture here but by moving away from an Excel format you
are also potentially taking away a lot of control they have themselves over
managing the data and working the data which Excel is extremely good at with
little to no programming.

~~~
omni
Your point is true generally, but I have yet to use a version of Excel that
can handle multiple gigabytes of data without serious performance degradation
and probably a lot of crashes.

~~~
cosmie
Which would matter, if businesses cared. I spend 4 days every month compiling
a 1.5GB, Excel-based report. 80% of that time is spent with Excel frozen up
and me playing on my phone, hoping that when Excel unfreezes it doesn't crash.

I suggested porting the process to Access with an Excel-based front end, which
would cut the compilation time down to 4 hours at the most. I was denied due
to the fact that whomever supports it after me will more than likely not have
Access experience, and told to continue with the 4 days of hours-long freeze
ups.

Corporate life for you, I guess.

------
th0ma5
A new tool for pulling tables out of PDFs was released recently by Mozilla
Open News [http://source.mozillaopennews.org/en-
US/articles/introducing...](http://source.mozillaopennews.org/en-
US/articles/introducing-tabula/)

------
droithomme
It seems like most federal and state agencies nowadays make much data - data
that they are forced by law to make public - only available in pdf files. The
reason for this seems to be to prevent as much as possible use and analysis of
the data, while technically complying with public disclosure requirements. It
is often extremely difficult to get public data in a usable form out of most
public web sites, and the ways in which this happen are so unoptimal they move
past the level of mere government incompetence to intentional obfuscation.

~~~
brudgers
Any format other than PDF and the large numbers of citizens will not have the
proper tool to access it - e.g. a person using a public computer in a library.

It's not malice even if it hampers outside researchers. Any format other than
printed (e.g. PDF) is going to require the user to deal with the meta-data in
order to extract useful information - that's just the nature of data.

~~~
amboar
They could publish the data in multiple formats, PDF and something actually
useful if you're worried about someone at a library not having the tools. The
point is, don't cripple the data from an analysis perspective.

------
moron4hire
I was really hoping to come to this comment page and see a dozen top-level
comments claiming you could easily parse a PDF with XYZ tool, because the PDF
represented the table with reasonable data, and how could you not know that,
you poor excuse for a computer geek, in true HN (and Reddit and 4chan and
whatever you your flavor of BBS) fashion.

But no, we are well and truly fucked.

------
DanBC
Perhaps we're at the point where some bunch of people have already created a
Best Current Practice standard for preparing data prior to data sharing?
Because it seems like something that someone, somewhere, should have done
already.

\---

People can be great at using their particular tool, and then they'll do
something to make you realise that they're just an expert with that tool and
that their wider computer knowledge isn't so great.

I've had people want online bank statements as PDFs not CSVs because "anyone
can change the Excel file"; I had people asking me to scan engineering drawing
on paper to gif files ("Nice and small! We don't have enough space for that
other format!" (we did have space)) and then import those gifs into AutoCAD so
we could update our versions of the drawings when the customer updated their
versions. (The guy before me had given up trying to explain why it was a dumb
idea and had just scanned about 1,000 (of about 10,000) documents and saved
them as gif.)

Some people aren't particularly sharing data. They're just sharing their
information that they've got from that data; the interpretation they're
making.

------
astrosi
I'm always surprised that FITS [0] doesn't get more of a look in outside of
astronomy. It works very well in my experience for storing tables of data as
well as images.

[0] <http://en.wikipedia.org/wiki/FITS>

------
noonespecial
This reads like a giant advertisement for why scientists should learn to code.
Not only would they not be helpless in the face of stupid formats, they'd be
far more likely to pick a decent one to store the data in the first place.

------
coldtea
Posted "Excel as a database" link lower in the discussion threads, but I think
it's worth a comment of it's own:

<http://wyorock.com/excelasadatabase.htm>

------
kyllo
It does seem to be a big challenge to make data available both in analyzed,
"crunched" format for readers of a report (in which case PDF is usually fine),
and in "raw" format for analysts and developers (in which case one of CSV
flatfile, JSON or XML is usually the best format). Too often, people don't
seem to recognize the difference.

The internet isn't just about software and protocols, it's also about the
data. Sharing data is kind of the whole point of networking, so making your
data available in a machine-parseable format is so important.

------
pvaldes
Several good advices, but missing the real problem here. Of course you CAN
extract text from a pdf and you can extract a table from excel. A few lines of
perl or your favourite language and you have it.

Data are worthless if you can't trust it. Plain CSV are easy to read but easy
to change, even on the fly. Pdf can be changed also, but is not so easy and if
someone makes a subtile change in a number the error don't propagates by all
the pages reaching the totals like in excel. Excel macros/formulas can be a
source of headaches.

So to share your data use a format that: 1-you can trust (reasonably). 2-all
your other reasons go here...

And if this format is full of nested tables, and you find difficult to extract
the info from those tables, don't throw out the format, ask for help instead.

PDF can support passwords, is very compact, not so easy to change on the fly
and can be encrypted. Maybe not the best, but not the worst of the available
tools, in my opinion.

------
abraxasz
Yes, a million times yes. I'd like to emphasize the following points:

(2), (3), (4) are all tied to the same idea: Your data will be parsed by a
script, not by me. I'm not going to use excel to analyse your data. I'm gonna
import them into R/Python/... and then extract the features I'm interested in.
I've been a graduate student in statistics for almost 2 years now, and when
the data are correctly packaged, I've never had to use excel.

(5) Come on, I know that nobody likes writing README's. Heck, I don't like it
myself. But do you really expect me to guess what your two-letters variable
name means? And please, make your readme searchable, and short. The cdc data
comes with a 200 pages manual explaining what the data is, and finding what
you're looking for is a nightmare..

------
bliker
...and horrors of math equations and columns in PDF. We should really come up
with some light format where style is separated from data. Make a universal
converter from doc and xls files and save world from data stuck in PDFs.

~~~
bdunbar
We could call it .. LaTeX.

~~~
bliker
Why people share exported PDF's not their LaTeX files?

~~~
bdunbar
Because no one has heard of LaTeX.

------
crazygringo
As long as the PDF contains text (and isn't just scans of paper), it's usually
not TOO hard to select and copy the relevant tables/text, paste into a text
editor, and use regex's to transform it into whatever you want (CSV, SQL
insert statements, etc.)

Since there's no universal data format (e.g. the data equivalent of PDF), I
pretty much assume that _whatever_ format data comes in, I'll have to be doing
regex transforms to it in a text editor, in order to import it wherever I
want.

So as long as it's copy-pastable as text, PDF doesn't seem appreciably worse
than pretty much any other format.

~~~
zdw
Even better, use `pdftotext` from the Xpdf package to turn it into text, so
you skip the manual steps.

I do this will all my banking data - dump their PDF's in a folder, a script
converts to CSV, then to ledger (cli accounting program) format.

------
minimize_me
[http://source.mozillaopennews.org/en-
US/articles/introducing...](http://source.mozillaopennews.org/en-
US/articles/introducing-tabula/) Wasn't this on here a few days ago?

------
navait
I started a white house petition to mandate federal agencies also provide
plain-text data: <http://wh.gov/MIXT>

------
dpcx
I've used scraperwiki.com before to convert PDFs to "usable" data in cases
like these. Yeah, it's a pain, but then it's available for everyone.

------
kimagure
Maybe it's a naive question, but why shouldn't you use quotes to encapsulate
things that are meant to be text so that you can catch things like white
space?

Also that do people use formats other than comma and tab delimited?

My own experience with CSV is dealing with my own data and instrument readouts
(which are mostly always tab delimited with a header section) so I don't know.

~~~
itafroma
Where did you read that you shouldn't use quotes to encapsulate text? RFC 4180
(<http://tools.ietf.org/html/rfc4180>) says you may.

Many CSV generators and CSV parsers don't conform to RFC 4180, however, but if
you're planning on transmitting data using CSV, it's simple enough to say "use
an RFC 4180-compliant parser."

~~~
kimagure
yeah, i was just wondering why the author kind of explicitly recommended
against using white space when most parsers (including MS excel in my
experience) handle white space encapsulated in quotes just fine

~~~
itafroma
That's for variable names, not run-of-the-mill text. Variable names (and, in
certain situations, dictionary keys) generally can't have spaces, so if you're
generating code off of the data set, having to generate variable names (or
dictionary keys) without spaces is an additional step.

------
bloaf
I suppose I should mention Wolfram's expedition into this domain-the CDF
format:

<http://www.wolfram.com/cdf/compare-cdf/how-cdf-compares.html>

<http://en.wikipedia.org/wiki/Computable_Document_Format>

~~~
rurounijones
"It is a _closed format_ created by Wolfram Research"

"CDF files can be read using a _proprietary_ CDF Player with a _restrictive_
license"

No thank you.

------
hmottestad
RDF, RDF, RDF, RDF, RDF, RDF, RDF.

Yes. Please use RDF.

And please use a common vocabulary.

If you want to know how to do this? Have a look at the UK governments open
data portal (which uses RDF): data.gov.uk

And their SPARQL (query) endpoint: <http://data.gov.uk/sparql>

~~~
Homunculiheaded
RDF took Prolog clauses, restricted them to triplets, and made their
representation dramatically more verbose. Then it removed the reasoning engine
part of logic programming and replaced it with OWL which is by default
computationally intractable.

I have tried for many, many years to learn to appreciate RDF and semantic
technologies in general, but everytime I get involved it screams of a problem
desperately looking for a solution, and whose general case of the problem
(reasoning about relations) was already solved better decades ago (and with
the recent revived interested in logic/relational programming seems likely to
improve soon).

If you want to represent a graph structure in your data why not sets of json
objects? Sure you end up with an ad hoc vocabulary, but I've never seen a
successful project make use of the RDF ideal of massively shared vocabularies.

~~~
hmottestad
JSON doesn't really support datatypes. So not really useful for sending data
to someone else. Even simple things such as float vs. double will be a
challenge. Not to mention dates or locations.

Pity you haven't gotten to like it. It's being tested on a lot of stuff, some
sticks some doesn't.

And OWL has a number of subsets for whatever complexity you feel like
enjoying.

------
wiggity
this advice is pertinent even _within_ excel, especially for those who rely on
any of its summary features (pivot charts, data analysis, etc). essentially an
illustration of the benefits of separating content from presentation.

------
smrtinsert
TOML will solve all these problems. More structured than JSON, easier to
consume than XML.

<https://github.com/mojombo/toml>

~~~
lttlrck
Is it really more structured than JSON?

~~~
smrtinsert
first class dates says yes it is!

------
sokrates
The "D" in PDF does not stand for data, but for documents.

------
g8oz
You can embed documents within PDFs right? That feature should be used more to
include source data that can be manipulated by end users.

------
dmourati
If PDF is all you have, at least there is pdfescape:

<http://www.pdfescape.com/>

------
rowdyrabbit
Why not just provide multiple formats?

------
Lambdanaut
Some "import from PDF" functionality would be immensely valuable. It's not
impossible.

------
johngalt
'Send me your data in PDF and ONLY PDF' said most courts in the US.

