Hacker News new | past | comments | ask | show | jobs | submit login
Data Organization in Spreadsheets (tandfonline.com)
77 points by fanf2 8 months ago | hide | past | web | favorite | 27 comments

>"Spreadsheets are often used as a multipurpose tool for data entry, storage, analysis, and visualization. Most spreadsheet programs allow users to perform all of these tasks, however we believe that spreadsheets are best suited to data entry and storage, and that analysis and visualization should happen separately."

in every organization I've worked in, its usually the exact opposite - spreadsheets are used to do analysis and create reports while the source data is coming from other systems (SAP, Salesforce, quickbooks, etc)

Sometimes the spreadsheets are the output of one system, but the input to the next: the spreadsheets of the Office of National Statistics in the UK are the final, and official output of a long chain of number wrangling, but when it comes to re-consuming them to power their new data-driven website it's a real challenge -- I wound up working on a project to allow programmatic consumption of these complex spreadsheets.

(https://github.com/sensiblecodeio/databaker for the top-level project; https://github.com/sensiblecodeio/xypath for the library I created).

Get in touch if you're interested -- dave.mckee@gmail.com

Oh fantastic, I want to do more with the data but the spreadsheets are designed for people rather than machines. Will have to check this out, thanks!

See also DataWrangler.

No enterprise application is complete until it gets Export to Excel feature.

I've seen and heard from colleagues that there is a lot of data being stored and shared in CSV too ..and then analysis done in excel.

that seems to change as an automation and IT culture starts to develop in an organization; interestingly, IT can sometimes be a barrier to data organization (too slow, cross-org costs etc) and cause people to continue to use CSV and excel.

fwiw - we built a service called "yukon data solutions" that solves these classes of problems for people trying to transition from CSV and spreadsheets into automated and reproducible analytics and reporting.

FYI Yukon was the codename for Microsoft SQL Server 2005. Dunno if you care.

oh - that's cool, didn't know that :) Our's is based on the Canadian Air force CC-106 Yukon.

While I don't necessarily agree with the author there, I don't think there is contradiction between their and your statements. Tools are not necessarily commonly used for what they are best suited for.

Also the context is bit different; research vs businesses

Spreadsheets are also great UIs for configuration files to be used for models. Perhaps not for the programmer (though you can write your excel serializer/deserializer) but for users yes. If you have to make large changes on many entries, try updating a json or xml file by hand, or spend two hours clicking through buttons and drop downs in a UI vs a few seconds on a spreadsheet for any decent excel user.

There's a interesting recent paper, also cited by these folks, about errors in gene expression data. Spreadsheets (or at least Excel) are a major culprit, interpreting gene SEP15 as September 15, for example. This is dangerous when it happens in only part of the data, like some columns.

Gene name errors are widespread in the scientific literature

Genome Biology 2016 17:177

https://doi.org/10.1186/s13059-016-1044-7 (Open Access)

Hadley Wickham's tidy data is also an excellent read: http://vita.had.co.nz/papers/tidy-data.html

I work closely with social scientists and it's common in those disciplines to embed narrative and presentation data in spreadsheets.. articles like this and the Good Enough Practices in Scientific Computing (http://journals.plos.org/ploscompbiol/article?id=10.1371/jou...) are nice summaries of good practices.

A technique I use, is to enter the data in an sqlite3 database, using SQLiteStudio and use spreadsheets for importing, exporting and presentation only.

> save the data in plain text files.

Not if you are working in excel! It can corrupt all sorts of unicode data when exporting to csv.

Excel has a CSV UTF-8 export option...use it!

Excel has many other sins. If you have a long numeric ID in a csv file, it will import it as a floating point and truncate (round) the ID.

Learn how to use the "From Text" wizard importer...

Once you have are aware of it there are ways around that. The problem is that this will surprise I think pretty much anyone who hasn't been bitten yet. When you double click on a csv file, it just opens it. Then you have some weird behavior of things not matching when they should.

Yeah, it's a pain. Zip Codes that begin with zero, FedEx tracking numbers, as you mentioned any long numbers. All of these gotchas have a work-around but they can be very tedious to look for and address. The worst part is that when you think you've fixed whatever needed fixing, save the CSV file and then remember something you missed and reopen it all of your first round changes are lost.

I've gotten in the habit of saving a copy to Excel first, then to CSV again. If I realize I still have a bug in my CSV copy I reopen the Excel version so any earlier formatting edits I've made are still intact.

I've yet to find a reliable way to prevent Excel from destroying dates (or seeing dates in other data) though. In this regard I prefer LibreOffice. (Or SQLite, if I'm the only person working on it.)

I might have missed something obvious though so feel free to point it out. Google failed me though.

Use From Text or Get & Transform...you can tell Excel how to Type each column.


Two questions:

1. What do people use (currently) for data standardization?

2. What do you do if it's not in tabular format, but is something like JSON/noSQL?

I use RDF and standard open schemas like schema.org or VIVO-ISF, with local schema extensions as needed. The datasets I need to standardize tend to be things like scholarly research networking data or data about healthcare clinics and providers, which tend to be highly connected with some sparsely populated fields, both of which lend themselves well to a graph-based model.

Right now I use the Karma Data Integration tool (http://usc-isi-i2.github.io/karma/) to transform my source data to RDF triples. It can handle data in a variety of shapes, not just tabular, though it does struggle with say, highly-nested XML data. I want to try LinkedPipes ETL (https://etl.linkedpipes.com/) on my next project, whatever it might be.

At risk of shamelessly promoting, we built Parabola (https://parabola.io) with many of these same best practices in mind.

HN discussion: https://news.ycombinator.com/item?id=17147272

Ah, yes, I remember Parabola.

I used to give a "data science" (read: data munging) training at a large accounting firm, and this article seems to track my recommendations almost exactly. The crazy things people do with spreadsheets never ceases to amaze me.

Can’t believe this was accepted as a paper...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact