
Frictionless Data: Lightweight standards and tooling for data sharing - rkda
http://frictionlessdata.io/
======
vitorbaptistaa
Happy to see this on the frontpage. I work for the Open Knowledge
International, which develops the Frictionless Data standard. Feel free to ask
me anything, and I'll make sure myself or someone else from the team answers
it.

------
jasode
The frictionlessdata landing page has very generalized verbiage so here's my
technical summary of it...

The main idea for "container" or "package" hinges on a file called _"
datapackage.json"_[1].

An analogy would be the "sfv" files like _" checksums.sfv"_ for verifying the
integrity of files. Since so many people use "sfv" as a defacto standard, many
programs exist to scan it and verify the associated files. Another analogy
would be _DTD_ for XML files.

Similarly, if everybody could converge on the file _" datapackage.json"_ as a
metadata & schema description standard, a useful ecosystem of utilities and
libraries for processing data would take advantage of it.

One example library would be:
[https://github.com/frictionlessdata/datapackage-
py](https://github.com/frictionlessdata/datapackage-py)

(In the Python source code for "package.py"[2], Ctrl+F search for _"
datapackage.json"_ to see how it looks for that particular file.)

With a data wrangling API like that, one could then do joins on csv files
directly[3] and write the results to another csv file _with the associated
"datapackage.json"_.

Instead of passing "dumb" csv or raw json files around, add a little
"intelligence" to the dataset by way of _" datapackage.json"_ so tools can
parse the schema and process csv/json at a higher abstraction level. That
leads to more "effortless" and "frictionless" data interoperability.

What I can't tell so far is if _" datapackage.json"_ already has momentum of
adoption across many communities such as Julia, Tensorflow, Hadoop, etc. and
we need to get on the bandwagon -- or -- adoption is still in its infancy and
there are _other competing data "container/package" specifications_ to look
at.

[1] [http://frictionlessdata.io/guides/data-
package/](http://frictionlessdata.io/guides/data-package/)

[2] [https://github.com/frictionlessdata/datapackage-
py/blob/mast...](https://github.com/frictionlessdata/datapackage-
py/blob/master/datapackage/package.py)

[3] [http://frictionlessdata.io/guides/joining-tabular-data-in-
py...](http://frictionlessdata.io/guides/joining-tabular-data-in-python/)

~~~
_pwalsh
Hi,

(I work on the Frictionless Data specifications and tooling at Open Knowledge
International.)

Thanks. We are working on the website at present [1], and we are trying to
manage a balance of targeting technical and non-technical users, which is hard
to get right.

About momentum - I can address that. We have seen significant momentum in the
last 2 years, around open data / government transparency / civic tech ( our
natural environment - see [https://okfn.org](https://okfn.org) for details ),
around scientific / academic research via our work enabled by a grant from
Sloan [2]( see [http://frictionlessdata.io/case-
studies/](http://frictionlessdata.io/case-studies/) for a small selection,
more reports coming ), and in general around data wrangling and data science
efforts (including integration of Table Schema [3] with Pandas [4]).

In terms of _big_ data / machine learning - we have not actively worked in
that space to present.

In terms of Julia, and other languages, we have a Julia library in development
via our Tool Fund [5], and this will add to implementations [6] in PHP, Java,
R, Clojure which are already underway via the Tool Fund, and accompany the
Python, Javascript and Ruby implementations that we maintain directly at Open
Knowledge International.

[1]:
[https://github.com/frictionlessdata/frictionlessdata.io/issu...](https://github.com/frictionlessdata/frictionlessdata.io/issues/129)
[2]: [https://sloan.org](https://sloan.org) [3]:
[http://specs.frictionlessdata.io/table-
schema/](http://specs.frictionlessdata.io/table-schema/) [4]: [https://pandas-
docs.github.io/pandas-docs-travis/generated/p...](https://pandas-
docs.github.io/pandas-docs-
travis/generated/pandas.io.json.build_table_schema.html) [5]:
[http://toolfund.frictionlessdata.io](http://toolfund.frictionlessdata.io)
[6]:
[https://github.com/frictionlessdata](https://github.com/frictionlessdata)

~~~
philipov
Do you support yaml for people who dislike json?

~~~
_pwalsh
Hi. One could read the YAML first with the library of choice, and then load
into Data Package or Table Schema libraries.

------
craig_peacock
This is an overly complicated data container format for not much advantage. To
be honest, everything you can do with this can be done at the same level or
better with SQLite, an actual database system. Having to implement 4 different
parers and validation functions spanning a mix of csv, xml and json just to
access what is essentially a csv file is not feasible.

~~~
curragh
I agree that SQLite is amazing, and the problem that I had with some of the
datapackage implementations (CSVLint) is that they stored validation errors
in-memory (this is a deal breaker for data sets larger than a few hundred MB)
and didn't work well when cross-validating data between multiple files. That's
why I created ETLyte
([https://github.com/sorrell/etlyte](https://github.com/sorrell/etlyte)) which
reads data into a SQLite DB, writes errors to the DB, and streams output to
file/stdout.

I disagree that there is "not much advantage" in the format though. I use much
of the "resources" area of the data container format and find it tremendously
helpful for validating the expected datatypes (remember, SQLite has no true
datatypes for columns), defining expected values, and defining some of the
"ETL" functionality in ETLyte, like derived columns.

Also on the horizon is a fuzzing tool I'm creating to help exercise the
boundaries and variations of data that an ETL process can expect, and this
wouldn't be possible without a data container format. So again, I think there
are very good use cases for it that we haven't even tapped into yet.

------
jakubp
CSV as serialization format? Ouch. Could we do better? My experience with CSV
has been nothing but pain in the past. Ambiguous formats, quoting issues,
incompatible libraries between languages and popular GUI tools like Excel or
data vis apps.

I wonder if there's anything better.

~~~
slowmotiony
I hate csv so much. Simply opening a csv in Excel and saving it breaks the
format so much that most parsers cannot understand it anymore. Plus there is
no way to know which encoding the csv came in - Notepad++ might say one thing
and PowerShell will say another. Have fun figuring out why cyrrilic characters
or umlauts cause issues in your scripts...

~~~
jarman
Problem is not CSV, it's Excel. It's hard to blame format, when application is
insane enough to use locale settings to determine file parsing

------
sandGorgon
it is relevant to point out yesterday's article by Wes McKinney on Apache
Arrow and the future of high performance data formats -
[https://news.ycombinator.com/item?id=15335462](https://news.ycombinator.com/item?id=15335462)

