
Show HN: Concept of a marketplace for machine learning datasets - akshaynathr
http://www.datapie.in/
======
minimaxir
There are a few services popping up with aim to provide data repositories for
analysis/ML (Kaggle, data.world, /r/datasets)

As someone who likes making analyses from random datasets, I have a few issues
with these types of services:

1) There is often no indication of the distribution rights of the data, or
whether the data was obtained ethically from the source (i.e. following the
ToS). I made this mistake when I used an OKCupid dataset released on an Open
Data Repository; turns out it was scraped with a logged-in account and the
dataset was taken down by DMCA

2) There is no indication of the _quality_ of the data, and as a result, it
may take an absurd amount of time cleaning the data for accuracy. Some
datasets may not be salvageable.

3) Bandwidth. Good datasets have lots of data for better models, which these
sites may not be able to support. (BigQuery public datasets solve this problem
however)

~~~
jamesblonde
Good points. We are soon going to release a p2p system for sharing datasets,
backed by Hadoop clusters. You install the Hadoop stack (localhost or
distributed), then you can free-text search for datasets that have been made
'public' on any hadoop cluster that participates in the 'ecosystem'. We expect
it to be self-policing, but there will be a way to report illegal distribution
of datasets. The solution is based on a variant of Bittorrent where files are
downloaded in-order (not randomly due to rarest piece selection in
Bittorrent). Files can be downloaded to either HDFS or to a Kafka topic. We
will demo it in 2 weeks here:
[https://fosdem.org/2017/schedule/event/democratizing_deep_le...](https://fosdem.org/2017/schedule/event/democratizing_deep_learning/)

The system will be bootstrapped with lots of interesting big datasets:
imagenet, 10m images, youtube 8m, reddit comments, hn comments, etc. Our
experience is that we need a central point for researchers to get easy access
to open datasets that doesn't require a AWS or GCE account.

~~~
Cacti
Could you explain what problem you're trying to solve here? Are there really
that many researchers who have access to modern (and expensive) GPU hardware
that don't have bandwidth or disk space available? Or are there many
researchers who are putting in lots of time assembling a dataset but don't
have the bandwidth to distribute it?

~~~
jamesblonde
It's more a case of providing a quick and easy way to share large datasets,
backed by HDFS. So, researchers don't have a good way to share datasets (apart
from AWS/GCE).

We work with climate science researchers who have multi-TB datasets, and they
have no efficient way to share them. Same goes for genomics researchers who
routinely pay lots of money for Aspera licenses just to download datasets
faster than TCP allows. We are using a Ledbat protocol tuned to give good
bandwidth over high latency links, but only scavange available b/w as it is
lower priority than TCP.

For the machine learning researcher: i'd like to test this RNN on the reddit
comments dataset....3 days later after finding a poor quality torrent...oh,
now i can do it. On our system, search, find, click to download. We will move
towards downloading (random) samples of very large datasets (even to Kafka
from where they can be processed as they are downloaded).

~~~
amelius
Sounds nice. Could you consider to make it more general than sharing datasets
for ML? I mean, it sounds like a really generic solution that anyone could
benefit from, not just researchers.

------
EternalData
There's probably some utility to it: a lot of problems involve hacking
together datasets, sometimes in dubious ways. There's also value, especially
for startups that are looking to build simple neural net applications (ex:
identifying plates of food from different restaurants) which are very data-
dependent. Researchers may also want to reflect the cost assembling datasets
(ex: MTurk, processing power) and open up datasets that may never have been
open before.

My general sense on this though is that I'd like there to be more of an
incentive for people to open up their datasets to the larger public. Maybe I'm
being idealistic but a crowdsourcing type function where you pay for X dataset
together with other users and then it's released under MIT, forever free etc.

As others have mentioned that'll probably bump against usage rights issues, a
larger problem you'll have to deal with independent of your need to sell or
distribute the datasets in question.

------
akshaynathr
Hi everyone, While working on some of my projects involving Machine learning
algorithms and deep neural networks i have found that there is a lack of
training data sets in many areas. Also many of them are scattered throughout
web, some are extremely huge for an individual to process etc.So i thought of
this idea of having a marketplace for data ,structured for machine learning
communities. It can be a one stop place for researchers, scientists,
students,data analysts etc.Looking for some valuable opinions.

~~~
nerdponx
Are you familiar with the UCI repository?
[https://archive.ics.uci.edu/ml/datasets.html](https://archive.ics.uci.edu/ml/datasets.html)

~~~
akshaynathr
Yeah. I was thinking of a crowdsourced marketplace platform which follows a
particular standard so all dataset supports major ML platforms, API support
for programming languages for preprocessing huge datasets etc.People can
prepare and sell /buy datasets. This can bring people from various streams
together to solve really major issues using the data.

~~~
huac
what is a 'particular standard for data'

~~~
akshaynathr
Many datasets available online does not clearly tell the quality of data. So I
think, having a particular standard of quality can really help.

------
hazelnut
I think the idea is great but you should think about this sentence: "Buy and
sell your data like Ebay" next to an image of connected people. It looks like
you're a shady user profile dealer. To be successful it's crucial that you
draw a clear line there

~~~
verdverm
also, how does this fit with...

""" Datapie offers data analysis without downloading the data. This means you
need not download the massive data. No need to have massive distributed
systems to process it. """

Where is your company trying to fit into the market? Is this a
[http://zerotoonebook.com/](http://zerotoonebook.com/) or are we commodotized
in this space already?

IBM offers many of the same data sets, paired with your company's private
data, both annotated by Watson. IBM also does not learn from, or improve their
own models, w.r.t. your company's proprietary data.

------
pgroves
I swear this is what Infochimps used to be, but now I don't really see a
reference to it on their website. Except for a 404 when I click on "Resources"
-> "Data Marketplace"[1]. I'm guessing that means they moved away from that
business. Looks like they now focus on tools, not data itself.

[1]
[http://www.infochimps.com/marketplace](http://www.infochimps.com/marketplace)

~~~
Palomides
I think they pivoted and then got bought out, but yeah, I recall the same.

------
amelius
What I like more is the concept of a job-agency for AIs. Basically, the job-
agency is a broker between people who have data and need an algorithm, and
people who have an algorithm but no data. The broker can then work as a
matchmaker, but also provide protection against data/algorithm theft by
managing hardware themselves.

As an example, see [1].

[1] [http://www.aigency.co/](http://www.aigency.co/)

------
_wmd
relevant:
[https://www.reddit.com/r/datasets/top/](https://www.reddit.com/r/datasets/top/)

~~~
btown
For those unfamiliar with Reddit, you can choose in a dropdown what time
period to filter top posts by; here's a direct link to the all-time top
dataset posts there:
[https://www.reddit.com/r/datasets/top/?sort=top&t=all](https://www.reddit.com/r/datasets/top/?sort=top&t=all)

------
pjackson5
Would there be much of an opportunity for someone whos into photo-realistic 3d
rendering to create some of these datasets? For starters i was thinking of
making something like the make3d dataset -
[http://make3d.cs.cornell.edu/data.html](http://make3d.cs.cornell.edu/data.html)
for some of my own experiments.

------
markkurt
not sure why it bothered me but the scrim on your top section could use an
extra couple pixels of padding.

like the idea - minimaxir had some good thoughts.

------
chattamatt
Facebook page button doesn't work, fwiw

------
ungaro
link seems broken to me, but here is something that i know:
[http://academictorrents.com/](http://academictorrents.com/)

------
akshaynathr
Link is now working again. It went down due to HN load.

