
Lehigh research team to investigate a “Google for research data” - sprague
https://www.lehigh.edu/engineering/news/faculty/2018/20180820-davison-heflin-jia-dataset-search-engine-nsf-award.html
======
gervase
Yikes, talk about poor timing [0]!

Of course, the proposal [1] was submitted at the beginning of August before
this was available, but it must still be a bit of a gut check for the research
team.

Hopefully they can figure out a way to inject additional novelty in the
project.

[0]:
[https://news.ycombinator.com/item?id=17919297](https://news.ycombinator.com/item?id=17919297)

[1]:
[https://www.nsf.gov/awardsearch/showAward?AWD_ID=1816325](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1816325)

~~~
toomuchtodo
Not poor timing at all! A non-Google alternative is necessary for when Google
(inevitably) decides to shelve their version.

------
xt00
I've been curious about how to pay people for their datasets.. I feel like
research into that is really interesting.. what would be great would be if you
could essentially pay known trustworthy sources for vetted data, and those
sources would essentially crowdsource the data by literally pounding the
pavement or parsing through difficult to parse datasets.. I mean entire
businesses are based upon that, so I could see how this gets out of hand
quickly where somebody wants access to a dataset but does not want others to
have access to it. But anyway, I wonder if people have studied the "how to pay
people for datasets" problem where you get good quality data that is not gamed
to just be maximizing profit, but at the same time people have an incentive to
collect the data.

~~~
lingz
There are already several established organizations that hold and manage
billing for data sets within specific domains. An example is the LDC
([https://www.ldc.upenn.edu/](https://www.ldc.upenn.edu/)) which hosts huge
amounts of natural language + voice data in many languages, submitted by
universities around the world.

Personally, I think there is a big downside to attaching it to billing. The
process is quite difficult to obtain (financially and logistically),
especially if you are outside a university. Even within a university,
procuring data could take months of bureaucracy. Also as an independent
student or developer, this data becomes largely inaccessible.

------
shawn
I'd like a dataset which maps (zip code, age, single) to a distribution of
incomes. E.g. (60642, 18, single) is likely <$20k income. Ideally it would
return a big list of (age, zip code, income, year, single) entries.

Is it possible to find this data using Google's dataset search? If not, making
an easier frontend for it might be one way to add novelty.

It's also hard to figure out whether the data exists or whether your search
terms are poor.

~~~
spydum
This already exists.. look at claritas:
[https://claritas360.claritas.com/mybestsegments/#zipLookup](https://claritas360.claritas.com/mybestsegments/#zipLookup)

~~~
shawn
This is magical. How'd you find out about this?

~~~
conception
The dataset you are looking for is a pretty common marketing research dataset
which may help your Google queries.

------
modells
Google may not be the best model to aid researchers, or the most useful and
profitable. An AWS meets Coursera meets helpful tech and engineering
consulting/support shop seems like a better, full-service model to help bio
people accomplish their work while having the support of a top-notch
IT/engineering organization. Professional services without the delay, cost or
extractive tendencies... more like kick-ass support that gets things done
right now.

------
kyle_v
Suprised no one has mentioned google scholar, the solution already exists. Not
to mention google by itself is already a pretty great research tool if you
know what you're looking for. You can even search by file type e.g. "cure for
cancer filetype: pdf"

[https://scholar.google.com/](https://scholar.google.com/)

~~~
imh
This isn't for finding papers. It's for finding the data the papers were
written about.

------
j_star
There's already quite a few sites like this, including the one I work on for
Canadian research data: [https://www.frdr.ca](https://www.frdr.ca)

We have a search engine that covers every Canadian data set we can find (both
academic and governmental) and the ability to upload your own data set
directly to the site.

------
mooman219
Incorrect title, or misleading at least, The NSF awared $500K for a "Lehigh
research team to investigate a 'Google for research data'". You're likely not
going to make a "google for x" on $500K.

~~~
chiefalchemist
I'm thinking the better question is: What would it take for Google to be
"Google for research data"? I have to presume the answer is: not much.

~~~
foobar2020
I don't know what it took but they did it recently:

[https://news.ycombinator.com/item?id=17919297](https://news.ycombinator.com/item?id=17919297)

