
Show HN: TheBigDB - A simpler open database of facts - christophe971
http://thebigdb.com/
======
barakm
Former Freebase, current Google engineer here:

First of all, let me say that I'm glad more people are thinking/working in the
space of triples. Even unstructured ones like this.

But when there's no semi-strict schema, it gets really, really tricky. Free
text is hard, and actual meaning is hard to separate. (I say semi-strict, as
Freebase is schema-last -- feel free to create your own! -- but has some level
of enforcement)

For specific domains you may be okay with tags. And for some limited
applications it probably works great. Triples are cool!

But when you start talking about larger, broader, datasets, ones that no one
person or small group can curate, you're going to start running into
collision.

There's certainly an argument to be made for metaschema --
<https://developers.google.com/freebase/v1/search-metaschema> \-- and
crowdsourcing these sorts of things could be interesting.

I think there's a lot of interesting work to be done. But I doubt that this is
"better" per se, or at the very least, is little more than a toy.

And hey, I built such a toy graph engine once upon a time (be gentle -- it was
really a demo hack) <https://github.com/barakmich/jgd> \-- you can even query
it with Freebase's old MQL. (Which I have mixed feelings about, but is cool in
its own way)

I guess my argument is, don't throw the baby out with the bathwater. And feel
free to ping me for more!

~~~
christophe971
First of all, I'm glad to be talking to a Freebase engineer :)

Second: It is not intended to be a Freebase killer, but it is certainly better
for me and what I needed it for: building denisthebot.com - which can answer
questions about a whole lot of subjects while being completely agnostic about
what "categories" these subjects fit in.

Third: Since we all know no-one is going to change anyone's mind here, I won't
discuss the merits of a DB structured the way I built it. But calling it a toy
is quite flattering, I know some very powerful stuff that were called "toy"
for some time :) Also, it would be a very easy toy to use, which will suit a
lot of people, according to the emails I received.

Also, you're right: "don't throw the baby out with the bathwater".

------
crazygringo
Comparison with Freebase:

> _Simpler structure: There are no datatypes, namespaces, lists, domains. Just
> ordered nodes. Having a dead simple structure like that allows developers to
> quickly and intuitively know how to access the info they want._

I don't see how this makes it simpler or intuitive at ALL. If there's no
convention as to whether I should search for "born on" or "born_date" or
"year_born", or whether the date will be "1900-08-01" or "08-01-1900" or
"1900/08"... then how is this supposed to be useful?

The central problem is, there are lots of textual ways of describing the same
thing. Without standardized datatypes and standardized tags, it quickly
becomes a messy, useless free-for-all.

I don't see how TheBigDB gets around this. The FAQ explains how it's
_different_ from Freebase/Wikidata, but I don't at all understand how it's
supposed to be _better_ , or even as good.

~~~
JPKab
This is stupid. I hate to say it, but by taking all of the complexity out of
Freebase, they forgot the problem Freebase was trying to solve: Semantic
integration

Right on the front page they have their biggest flaw in the example:

Apple weight 150g

Ok, so what about the company Apple? How are they semantically distinguishing
one item from another? I guess Steve Jobs is Founder of a Fruit that weighs
150 grams.

~~~
christophe971
What you're talking about is the larger problem of "context". Thankfully, I
already wrote the code and the documentation to attach statements to contexts,
solving this problem in an elegant way. I haven't released it yet to keep the
core of the service simple, and to help people understand what it does first.

It will be released when it will be the right time for it. Thanks for your
comment!

~~~
JPKab
I apologize for calling your project stupid. I should not have done that. I'm
interested in seeing what you are doing on context. Thanks for being classy.

~~~
christophe971
No problem, I'm not afraid of criticism and insults, there is always some kind
of information in them :)

------
andyjohnson0
Sounds like a much simplified version of Douglas Lenat's Cyc project [1],
which has been going since the mid eighties and is attempting to build a
structured knowledgebase/ontology of everyday knowledge. They have freely
downloadable subset called OpenCyc [2]. It seemed pretty impressive last time
I looked at it.

[1] <http://en.wikipedia.org/wiki/Cyc>

[2] <http://www.cyc.com/platform/opencyc>

~~~
JPKab
Does anyone actually DO anything with Cyc? It always appears to be vaporware
to me.

Maybe I'm just impatient, and something mindblowingly awesome, based on Cyc,
is around the corner.

~~~
Houshalter
Well according to Wikipedia:

>Lenat was frustrated by Automated Mathematician's constraint to a single
domain and so developed Eurisko; his frustration with the effort of encoding
domain knowledge for Eurisko led to Lenat's subsequent (and, as of 2008,
continuing) development of Cyc. Lenat envisions ultimately coupling the Cyc
knowledgebase with the Eurisko discovery engine.

I don't know what he intends to do with it from there, but it could
potentially make for some very powerful AIs.

------
ChuckMcM
I wonder if you could do machine learning on schemata. Basically start
learning about dates (as an example) and as it learns updates the information
with what it has learned. Something that has one person putting in { name
"foo", born "10/1/92"} and someone else putting in { name "bar", born
"september 30th, 1966" } and then going back and replacing the dates with an
ISO standard date type but with a change history so you could look backwards
in time at the data and see how the database had "improved" it. (or not). Then
by voting on the improvements you teach the system to clean up its data
representations. Crazy? Insightful? Stupid? I don't know but it was the
question that popped into my head.

~~~
vidarh
The problem is that many possible format conflicts in ways that make
resolution impossible without cross-referencing with other sources.

Which date is "10/3/5"? Is it March 5th 2010? March 5th 1910? March 10th 1905?
March 10th 2005? October 3rd 1905? October 3rd 2005? (or another century
entirely, though the 20th and 21st would be most likely). And don't think the
"/" vs. "-" as separate is sufficient to tell them apart.

Aand you'll find a lot of other variations - I'm used to writing 10/3-5 for
example... But I'm not even consistent, I might write 10/3/5 or 10-3-5, or
5/3/10 / 5-3-10; anywhere I want to be explicit, I would write 2005-03-10
exactly because I'm used to seeing so many ambiguous dates that can't easily
be resolved.

What about the value 5.123? Is it a floating point value with "123" after the
decimal point, or the integer 5123? The "decimal point" is "," in many
countries, and the thousand separator is usually, but not always, "." in
countries that use "," as the decimal marker. If you treat things as "just
text" you are going to have to potentially deal with dozens of different
combinations of decimal points and quantity markers (depending on country, the
markers don't all occur only every 3 digits to the left from the decimal
marker...)

Interpreting small text fragments is fraught with a near infinite number of
obnoxious details like this, and part of the problem is that even few people
know most of them and will be unable to quickly resolve ambiguities without
cross referencing with other data (or worse: they _think_ they know, or don't
even recognize that there is an ambiguity in the first place)

------
troymc
One nice property of the Wikidata database is that it is a "secondary
database. Wikidata will record not just statements, but their sources, thus
reflecting the diversity of knowledge available and supporting the notion of
verifiability." [1]

I think that's far better than voting. Voting for facts amounts to relying on
a logical fallacy: appeal to the majority. [2] (Voting is fine for popularity
contests, or things that can only be matters of opinion, but facts?)

[1] <http://www.wikidata.org/wiki/Wikidata:Introduction>

[2] <https://en.wikipedia.org/wiki/Argumentum_ad_populum>

------
kmike84
Is it possible to download all data and use it under some open license (like
CC-BY)? I can't find data license terms.

If no, then sorry, freebase is vastly superior IMHO - from user's point of
view I don't see a point in a crowdsourced proprietary database (even if API
is currently free).

~~~
christophe971
While it is not the case right now, being heavily invested into open-source
myself, it is very possible that it ends up under CC-BY at some point.

~~~
vidarh
It's worth pointing out that in most of the world facts is not copyrightable,
and the copyright status of a _collection_ of facts is often not
copyrightable. In those instances, CC-BY or whatever license you apply will be
unenforceable. I believe Creative Commons themselves explicitly believe that
CC-BY and similar licenses are _not_ appropriate for collections of facts.

E.g. they explicitly say on their website: "Copyright does not protect the
facts or ideas underlying the creative expression. So, Creative Commons
licenses do not apply to ideas, factual information or other non-creative
elements that are not protected by copyright."

Courts in many jurisdiction explicitly refuse to accept "sweat of the brow"
arguments for copyright, explicitly requiring an element of creativity. Some
countries do have "database copyrights" that can protect arrangements of facts
in certain ways, while in others a collection of straight up facts can not be
copyrighted pretty much no matter what.

------
danso
Have you/do you plan to seed your database with the already structured data
from freebase? It should be relatively straightforward, right? Well, I mean,
minus the time to properly map the Freebase schema into your format. But
that's probably less time than it takes to wait for people to fill in enough
facts.

------
oelmekki
Congrats, it looks great.

So, if I understand correctly, it let people crowdsource any kind of
structured and descriptive data ?

~~~
christophe971
That's exactly right! There are a lot of data on the web, almost none of it is
structured in a way that lets other easily access and edit it.

For example, as you can see on <http://denisthebot.com>, having that kind of
API makes it trivial to built smart question-answering engines.

~~~
oelmekki
Great. It may offer a good way to build some kind of open alternative to
google now.

~~~
christophe971
Absolutely, thanks for your question!

------
namank
Excellent! I've been working on something similar. Trying to come up with a
schema that is data-centric is hard enough let alone focusing on the ease of
use by developers. Good luck!

 _Can I send how many requests I want?_ , I think you might mean _Can I send
as many requests as I want?_ ?

~~~
christophe971
Woops, good catch! Fixed. And thank you!

------
timdorr
Any plans to degrade votes over time, so that new or updated facts can more
quickly gain precedence?

~~~
DanWaterworth
Facts don't change.

~~~
oelmekki
It's thursday.

~~~
DanWaterworth
Your very clever. Do you really think I hadn't considered a statement like
that?

Facts are verifiable statements and things that are verifiable are verifiable
repeatably. This is not a verifiable statement.

~~~
jacalata
'Yugoslavia is a country'. Are you really going to exclude that? So, no
statements about political geography at all, then. Same problem for physical
geography ('New Moore Island exists'). Nothing about climate (because 'the
rain in Spain' could change at any time), nothing about population levels.
Even the sample 'average weight of an apple' is pretty suspect. I think your
database is going to be quite limited.

~~~
DanWaterworth
> 'Yugoslavia is a country'. Are you really going to exclude that?

Yes, of course.

> So, no statements about political geography at all, then.

Absolutely not, you just have to make time explicit.

~~~
msellout
What calendar will you use?

Edit: It seems we're arguing about what a priori knowledge is capable of
serving as a base for factual deductions. The Kantian approach is to say that
we all agree on time and space and everything can be based off of these self-
evident truths. I think there is not such a clear boundary between objective
truth and induction.

Edit2: I'd also like to take this moment to point out that "you're" is the
proper contraction of "you are", since we're getting all semantic.

~~~
DanWaterworth
This has suddenly become very philosophical. My view is that a database of
facts should contain things that are believed to be facts. It should be
possible to remove facts that are shown to be incorrect, but those things
should never have been true.

> I'd also like to take this moment to point out that "you're" is the proper
> contraction of "you are", since we're getting all semantic.

I know, it annoys me too. By the time I'd realized it was too late to edit.
Typos happen.

~~~
jacalata
"My view is that a database of facts should contain things that are believed
to be facts. It should be possible to remove facts that are shown to be
incorrect, but those things should never have been true."

But they definitely were true. I thought you were making a distinction between
'something that is true' and 'something that is a fact (ie: is unchangingly
true)' which I don't think most people make.

------
pmtarantino
Any chance to release this as an open source? For example, people would like
to have installed in their servers and use it for their own things. I think it
would be useful for fandom. For example, the Star Wars DB or Lord of the Rings
Db :-)

~~~
christophe971
I'm seriously considering it actually, but not 100% sure at the moment... :)
And yes, it would be awesome for fandom, but the whole point of the service is
to be able to have all kinds of data in it, so... :)

~~~
pmtarantino
Yes, but people is usually more motivated to contribute to create the biggest
database about a specific topic than the bigger ever :)

Check, for example, Wikia, which contains Wikipedia for TV Shows which are
more complete than Wikipedia.

------
feniv
Don't be deterred by the negative comments about the unstructured data. It's a
tough problem but not an impossible one. I know because I'm also battling the
same question building a free-form NLP based self tracking app to help track
daily data ( <http://thyself.io> ). The problem for me is that it's hard to
perform analytics when one datapoint is in "miles walked" and the other is in
"laps ran".

As you said, conventions help mitigate the problem a little bit but the end
user can hardly be expected to stick to best practices.

I have hope though. This is a problem worth solving.

------
vineet
Reminds me of Freebase. They built a huge data-set as well as tools and an api
to access themselves. Have you talked to anyone on the team? (they are now at
Google) How would you say that you are different from them?

~~~
christophe971
Never talked to them, but I did address the differences between TheBigDB and
Freebase here: <http://thebigdb.com/faq>

------
jeffdavis
I like this idea in the sense of an experiment. I'm not sure where it will end
up, but it could be interesting.

As others have pointed out, some kind of conventions must be established
around the semantics, and something must be done to avoid redundancy (which
leads to inconsistency) and ambiguity.

I agree with those criticisms, but if the community also helps develop the
schema, it will be interesting to see. What collisions will happen? What will
be the result of queries that reach far across disciplines?

------
adventured
I appreciate any new service that attempts to organize data / information.
With that in mind, I hope this succeeds.

A suggestion: it needs a demo query box on the site. Shouldn't be too hard to
let a rate limited IP address throw a few keywords at it and spit back
results. I'd like to see what the db contains before I invest too much time
(how many topics, how many facts, etc).

~~~
christophe971
You're right, it would be a plus for people who quickly want to test the API.

In the meantime, you can just look through <http://browser.thebigdb.com> what
the DB contains, or just "gem install thebigdb" and start copy/paste the code
examples to see how the API really behaves.

Thanks for your suggestion!

------
gojomo
Based on observations and prior experience (esp. Bitzi), I believe the wiki
approach of "correct-in-place" leads to better convergence and community than
"downvote the errors, add a corrected entry, upvote the better entry".

(Voting democracy may help prevent people from being oppressed in certain
ways, but it isn't much of a truth-discovery mechanism.)

------
indeyets
Interesting concept. It's like RDF for human beings. It's easier for human
beings to look at unstructured data, but at the same time it makes it
extremely hard to do interesting stuff programmatically. You just can't do
reliable inferencing

------
xivSolutions
Cool project, man. And, way to show class in handling the "detractors" here. I
firmly believe that constructive debate is a good thing.

------
c0n5pir4cy
Also quite similar to freebase: <http://www.freebase.com/>

~~~
christophe971
Not really. I addressed this in the FAQ: <http://thebigdb.com/faq>

~~~
firdaus
Sounds a bit like what <http://fluidinfo.com/> wanted to do originally.

------
ximeng
You should have a list of the most recent facts added to give people a taste
of what's in the database.

~~~
christophe971
Yes, while you can see what's in the DB here: <http://browser.thebigdb.com>,
you can't see what's recent. I'll do something about that.

------
hnriot
Do you support downloads of the data?

------
peteypao2013
Ahh, too yellow!

~~~
christophe971
Sorry!

