
Microsoft Concept Graph - uyoakaoma
https://concept.research.microsoft.com/Home/Introduction
======
aerovistae
This is a thing I think about often which I always conclude is not currently
possible in the way I wish it were.

While we can make a concept graph, what I often wonder is whether it's really
possible to make a computer _think_ of a thing, truly have an _idea_ of it in
their "head" the way it is for a person.

When you think of an apple, you don't just connect to a text description of it
and a picture and a bunch of links to "fruit" and "seed" and "food." You sort
of see it and feel it and taste it and know its value. It's rendered in
senses, not in text.

I am not confident that it will be possible for a computer to understand
something that way for a very long time. I think until we understand how that
information is encoded in our own minds, getting a machine to truly understand
it the same way will be elusive.

When I was recently considering this, the fundamental difference I came down
to was this: a living thing _wants things_ , _needs things._ So long as a
computer does not have any desires, I just don't see how it could ever
understand the world the way we do. What would anything matter to you if you
didn't eat, drink, sleep, feel, get bored, get curious?

I think those aspects of a living thing drive our understanding of everything
else. Without that, it's all just text.

But of course I do understand perfectly that I am speaking of a longer
timeline sort of project and that a Probase-like component is still a big part
of it and can still independently move things forward quite a bit.

~~~
sdrinf
What you're looking for is called the _symbolic grounding problem_ :

| But as an approach to general intelligence, classical symbolic AI has been
disappointing. A major obstacle here is the symbol grounding problem [18, 19].
The symbolic elements of a representation in classical AI – the constants,
functions, and predicates – are typically hand-crafted, rather than grounded
in data from the real world. Philosophically speaking, this means their
semantics are parasitic on meanings in the heads of their designers rather
than deriving from a direct connection with the world. Pragmatically, hand-
crafted representations cannot capture the rich statistics of realworld
perceptual data, cannot support ongoing adaptation to an unknown environment,
and are an obvious barrier to full autonomy. By contrast, none of these
problems afflict machine learning. Deep neural networks in particular have
proven to be remarkably effective for supervised learning from large datasets
using backpropagation. [..] The hybrid neuralsymbolic reinforcement learning
architecture we propose relies on a deep learning solution to the symbol
grounding problem.

Source: Marta Garnelo et al: Towards Deep Symbolic Reinforcement Learning
[https://arxiv.org/pdf/1609.05518.pdf](https://arxiv.org/pdf/1609.05518.pdf)

~~~
aerovistae
Thank you for linking me to this. I had never heard of it. That is exactly it.

~~~
visarga
Besides "symbolic grounding" also look up "word vectors". It is an attempt to
ground words in the statistical probability of their surrounding words in very
large bodies of text.

~~~
kybernetikos
I also recommend 'Ventus' by Karl Schroeder. It's a fun scifi read, covers
some of these concepts and can be downloaded for free:
[http://www.kschroeder.com/my-books/ventus/my-
books/ventus/fr...](http://www.kschroeder.com/my-books/ventus/my-
books/ventus/free-ebook-version)

------
rspeer
The Terms of Use appear to be very restrictive.

Not only does it have the "non-commercial" restriction, limiting its use to
throwaway projects that are not expected to succeed, but _derivative works_
are disallowed.

> Unless otherwise specified, the Services are for your personal and non-
> commercial use. You may not modify, copy, distribute, transmit, display,
> perform, reproduce, publish, license, create derivative works from,
> transfer, or sell any information, software, products or services obtained
> from the Services.

As far as I can tell, you are free to admire the Concept Graph from a
distance, but not to build anything on it.

------
wslh
It reminds me of Freebase [1], acquired by Google and later deprecated, you
can find the data in [2] and the new Google Knowledge Graph Search API. It may
not be enough for making computers think but can help to augment a search
engine. In Freebase you could perform queries like "give me the VCs who have
great exits in telecommunication companies". It is very useful to apply this
kind of queries to news because it adds context. DBpedia [4] is another
interesting project on this subject.

[1]
[https://en.wikipedia.org/wiki/Freebase](https://en.wikipedia.org/wiki/Freebase)

[2]
[https://developers.google.com/freebase/](https://developers.google.com/freebase/)

[3] [https://developers.google.com/knowledge-
graph/](https://developers.google.com/knowledge-graph/)

[4] [http://wiki.dbpedia.org/](http://wiki.dbpedia.org/)

~~~
kmote00
Freebase was mentioned in the article (along with Cyc) and was noted to have
2000 concepts compared to MCG's 5.4 million (and Cyc's 120K).

~~~
crypto5
I suspect MS concept graph's concepts can be represented as compounded triples
in FreeBase, e.g. in MS Graph they may have Jacques Chirac instance of
President of France concept, where in Freebase such knowledge will be
represented as:

Jacques Chirac - occupation - President

Jacques Chirac - country - France

I find later to be much more efficient.

------
paragraft
I'm in no position to discuss the actual service, but was really surprised by
the citation requests at the bottom. If you use the service, please cite these
6 papers. If you use the data, please cite these other 2.

Is this a new norm that's come about from publish/perish since I was at uni?
I've always assumed that you cite what you actually refer to, and even if you
just cite as a reference to describe a working project, surely one suffices.
Six though?

~~~
roel_v
Maybe you were at uni 60 years ago, but yes sometimes you cite more than one
source.

~~~
arethuza
Well, when I was at university it was considered bad form to cite sources you
hadn't actually consulted yourself.

------
coldnebo
Oh, I just realized that 'tagging' in this way is kind of an implementation of
Minsky's k-lines.
[https://en.m.wikipedia.org/wiki/K-line_(artificial_intellige...](https://en.m.wikipedia.org/wiki/K-line_\(artificial_intelligence\))

Cool!

------
unoti
It doesn't know about "vibrator", but it knows all about "astable
multivibrator". Sounds a lot like me as a kid!

This data could be used to automatically generate trivia questions and to
power other kinds of word games...

------
vvvvvoid
Training data is the mirror of society?
[https://concept.research.microsoft.com/Home/Demo?instance=wo...](https://concept.research.microsoft.com/Home/Demo?instance=woman&smooth=0.0001)

vs.

[https://concept.research.microsoft.com/Home/Demo?instance=ma...](https://concept.research.microsoft.com/Home/Demo?instance=man&smooth=0.0001)

In this context, the disclaimer makes much more sense.

~~~
zyx321
"Man" is a somewhat ambiguous word, and the algorithm is clearly interpreting
it to mean "the human species" first and foremost.

~~~
thanatropism
[https://concept.research.microsoft.com/Home/Demo?instance=bi...](https://concept.research.microsoft.com/Home/Demo?instance=billionaire&smooth=0.0001)

------
pbnjay
Pretty neat implementation!

Is there any way to monetize a similar independent project like this? I
understand it can help ML tasks with disambiguation but that's even farther
out of my expertise. I ask because I did very similar work for my CS PhD
dissertation in 2013. Basically covering their 2nd aim, but with fewer scoring
methods and a viz component.

It would be cool to dust off my old code and try it on this data set either
way...

~~~
barakm
Say, you were an early contributor to Cayley!
([https://github.com/cayleygraph/cayley](https://github.com/cayleygraph/cayley))

Things were slow there for a while, but we have our own namespace now, we've
done about a release per quarter for a bit, and have a small but thriving
community on our discussion board:
[https://discourse.cayley.io](https://discourse.cayley.io)

Currently up for discussion is reification :)

------
thomas4g
This is really cool! I wonder if there's any intersection between this and
MIT's Concept Net
([http://conceptnet5.media.mit.edu/](http://conceptnet5.media.mit.edu/))
somewhere down the road.

~~~
rspeer
Interesting that I was just about to link to the new version of ConceptNet
([http://conceptnet.io](http://conceptnet.io)).

Certainly a lot of the same language used to describe it. Different areas of
focus. There's room for both in the world but dang the names are going to be
confusing.

------
foota
Anyone know if we can build something off this? The text "Disclaimer: The
data, service, and algorithm provided by this website are based on the
automatically computing and training of public available data. They are only
for academic use. The user of such data, service, and algorithm shall be
responsible for contents created by the algorithm by complying with compliance
with applicable laws and regulations." makes me hesitate.

------
foota
I think this could be used to do some really neat procedural generation of
concepts in a game world (ala the kind of experience that Dwarf Fortress
offers)

------
Mathnerd314
> Microsoft

> largest OS vendor

> "We may not be able to find any reasonable object other than Microsoft."

This seems a bit contrived, considering that Android has the larger install
base.

~~~
ctolkien
Much like the real world and real people, there's no guarantee that the most
popular concepts will be technically correct.

------
bsbechtel
It seems to me one of the problems with machine learning in the nlp domain is
that language concepts are mutable, but at varying degrees. In the dog/cat
example used in the original post, the degree of mutability is very low, given
the concept of a dog and a cat are rooted in the physical world.

However, consider more abstract human concepts or language that is new and
changing often. Ironically, much of the language used to describe AI falls
into this category (and thus subject to confusion among humans).

Any sort of machine learning algorithm would need to include some sort of
'adaptability' parameter that could tell the machine when to discard the
current concept of the word and try forming a new one. This would need to be
based on checks in both immediate context of the phrase, and related phrases.

Disclaimer: My knowledge of machine learning is limited to passive reading, so
this may already be a part of any nlp algorithm, or I'm just completely off
base. So please consider my comments are coming from the perspective of an
outsider!

------
mitbal
What is the difference with word embedding method? Isn't concept means just
another word with high semantic similarity with each other?

------
deviate_X
It seems to be quite opinionated

[https://concept.research.microsoft.com/Home/Demo?instance=hi...](https://concept.research.microsoft.com/Home/Demo?instance=hillary&smooth=0.01)

it would be interesting to know more about how the graph is formed, and how it
avoids "gaming" the engine

the probase link is giving be a 400 error

~~~
Mathnerd314
My guess is it only parses certain word forms. "Blatant state-shtuppers" is in
this blog post:
[http://www.transterrestrial.com/?p=63723](http://www.transterrestrial.com/?p=63723)

> "Let’s put blatant State-shtuppers such as Hillary, Bernie, and Obama at
> about 7 or an 8."

This matches Hearst Pattern #1 from [https://www.microsoft.com/en-
us/research/wp-content/uploads/...](https://www.microsoft.com/en-
us/research/wp-content/uploads/2012/05/paper.pdf):

> NP such as {NP,}*{(or, and)} NP

Hillary usually appears by herself, rather than in a list. Apparently Probase
doesn't pick up the plentiful "X is a Y" associations, e.g. the "Hillary is a
liar" from [http://thefederalist.com/2015/08/27/poll-voters-
overwhelming...](http://thefederalist.com/2015/08/27/poll-voters-
overwhelmingly-say-hillary-is-a-dishonest-liar/) or "Hillary is a candidate"
from [http://www.huffingtonpost.com/jeffrey-sachs/hillary-is-
the-c...](http://www.huffingtonpost.com/jeffrey-sachs/hillary-is-the-
candidate_b_9168938.html)

Or maybe it does, and they're ranked down. They do have a truth-detection
phase, but it's mostly syntactic, and the top categories all have negative
examples ("Hillary is not a candidate", "Hillary is not a democrat", etc.).

------
rasengan0
Where is the data? Probase is sanitized; there is no MS
[https://concept.research.microsoft.com/Home/Demo?instance=pr...](https://concept.research.microsoft.com/Home/Demo?instance=prick&smooth=0.0001)

------
DodgyEggplant
"sex" >> Sorry that current Microsoft Concept Graph doesn't contain this
instance.

------
rsiqueira
Please note that Microsoft have censored some words like BLACK or F*CK so they
do not appear in their search results: "Sorry that current Microsoft Concept
Graph doesn't contain this instance."

But words like "WHITE" are ok, identified as "neutral color, traditional
color, classic color non obtrusive color".

This is the Concept Graph demo where you can verify if the word is censored or
not:
[https://concept.research.microsoft.com/Home/Demo](https://concept.research.microsoft.com/Home/Demo)

~~~
ComodoHacker
I hope this censoring was applied to a limited demo dataset and not to whole
dataset. Otherwise I can't really trust such a "research".

In the end, it's our digital world that reflect our minds. And we should have
courage to look into the mirror.

------
salex89
Oh joy, the Semantic Web all over again...

~~~
mark_l_watson
The semantic web is still in play, in the form of linked data, scheme.org,
etc. I was looking at SKOS yesterday to help solve a particular problem
yesterday. Google's and Facebook's knowledge graphs are born out of knowledge
engineering, semantic web, etc.

I for one, am willing to declare victory for semantic web technology.

------
amelius
I was under the impression that deep learning could already extract clusters
of symbols and group them into concepts, with no other input than just large
bodies of text, but I could be wrong.

------
davidfm
Some interesting highlights from a quick scan of the data: item hot 44

complex carbohydrate entirely grain product whole wheat bread 4620

free rich company datum size 33222

issue stress pain depression sickness 11110

testing device glucometer diabetes blood sugar test strips insulin pump 7138

big deal real estate investment opportunity 4135

small portion couple small cookie 2438

microsoft hardware failure bad hard drive 2281

affordable and multifunctional furniture piece sofa 1750

environmental factor diet 1588

so called designer sandwich cranberry 1460

practical add on towel rack 1459

practical accessory towel rack 1498

shop el corte ingles department store chain 1499

combustible material clothe 1405

------
foota
Looks like there's some other information about this at
[https://www.microsoft.com/en-
us/research/project/probase/](https://www.microsoft.com/en-
us/research/project/probase/) and
[http://haixun.olidu.com/probase/browser.htm](http://haixun.olidu.com/probase/browser.htm)

~~~
rayshan
There are only screenshots of the Probase Browser. Anyone has a link to a
working instance?

~~~
kmote00
[https://concept.research.microsoft.com/Home/Demo](https://concept.research.microsoft.com/Home/Demo)

~~~
rayshan
Thanks, I saw that, but it's not the same graph-based UI as the screenshots.

------
daxfohl
How do nlp frameworks like this deal with the fact that "literally" means "not
literally" in some contexts?

~~~
visarga
There are dictionaries of word senses, such as WordNet.

In order to identify word senses - first extract the words from text, collect
many examples of their surrounding contexts and apply clustering on them. If
there are more than one senses, they appear as clusters. Furthermore, words
are being replaced with word embeddings (numerical representations of their
meanings).

------
MarkMc
Somewhat related: I develop accounting software, and I want my users to be
able to say, "Show me all the unpaid invoices for Paul Jones" or "Record a
Mastercard payment to ABC Supplies for £32.20".

Are there any libraries or platforms that would help me implement this kind of
natural language UI?

~~~
Maarten88
Microsofts' LUIS service will easily do that for you. You train it to
recognize a user intent (show_invoices, record_payment) and classify
additional entities (payment_status, creditor_name, or payment_type). It works
remarkably well.

You use the portal to register your application and enter some example
sentences and specify your expected interpretation. Then you can start
recognizing input strings using the API. You can then improve the recognizer
by manually correcting input from actual use.

[https://www.luis.ai/](https://www.luis.ai/)

------
wrabbitfoot
[https://en.wikipedia.org/wiki/List_of_animals_by_number_of_n...](https://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons)

------
MichaelMoser123
i think they did not build the graph by hand - they must have automated the
process of creating it (interesting how they did that); the article mentions
four team members, if you do this by hand then you would need more hands.

... well the article links to this article where they seem to be automating
the process

"Probase: A Probabilistic Taxonomy for Text Understanding"
[https://www.microsoft.com/en-us/research/wp-
content/uploads/...](https://www.microsoft.com/en-us/research/wp-
content/uploads/2012/05/paper.pdf)

thanks for the link, now i have something to read. MS research has some really
bright people working for them; wow.

------
mycall
Reminds me of Cyc.

~~~
mark_l_watson
Different though. OpenCyc has not see an update in a long while, the the data
and the inferencing system are still good.

------
lqdc13
This is amazing! Kudos to Microsoft! This will really help people in NLP
applications and small scale search engines for disambiguation.

Anyone has a script for calculating the similarity scores or knows which
papers have the formulas?

------
smellf
I was unable to parse "animals other than dogs such as cats" because it isn't
a sentence, it has no predicate. Shouldn't language sort of be their thing
here?

------
bonniemuffin
I'd like to be able to casually browse this taxonomy to see what's related to
what, in the same way that I enjoy browsing Wikipedia with no particular goal
in mind.

~~~
foota
Unfortunately it doesn't sound like the data could be used for this,
"Disclaimer: The data, service, and algorithm provided by this website are
based on the automatically computing and training of public available data.
They are only for academic use. The user of such data, service, and algorithm
shall be responsible for contents created by the algorithm by complying with
compliance with applicable laws and regulations."

------
z3t4
I see a disturbing trend with tech blogs having binary images instead of html
tables and graphs.

------
vasaulys
This looks a _lot_ like word2vec applied in large scale to me. [1] I have a
feeling Google already does this.

[1]
[https://en.wikipedia.org/wiki/Word2vec](https://en.wikipedia.org/wiki/Word2vec)

------
ryanbertrand
Why does Microsoft not like making mobile optimized web pages?

~~~
FnuGk
because they dont have a mobile platform

------
hammock
Can this data be put to commercial use?

------
ndonnellan
What do these mean:

\- MI

\- NPMI

\- PMI^K

\- BLC

------
sixo
Short text understanding? God, I just want a decent thesaurus.

~~~
linusw
Funny you should mention that - my friend just submitted
[https://news.ycombinator.com/item?id=12852302](https://news.ycombinator.com/item?id=12852302)
for feedback on his stab at a thesaurus!

